ST104B_subject_guide_2023_final (1)
ST104B_subject_guide_2023_final (1)
Statistics 2
J.S. Abdey
ST104B
2023
Statistics 2
J.S. Abdey
ST104B
2023
Undergraduate study in
Economics, Management,
Finance and the Social Sciences
This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.
University of London
Publications office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk
Contents
0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.7.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.4 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.5 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.6 Making use of the Online Library . . . . . . . . . . . . . . . . . . 6
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1 Probability theory 9
1.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 18
1.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 21
1.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 23
1.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 23
1.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 28
1.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 29
1.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 29
1.7.3 Conditional probability of independent events . . . . . . . . . . . 31
i
Contents
ii
Contents
iii
Contents
iv
Contents
v
Contents
vi
Chapter 0
Preface
By successfully completing this course, you will understand the ideas of randomness and
variability, and the way in which they link to probability theory. This will allow the use
of a systematic and logical collection of statistical techniques of great practical
importance in many applied areas. The examples in this subject guide will concentrate
on the social sciences, but the methods are important for the physical sciences too. This
subject aims to provide a grounding in probability theory, point estimation and analysis
of variance.
The material in ST104B Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing
1
0. Preface
For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.
0.3 Syllabus
The up-to-date course syllabus for ST104B Statistics 2 can be found in the course
information sheet, which is available on the course VLE (virtual learning environment)
page.
apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions
2
0.6. Employability outcomes
1. complex problem-solving
2. decision making
3. communication.
Week Chapter
1&2 Chapter 1: Probability theory
3 Chapter 2: Discrete probability distributions
4 Chapter 3: Continuous probability distributions
5 Chapter 4: Multivariate random variables
6 Chapter 5: Sampling distributions of statistics
7 Chapter 6: Estimator properties
8 Chapter 7: Point estimation
9 & 10 Chapter 8: Analysis of variance (ANOVA)
3
0. Preface
The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.
Basic notation
We often use the symbol to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.
Calculators
A calculator may be used when answering questions on the examination paper for
ST104B Statistics 2. It must comply in all respects with the specification given in the
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.
Computers
If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package, such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material. On a few
occasions in this subject guide R will be used for illustrative purposes only. You will not
be examined on R.
This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST104B Statistics 2. Throughout the subject guide there are
many worked examples, practice problems and sample examination questions replicating
resources typically provided in statistical textbooks. You may, however, feel you could
benefit from reading textbooks, and a suggested list of these is provided below.
4
0.7. Overview of learning resources
Statistical tables
Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 9780521484855].
These relevant extracts can be found at the end of this subject guide, and are the same
as those distributed for use in the examination. It is advisable that you become familiar
with them, rather than those at the end of a textbook which may differ in presentation.
Freedman, D., R. Pisani and R. Purves Statistics. (New York: W.W. Norton &
Company, 2007) fourth edition [ISBN 9780393930436].
Johnson, R.A. and G.K. Bhattacharyya Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].
Larsen, R.J. and M.J. Marx An Introduction to Mathematical Statistics and Its
Applications. (London: Pearson, 2017) sixth edition [ISBN 9780134114217].
Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Pearson, 2012) eighth edition [ISBN 9780273767060].
5
0. Preface
Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
Readings: Direct links, wherever possible, to essential readings in the Online
Library, including journal articles and ebooks.
Video content: Including introductions to courses and topics within courses,
interviews, lessons and debates.
Screencasts: Videos of PowerPoint presentations, animated podcasts and
on-screen worked examples.
External material: Links out to carefully selected third-party resources.
Self-test activities: Multiple-choice, numerical and algebraic quizzes to check
your understanding.
Collaborative activities: Work with fellow students to build a body of
knowledge.
Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.
Past examination papers: We provide up to three years of past examinations
alongside Examiners’ commentaries that provide guidance on how to approach the
questions.
Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.
Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.
6
0.8. Examination advice
The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please use the online help pages
(http://onlinelibrary.london.ac.uk/resources/summon) or contact the Online Library
team using the ‘Chat with us’ function.
where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.
7
0. Preface
8
Chapter 1
Probability theory
explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.
1.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:
Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%
9
1. Probability theory
However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.
‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’
In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.
In the next few chapters, we will learn about the terms in bold, among others.
In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.
Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.
A preview of probability
10
1.4. Set theory: the basics
Experiment: for example, rolling a single die and recording the outcome.
Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.
Event: any subset A of the sample space, for example A = {4, 5, 6}.
B = {1, 2, 3, 4, 5}.
1 ∈ A and 2 ∈ A
6∈
/ A and 1.5 ∈
/ A.
The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.
Example 1.3 In Figure 1.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .
11
1. Probability theory
A⊂B when x ∈ A ⇒ x ∈ B.
Example 1.4 An example of the distinction between subsets and non-subsets is:
{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set
{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.
Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.
A ∪ B = {x | x ∈ A or x ∈ B}.
That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 1.3.
12
1.4. Set theory: the basics
A ∪ B = {1, 2, 3, 4}
A ∪ C = {1, 2, 3, 4, 5, 6}
B ∪ C = {2, 3, 4, 5, 6}.
A ∩ B = {x | x ∈ A and x ∈ B}.
That is, the set of those elements which belong to both A and B. An example is
shown in Figure 1.4.
A ∩ B = {2, 3}
A ∩ C = {4}
B ∩ C = ∅.
13
1. Probability theory
Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1
and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1
These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.
Complement (‘not’)
Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 1.5.
We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.
14
1.4. Set theory: the basics
Commutativity:
A ∩ B = B ∩ A and A ∪ B = B ∪ A.
Associativity:
A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.
Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)
and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
De Morgan’s laws:
If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:
∅c = S.
∅ ⊂ A, A ⊂ A and A ⊂ S.
A ∩ A = A and A ∪ A = A.
A ∩ Ac = ∅ and A ∪ Ac = S.
If B ⊂ A, A ∩ B = B and A ∪ B = A.
A ∩ ∅ = ∅ and A ∪ ∅ = A.
A ∩ S = A and A ∪ S = S.
∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.
15
1. Probability theory
A ∩ B = ∅.
Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.
Partition
The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 1.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form
∞
S
a partition of A if they are pairwise disjoint and Ai = A.
i=1
A3 A2
A1
We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.
16
1.5. Axiomatic definition of probability
Example 1.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.
The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.
Axioms of probability
‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.1 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).
Axiom 2: P (S) = 1.
The axioms require that a probability function must always satisfy these requirements.
1
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.
17
1. Probability theory
Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.
Probability property
In pictures, the previous result means that in a situation like the one shown in Figure
1.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:
That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.
Probability property
18
1.5. Axiomatic definition of probability
A2
A1
A3
Figure 1.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.
Probability property
Probability property
Probability property
P (A) = P (A ∩ B c ) + P (A ∩ B)
P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
P (A ∪ B) = (P (A) − P (A ∩ B)) + P (A ∩ B) + (P (B) − P (A ∩ B))
= P (A) + P (B) − P (A ∩ B).
19
1. Probability theory
These show that the probability function has the kinds of values we expect of something
called a ‘probability’.
P (Ac ) = 1 − P (A).
86% spend at least 1 hour watching television (event A, with P (A) = 0.86)
19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)
15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).
Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.
20
1.6. Classical probability and counting rules
Example 1.10 How should we interpret the following, as statements about the real
world of coins and babies?
‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.
‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.
A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.
If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.
Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.
21
1. Probability theory
Standard illustrations of classical probability are devices used in games of chance, such
as:
We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:
k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S
That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.
Example 1.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?
S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.
The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.
Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.
22
1.6. Classical probability and counting rules
Example 1.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?
The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.
The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.
Example 1.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?
Example 1.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 1.8.
23
1. Probability theory
s s s s s s s s
whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
whether the selected set is treated as ordered or unordered.
Therefore:
n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.
24
1.6. Classical probability and counting rules
Now:
n objects are available for selection for the 1st object in the sequence
. . . and so on, until n − k + 1 objects are available for selection for the kth object.
n × (n − 1) × · · · × (n − k + 1). (1.2)
Factorials
The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.
n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!
Suppose now that the identities of the objects in the selection matter, but the order
does not.
25
1. Probability theory
For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.
With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
Unordered k k
= k! (n−k)!
We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness.
Example 1.16 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?
1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:
(365)3 = 48,627,125.
2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!
26
1.6. Classical probability and counting rules
3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have
different birthdays (without replacement). The number of different sets of
birthdays is:
365 365! 365 × 364 × 363
= = = 8,038,030.
3 3! (365 − 3)! 3×2×1
Example 1.17 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.
1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .
2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.
Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r
and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:
27
1. Probability theory
independence
conditional probability
Bayes’ theorem.
updating probabilities of events, after we learn that some other event has happened.
Independence
P (A ∩ B) = P (A) P (B).
if A happens, this does not affect the probability of B happening (and vice versa)
if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).
Example 1.18 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:
Therefore:
28
1.7. Conditional probability and Bayes’ theorem
Example 1.19 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) = and P (S ∩ G) = .
4 4
From these results, we can verify that:
P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)
and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.
29
1. Probability theory
Conditional probability
Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?
The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:
P (A ∩ B)
P (A | B) =
P (B)
assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.
Example 1.20 Suppose we roll two independent fair dice again. Consider the
following events.
These are shown in Figure 1.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:
P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15
30
1.7. Conditional probability and Bayes’ theorem
A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)
A B
Example 1.21 In Example 1.20, when we are told that the conditioning event B
has occurred, we know we are within the solid green line in Figure 1.10. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B
31
1. Probability theory
P (A ∩ B) = P (A | B) P (B).
That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:
s
B
s
As
where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 1.22.
Example 1.23 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52
4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:
P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards
32
1.7. Conditional probability and Bayes’ theorem
P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn
P (A3 | A1 , A2 ) = 2/50
P (A4 | A1 , A2 , A3 ) = 1/49.
Putting these together with the chain rule gives:
We now return to probabilities of partitions like the situation shown in Figure 1.11.
HH A1
H
A2 HH
A1
r HHr
A
A2
HH
A3 H
HH
HH A3
Figure 1.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.
Both diagrams in Figure 1.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 1.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).
33
1. Probability theory
r B1
r B2
HH
H
r
r B3 HHHr
H A
@H
H
@ HH
@ Hr
@ B4
@
@r
B5
Figure 1.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.
Example 1.24 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).
r Bc
HH
HH
HH
rH
Hr A
HH
H
HH
r
H
B
Example 1.25 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).
34
1.7. Conditional probability and Bayes’ theorem
P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.
What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 1.14.
So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.
35
1. Probability theory
Bayes’ theorem
Using the chain rule and the total probability formula, we have:
P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1
Example 1.26 Continuing with Example 1.25, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:
Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098
Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.
36
1.10. Sample examination questions
Community A B C
Proportion 0.20 0.50 0.30
Community given A B C
Probability of being vaccinated 0.80 0.70 0.60
(a) We choose a person from the county at random. What is the probability that
the person is not vaccinated?
(b) We choose a person from the county at random. Find the probability that the
person is in community A, given the person is vaccinated.
(c) In words, briefly explain how the ‘probability of being vaccinated’ for each
community would be known in practice.
37
1. Probability theory
2. We are given that P ((A ∪ B)c ) = π1 , P (A) = π2 , and that A and B are
independent. Hence:
Therefore:
(c) Any reasonable answer accepted, such as relative frequency estimate, or from
health records.
38
Chapter 2
Discrete probability distributions
formally define a random variable and distinguish it from the values which it takes
2.3 Introduction
A random variable is a ‘mapping’ of the elementary outcomes in the sample space to
real numbers. This allows us to attach probabilities to the experimental outcomes.
Hence the concept of a random variable is that of a measurement which takes a
particular value for each possible trial (experiment). Frequently, this will be a numerical
value.
Example 2.1 Suppose we sample five people and measure their heights, hence
‘height’ is the random variable and the five (observed) values of this random variable
are the realised measurements for the heights of these five people.
Example 2.2 Suppose a fair die is thrown four times and we observe two 6s, a 3
and a 1. The random variable is the ‘score on the die’, and for these four trials it
takes the values 6, 6, 3 and 1. (In this case, since we do not know the true order in
which the values occurred, we could also say that the results were 1, 6, 3 and 6, or 1,
3, 6 and 6, or . . ..)
39
2. Discrete probability distributions
Discrete: Synonymous with ‘count data’, that is, as far as this course is
concerned, random variables which take non-negative integer values, such as
0, 1, 2, . . .. For example, the number of heads in n coin tosses.
40
2.4. Probability distribution
Example 2.4 Let X = ‘the score of a fair die’. If the die results in a 3, then this is
written as x = 3.
The probability distribution of X is:
X=x 1 2 3 4 5 6
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6
2
When dealing with continuous random variables the analogous condition is integrating, rather than
summing, to 1. More of this in Chapter 3.
3
At school, ‘uniforms’ are worn, i.e. all pupils wear the same clothes (possibly slight differences across
genders), hence when the term ‘uniform’ is applied to a probability distribution, we have the same
probability of occurrence for each sample space value.
41
2. Discrete probability distributions
1.0
0.8
0.6
Probability
0.4
0.2
0.0
1 2 3 4 5 6
Score
Figure 2.1: Probability distribution for the score on a fair die in Example 2.4.
Example 2.5 Let X = ‘the number of heads when five fair coins are tossed’. The
probability distribution of X is:
X=x 0 1 2 3 4 5
P (X = x) 0.03 0.16 0.31 0.31 0.16 0.03
= (0.5)5 = 5 × (0.5)5 = 10 × (0.5)5 = 10 × (0.5)5 = 5 × (0.5)5 = (0.5)5
42
2.5. Binomial distribution
1.0
0.8
0.6
Probability
0.4
0.2
0.0
0 1 2 3 4 5
Number of heads
Figure 2.2: Probability distribution for the number of heads when five fair coins are
tossed.
A Bernoulli trial has only two possible outcomes (i.e. it is dichotomous) which are
typically called ‘success’ and ‘failure’ – such as ‘heads’ and ‘tails’. We usually code
a success as 1 and a failure as 0.
Bernoulli distribution
43
2. Discrete probability distributions
Example 2.6 Other potential examples of Bernoulli trials are: (i.) the sex of
new-born babies (male or female), (ii.) the classification of factory output (defective
or not defective), and (iii.) voters supporting a candidate (support or not support).
In fact, many sampling situations become Bernoulli trials if we are only interested in
classifying the result categorically in one of two ways – for example, heights of people if
we are only interested in whether or not each person is taller than 180 cm, say.
Extending this idea, if we have n successive Bernoulli trials, then we define the binomial
distribution.
Binomial distribution
X ∼ Bin(n, π)
where the terms n and π are called parameters, since the values of these define
which specific binomial distribution we have. Its probability function is:
(
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
P (X = x) = (2.1)
0 otherwise.
n is the number of Bernoulli trials, π is the (constant) probability of success for each
trial, P (X = x) is the probability that the number of successes in the n trials is equal
to x. That is, we are seeking to count the number of successes, and each P (X = x)
is the probability that the discrete (count) random variable X takes the value x.
(2.1) can be used to calculate probabilities for any binomial distribution, provided n
and π are both specified. Note that a binomial random variable can take n + 1 different
values, not n, since the variable measures the number of successes. The smallest number
of successes in n trials is zero (i.e. if all trials resulted in failure); the largest number of
successes is n (i.e. if all trials resulted in success); with the intervening number of
successes being 1, 2, . . . , n − 1. Therefore, there are n + 1 different values in total.
Each trial has only two possible outcomes – success and failure.
4
Read ‘∼’ as ‘is distributed as’.
44
2.6. Cumulative distribution functions
For discrete random variables taking non-negative integer values, the cumulative
distribution function (cdf) is:5
F (x) = P (X = 0) + P (X = 1) + P (X = 2) + · · · + P (X = x)
= p(0) + p(1) + p(2) + · · · + p(x).
It follows that we can easily find the probability function from the cumulative
distribution function, or vice versa, using this relationship. Specifically, note that:
P (X = x) = F (x) − F (x − 1).
Example 2.7 Consider ten test tubes of bacterial solution and let us suppose that
the probability of any single test tube showing bacterial growth is 0.2. Let X denote
the number of test tubes showing bacterial growth. Hence:
10
P (exactly 4 show growth) = P (X = 4) = × (0.2)4 × (0.8)6
4
and:
Note this technique also illustrates the advantage of computing the probability of an
event by calculating the probability of it not happening and subtracting this from 1.6
5
Note you can use either form of notation p(x) or P (X = x), whichever you prefer.
6
Recall P (A) = 1 − P (Ac ).
45
2. Discrete probability distributions
F (x) = P (X ≤ x).
Example 2.8 If X ∼ Bin(2, π), then each time x reaches an integer value in the
range [0, 2], the cdf ‘jumps’ by P (X = x), until the sum of the P (X = x)s reaches 1.
This is shown in Figure 2.3.
F(x)
1
2π (1-π ) + (1-π )2
(1-π )2
0 1 2
x
Figure 2.3: Step function showing the cdf of X ∼ Bin(2, π) in Example 2.8.
The same pattern shown in Figure 2.3 applies to any version of Bin(n, π), or indeed to
any other distribution for which there is a largest possible integer value. For
distributions like the Poisson (discussed next) which count, but which do not have a
largest possible value, the pattern is similar, but the value 1 is never reached.
7
Technically speaking for n = 1, if π = 0 or 1 then P (X = π) = 1, but this means a success is
impossible or certain, respectively. Hence we no longer have two possible outcomes, but one certain
outcome, i.e. a failure or success, respectively.
8
Note that the argument x is not the same as the random variable X, nor is it the same as a realisation
or value of X. This is because X is not continuous, but takes only (selected) integer values. The x value
simply tells us the range of values in which we are interested.
46
2.7. Poisson distribution
In this situation the random variable X is the number of points in a particular unit of
the medium.
where λ is the average number of points per unit of the medium, and is known as
the rate parameter. Note that, unlike the binomial distribution, there is no upper
bound on the value of x.
Example 2.9 Examples of a Poisson process include (i) machine breakdowns per
unit of time, (ii) arrivals at an airport per unit of time, and (iii) flaws along a rope
per unit of length.
Example 2.10 Consider a machine which breaks down, on average, 3.2 times per
week, hence λ = 3.2 per week. The probability that it will break down exactly once
next week is:
e−3.2 (3.2)1
P (X = 1) = = 0.1304.
1!
The probability that it will break down exactly four times in the next two weeks
(hence λ is now 6.4) is:
e−6.4 (6.4)4
P (X = 4) = = 0.1162.
4!
Note that if we know λ for one unit of time (here, per week) and we want to look at
k units of time (in this example, k = 2), then we need to proportionally change λ to
reflect this, i.e. the revised rate parameter is k × λ (hence in this example the revised
λ for a two-week period is 2 × 3.2 = 6.4).
47
2. Discrete probability distributions
Example 2.11 Suppose we sample 100 items at random from a production line
which is providing, on average, 2% defective items. What is the probability of
exactly 3 defective items in our random sample?
First, we have to check that the relevant criteria for using the Poisson approximation
are satisfied. Indeed they are. n = 100 > 30, π = 0.02 is sufficiently small such that
nπ = 2 < 10 and x = 3 is small relative to n. Hence:
e−2 23
P (X = 3) = 100 C3 × (0.02)3 × (0.98)97 = 0.1823 ≈ = 0.1804.
| {z } 3!
true binomial probability
48
2.9. Expected value of a discrete random variable
It would be foolish to simply take the arithmetic average of all the values taken by the
random variable, as this would mean that very unlikely values (those with small
probabilities of occurrence) would receive the same weighting as very likely values
(those with large probabilities of occurrence). The obvious approach is to use the
probability-weighted average of the sample space values, known as the expected
value of X.
Note that the expected value is also referred to as the population mean, which can be
written as E(X) (in words ‘the expectation of the random variable X’), or µ (in words
‘the (population) mean of X’). Also, note the distinction between the sample mean, x̄,
(introduced in ST104a Statistics 1) based on observed sample values, and the
population mean, µ, based on the theoretical probability distribution.
Why this reduces to nπ is beyond the scope of this course, but the fact that
E(X) = nπ for the binomial distribution is a useful result!
Again, why this reduces to λ is beyond the scope of this course, but again is a useful
result.
49
2. Discrete probability distributions
Above we have labelled the population mean as the ‘expectation’ of the random
variable and introduced
P the expectation operator, E(·). This operator, like the
summation operator , is a linear operator and hence this property can be used to find
the expectation of a new random variable, be it a transformation of a single random
variable or a linear combination of two (or more) random variables.
Given random variables X and Y , and constants α and β (both non-zero), define
T = αX ± βY . It follows that:
50
2.10. Variance of a discrete random variable
N
X
E(ln(X)) = ln(xi ) pi for all xi > 0.
i=1
N
X
2
E(X ) = x2i pi .
i=1
One very important average associated with a distribution is the expected value of the
square of the deviation11 of the random variable from its mean, µ. This can be seen to
be a measure – not the only one, but the most widely used by far – of the dispersion of
the distribution and is known as the (population) variance of the random variable.
The (positive) square root of the variance is known as the standard deviation and,
given the variance is typically denoted by σ 2 , is denoted by σ.
Example 2.18 Let X represent the value shown when a fair die is thrown once.
We now compute the mean and variance of X as follows.
X=x 1 2 3 4 5 6 Total
51
2. Discrete probability distributions
It helps to have (and to calculate) the ‘Total’ column since, for example, if a
probability, P (X = x), has been miscalculated or miscopied then the row total
will not be 1 (recall axiom 2). Therefore, this would highlight an error so, with a
little work, could be identified.
In words, ‘the (population) variance is equal to the mean of the square minus the square
of the mean’. Rearranging gives:
E(X 2 ) = σ 2 + µ2 .
This representation is useful since we often want to know E(X 2 ), but start by knowing
the usual details of a distribution, i.e. µ and σ 2 .
X=x 1 2 3 4 5 6 Total
Hence µ = E(X) = 3.5, E(X 2 ) = 91/6, so the variance is 91/6 − (3.5)2 = 2.92, as
before. However, this method is usually easier.
52
2.11. Distributions related to the binomial distribution
If X ∼ Pois(λ), then:
Var(X) = λ.
Note that for the Poisson distribution the mean and variance are equal.
As in the case of the expected value, we might want to look at linear combinations of
random variables.
Given random variables X and Y and non-zero constants α and β, by defining two
new random variables U = αX and T = αX ± βY , then:12
12
One way to remember this is to think of ‘Var’ as a homogeneous function of degree 2, like the
Cobb–Douglas utility and production functions which crop up in economics.
13
We have already met independent events, but we do not yet know what it means for random variables
to be independent. This will be covered later.
53
2. Discrete probability distributions
If: (
(1 − π)x−1 π for x = 1, 2, . . .
P (X = x) =
0 otherwise
then X has a geometric distribution, denoted X ∼ Geo(π). It can be shown that
for the geometric distribution E(X) = 1/π and Var(X) = (1 − π)/π 2 .
If: (
x−1
r−1
π r (1 − π)x−r for x = r, r + 1, r + 2, . . .
P (X = x) =
0 otherwise
then X has a negative binomial distribution, denoted X ∼ Neg. Bin(r, π). It
can be shown that for the negative binomial distribution E(X) = r/π and
Var(X) = r(1 − π)/π 2 .
54
2.12. Overview of chapter
55
2. Discrete probability distributions
2. Suppose that a particle starts at the origin of the real line and moves along the line
in jumps of one unit. For each jump, the probability is π (where 0 ≤ π ≤ 1) that
the particle will jump one unit to the left, and hence the probability is 1 − π that
the particle will jump one unit to the right. Find the expected value of the position
of the particle after n jumps.
2. Let Xi = 1 if the ith jump of the particle is one unit to the right, and let Xi = −1
if the ith jump is one unit to the left. Therefore, for i = 1, 2, . . . , n, we have:
E(Xi ) = −1 × π + 1 × (1 − π) = 1 − 2π.
3. We have that:
e−λ λ0 1
P (X = 0) = = e−λ = ⇒ λ ≈ 1.10.
0! 3
Therefore:
e−1.10 (1.10)1 1
P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − − = 0.3010.
1! 3
56
Chapter 3
Continuous probability distributions
3.3 Introduction
So far, we have considered discrete distributions such as the binomial and Poisson.
These have dealt with count (or frequency) data, giving sample space values which are
non-negative integers. In such cases, as with other discrete distributions, there is always
a ‘gap’ between two possible values within which there is no other possible value. Hence
the probability of an event involving several possible values is just the sum of the
relevant probabilities of interest.
In contrast, a continuous-valued random variable, say X, can take any value over some
continuous range, or interval. So suppose x1 and x2 are distinct possible values of X,
then there is another possible value between them, for example the mid-point
(x1 + x2 )/2. Although in practice our measurements will obviously only have so many
decimal places (a consequence of limits to measurement accuracy), in principle it is
possible to measure continuous variables to infinitely many decimal places. Hence it is
mathematically convenient to use functions which can take any value over some defined
continuous interval.
57
3. Continuous probability distributions
The main transition from the discrete world of thinking in Chapter 2 is that in the
continuous world only intervals are of interest, not single (point) values.
there is an interval S ∈ R such that the possible values of X are the point values
in S
P (X = x) = p(x) = 0 (3.1)
for any pair of individual point values u and v, say, in S with u < v we can
always work out P (u < X < v).
A consequence of (3.1), i.e. that in the continuous world the probability of a single point
value is zero, is that we can be somewhat blasé about our use of < and ≤ since:
Hence for any constants a and b, such that a < b, all of the following are equivalent:
58
3.4. Probability density function and cumulative distribution function
be able to actually calculate various probabilities such as P (a < X < b). This is possible
using either the probability density function or the cumulative distribution
function.
Example 3.2 If we wanted P (1 < X < 3), say, then we would compute the area
under the curve defined by f (x) and above the x-axis interval (1, 3). This is
illustrated in Figure 3.1.
f(x)
P (1 < X < 3)
1 3
x
Figure 3.1: For an arbitrary pdf, P (1 < X < 3) is shown as the area under the pdf and
above the x-axis interval (1, 3).
In this way the pdf will give us the probabilities associated with any interval of interest,
but there is never any interest in wanting the probability for a point value of a
continuous random variable (which, remember, is zero). With this in mind, it is clear
that integration is very important in the theory of continuous random variables, because
of its role in determining areas. Hence the following properties for Example 3.2 should
be readily apparent:
R3
P (1 < X < 3) = the area under f (x) above the x-axis interval (1, 3) = 1
f (x) dx
the total area under the curve is 1, since this represents the probability of X taking
any possible value.
59
3. Continuous probability distributions
Any function f (x) defined on an interval S ∈ R can be the pdf for the probability
distribution of a (continuous) random variable X, provided that it satisfies the
following two criteria.
1. f (x) ≥ 0 for all x ∈ S (since you cannot have negative probabilities – recall
axiom 1).
R
2. S f (x) dx = 1, where S represents the sample space of x values. Hence the
total area under the curve (i.e. the total probability) is 1 – recall axiom 2.
So, if we want to calculate P (a < X < b), for any constants a and b in S such that
a < b, then:
Z b
P (a < X < b) = f (x) dx.
a
Therefore, this integration/area approach helps explain why, for any single point u ∈ S,
we have P (X = u) = 0. We can think of this probability Ras being the area of the
u
(vertical) line segment from the x-axis to f (u), which is u f (x) dx = 0.
Just as with a discrete random variable, for a continuous random variable we want to
describe key features of the distribution. We now define various measures of location
and dispersion. The main difference from Chapter 2 is the use of integrals instead of
summations in the definitions.
σ 2 = E(X 2 ) − µ2
where: Z
2
E(X ) = x2 f (x) dx.
S
60
3.4. Probability density function and cumulative distribution function
If the sample space of X has a lower bound of a then substitute a for −∞, and
if the sample space of X has an upper bound of b, then substitute b for ∞.
The mode of X is the value of X (if any) at which f (x) achieves a maximum.
Note the pdf could be multimodal.
We also have:
1 1 1
x4 x5
Z Z
2 3 3 4
E(X ) = 6x (1 − x) dx = 6 (x − x ) dx = 6 − = 0.30
0 0 4 5 0
hence:
Var(X) = E(X 2 ) − (E(X))2 = 0.30 − (0.50)2 = 0.05.
61
3. Continuous probability distributions
If X is a continuous random variable and x ∈ R, then the cdf for X is the probability
that X is less than or equal to x, such that:
Z x
F (x) = P (X ≤ x) = f (t) dt. (3.2)
−∞
(Note the cdf is written F (x), i.e. with a capital ‘F ’, while the pdf is written f (x),
i.e. with a lower case ‘f ’.)
Hence we see an important relationship between the cdf and the pdf, that is, we obtain
the cdf by integrating the pdf from the lower bound of X (−∞ in (3.2)) to x. Therefore,
this implies we can obtain the pdf by differentiating the cdf with respect to x.
d
f (x) = F (x) = F 0 (x).
dx
It is very important to remember these methods for obtaining the pdf (cdf) from the
cdf (pdf). For instance, if we needed to work out E(X), but only R ∞ had the cdf, then we
would need to differentiate this first to obtain f (x) for use in −∞ x f (x) dx.
The following points are worth noting.
The events X = x and X < x are mutually exclusive, so by the additive law we
know that, if X is continuous:
F (x) = P (X ≤ x) = P (X < x) + P (X = x)
= P (X < x) + 0
= P (X < x).
It is important to realise that the equality F (x) = P (X < x) only holds for
continuous distributions.
62
3.5. Continuous uniform distribution
If X has a uniform distribution over the continuous interval [a, b], then:
(
1/(b − a) for a ≤ x ≤ b
f (x) = (3.3)
0 otherwise.
We can easily check that (3.3) is a valid pdf since, clearly, f (x) ≥ 0 for all values of x,
and it integrates to 1 since:
Z ∞ Z a Z b Z ∞ b
1 x
f (x) dx = 0 dx + dx + 0 dx = 0 + + 0 = 1.
−∞ −∞ a b−a b b−a a
f(x)
0.1
P (2 < X < 6)
0 2 6 10
x
Figure 3.2: The pdf of X when X has a uniform distribution over [0, 10], showing the
region denoting P (2 < X < 6).
Example 3.4 Suppose X has a continuous uniform distribution over [0, 10], such
that 1/(b − a) = 1/(10 − 0) = 0.1. We have:
Z 6 h i6
P (2 < X < 6) = 0.1 dx = 0.1x = 0.4
2 2
and: Z 8 h i8
P (X < 8) = 0.1 dx = 0.1x = 0.8.
0 0
Of course, for this distribution these probabilities can simply be found geometrically
as areas of appropriate rectangles, as illustrated in Figure 3.2 for P (2 < X < 6).
Also, note that geometrically we can determine the median to be 5.
2
We have previously encountered discrete uniform distributions in Chapter 2. For example, the score
when rolling a fair die.
63
3. Continuous probability distributions
F(x)
1
0 10
x
Figure 3.3: The cdf of X when X has a uniform distribution over [0, 10].
We can only have a continuous uniform distribution for an interval of finite length,
since any rectangle of infinite length would have infinite area, not an area of 1.
The ‘full’ version of the cdf in Example 3.4 is an example of a definition by cases,
which is an important technique where we use different rules for a function
depending on which variable values we are talking about.
The graph of F (x) has a minimum of 0 (for x ≤ a, and a = 0 in Example 3.4) and
a maximum of 1 (for x ≥ b, and b = 10 in Example 3.4), which it must have since it
is a (cumulative) probability so must satisfy axioms 1 and 2. Note the cdf is
non-decreasing, as it must be; otherwise this would suggest the possibility, for
u < v say, that P (X ≤ u) > P (X ≤ v), which implies that P (u < X ≤ v) < 0, i.e.
64
3.6. Exponential distribution
a negative probability. This is ‘illegal’ since it would violate axiom 1! Hence all cdfs
are non-decreasing functions bounded by 0 and 1.
Exponential pdf
and: Z 6 h i6
P (X < 6) = 3e−3x dx = − e−3x = e0 − e−18 = 1 − e−18 ≈ 1.
0 0
Note these last two results were obtained using integration by parts (a
non-examinable technique).
65
3. Continuous probability distributions
3
Also referred to as the Gaussian distribution, after Carl Friedrich Gauss (1777–1855).
66
3.7. Normal distribution
The normal distribution is often used as the distribution of the error term in
standard statistical and econometric models such as linear regression. This
assumption can be, and should be, checked. We will see the distributional
assumption of normality applied in the context of analysis of variance (ANOVA) in
Chapter 8.
We are justified in assuming normality for statistics which are sample means, or
linear transformations of them.
The CLT was introduced above ‘providing the sample size is reasonably large’. In
practice a sample size of 30 or more is usually sufficient (and can be used as a
rule-of-thumb), although the distribution of X̄ may be normal for n much less than 30.
This depends on the distribution of the original (population) variable. If this population
distribution is in fact normal, then all sample means computed from it will be normal.
However, if the population distribution is very non-normal, such as the exponential,
then a sample size of (at least) 30 would be needed to justify normality.
4
Note the use of the word ‘modelled’. This is due to the distributional assumption of normality. A
normal random variable X is defined over the entire real line, i.e. −∞ < x < ∞, but we know a person
cannot have a negative height, even though the normal distribution has positive, non-zero probability
over negative values. Also, nobody is of infinite height (the world’s tallest man ever, Robert Wadlow,
was 272 cm), so clearly there is a finite upper bound to height, rather than ∞. Therefore, height does
not follow a true normal distribution, but it is a good enough approximation for modelling purposes.
67
3. Continuous probability distributions
Table 4 of the New Cambridge Statistical Tables lists cumulative probabilities, which
can be represented as:
We now consider some examples of working out probabilities from Z ∼ N (0, 1).
68
3.7. Normal distribution
So, for P (Z > 1.2), we require the upper-tail probability shaded in red in Figure 3.5.
This is simply 1 − Φ(1.2), which is 0.1151 from Table 4.
Figure 3.5: The standard normal distribution with the total shaded area depicting the
value of P (Z > 1.2).
69
3. Continuous probability distributions
Figure 3.6: The standard normal distribution with the total shaded area depicting the
value of P (−1.24 < Z < 1.86).
X −µ
Z=
σ
creates a standard normal random variable, i.e. Z ∼ N (0, 1). So to standardise X
we subtract its mean and divide by its standard deviation.
To see why, first note that any linear transformation of a normal random variable is also
normally distributed. Therefore, as X is normal, so too is Z, since the standardisation
transformation is linear in X. It remains to show that standardisation results in a
random variable with a zero mean and a unit variance. This is easy to show and is
worth remembering.
Since X ∼ N (µ, σ 2 ), then:
X −µ
1 1 1
E(Z) = E = E(X − µ) = (E(X) − µ) = (µ − µ) = 0.
σ σ σ σ
This result exploits the fact that σ is a constant, hence it can be taken outside the
70
3.7. Normal distribution
X −µ
1 1 1
Var(Z) = Var = 2 Var(X − µ) = 2 Var(X) = 2 × σ 2 = 1.
σ σ σ σ
This result uses the fact that we must square a constant when taking it outside the
‘Var’ operator.
a + bX ∼ N (a + bµ, b2 σ 2 ).
If X1 and X2 are independent normal random variables, such that X1 ∼ N (µ1 , σ12 )
and X2 ∼ N (µ2 , σ22 ), then:
Note that the variances are added even when dealing with the difference between
independent random variables.
71
3. Continuous probability distributions
normally distributed. This can be used, via the standard normal distribution, to
calculate probabilities for the original random variable.
Example 3.9 For asbestos fibres, if Y represents the length of a fibre, then the
distribution of Y is very positively-skewed and is definitely non-normal. However, if
we transform this to X = ln(Y ), then X is found to have a normal distribution. Any
Y for which such a transformation leads to a normal random variable is said to have
a log-normal distribution. Another common transformation which converts some
non-normal distributions into normal distributions is taking the square root.
Example 3.10 Suppose that X = ln(Y ) and that X ∼ N (0.5, 0.16) and we seek
P (Y > 1). It follows that, using Table 4:
0 − 0.5
P (Y > 1) = P (X > ln(1)) = P (X > 0) = P Z > √
0.16
= P (Z > −1.25)
= Φ(1.25)
= 0.8944.
Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, 2, . . . , 40, then:
P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)
72
3.9. Overview of chapter
since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast, if Y ∼ N (16, 9.6), then:
since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.
Continuity correction
Example 3.11 A fair coin is tossed 100 times. What is the probability of getting
more than 60 heads?
Let X be the number of heads, hence X ∼ Bin(100, 0.5). Here n > 30 and π is
moderate, hence a normal approximation to the binomial is appropriate. We use
Y ∼ N (50, 25) as the approximating distribution. So:
60.5 − 50
P (X > 60) ≈ P (Y > 60.5) = P Z > √ = P (Z > 2.10) = 0.01786.
25
73
3. Continuous probability distributions
2. The amount of coffee, C, dispensed into a coffee cup by a coffee machine follows a
normal distribution with mean 150 ml and standard deviation 10 ml. The coffee is
sold at the price of £1 per cup. However, the coffee cups are marked at the 137 ml
level, and any cup with coffee below this level will be given away free of charge.
The amounts of coffee dispensed in different cups are independent of each other.
(a) Find the probability that the total amount of coffee in 2 cups exceeds 280 ml.
(b) Find the probability that one cup is filled below the level of 137 ml.
(c) Find the expected income from selling one cup of coffee.
Let Y denote the number of the 20 observations which are in the interval (0.5, 1).
Calculate E(Y ) and Var(Y ).
74
3.12. Solutions to Sample examination questions
(b) We have:
1 1 1
3x4
Z Z
3 3
E(X) = x f (x) dx = 3x dx = =
0 0 4 0 4
and: 1
1 1
3x5
Z Z
2 2 4 3
E(X ) = x f (x) dx = 3x dx = = .
0 0 5 0 5
Hence:
2
2 3 2 3 3
Var(X) = E(X ) − (E(X)) = − = = 0.0375.
5 4 80
2. (a) The total amount of coffee in 2 cups, T , follows a normal distribution with a
mean of:
E(T ) = E(X1 + X2 ) = E(X1 ) + E(X2 ) = 150 + 150 = 300
and, due to independence, a variance of:
Var(T ) = Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) = (10)2 + (10)2 = 200.
Hence T ∼ N (300, 200). Therefore:
280 − 300
P (T > 280) = P Z > √ = P (Z > −1.41) = Φ(1.41) = 0.9207.
200
(b) Since C ∼ N (150, 100), one cup to be filled below the level of 137 ml has
probability:
137 − 150
P (C < 137) = P Z < √ = P (Z < −1.30) = 1 − Φ(1.30) = 0.0968.
100
(c) Let X denote the income from selling one cup of coffee. This is £0 if the cup is
filled below the level of 137 ml (with probability 0.0968), and is £1 otherwise
(with probability 1 − 0.0968 = 0.9032). Hence:
X=x 0 1
P (X = x) 0.0968 0.9032
Hence:
1
X
E(X) = x p(x) = 0 × 0.0968 + 1 × 0.9032 = £0.9032.
x=0
There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)
75
3. Continuous probability distributions
76
Chapter 4
Multivariate random variables
4.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.
X = (X1 , X2 , . . . , Xn )0
77
4. Multivariate random variables
Example 4.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:
Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).
78
4.5. Marginal distributions
Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:
Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006
The joint probability function gives probabilities of values of (X, Y ), for example:
A 1–1 draw, which is the most probable single result, has probability
P (X = 1, Y = 1) = p(1, 1) = 0.146.
79
4. Multivariate random variables
where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .
The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.
For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x
Example 4.4 Continuing with the football example introduced in Example 4.2, the
joint and marginal probability functions are:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 2.
80
4.6. Conditional distributions
Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:
Example 4.6 Recall that in the football example the joint and marginal pfs were:
Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000
We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:
81
4. Multivariate random variables
pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00
if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154
if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.
The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pX,Y (x, y)
pY|X (y | x) =
pX (x)
82
4.7. Covariance and correlation
where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.
So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:
pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92
83
4. Multivariate random variables
3.0
Home goals x
Expected away goals E(Y|x)
2.5
2.0
1.5
1.0
0.5
0.0
Goals
4.7.1 Covariance
Definition of covariance
(Note that these involve expected values of products of two random variables, which
have not been defined yet.)
Properties of covariance
The covariance of a random variable with itself is the variance of the random
variable:
Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).
84
4.7. Covariance and correlation
4.7.2 Correlation
Definition of correlation
When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.
Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.
Example 4.8 Recall the joint pf pX,Y (x, y) in the football example:
Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
3 0 3 6 9
0.062 0.031 0.039 0.006
Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:
XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006
85
4. Multivariate random variables
Hence:
also:
E(X 2 ) = 2.827, E(Y 2 ) = 2.039
hence:
Var(X) = 2.827 − (1.383)2 = 0.9143
and:
Var(Y ) = 2.039 − (1.065)2 = 0.9048.
Therefore, the covariance of X and Y is:
The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).
pX,Y (x, y)
pY |X (y | x) = = pY (y) for all x and y
pX (x)
so that knowing the value of X does not help to predict Y ? This implies that:
86
4.8. Independent random variables
for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:
for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.
If two random variables are independent, they are also uncorrelated, i.e. we have:
The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.
87
4. Multivariate random variables
(xi − µ)2
1
f (xi ) = √ exp −
2πσ 2 2σ 2
Example 4.12 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:
Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006
However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?
88
4.9. Sums of random variables
In particular, for n = 2:
89
4. Multivariate random variables
These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.
An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.
90
4.9. Sums of random variables
All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1
where:
n
X n
X XX
µ= ai µi + b and σ 2 = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j
If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1
Example 4.14 Suppose that in the population of English people aged 16 or over:
the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.
Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
The probability we need is:
D − 13.6 10 − 13.6
P (D ≤ 10) = P ≤
10.08 10.08
= P (Z ≤ −0.36)
= P (Z ≥ 0.36)
= 0.3594
using Table 4 of the New Cambridge Statistical Tables.
The probability that a randomly selected man is at most 10 cm taller than a
randomly selected woman is about 0.3594.
91
4. Multivariate random variables
X = −1 X=0 X=1
Y = −1 0.09 0.16 0.15
Y =0 0.09 0.08 0.03
Y =1 0.12 0.16 0.12
(a) Determine the marginal distributions and calculate the expected values of X
and Y , respectively.
(d) Define U = |X| and V = Y . Calculate E(U ) and the covariance of U and V .
Are U and V correlated?
2. Suppose X and Y are two independent random variables with the following
probability distributions:
X=x −1 0 1 Y =y −1 0 1
and
P (X = x) 0.30 0.40 0.30 P (Y = y) 0.40 0.20 0.40
92
4.13. Solutions to Sample examination questions
S = X2 + Y 2 and T = X + Y.
ii. Cov(S, T ).
(c) Are S and T uncorrelated? Are S and T independent? Justify your answers.
X −1 0 1
pX (x) 0.30 0.40 0.30
The marginal distribution of Y is:
Y −1 0 1
pY (y) 0.40 0.20 0.40
Hence:
X
E(X) = x pX (x) = (−1 × 0.30) + (0 × 0.40) + (1 × 0.30) = 0
x
and:
X
E(Y ) = y pY (y) = (−1 × 0.40) + (0 × 0.20) + (1 × 0.40) = 0.
y
(b) We have:
Therefore:
93
4. Multivariate random variables
0.09
P (X = −1 | Y = 0) = = 0.45
0.20
0.08
P (X = 0 | Y = 0) = = 0.40
0.20
0.03
P (X = 1 | Y = 0) = = 0.15
0.20
and therefore:
0.16 16
P (X = 0 | X + Y = 1) = =
0.19 19
0.03 3
P (X = 1 | X + Y = 1) = =
0.19 19
and therefore:
16 3 3
E(X | X + Y = 1) = 0 × +1× = = 0.1579.
19 19 19
U =0 U =1
V = −1 0.16 0.24
V =0 0.08 0.12
V =1 0.16 0.24
We then have that P (U = 0) = 0.16 + 0.08 + 0.16 = 0.40 and also that
P (U = 1) = 1 − P (U = 0) = 0.60. Also, we have that P (V = −1) = 0.40,
P (V = 0) = 0.20 and P (V = 1) = 0.40. So:
and:
E(U V ) = −1 × 1 × 0.24 + 1 × 1 × 0.24 = 0.
94
4.13. Solutions to Sample examination questions
Var(T ) = E(T 2 )
2
X
= t2 p(t)
t=−2
iii. We have:
2
X 0.08 0.24
E(S | T = 0) = s pS|T =0 (s | t = 0) = 0 × +2× = 1.5.
s=0
0.32 0.32
(c) The random variables S and T are uncorrelated, since Cov(S, T ) = 0. However:
but:
P ({T = −2} ∩ {S = 0}) = 0 6= P (T = −2) P (S = 0)
which is sufficient to show that S and T are not independent.
95
4. Multivariate random variables
96
Chapter 5
Sampling distributions of statistics
prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement
state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.
5.3 Introduction
Suppose we have a sample of n observations of a random variable X:
{X1 , X2 , . . . , Xn }.
97
5. Sampling distributions of statistics
We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.
The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).
Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.
For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.
We will assume this most of the time from now. So you will see many examples and
questions which begin something like:
98
5.5. Statistics and their sampling distributions
Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.
99
5. Sampling distributions of statistics
n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1
Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.
Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:
x̄ = 4.94
s2 = 0.90
maxx = 6.58.
100
5.5. Statistics and their sampling distributions
Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:
The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.
Example 5.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .
Figures 5.1, 5.2 and 5.3 show histograms of the statistics for these 10,000
random samples.
We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:
101
5. Sampling distributions of statistics
where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.
Curves of the densities of these distributions are also shown in Figures 5.1, 5.2 and
5.3.
Sample mean
and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1
102
5.6. Sample mean from a normal population
Sample variance
5 6 7 8 9
Maximum value
103
5. Sampling distributions of statistics
For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n
So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:
σ2
X̄ ∼ N µ, .
n
For example, the pdf drawn on the histogram in Figure 5.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.
√
We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.
More interestingly, the sampling variance gets smaller when the sample size n
increases.
Example 5.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.
104
5.6. Sample mean from a normal population
n=100
n=20
n=5
We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?
√
Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:
P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
√ √
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
√
n 0.05
P Z> < = 0.025
7.39 2
where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:
√
n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.
105
5. Sampling distributions of statistics
It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.
There are more general versions of the CLT which do not require the observations
Xi to be IID.
Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n
106
5.7. The central limit theorem
Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.
The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:
from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 5.5.
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 5.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.
Example 5.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 5.6.
107
5. Sampling distributions of statistics
n=5 n = 10
n=1
0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10
n = 30 n = 100 n = 1000
2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4
Figure 5.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.
The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:
108
5.8. Some common sampling distributions
n = 30
n = 10
n=1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5
n = 100 n = 1000
n = 50
0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24
Figure 5.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.
E(X) = k
Var(X) = 2k.
where: Z ∞
Γ(α) = xα−1 e−x dx
0
is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 5.7.
In most applications of the χ2 distribution the appropriate value of k is known, in which
case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:
109
5. Sampling distributions of statistics
0.10
0.6
k=1 k=10
k=2 k=20
0.5
0.08
k=4 k=30
k=6 k=40
0.4
0.06
0.3
0.04
0.2
0.02
0.1
0.0
0.0
0 2 4 6 8 0 10 20 30 40 50
random sample from the population N (µ, σ 2 ), and S 2 is the sample variance, then:
(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.
In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of the New Cambridge Statistical Tables shows the following
information.
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.
The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.
Example 5.7 Consider two numbers in the ‘ν = 5’ row, the 9.236 in the ‘α = 0.10
(P = 10)’ column and the 11.07 in the ‘α = 0.05 (P = 5)’ column. These mean that
for X ∼ χ25 we have:
110
5.8. Some common sampling distributions
These also provide bounds for probabilities of other values. For example, since 10.0
is between 9.236 and 11.07, we can conclude that:
Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.
N(0,1)
k=1
k=3
k=8
0.3
k=20
0.2
0.1
0.0
−2 0 2
111
5. Sampling distributions of statistics
For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.
and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.
In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 10 of the New Cambridge Statistical Tables shows the following
information.
The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).
If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.
The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.
Example 5.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05
(P = 5)’ column. This means that for T ∼ t4 we have:
The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025 (P = 2.5)’ column is 2.776, so P (T > 2.776) = 0.025. Since
2.132 < 2.5 < 2.776, we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.
112
5.8. Some common sampling distributions
Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).
(10,50)
(10,10)
(10,3)
f(x)
0 1 2 3 4
For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We now consider how to use them in the following example.
Example 5.9 Here we practise use of Table A.3 of the Dougherty Statistical Tables
to obtain critical values for the F distribution.
Table A.3 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:
113
5. Sampling distributions of statistics
and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.24
and:
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.58
2. Suppose Xi ∼ N (0, 9), for i = 1, 2, 3, 4. Assume all these random variables are
independent. Derive the value of k in each of the following.
114
5.12. Solutions to Sample examination questions
4
Xi2
P
(b) P <k = 0.90.
i=1
1/2
(c) P X1 > (k(X22 + X32 )) = 0.10.
Therefore, by independence:
X̄ − Ȳ ∼ N (2, 10)
so:
−2
P (X̄ − Ȳ > 0) = P Z > √ = P (Z > −0.63) = 0.7357.
10
Hence:
k
P (X1 + 6X2 < k) = P Z < √ = 0.3974.
333
Since, using Table 4 of the New Cambridge Statistical Tables, we can deduce
that Φ(−0.26) = 0.3974, we have:
k
√ = −0.26 ⇒ k = −4.7446.
333
√ 4
(b) Xi / 9 ∼ N (0, 1), and so Xi2 /9 ∼ χ21 . Hence we have that Xi2 /9 ∼ χ24 .
P
i=1
Therefore, using Table 8 of the New Cambridge Statistical Tables, we have:
4
!
X
2 k k
P Xi < k = P X < = 0.90 ⇒ = 7.779 ⇒ k = 70.011
i=1
9 9
where X ∼ χ24 .
115
5. Sampling distributions of statistics
(c) We have:
√ !
X1 / 9 √ √
(k(X22 X32 ))1/2
P X1 > + =P p > 2× k
(X22 + X32 )/(9 × 2)
√ √
= P (T > 2 × k)
= 0.10
where
√ T ∼ t2 . From Table 10 of the New Cambridge Statistical Tables,
√
2 × k = 1.886, hence k = 1.7785.
116
Chapter 6
Estimator properties
6.3 Introduction
Reiterating previous discussions, one of the main uses of statistics, and of sampling, is
to estimate the value of some unknown population characteristic (parameter).1 Given a
relevant set of data (a random sample) drawn from the population, the problem is to
perform a calculation using these data values in order to arrive at a value which, in some
sense, comes ‘close’ to the unknown population parameter which we wish to estimate.
Whatever it is that we are trying to estimate, this (numerical) statistic which we
calculate is known as a point estimate, and the general name for making such value
estimates is ‘point estimation’. Note that, in principle, we might want to estimate any
characteristic of the (true) population distribution, but some will be far more common
than others.
The statistic used to obtain a point estimate is known as an estimator, and Chapter 5
derived the sampling distribution of arguably the most common estimator, i.e. the
sample mean, X̄, which is used as our preferred estimator of the population mean, µ.
1
Clearly, if the population parameter is known, then there is no need to estimate it!
117
6. Estimator properties
Bias of an estimator
b −θ =0
unbiased if E(θ)
b − θ < 0.
negatively biased if E(θ)
2
The b (hat) notation is frequently deployed by statisticians to denote an estimator of the symbol
beneath the b. So, for example, λ
b denotes an estimator of the Poisson rate parameter, λ.
118
6.4. Estimation criteria – bias, variance and mean squared error
In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, the estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.
σ2
π(1 − π)
P = X̄ ∼ N µ, = N π,
n n
It is clear that in both (6.2) and (6.3) increasing the sample size n decreases the
estimator’s variance (and hence the standard error), so in turn increases the precision of
the estimator.3 We conclude that variance is also a ‘bad’ thing so, other things being
equal, the smaller an estimator’s variance the better.
A popular quality metric for assessing how ‘good’ an estimator is based on the
estimator’s average squared error is the ‘mean squared error’.
The mean squared error (MSE) of an estimator is the average squared error.
Formally, this is defined as:
b = E((θb − θ)2 ).
MSE(θ) (6.4)
It is possible to decompose this into components involving both the bias and the
variance of an estimator. Recall that:
Also, note that for any constant k, Var(X ± k) = Var(X), hence adding or subtracting a
constant has no effect on the variance of a random variable. Noting that the true
parameter θ is some (unknown) constant,4 it immediately follows, by setting
3
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
4
Even though θ is an unknown constant, it is known to be a constant!
119
6. Estimator properties
b = E((θb − θ)2 )
MSE(θ)
= Var(θ) b 2.
b + (Bias(θ)) (6.5)
It is the form of the MSE given by (6.5), rather than (6.4), which we will use in practice.
We have already established that both the bias and the variance of an estimator are
‘bad’ things, so the MSE (being the sum of a bad thing and a bad thing squared) can
also be viewed as a ‘bad’ thing.5 Therefore, when faced with several competing
estimators, we prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in the variance. Hence the MSE provides us with a formal criterion to assess
the trade-off between the bias and variance of different estimators.
E(T1 ) = E(X̄) = µ
σ2
Var(T1 ) = Var(X̄) = .
n
So MSE(T1 ) = σ 2 /n. Moving to T2 , note:
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2
and:
2σ 2 σ2
X1 + Xn Var(X1 ) + Var(Xn )
Var(T2 ) = Var = = = .
2 22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2. Finally, consider
T3 , noting:
E(T3 ) = E(X̄ + 3) = E(X̄) + 3 = µ + 3
5
Or, for that matter, a ‘very bad’ thing!
120
6.5. Unbiased estimators
and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a biased estimator of µ, with a bias of:
Bias(T3 ) = E(T3 ) − µ = µ + 3 − µ = 3
hence MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:
MSE(θ)
b = Var(θ) b 2 = Var(θ)
b + (Bias(θ)) b + 02 = Var(θ).
b
So, minimising the MSE for unbiased estimators is the same as choosing the estimator
with the smallest variance, hence we term such an estimator the minimum variance
unbiased estimator. Therefore, if we had two unbiased estimators of θ, say θb1 and θb2 ,
then we prefer θb1 if Var(θb1 ) < Var(θb2 ). If this is the case, then θb1 is called the more
efficient estimator.
Examples of unbiased estimators of parameters include:
E(θb2 ) = Var(θ)
b + θ2 > θ2 .
6
This justifies use of the n − 1 divisor when computing the sample variance, since this results in an
unbiased estimator of σ 2 .
121
6. Estimator properties
Unbiased estimators of functions of parameters, when the parameter is drawn from the
population distribution, can, however, often be found by an appropriate adjustment.
Example 6.2 Suppose we have a random sample of n values from N (µ, σ 2 ), and we
wish to find an unbiased estimator of µ2 . If we try X̄ 2 , then:
σ2
E(X̄ 2 ) = Var(X̄) + (E(X̄))2 = + µ2 6= µ2 .
n
However, we know E(S 2 ) = σ 2 , so by combining this with the above, it follows that:
S2
2
σ2 σ2
2 2 S
E X̄ − = E(X̄ ) − E = + µ2 − = µ2 .
n n n n
122
6.8. Sample examination questions
(a) Consider the estimators T1 = X and T2 = 2X 2 /3. Show that they are both
unbiased estimators of α.
(a) For each of the three estimators above, determine whether or not they are
unbiased estimators of λ.
and:
2
2x2
X 3α 2 α 8 α 2α 8α
E(T2 ) = p(x) = 0 × 1 − + × + × = + =α
x=0
3 4 3 2 3 4 6 12
(b) Since T1 and T2 are unbiased estimators we prefer the one with the smallest
variance. Since E(T1 ) = E(T2 ) = α, this is equivalent to choosing the estimator
with the smaller value of E(T12 ) and E(T22 ). We have:
2
X 3α α α α 4α 3α
E(T12 ) = 2 2
x p(x) = 0 × 1 − + 12 × + 2 2 × = + =
x=0
4 2 4 2 4 2
123
6. Estimator properties
and:
2
4x4
X 3α 4 α 64 α 4α 64α
E(T22 ) = p(x) = 0 × 1 − + × + × = + = 2α
x=0
9 4 9 2 9 4 18 36
so we prefer T1 .
and:
X +Y E(X) + E(Y ) λ + 3λ
E = = =λ
4 4 4
hence these are all unbiased estimators of λ.
(b) Because each estimator is unbiased, minimising the mean squared error is
equivalent to minimising the variance. Again, using the formula sheet, due to
independence of X and Y we have that:
Y Var(Y ) 3λ λ
Var(X) = λ, Var = = =
3 9 9 3
and:
X +Y Var(X) + Var(Y ) λ + 3λ λ
Var = = = .
4 16 16 4
Of these three unbiased estimators we prefer the one with the smallest
variance, hence we prefer (X + Y )/4.
124
Chapter 7
Point estimation
7.3 Introduction
We have previously seen a selection of families of theoretical probability distributions.
Some of these were discrete distributions, such as the Bernoulli, binomial and Poisson
distributions (seen in Chapter 2), while others were continuous, such as the exponential
and normal distributions (seen in Chapter 3). These are ‘families’ of distributions in
that the different members of these families vary in terms of the values of the
parameter(s) of the distribution.
Let θ be the general notation for the parameter of a probability distribution. If its value
is unknown, we need to estimate it using an estimator. In general, how should we find
an estimator of θ in a practical situation? There are three conventional methods:
method of moments estimation
least squares estimation
maximum likelihood estimation.
125
7. Point estimation
Let {X1 , X2 , . . . , Xn } be a random sample from a population with cdf F (x; θ), i.e.
a cdf which depends on θ. Suppose θ has p components (for example, for a normal
population N (µ, σ 2 ), p = 2; for a Poisson population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n
µk (θ)
b = Mk for k = 1, 2, . . . , p.
= E(X 2 ) − E(X̄ 2 )
2
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n
126
7.4. Method of moments (MM) estimation
Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1
Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.
Example 7.3 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.
For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.
127
7. Point estimation
The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1
128
7.5. Least squares (LS) estimation
where all terms are non-negative, then the value of a for which S is minimised is when
n(X̄ − a)2 = 0, i.e. a = X̄.
Estimator accuracy
2
σ2
MSE(µ) b − µ) ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:
X̄ − µ
P √ ≤ z → Φ(z)
σ/ n
for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Some remarks are the following.
Example 7.5 Suppose that you are given independent observations y1 , y2 and y3
such that:
y1 = 3α + β + ε1
y2 = α − 2β + ε2
y3 = −α + 2β + ε3 .
129
7. Point estimation
We have to minimise:
3
X
S= ε2i = (y1 − 3α − β)2 + (y2 − α + 2β)2 + (y3 + α − 2β)2 .
i=1
We have:
∂S
= −6(y1 − 3α − β) − 2(y2 − α + 2β) + 2(y3 + α − 2β)
∂α
= 22α − 2β − 2(3y1 + y2 − y3 )
and:
∂S
= −2(y1 − 3α − β) + 4(y2 − α + 2β) − 4(y3 + α − 2β)
∂β
= −2α + 18β − 2(y1 − 2y2 + 2y3 ).
The estimators α
b and βb are the solutions of the equations ∂S/∂α = 0 and
∂S/∂β = 0. Hence:
α − 2βb = 6y1 + 2y2 − 2y3
22b
and:
−2b
α + 18βb = 2y1 − 4y2 + 4y3 .
Solving yields:
4y1 + y2 − y3 2y1 − 3y2 + 3y3
α
b= and βb = .
14 14
They are unbiased estimators since:
4y1 + y2 − y3 12α + 4β + α − 2β + α − 2β
E(b
α) = E = =α
14 14
and:
2y1 − 3y2 + 3y3 6α + 2β − 3α + 6β − 3α + 6β
E(β)
b =E = = β.
14 14
130
7.6. Maximum likelihood (ML) estimation
Example 7.6 Suppose we toss a coin 10 times, and record the number of ‘heads’ as
a random variable X. Therefore:
X ∼ Bin(10, π)
Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:
d
l(π) = 0
dπ
we obtain the maximum likelihood estimate π
b = 0.8.
131
7. Point estimation
θb = θ(X
b 1 , X2 , . . . , Xn ).
The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.
132
7.6. Maximum likelihood (ML) estimation
iii. It is often more convenient to use the log-likelihood function1 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1
θb = max l(θ).
θ
iv. For a smooth likelihood function, the MLE is often the solution of the equation:
d
l(θ) = 0.
dθ
vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.
1
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.
133
7. Point estimation
n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.
Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.
Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = 2 n/2
exp − 2 (Xi − µ)2
(2πσ ) 2σ i=1
n
!
1 1 X n
= exp − 2 (Xi − X̄)2 exp − 2 (X̄ − µ)2 .
(2πσ 2 )n/2 2σ i=1 2σ
134
7.6. Maximum likelihood (ML) estimation
n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1
135
7. Point estimation
(b) Is the estimator of θ derived in part (a) biased or unbiased? Justify your
answer.
(c) Determine the variance of the estimator derived in part (a) and check whether
it is a consistent estimator of θ.
Use this sample to calculate the method of moments estimate of θ using the
estimator derived in part (a), and sketch the above probability density
function based on this estimate.
136
7.10. Solutions to Sample examination questions
(b) Show that the estimator derived in part (a) is mean square consistent for θ.
Hint: You may use the fact that E(X) = αθ and Var(X) = αθ2 .
b 2= 3 − θ2
MSE(θ)
b = Var(θ)
b + (Bias(θ)) + 02 → 0
n
as n → ∞, hence θb is a consistent estimator of θ.
137
7. Point estimation
(d) The sample mean is x̄ = 0.24, hence θb = 3x̄ = 3 × 0.24 = 0.72. Therefore:
(
b = 0.50 + 0.36x for − 1 ≤ x ≤ 1
f (x; θ)
0 otherwise.
A sketch of f (x; θ)
b is:
2. (a) For α > 0 known, due to independence the likelihood function is:
n n
!α−1 n
!
Y 1 Y 1X
L(θ) = f (xi ; α, θ) = n θ nα
xi exp − xi .
i=1
((α − 1)!) i=1
θ i=1
The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)
138
Chapter 8
Analysis of variance (ANOVA)
restate and interpret the models for one-way and two-way analysis of variance
perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance
8.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.
139
8. Analysis of variance (ANOVA)
Example 8.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?
Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:
H0 : µ1 = µ2 = µ3 .
The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj
Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66
Note that similar problems arise from other practical situations. For example:
If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3
140
8.5. One-way analysis of variance
Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .
141
8. Analysis of variance (ANOVA)
k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1
k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1
We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.
142
8.5. One-way analysis of variance
nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1
k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1
where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.
143
8. Analysis of variance (ANOVA)
p-value = P (F > f ).
It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.
Example 8.2 Continuing with Example 8.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1
and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1
3
X
= 5s2j
j=1
= 5 × (34 + 20 + 32)
= 430.
Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.36 < 9, using
Table A.3 of the Dougherty Statistical Tables, we reject H0 at the 1% significance
level. In fact the p-value (using a computer) is P (F > 9) = 0.003. Therefore, we
conclude that there is a significant difference among the mean examination marks
across the three classes.
144
8.5. One-way analysis of variance
Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946
> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33
[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867
[[3]]
English Mathematics Political Science
100 100 100
[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867
145
8. Analysis of variance (ANOVA)
Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:
Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:
H0 : µ1 = µ2 = µ3 .
An estimator of σ is:
s
W
σ
b =S= .
n−k
S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj
where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.
Example 8.4 Assuming a common variance for each group, from the preceding
output in Example 8.3 we see that:
1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,
146
8.5. One-way analysis of variance
respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100
6
4
2
0
Example 8.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file
‘GallupPoll.csv’ (available on the VLE). They are classified into four groups
according to their incomes. Below is part of the R output of the descriptive statistics
147
8. Analysis of variance (ANOVA)
of the classified data. Can we infer that income group has a significant impact on the
mean length of time before facing financial hardship?
Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00
[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043
[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67
[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1
Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301
148
8.5. One-way analysis of variance
Now:
k
X
b= nj (x̄·j − x̄)2
j=1
We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
s24 = (8.087)2 = 65.400, hence:
nj
k X k
X X
2
w= (xij − x̄·j ) = (nj − 1)s2j
j=1 i=1 j=1
Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.85 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s = w/(n − k) = 25,968.24/(301 − 4) = 9.351.
s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj
18.328
22.21 ± √ ⇒ (19.28, 25.14)
39
18.328
9.313 ± √ ⇒ (7.07, 11.55).
67
Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.
R output for the data is:
149
8. Analysis of variance (ANOVA)
Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.
150
8.7. Two-way analysis of variance
In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:
γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.
151
8. Analysis of variance (ANOVA)
r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1
The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.
c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1
r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1
r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1
r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1
r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1
c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1
r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1
152
8.8. Residuals
As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS
8.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
We now decompose the observations as follows:
Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)
for i = 1, 2, . . . , r and j = 1, 2, . . . , c, where we have the following point estimators.
153
8. Analysis of variance (ANOVA)
µ
b = X̄ is the point estimator of µ.
for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.
Example 8.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock Exchange
during 1981–85, available in the data file ‘NYSE.csv’ (available on the VLE).
154
8.8. Residuals
r X
X c
Total SS = x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1
r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1
c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1
Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482
We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:
Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)
155
8. Analysis of variance (ANOVA)
Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:
εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.
If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.
156
8.11. Sample examination questions
Test at the 5% significance level whether the true mean price-earnings ratios for
the three market sectors are the same. Use the ANOVA table format to summarise
your calculations. You may exclude the p-value.
2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.
157
8. Analysis of variance (ANOVA)
= 76.31.
Therefore, w = 186.80 − 76.31 = 110.49. Hence the ANOVA table is:
Source DF SS MS F
Sector 2 76.31 38.16 11.39
Error 33 110.49 3.35
Total 35 186.80
We test:
H0 : PE ratio means are equal vs. H1 : PE ratio means are not equal
and we reject H0 if:
f > F0.05, 2, 33 ≈ 3.30.
Since 3.30 < 11.39, we reject H0 and conclude that there is evidence of a difference
in the mean price-earnings ratios across the sectors.
A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)
158
Appendix A
Probability theory
Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!
However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)
Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)
159
A. Probability theory
4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).
Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).
(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).
160
A.1. Worked examples
Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1
By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:
K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1
6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?
Solution:
Define a partition {Ci }, such that:
5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:
1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12
7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?
161
A. Probability theory
Solution:
P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1
4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2
P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1
8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?
Solution:
162
A.1. Worked examples
150
= 0.00015.
1,000,000
Note that the first number can be any number (with probability 1).
Subtracting from 1 gives the required probability, i.e. 0.009965.
Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:
1 − (1 − π1 )(1 − π2 )(1 − π3 ).
1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 .
163
A. Probability theory
10. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.
11. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.
12. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:
A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.
13. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3
14. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the
164
A.1. Worked examples
15. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:
A ∩ B = ∅.
So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:
P (A ∩ B) = P (A) P (B).
It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.
16. Write down the condition for three events A, B and C to be independent.
Solution:
Applying the product rule, we must have:
Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:
P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)
and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.
17. Prove the simplest version of Bayes’ theorem from first principles.
Solution:
Applying the definition of conditional probability, we have:
P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)
1
Note that independence and disjointness are not similar ideas.
165
A. Probability theory
18. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.
(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?
Solution:
(a) The first part of the example asks for the denominator of Bayes’ theorem:
P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.
2
Notice that F = P c and N = C c .
166
A.1. Worked examples
167
A. Probability theory
1. (a) A, B and C are any three events in the sample space S. Prove that:
P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2
3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.
168
Appendix B
Discrete probability distributions
Solution:
Each face has an equal chance of 1/6 of turning up, so we get the following table:
X=x 1 2 3 4 5 6 Total
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1
x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.5
x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6
Hence the mean is E(X) = 3.5. The variance is E(X 2 ) − µ2 = 91/6 − (3.5)2 = 2.92.
(b) Determine the probability distribution of the absolute difference of the two
dice, Y , and find its mean and variance.
Solution:
(a) The pattern is made clearer by using the same denominator (i.e. 36) below.
X=x 2 3 4 5 6 7
P (X = x) 1/36 2/36 3/36 4/36 5/36 6/36
x P (X = x) 2/36 6/36 12/36 20/36 30/36 42/36
x2 P (X = x) 4/36 18/36 48/36 100/36 180/36 294/36
X=x 8 9 10 11 12 Total
P (X = x) 5/36 4/36 3/36 2/36 1/36 1
x P (X = x) 40/36 36/36 30/36 22/36 12/36 252/36
x2 P (X = x) 320/36 324/36 300/36 242/36 144/36 1,974/36
169
B. Discrete probability distributions
0.16
0.14
0.12
Probability
0.10
0.08
0.06
0.04
2 4 6 8 10 12
Value of sum
0.15
0.10
0.05
0 1 2 3 4 5
170
B.1. Worked examples
(a) Draw the probability distribution of X, and find its mean and variance.
(b) The examiner calculates a rescaled mark using the formula Y = 10 + 22.5X.
Find the mean and variance of Y .
Solution:
(a) The distribution of X is binomial with n = 4 and π = 1/3.
X=x 0 1 2 3 4 Total
P (X = x) 0.1975 0.3951 0.2963 0.0988 0.0123 1
x P (X = x) 0 0.3951 0.5926 0.2964 0.0492 1.33
x2 P (X = x) 0 0.3951 1.1852 0.8892 0.1968 2.67
0.2
0.1
0.0
0 1 2 3 4
(b) We have:
E(Y ) = 10 + 22.5 × E(X) = 10 + 22.5 × 1.33 = 39.93
and:
Var(Y ) = (22.5)2 × Var(X) = (22.5)2 × 0.89 = 450.6.
4. In a game show each contestant has two chances out of three of winning a prize,
independently of other contestants. If six contestants take part, determine the
probability distribution of the number of winners. Find the mean and variance of
the number of winners.
Solution:
The number of winners, X, is binomial with n = 6 and π = 2/3.
X=x 0 1 2 3 4 5 6 Total
P (X = x) 0.0014 0.0165 0.0823 0.2195 0.3292 0.2634 0.0878 1
x P (X = x) 0 0.0165 0.1646 0.6585 1.3168 1.3170 0.5268 4
x2 P (X = x) 0 0.0165 0.3292 1.9755 5.2672 6.5850 3.1608 17.33
171
B. Discrete probability distributions
0.30
0.25
0.20
Probability
0.15
0.10
0.05
0.00
0 1 2 3 4 5 6
Number of winners
5. Suppose that the probability of a warship hitting a target with any shot is 0.2.
(a) What are the probabilities that in six shots it will hit the target:
i. exactly twice
ii. at least three times
iii. at most twice?
(b) What assumptions are you implicitly making in (a)? Are the assumptions
reasonable?
Solution:
(a) Using the binomial distribution with n = 6 and π = 0.2 we obtain the
following:
i. P (X = 2) = 15 × (0.2)2 × (0.8)4 = 0.2458.
ii. P (X = 0) = (0.8)6 = 0.2621 and P (X = 1) = 6 × (0.2)1 × (0.8)5 = 0.3932.
So:
P (at most 2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.2621 + 0.3932 + 0.2458
= 0.9011.
Therefore:
172
B.1. Worked examples
(b) This assumes independence and that the probability of a hit is constant at 0.2.
In practice these assumptions might not be valid. For example, there might be
an improvement in accuracy due to experience – naval and artillery gunners
often use ‘ranging shots’ (so that, for instance, if the shell has landed too far
to the left, then they can aim a bit more to the right next time). These
answers might, nevertheless, be reasonable approximations.
6. Components for assembly are delivered in batches of 100 and experience shows that
5% of each batch are defective. On arrival, five pieces are selected at random from
each batch and tested. If two or more of the five are found to be faulty, the entire
batch is rejected. What is the probability that a 5% defective batch will be
accepted?
Solution:
Since this question involves sampling without replacement, we ought in principle to
use the ‘hypergeometric distribution’ (not covered in this course). However, we
frequently want to save effort in these calculations (with a nod to pre-computer
approaches), and here n = 5 (both the sample size and the upper limit of the
values of interest) and N = 100, so that π = n/N = 5/100 = 0.05 is small.
Moreover, N π = n = 5 is also ‘small’.
Therefore, we can use a binomial approximation to these preferable, but more
complicated, methods, with n = 5 and π = 0.05. We find that:
7. One in ten of the new cars leaving a factory has minor faults of one kind or
another.
(a) Assuming that a batch of ten cars delivered to a dealer represents a random
sample of the output, what is the probability that:
i. at least 1 will be faulty
On receiving a delivery of ten new cars from the manufacturer, the dealer checks
out four of these, chosen at random, before they are delivered to customers.
(b) If, in fact, two of the cars are faulty, what is the probability that both faults
will be discovered?
Solution:
(a) Using the binomial distribution with n = 10 and π = 0.1 we have the
following:
i. P (at least 1 faulty) = 1 − P (X = 0) = 1 − (0.9)10 = 0.6513.
173
B. Discrete probability distributions
ii. We have:
P (more than 3 faulty) = 1 − P (X = 0) − P (X = 1) − P (X = 2) − P (X = 3)
= 1 − 0.3487 − 0.3874 − 0.1937 − 0.0574
= 0.0128.
8. Over a period of time the number of break-ins per month in a given district has
been observed to follow a Poisson distribution with mean 2.
(a) For a given month, find the probability that the number of break-ins is:
i. fewer than 2
ii. more than 4
iii. at least 1, but no more than 3.
(b) What is the probability that there will be fewer than ten break-ins in a
six-month period?
Solution:
(a) The first few entries in the table of the probability function are given below.
They all follow from the Poisson probability function:
e−λ λx
P (X = x) =
x!
with λ = 2.
X=x 0 1 2 3 4 5
P (X = x) 0.1353 0.2707 0.2707 0.1804 0.0902 ...
i. We have:
P (X < 2) = P (X = 0) + P (X = 1)
= 0.1353 + 0.2707
= 0.4060.
ii. We have:
P (X > 4) = 1 − P (X = 0) − P (X = 1) − P (X = 2) − P (X = 3) − P (X = 4)
= 1 − 0.1353 − 0.2707 − 0.2707 − 0.1804 − 0.0902
= 0.0527.
174
B.1. Worked examples
iii. We have:
P (1 ≤ X ≤ 3) = P (X = 1) + P (X = 2) + P (X = 3)
= 0.2707 + 0.2707 + 0.1804
= 0.7218.
(b) If there are an average of 2 break-ins per month, there will be an average of 12
break-ins in a 6-month period. Therefore, the number of break-ins will have a
Poisson distribution with λ = 12. We need to calculate P (X < 10). This is
time-consuming (though not difficult) by hand, resulting in a probability of
0.2424.
9. Two per cent of the videotapes produced by a company are known to be defective.
If a random sample of 100 videotapes is selected for inspection, calculate the
probability of getting no defectives by using:
(a) the binomial distribution
(b) the Poisson distribution.
Solution:
(a) Using the binomial with n = 100 and π = 0.02, P (X = 0) = (0.98)100 = 0.1326.
(b) Using the Poisson with λ = 100 × 0.02 = 2, P (X = 0) = exp(−2) = 0.1353.
Note that the answers are almost equal. This is because n is large and π is small.
10. The probability that a marksman hits the bullseye in target practice is 0.7. Assume
that successive shots are independent.
(a) What is the probability that, out of seven shots, he hits the target:
i. exactly once
ii. every time
iii. at least five times
iv. at least five times in succession (with all hits being in succession)?
(b) Suppose he bets £1 that he can hit the target at least five times out of seven
(i.e. he gains £1 if he succeeds and loses £1 if he fails). What are his expected
winnings?
Solution:
175
B. Discrete probability distributions
iii. We have:
P (X ≥ 5) = P (X = 5) + P (X = 6) + P (X = 7)
7 5 2 7
= × (0.7) × (0.3) + × (0.7)6 × 0.3 + (0.7)7
5 6
= 0.6471.
iv. We have:
1. On average a fire brigade in a large town receives 1.6 emergency calls per hour.
Assuming a suitable probability distribution for this variable, find the probability
that the number of calls is:
(a) none in 1 hour
(b) exactly 5 in 1 hour
(c) more than 4 in 2 hours.
2. A glacier in Greenland ‘calves’ (lets fall off into the sea) an iceberg on average
twice every five weeks. (Seasonal effects can be ignored for this question.)
(a) Explain which distribution you would use to estimate the probabilities of
different numbers of icebergs being calved in different periods, justifying your
selection.
(b) What is the probability that no iceberg is calved in the next two weeks?
176
B.2. Practice questions
(c) What is the probability that no iceberg is calved in the two weeks after the
next two weeks?
(d) What is the probability that exactly three icebergs are calved in the next four
weeks?
(e) If exactly three icebergs are calved in the next four weeks, what is the
probability that exactly three more icebergs will be calved in the four-week
period after the next four weeks?
(f) Comment on the relationship between your answers to (d) and (e).
177
B. Discrete probability distributions
178
Appendix C
Continuous probability distributions
Determine the probability density function (pdf) of X, and find its mean and
standard deviation.
Solution:
We obtain the pdf by differentiating the cdf with respect to x. Hence:
(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise.
By definition the mean, µ, is:
Z ∞ 1 1
2x3
Z
2 2
µ = E(X) = x f (x) dx = 2x dx = = = 0.6667.
−∞ 0 3 0 3
We obtain the variance using Var(X) = E(X ) − (E(X)) , so we also need E(X 2 ):
2 2
Z ∞ Z 1 4 1
2 2 3 2x 1
E(X ) = x f (x) dx = 2x dx = = = 0.5.
−∞ 0 4 0 2
Hence Var(X) = √σ 2 = 1/2 − (2/3)2 = 1/18 ≈ 0.0556. Therefore, the standard
deviation is σ = 0.0556 = 0.2357.
179
C. Continuous probability distributions
Solution:
(a) The constant c must be such that the area between the curve representing the
pdf and the (horizontal) x-axis equals 1. Hence the value c satisfies:
∞ 1 1
x3
Z Z
2 c
1= f (x) dx = c x dx = c = ⇒ c = 3.
−∞ 0 3 0 3
(b) i. We have:
Z 1/2 Z 1/2
1 h i1/2 1
P X≤ = f (x) dx = 3x2 dx = x3 = = 0.125.
2 −∞ 0 0 8
ii. We have:
Z 3/4 Z 3/4
1 3 h i3/4 13
P ≤X≤ = f (x) dx = 2
3x dx = x3 = = 0.4063.
4 4 1/4 1/4 1/4 32
(c) To determine the mean and variance, we compute E(X) and E(X 2 ):
∞ 1 1
3x4
Z Z
3 3
E(X) = x f (x) dx = 3x dx = = = 0.75
−∞ 0 4 0 4
and:
∞ 1 1
3x5
Z Z
2 2 4 3
E(X ) = x f (x) dx = 3x dx = = = 0.6.
−∞ 0 5 0 5
180
C.1. Worked examples
Solution:
(a) For the continuous random variable X, we know that:
P (0.5 < X < 1) = F (1)−F (0.5) = (1−e−1 )−(1−e−0.5 ) = e−0.5 −e−1 = 0.2387.
(b) P (X > x) = 0.05 means F (x) = 0.95, i.e. 1 − e−x = 0.95. That is:
181
C. Continuous probability distributions
This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞
For 1 ≤ x ≤ 3, we have:
Z x 0 Z x 1
3−t
Z Z
2t
F (x) = f (t) dt = 0 dt + dt + dt
−∞ −∞ 0 3 1 3
x
t2 x2 x2 1
1 1 1
=0+ + t− = + x− − 1− =x− − .
3 6 1 3 6 6 6 2
182
C.1. Worked examples
183
C. Continuous probability distributions
(b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is:
(10 − a) − 10 (10 + a) − 10
0.95 = P <Z<
2 2
a a
=P − <Z<
2 2
a a
=1−P Z > −P Z <−
2 2
a
=1−2×P Z > .
2
This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025. Hence,
from Table 4, a/2 = 1.96, and so a = 3.92.
(c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar
reasoning shows that P (Z > b/2) = 0.005. Hence, from Table 4, b/2 = 2.58, so
that b = 5.16.
(e) We want x such that P (Z < x) = 0.05. This means that x < 0 and
P (Z > |x|) = 0.05, so, from Table 4, |x| = 1.65 and hence x = −1.65.
7. Your company requires a special type of light bulb which is available from only two
suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours with a
standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime of
1,850 hours with a standard deviation of 100 hours. The distribution of the
lifetimes of each type of light bulb is normal. Your company requires that the
lifetime of a light bulb be not less than 1,500 hours. All other things being equal,
which type of bulb should you buy, and why?
Solution:
Let A and B be the random variables representing the lifetimes (in hours) of light
bulbs from supplier A and supplier B, respectively. We are told that:
Since the relevant criterion is that light bulbs last at least 1,500 hours, the
company should choose the supplier whose light bulbs have a greater probability of
doing so. We find that:
1,500 − 2,000
P (A > 1,500) = P Z >
180
= P (Z > −2.78)
= 1 − P (Z > 2.78)
= 0.9973
184
C.1. Worked examples
and:
1,500 − 1,850
P (B > 1,500) = P Z >
100
= P (Z > −3.50)
= 1 − P (Z > 3.50)
= 0.9998.
Therefore, the company should buy light bulbs from supplier B, since they have a
greater probability of lasting the required time.
Note it is good practice to define notation and any units of measurement and to
state the distributions of the random variables. Note also that here it is not essential
to compute the probability values in order to determine what the company should
do, since −2.78 > −3.5 implies that P (Z > −2.78) < P (Z > −3.5).
8. The life, in hours, of a light bulb is normally distributed with a mean of 200 hours.
If a consumer requires at least 90% of the light bulbs to have lives exceeding 150
hours, what is the largest value that the standard deviation can have?
Solution:
Let X be the random variable representing the lifetime of a light bulb (in hours),
so that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.9,
such that:
150 − 200 50
P (X > 150) = P Z > =P Z>− = 0.9.
σ σ
Note that this is the same as P (Z > 50/σ) = 1 − 0.9 = 0.1, so 50/σ = 1.28, giving
σ = 39.06.
Solution:
Let X and Y , respectively, denote the random variables for the diameters of the
rods and of the holes (in millimetres), so that:
Therefore, P (rod fits hole) = P (X < Y ) = P (Y − X > 0). Now note that:
185
C. Continuous probability distributions
So:
0 − 0.2
P (rod fits hole) = P Z>√
0.0074
= P (Z > −2.33)
= 1 − P (Z > 2.33)
= 0.9901.
10. An investor has the choice of two out of four investments: X1 , X2 , X3 and X4 . The
profits (in £000s per annum) from these may be assumed to be independently
distributed and:
Which pair of investments should the investor choose in order to maximise the
probability of making a total profit of at least £2,000? What is this maximum
probability?
Solution:
Let A1 , A2 , A3 and A4 be the profits from the investments X1 , X2 , X3 and X4 ,
respectively. There are 4 C2 = 6 possible pairs, and we simply have to work out the
probabilities for each. From the information in the question, we see that:
2−5
A1 + A2 ∼ N (5, 4) ⇒ P (A1 + A2 > 2) = P Z > √ = 0.9332
4
2−3
A1 + A3 ∼ N (3, 1.25) ⇒ P (A1 + A3 > 2) = P Z > √ = 0.8133
1.25
2 − 4.5
A1 + A4 ∼ N (4.5, 5) ⇒ P (A1 + A4 > 2) = P Z > √ = 0.8686
5
2−4
A2 + A3 ∼ N (4, 3.25) ⇒ P (A2 + A3 > 2) = P Z > √ = 0.8665
3.25
2 − 5.5
A2 + A4 ∼ N (5.5, 7) ⇒ P (A2 + A4 > 2) = P Z > √ = 0.9066
7
2 − 3.5
A3 + A4 ∼ N (3.5, 4.25) ⇒ P (A3 + A4 > 2) = P Z > √ = 0.7673.
4.25
Therefore, the investor should choose X1 and X2 , for which the maximum
probability is 0.9332.
186
C.2. Practice questions
1. An advertising agency claims that 40% of all television viewers watch a particular
programme. In a random sample of 500 viewers, what is the probability that fewer
than 170 will be watching the programme if the agency’s claim is correct?
187
C. Continuous probability distributions
188
Appendix D
Multivariate random variables
Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z
hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.
189
D. Multivariate random variables
X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15
Solution:
X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:
1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:
(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y
So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.
(c) X and Y are not independent random variables since, for example:
190
D.1. Worked examples
4. The random variables X1 and X2 are independent and have the common
distribution given in the table below:
X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1
Solution:
191
D. Multivariate random variables
which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19
5. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20
Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4
192
D.1. Worked examples
(b) We have:
Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))
hence:
(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:
0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5
hence:
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5
6. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.
(a) Show that P (X = 1, Y = 1) = 3/14.
193
D. Multivariate random variables
(b) Form the table showing the joint probability distribution of X and Y .
(c) Calculate E(X), E(Y ) and E(X | Y = 1).
(d) Find the covariance between X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.
Solution:
(a) With the obvious notation B = blue and R = red:
3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14
(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 ×+1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
+1×
E(Y ) = 0 × +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2
(d) The distribution of XY is:
XY = xy 0 1
pXY (xy) 22/28 6/28
Hence:
22 6 3
E(XY ) = 0 × +1× =
28 28 14
and:
3 3 1 9
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = − × =− .
14 4 2 56
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.
194
D.2. Practice questions
1. X and Y are discrete random variables which can assume values 0, 1 and 2 only.
(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.
2. Consider two random variables X and Y which both take the values 0, 2 and 4.
The joint probabilities for each pair are given by the following table:
195
D. Multivariate random variables
196
Appendix E
Sampling distributions of statistics
Solution:
where F ∼ F5, 17 , using Table A.3 of the Dougherty Statistical Tables (practice
of which will be covered later in the course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:
P (A3 + B 3 + C 3 < 0) = 0.
1
Although we have yet to ‘formally’ introduce Table A.3 of the Dougherty Statistical Tables, you
should be able to see how this works.
197
E. Sampling distributions of statistics
Solution:
(a) Z12 ∼ χ21
(b) Z12 /Z22 ∼ F1, 1
p
(c) Z1 / Z22 ∼ t1
k
P
(d) Zi /k ∼ N (0, 1/k)
i=1
k
Zi2 ∼ χ2k
P
(e)
i=1
Solution:
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45
(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:
where Y ∼ χ22 .
198
E.1. Worked examples
(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:
Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:
X1 − X2 − X3 ∼ N (0, 12).
So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.
So:
2X12
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32
where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32
X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 ! !
√ X22 X32 √
X1
=P >5 2 + 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.
199
E. Sampling distributions of statistics
Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085.
(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:
√
q
2 2
P X1 > k X3 + X4 = 0.025 = P (T > k 2)
√
where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4
where X ∼ χ23 . Therefore, k/4 = 6.251. Hence k = 25.004.
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k simplifies to:
and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table A.3 of the Dougherty Statistical Tables, k = 0.05.
6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.
200
E.1. Worked examples
Solution:
(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:
X̄ ∼ N (68.5, (0.54)2 ).
We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.
As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 4 of the
New Cambridge Statistical Tables. Since there are 200 independent random
samples drawn, we can now think of each as a single trial. The recorded mean
lies between 67.9 and 69.2 with probability 0.8026 at each trial. We are dealing
with a binomial distribution with n = 200 trials and probability of success
π = 0.8026. The expected number of successes is:
(c) The probability that the recorded mean is < 67.0 inches is:
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54
so the expected number of recorded means below 67.0 out of a sample of 200 is:
201
E. Sampling distributions of statistics
Solution:
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
√ √
2
P (Z < 3.841) = P − 3.841 < Z < 3.841
Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.
Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.
P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
2
X1 + X22 + X32
=P < 7.236
Y
2
(X1 + X22 + X32 )/3
5
=P < × 7.236
Y /5 3
2
(X1 + X22 + X32 )/3
=P < 12.060 .
Y /5
202
E.2. Practice questions
Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table A.3 of the Dougherty Statistical Tables, we see that 12.06
is the 1% upper-tail value for this distribution, so the probability we want is equal
to 0.99.
2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?
203
E. Sampling distributions of statistics
204
Appendix F
Estimator properties
(b) We have:
n
!
1X 1 1 2 σ2
Var(X̄) = Var Xi = Var(X 1 + X 2 + · · · + X n ) = nσ = .
n i=1 n2 n2 n
205
F. Estimator properties
Show that µb is a biased estimator of µ and comment briefly on the nature of the
bias. Determine the bias in terms of n and µ, and suggest how the bias can be
removed to create an unbiased estimator of µ.
Solution:
We have:
n
! n
! n
1 X 1 X 1 X nµ
E(b
µ) = E Xi = E Xi = E(Xi ) = <µ
n + 1 i=1 n+1 i=1
n + 1 i=1 n+1
3. Given the mean squared error of an estimator θb is defined as E((θb − θ)2 ), show that
this can also be expressed as:
Var(θ) b 2.
b + (Bias(θ))
Solution:
Since Var(X) = E(X 2 ) − (E(X))2 , E(X 2 ) = Var(X) + (E(X))2 . Let X = θb − θ.
Substituting in we get:
Hence:
b = E((θb − θ)2 ) = Var(θ)
MSE(θ) b 2.
b + (Bias(θ))
Solution:
No. Let θ = φ and T1 = T2 = T . Hence E(T1 T2 ) = E(T 2 ) > (E(T ))2 = θ2 = θφ.
3α α α
P (X = 0) = 1 − , P (X = 1) = and P (X = 2) =
4 2 4
such that 0 < α < 4/3. One observation is taken and we want to estimate α.
Consider the estimators T1 = X and T2 = 2X(X − 1) of α.
(a) Show that T1 and T2 are both unbiased estimators of α.
206
F.2. Practice questions
Solution:
(a) We have:
X α 2α
E(T1 ) = E(X) = x p(x) = 0 + + =α
∀x
2 4
and: X 4α
E(T2 ) = 2x(x − 1) p(x) = 0 + 0 + = α.
∀x
4
Hence both T1 and T2 are unbiased estimators of α.
(b) Since both estimators are unbiased, we prefer the minimum variance
estimator. We have:
3α
Var(T1 ) = Var(X) = E(X 2 ) − α2 = − α2
2
and:
Var(T2 ) = E(T22 ) − α2 = 4α − α2 .
Hence Var(T1 ) < Var(T2 ), so we choose T1 .
1. Let T1 and T2 be two unbiased estimators of the parameter θ. T1 and T2 have the
same variance and they are independent. Consider the following estimators of θ:
T1 + T2
S= and R = 2T1 − T2 .
2
207
F. Estimator properties
208
Appendix G
Point estimation
2. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1
where C > 0 is a constant. The MLE µ b maximises this function, and also
maximises the function:
n
1X
l(µ) = ln L(µ) = − (Xi − µ)2 + log(C).
2 i=1
n
(Xi − µ)2 , i.e. the MLE is also the
P
Therefore, the MLE effectively minimises
i=1
least squares estimator (LSE), i.e. µ
b = X̄.
209
G. Point estimation
e−λ λx
P (X = x) = .
x!
The likelihood and log-likelihood functions are, respectively:
n −λ Xi
Y e λ e−nλ λnX̄
L(λ) = = Qn
Xi !
i=1 Xi !
i=1
and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ
b = X̄, which is also the MME.
Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.
210
G.1. Worked examples
where:
X(n) = max Xi .
i
Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:
(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.
5. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.
Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.
Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.
211
G. Point estimation
6. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.
However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.
Note that:
∞
X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6
1
= π(1 − π)5 ×
π
= (1 − π)5 .
L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525
hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.
212
G.1. Worked examples
Solution:
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:
Solution:
The point estimate is:
v
u 8
u3 X
θbM M =t x2 ≈ 2.518
8 i=1 i
which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.
213
G. Point estimation
y 1 = α + β + ε1
y2 = −α + β + ε2
y 3 = α − β + ε3
y4 = −α − β + ε4 .
214
Appendix H
Analysis of variance (ANOVA)
Solution:
(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:
(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table A.3 of the Dougherty Statistical Tables),
we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that the means
are not equal.
215
H. Analysis of variance (ANOVA)
(c) We have:
s
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10
Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.
2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:
Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875
We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.
As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.
H 0 : µA = µB = µC
216
H.1. Worked examples
Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:
Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480
As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.
4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quotes of suppliers B, C and
D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.
Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:
217
H. Analysis of variance (ANOVA)
17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28
for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables (degrees of freedom 3 and 18) is 3.16, hence we do not reject
H0 and conclude that there is not enough evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7
giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.
Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total
218
H.1. Worked examples
giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.
A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76
Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:
219
H. Analysis of variance (ANOVA)
Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73
Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.
7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from the Dougherty Statistical Tables.
Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Error ? 708.00 ?
Total 34 1,915.76
Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table A.3
of the Dougherty Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:
Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76
220
H.2. Practice questions
(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first
(under $15,000) and the third (over $30,000) income groups.
2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.
Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter
221
H. Analysis of variance (ANOVA)
222
Appendix I
Solutions to Practice questions
(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2P (A ∪ B) so:
P (A) + P (B)
≤ P (A ∪ B).
2
P (A) + P (B)
P (A ∩ B) ≤ .
2
223
I. Solutions to Practice questions
(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.
224
I.2. Appendix B – Discrete probability distributions
(c) We have:
P (X > 4) = 1 − (P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4))
(3.2)0 (3.2)1 (3.2)2 (3.2)3 (3.2)4
−3.2
=1−e + + + +
0! 1! 2! 3! 4!
= 0.2194.
2. (a) If we assume that the calving process is random (as the remark about
seasonality hints) then we are counting events over periods of time (with, in
particular, no obvious upper maximum), and hence the appropriate probability
distribution is the Poisson distribution.
(b) The rate parameter for one week is 0.4, so for two weeks we use λ = 0.8, hence:
e−0.8 × (0.8)0
P (X = 0) = = e−0.8 = 0.4493.
0!
(c) If it is correct to use the Poisson distribution then events are independent, and
so:
P (none in weeks 1 & 2) = P (none in weeks 3 & 4) = · · · = 0.4493.
(f) The fact that the results are identical in the two cases is a consequence of the
independence built into the assumption that the Poisson distribution is the
appropriate one to use. A Poisson process does not ‘remember’ what happened
before the start of a period under consideration.
225
I. Solutions to Practice questions
R=r 0 1 2 3 4 Total
P (R = r) 0.4096 0.4096 0.1536 0.0256 0.0016 1
T =t 0 1 4 Total
P (T = t) 0.1536 0.4352 0.4112 1
Note E(T 2 ) = 7.0144, hence Var(T ) =√E(T 2 ) − µ2T = 7.0144 − (2.08)2 = 2.6880.
Hence the standard deviation is σ = 2.6880 = 1.6395.
2. (a) With a population of just 20, and 40% of them, i.e. 8, being EMFSS FC
supporters, the probability of success (a ‘yes’ response) changes each time a
respondent is asked, and the change depends on the respondent’s answer. For
instance:
• The probability is π1 = 0.4 that the first respondent supports EMFSS FC.
• If the first respondent supports EMFSS FC, the probability is
π2|1=‘yes’ = 7/19 = 0.3684 that the second respondent supports EMFSS FC
too.
226
I.3. Appendix C – Continuous probability distributions
• If the first respondent does not support EMFSS FC, the probability is
π2|1=‘no’ = 8/19 = 0.4211 that the second respondent supports EMFSS FC.
This means that we cannot assume that the probability of a ‘yes’ is (virtually)
the same from one respondent to the next. Therefore, we cannot regard the
successive questions to different people as being successive identical Bernoulli
trials. Equivalently, we must treat the problem as if we were using sampling
without replacement.
So, if ‘S’ means ‘supports EMFSS FC’ and ‘S c ’ means ‘does not support
EMFSS FC’, there are four possible sequences of responses which give exactly
three Ss, and we want the sum of their separate probabilities. So we have:
8 7 6 12 12 × 8 × 7 × 6
P (SSSS c ) = × × × =
20 19 18 17 20 × 19 × 18 × 17
8 7 12 6 12 × 8 × 7 × 6
P (SSS c S) = × × × =
20 19 18 17 20 × 19 × 18 × 17
8 12 7 6 12 × 8 × 7 × 6
P (SS c SS) = × × × =
20 19 18 17 20 × 19 × 18 × 17
12 8 7 6 12 × 8 × 7 × 6
P (S c SSS) = × × × = .
20 19 18 17 20 × 19 × 18 × 17
(b) In this case we can assume that the probability of an ‘S’ is (for practical
purposes) constant, hence we have a sequence of independent and identical
Bernoulli trials (and/or that we are using sampling with replacement).
Therefore, we can assume that, if X is the random variable which counts the
number of Ss out of forty people surveyed, then X ∼ Bin(40, 0.4). Hence:
40
P (X = 20) = × (0.4)20 × (0.6)20 = 0.0554.
20
(c) With n = 100 (and a ‘large’ population) we know that we can use the normal
approximation to X ∼ Bin(100, 0.4). Since E(X) = 40 and Var(X) = 24, we
can approximate X with Y , where Y ∼ N (40, 24). Hence (using the continuity
227
I. Solutions to Practice questions
correction):
29.5 − 40
P (X ≥ 30) = P (Y ≥ 29.5) = P Z ≥ √
24
= P (Z ≥ −2.14)
= 1 − P (Z ≥ 2.14)
= 0.98382.
(d) The only approximation used here is the normal approximation to the
binomial, used in (b) and (c), and it is justified because:
• n is ‘large’ (although whether n = 40 in (b) can be considered large is
debatable)
• the population is large enough to justify using the binomial in the first
place
• nπ > 5 and n(1 − π) > 5 in each case.
(e) Three possible comments are the following.
• For (a), if we had (wrongly) used Bin(4, 0.4) then we would have obtained
P (X = 3) = 0.1536, which is quite a long way from the true value (roughly
11%, proportionally) – we might have reached the wrong conclusion.
• For (b) and (c) it was important, to justify using the binomial, that the
population was ‘large’.
• The true value (to 4 decimal places) for (c) is 0.9852, so the
approximation obtained is pretty good – it is very unlikely that we might
have reached the wrong conclusion.
Note that since we have assumed that daily newspaper sales are independent,
which we were not told explicitly in the question, we need to (i.) think if it is
reasonable – in the given question – to assume independence, and then (ii.) say
we have assumed it.
228
I.4. Appendix D – Multivariate random variables
(d) Let s be the required stock, then we require P (X > s) = 0.1. Hence:
s − 350
P Z> = 0.1
30
s − 350
⇒ ≥ 1.28
30
⇒ s ≥ 350 + 1.28 × 30 = 388.4.
(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.
2. (a) We have:
P (X = 2, Y < 3) P (X = 2, Y = 0) + P (X = 2, Y = 2)
P (X = 2 | Y < 3) = =
P (Y < 3) P (Y = 0) + P (Y = 2)
0.10 + 0.16
=
0.30 + 0.30
= 0.4333.
229
I. Solutions to Practice questions
U =0 U =2
V =0 0.10 0.20
V =2 0.16 0.14
V =4 0.20 0.20
We then have:
and:
E(U V ) = 2 × 2 × 0.14 + 2 × 4 × 0.20 = 2.16
Hence:
230
I.5. Appendix E – Sampling distributions of statistics
2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n
√
= 1 − 2 × P (Z > 0.25 n)
≥ 0.95
which holds if:
√ 0.05
P (Z > 0.25 n) ≤ = 0.025.
2
From Table√ 4 of the New Cambridge Statistical Tables, we see that this is true
when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25)2 = 61.5. Rounding up to the
nearest integer, we get n ≥ 62. The sample size should be at least 62 for us to
be 95% confident that the sample mean will be within 0.5 units of the true
mean, µ.
231
I. Solutions to Practice questions
(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.
3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:
σ2
100
X̄ ∼ N µ, = N 54, .
n 25
(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25
Var(T1 ) + Var(T2 ) σ2
Var(S) = =
4 2
232
I.7. Appendix G – Point estimation
and:
Var(R) = 4 Var(T1 ) + Var(T2 ) = 5σ 2 .
Since Var(S) < Var(T1 ) < Var(R), and given that they are all unbiased
estimators, then S is the best estimator and R is the worst estimator due to
the equivalent ranking of their MSEs.
2. (a) We have:
(b) We know E(S 2 ) = σ 2 , so that, by combining this with the above, it follows
that:
S2
2
σ2 σ2
2 2 S
E X̄ − = E(X̄ ) − E = + µ2 − = µ2 .
n n n n
Hence X̄ 2 − S 2 /n is an unbiased estimator of µ2 .
3. We have:
X 1 X2 1 1 1 1
E(X) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ
2 2 2 2 2 2
and:
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
233
I. Solutions to Practice questions
d2 n
2
l(λ) = − 2
dλ λ
4
X
S= ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1
∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α
and:
∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.
y1 − y2 + y3 − y4 y1 + y2 − y3 − y4
α
b= and βb = .
4 4
(b) α
b is an unbiased estimator of α since:
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4
(c) We have:
4σ 2 σ2
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4
234
I.8. Appendix H – Analysis of variance (ANOVA)
235
I. Solutions to Practice questions
S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30
236
Appendix J
Formula sheet in the examination
Discrete distributions:
1 k+1 k2 − 1
Uniform for all x = 1, 2, . . . , k
k 2 12
n
Binomial x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n nπ nπ(1 − π)
1 1−π
Geometric (1 − π)x−1 π for x = 1, 2, . . .
π π2
e−λ λx
Poisson for x = 0, 1, 2, . . . λ λ
x!
Continuous distributions:
1 1
Exponential λe−λx for x ≥ 0 1 − e−λx for x ≥ 0
λ λ2
1 2 /2σ 2
Normal √ e−(x−µ) for all x µ σ2
2πσ 2
237
J. Formula sheet in the examination
One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1
Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1
238
Appendix K
Sample examination paper
SECTION A
Answer all parts of Question 1 (40 marks).
1. (a) For each one of the statements below say whether the statement is true or
false, explaining your answer. Note that A and B (and its complement B c ) are
events, while X and Y are random variables.
P (A | B c ) > P (A)
iii. If X ∼ Bin(n = 20, π = 0.50), then E(X) is less than the median of X.
(b) A box contains 12 light bulbs, of which two are defective. If a person selects 7
light bulbs at random, without replacement, what is the probability that both
defective light bulbs will be selected?
(5 marks)
239
K. Sample examination paper
Let Y denote the number out of the 60 observations which are in the interval
(0.75, 2.75). Calculate E(Y ) and Var(Y ).
(7 marks)
i. Sketch the above probability density function. The sketch can be drawn
on ordinary paper – no graph paper needed.
(3 marks)
ii. Derive the cumulative distribution function of X.
(6 marks)
iii. Calculate the lower quartile of X, denoted Q1 .
(2 marks)
iv. What is the mode of X? Briefly justify your answer.
(2 marks)
240
SECTION B
Answer all three questions from this section (60 marks in total).
2. (a) A random variable X can take the values −1, 0 and 1. We know that
X2
βb1 = X and β2 =
b .
5
(b) X1 , X2 , X3 and X4 are independent normal random variables with mean 0 and
variance 36. Using the nearest values in the statistical tables provided,
calculate the following probabilities:
(3 marks)
p
2 2
iii. P X1 < 7.02 (X2 + X3 ) .
(4 marks)
241
K. Sample examination paper
3. (a) Suppose that you are given observations y1 , y2 and y3 such that:
y 1 = α − β + ε1
y2 = α + 2β + ε2
y3 = −α − β + ε3 .
242
4. (a) A cinema chain decided to analyse their sales in five countries over a period of
twelve months. Annual sales (i.e. in total across the year, in $ millions) for the
five countries were 71.7, 78.6, 80.1, 81.9 and 89.7. The following is the
calculated ANOVA table with some entries missing.
Source Degrees of freedom Sum of squares Mean square F -value
Country
Month 0.915
Error 42.236
Total
Write out the full two-way ANOVA table using the information provided
above, showing your working.
(8 marks)
(b) Consider two random variables X and Y , where X can take the values −1, 0
and 1, and Y can take the values 0 and 1. You are provided with the following
information:
P (X = 0) = 0.2, P (X = 1) = 0.5
also:
P (Y = 1 | X = −1) = 0.8
P (Y = 1 | X = 0) = 0.5
P (Y = 1 | X = 1) = 0.6.
[END OF PAPER]
243
K. Sample examination paper
244
Appendix L
Sample examination paper –
Solutions
Section A
(b) The sample space consists of all (unordered) subsets of 7 out of the 12 light
bulbs in the box. There are 127
such subsets. The number of subsets which
contain the two defective bulbs is the number of subsets of size 5 out of the
10
other 10 bulbs, 5 , so the probability we want is:
10
5 7×6
12
= = 0.3182.
7
12 × 11
(c) We have:
4−8 4
0.40 = P (X < 4) = P Z < =P Z<−
σ σ
where Z ∼ N (0, 1). Since P (Z > 0.255) = P (Z < −0.255) ≈ 0.40, we have:
4
−0.255 = − ⇒ σ = 15.686
σ
so, approximately, Var(X) = (15.686)2 = 246.05.
245
L. Sample examination paper – Solutions
(e) i. We have:
ii. We determine the cdf by integrating the pdf over the appropriate range,
hence:
0 for x < 0
x2 /16
for 0 ≤ x < 2
F (x) =
(x/3) − (x2 /48) − 1/3 for 2 ≤ x ≤ 8
1 for x > 8.
This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞
For 2 ≤ x ≤ 8, we have:
Z x 0 2 x
8−t
Z Z Z
t
F (x) = f (t) dt = 0 dt + dt dt +
−∞ −∞ 0 2 24 8
x
t2 x x2 x x2 1
1 t 1 2 1
=0+ + − = + − − − = − − .
4 3 48 2 4 3 48 3 12 3 48 3
246
Section B
2. (a) i. We have:
X
E(βb1 ) = E(X) = x p(x) = −2β + 0 + 3β = β
∀x
and:
X x2 2β 3β
E(βb2 ) = p(x) = +0+ = β.
∀x
5 5 5
Therefore, both estimators are unbiased estimators of β.
ii. Since both estimators are unbiased, we prefer the smaller variance
estimator (and hence the smaller mean squared error). We have:
and:
β
Var(βb2 ) = E(βb22 ) − β 2 = − β 2 .
5
Hence Var(βb2 ) < Var(βb1 ), we prefer βb2 .
X1 + X2 + X3 ∼ N (0, 108).
Hence:
22
P (X1 > −X2 − X3 + 22) = P (X1 + X2 + X3 > 22) = P Z>√
108
≈ P (Z > 2.12)
= 0.0170.
4
ii. Xi /6 ∼ N (0, 1), and so Xi2 /36 ∼ χ21 . Hence we have that Xi2 /36 ∼ χ24 .
P
i=1
Therefore:
4
! 4
!
X X X2 i 342
P Xi2 > 342 =P > = P (X > 9.5) ≈ 0.05
i=1 i=1
36 36
where X ∼ χ24 .
iii. We have:
!
√
X1 /6
q
2 2
P X1 < 7.02 X2 + X3 = P p < 2 × 7.02
(X22 + X32 )/(36 × 2)
= P (T < 9.928)
≈ 0.995
where T ∼ t2 .
247
L. Sample examination paper – Solutions
and:
∂S
= 2(y1 − α + β) − 4(y2 − α − 2β) + 2(y3 + α + β)
∂β
= 2(2α + 6β + (y1 − 2y2 + y3 )).
The estimators α
b and βb are the solutions of the equations:
∂S ∂S
= 0 and = 0.
∂α ∂β
Hence:
α + 2βb = y1 + y2 − y3
3b α + 6βb = −y1 + 2y2 − y3 .
and 2b
Solving yields:
4y1 + y2 − 2y3 −5y1 + 4y2 − y3
α
b= and βb = .
7 14
Estimating the first population moment with the first sample moment, we
get:
θb X̄
= X̄ ⇒ θb = .
θb + 1 1 − X̄
ii. Due to independence, the likelihood function is:
n
Y n
Y
L(θ) = θXiθ−1 =θ n
Xiθ−1 .
i=1 i=1
248
The log-likelihood function is:
n
X
l(θ) = log L(θ) = n log θ + (θ − 1) log Xi .
i=1
4. (a) The average monthly sales in each country was 71.7/12 = 5.975,
78.6/12 = 6.550, 80.1/12 = 6.675, 81.9/12 = 6.825 and 89.7/12 = 7.475. The
average of these values is 6.70. Hence the SS due to country is:
P (Y = 0) = P (Y = 0, X = −1) + P (Y = 0, X = 0) + P (Y = 0, X = 1)
= P (Y = 0 | X = −1) P (X = −1) + P (Y = 0 | X = 0) P (X = 0)
+ P (Y = 0 | X = 1) P (X = 1)
= (1 − 0.8) × 0.3 + (1 − 0.5) × 0.2 + (1 − 0.6) × 0.5
= 0.36
249
L. Sample examination paper – Solutions
and:
1
X
E(Y ) = y P (Y = y) = 0 × 0.36 + 1 × 0.64 = 0.64.
y=0
ii. We find:
P (Y = 0, X + Y ≥ 0)
P (Y = 0 | X + Y ≥ 0) =
P (X + Y ≥ 0)
P (Y = 0, X ≥ 0)
=
1 − P (X + Y = −1)
P (Y = 0 | X = 0) P (X = 0) + P (Y = 0 | X = 1) P (X = 1)
=
1 − P (Y = 0 | X = −1) P (X = −1)
0.5 × 0.2 + 0.4 × 0.5
=
1 − 0.2 × 0.3
= 0.3191.
250
STATISTICAL
TABLES
Cumulative normal distribution
Critical values of the t distribution
Critical values of the F distribution
Critical values of the chi-squared distribution
© C. Dougherty 2001, 2002 ([email protected]). These tables have been computed to accompany the text C. Dougherty Introduction to
Econometrics (second edition 2002, Oxford University Press, Oxford), They may be reproduced freely provided that this attribution is retained.
STATISTICAL TABLES 1
TABLE A.1
z A(z)
1.645 0.9500 Lower limit of right 5% tail
1.960 0.9750 Lower limit of right 2.5% tail
2.326 0.9900 Lower limit of right 1% tail
2.576 0.9950 Lower limit of right 0.5% tail
3.090 0.9990 Lower limit of right 0.1% tail
3.291 0.9995 Lower limit of right 0.05% tail
-4 -3 -2 -1 0 1 z 2 3 4
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999
.
STATISTICAL TABLES 2
TABLE A.2
t Distribution: Critical Values of t
Significance level
Degrees of Two-tailed test: 10% 5% 2% 1% 0.2% 0.1%
freedom One-tailed test: 5% 2.5% 1% 0.5% 0.1% 0.05%
1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.768
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
32 1.694 2.037 2.449 2.738 3.365 3.622
34 1.691 2.032 2.441 2.728 3.348 3.601
36 1.688 2.028 2.434 2.719 3.333 3.582
38 1.686 2.024 2.429 2.712 3.319 3.566
40 1.684 2.021 2.423 2.704 3.307 3.551
42 1.682 2.018 2.418 2.698 3.296 3.538
44 1.680 2.015 2.414 2.692 3.286 3.526
46 1.679 2.013 2.410 2.687 3.277 3.515
48 1.677 2.011 2.407 2.682 3.269 3.505
50 1.676 2.009 2.403 2.678 3.261 3.496
60 1.671 2.000 2.390 2.660 3.232 3.460
70 1.667 1.994 2.381 2.648 3.211 3.435
80 1.664 1.990 2.374 2.639 3.195 3.416
90 1.662 1.987 2.368 2.632 3.183 3.402
100 1.660 1.984 2.364 2.626 3.174 3.390
120 1.658 1.980 2.358 2.617 3.160 3.373
150 1.655 1.976 2.351 2.609 3.145 3.357
200 1.653 1.972 2.345 2.601 3.131 3.340
300 1.650 1.968 2.339 2.592 3.118 3.323
400 1.649 1.966 2.336 2.588 3.111 3.315
500 1.648 1.965 2.334 2.586 3.107 3.310
600 1.647 1.964 2.333 2.584 3.104 3.307
f 1.645 1.960 2.326 2.576 3.090 3.291
.
STATISTICAL TABLES 3
TABLE A.3
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.91 245.36 246.46 247.32 248.01
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.42 19.43 19.44 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.71 8.69 8.67 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.87 5.84 5.82 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.64 4.60 4.58 4.56
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.96 3.92 3.90 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.53 3.49 3.47 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.24 3.20 3.17 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.03 2.99 2.96 2.94
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.86 2.83 2.80 2.77
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.74 2.70 2.67 2.65
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.64 2.60 2.57 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.55 2.51 2.48 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.48 2.44 2.41 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.42 2.38 2.35 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.37 2.33 2.30 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.33 2.29 2.26 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.29 2.25 2.22 2.19
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.26 2.21 2.18 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.22 2.18 2.15 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.20 2.16 2.12 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.17 2.13 2.10 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.15 2.11 2.08 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.13 2.09 2.05 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.11 2.07 2.04 2.01
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.09 2.05 2.02 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.08 2.04 2.00 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.06 2.02 1.99 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.05 2.01 1.97 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.04 1.99 1.96 1.93
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.04 1.99 1.94 1.91 1.88
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.95 1.90 1.87 1.84
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.95 1.89 1.85 1.81 1.78
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.86 1.82 1.78 1.75
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 1.89 1.84 1.79 1.75 1.72
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 1.88 1.82 1.77 1.73 1.70
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 1.86 1.80 1.76 1.72 1.69
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.85 1.79 1.75 1.71 1.68
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.78 1.73 1.69 1.66
150 3.90 3.06 2.66 2.43 2.27 2.16 2.07 2.00 1.94 1.89 1.82 1.76 1.71 1.67 1.64
200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.80 1.74 1.69 1.66 1.62
250 3.88 3.03 2.64 2.41 2.25 2.13 2.05 1.98 1.92 1.87 1.79 1.73 1.68 1.65 1.61
300 3.87 3.03 2.63 2.40 2.24 2.13 2.04 1.97 1.91 1.86 1.78 1.72 1.68 1.64 1.61
400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 1.78 1.72 1.67 1.63 1.60
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.77 1.71 1.66 1.62 1.59
600 3.86 3.01 2.62 2.39 2.23 2.11 2.02 1.95 1.90 1.85 1.77 1.71 1.66 1.62 1.59
750 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.77 1.70 1.66 1.62 1.58
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.70 1.65 1.61 1.58
.
STATISTICAL TABLES 4
.
STATISTICAL TABLES 5
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 5928.36 5981.07 6022.47 6055.85 6106.32 6142.67 6170.10 6191.53 6208.73
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.44 99.44 99.45
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.92 26.83 26.75 26.69
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.25 14.15 14.08 14.02
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.77 9.68 9.61 9.55
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.60 7.52 7.45 7.40
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.36 6.28 6.21 6.16
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.56 5.48 5.41 5.36
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 5.01 4.92 4.86 4.81
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.60 4.52 4.46 4.41
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.29 4.21 4.15 4.10
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.05 3.97 3.91 3.86
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.86 3.78 3.72 3.66
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.70 3.62 3.56 3.51
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.56 3.49 3.42 3.37
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.45 3.37 3.31 3.26
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.35 3.27 3.21 3.16
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.27 3.19 3.13 3.08
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.19 3.12 3.05 3.00
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.13 3.05 2.99 2.94
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.07 2.99 2.93 2.88
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 3.02 2.94 2.88 2.83
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.97 2.89 2.83 2.78
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.93 2.85 2.79 2.74
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.89 2.81 2.75 2.70
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.86 2.78 2.72 2.66
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.82 2.75 2.68 2.63
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.79 2.72 2.65 2.60
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.77 2.69 2.63 2.57
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.74 2.66 2.60 2.55
35 7.42 5.27 4.40 3.91 3.59 3.37 3.20 3.07 2.96 2.88 2.74 2.64 2.56 2.50 2.44
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.56 2.48 2.42 2.37
50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 2.56 2.46 2.38 2.32 2.27
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.39 2.31 2.25 2.20
70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 2.59 2.45 2.35 2.27 2.20 2.15
80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.42 2.31 2.23 2.17 2.12
90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 2.39 2.29 2.21 2.14 2.09
100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 2.37 2.27 2.19 2.12 2.07
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.23 2.15 2.09 2.03
150 6.81 4.75 3.91 3.45 3.14 2.92 2.76 2.63 2.53 2.44 2.31 2.20 2.12 2.06 2.00
200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.60 2.50 2.41 2.27 2.17 2.09 2.03 1.97
250 6.74 4.69 3.86 3.40 3.09 2.87 2.71 2.58 2.48 2.39 2.26 2.15 2.07 2.01 1.95
300 6.72 4.68 3.85 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.24 2.14 2.06 1.99 1.94
400 6.70 4.66 3.83 3.37 3.06 2.85 2.68 2.56 2.45 2.37 2.23 2.13 2.05 1.98 1.92
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.22 2.12 2.04 1.97 1.92
600 6.68 4.64 3.81 3.35 3.05 2.83 2.67 2.54 2.44 2.35 2.21 2.11 2.03 1.96 1.91
750 6.67 4.63 3.81 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.21 2.11 2.02 1.96 1.90
1000 6.66 4.63 3.80 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.20 2.10 2.02 1.95 1.90
.
STATISTICAL TABLES 6
.
STATISTICAL TABLES 7
v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4.05e05 5.00e05 5.40e05 5.62e05 5.76e05 5.86e05 5.93e05 5.98e05 6.02e05 6.06e05 6.11e05 6.14e05 6.17e05 6.19e05 6.21e05
2 998.50 999.00 999.17 999.25 999.30 999.33 999.36 999.37 999.39 999.40 999.42 999.43 999.44 999.44 999.45
3 167.03 148.50 141.11 137.10 134.58 132.85 131.58 130.62 129.86 129.25 128.32 127.64 127.14 126.74 126.42
4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49.00 48.47 48.05 47.41 46.95 46.60 46.32 46.10
5 47.18 37.12 33.20 31.09 29.75 28.83 28.16 27.65 27.24 26.92 26.42 26.06 25.78 25.57 25.39
6 35.51 27.00 23.70 21.92 20.80 20.03 19.46 19.03 18.69 18.41 17.99 17.68 17.45 17.27 17.12
7 29.25 21.69 18.77 17.20 16.21 15.52 15.02 14.63 14.33 14.08 13.71 13.43 13.23 13.06 12.93
8 25.41 18.49 15.83 14.39 13.48 12.86 12.40 12.05 11.77 11.54 11.19 10.94 10.75 10.60 10.48
9 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 10.11 9.89 9.57 9.33 9.15 9.01 8.90
10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.20 8.96 8.75 8.45 8.22 8.05 7.91 7.80
11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.35 8.12 7.92 7.63 7.41 7.24 7.11 7.01
12 18.64 12.97 10.80 9.63 8.89 8.38 8.00 7.71 7.48 7.29 7.00 6.79 6.63 6.51 6.40
13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.80 6.52 6.31 6.16 6.03 5.93
14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.80 6.58 6.40 6.13 5.93 5.78 5.66 5.56
15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.81 5.62 5.46 5.35 5.25
16 16.12 10.97 9.01 7.94 7.27 6.80 6.46 6.19 5.98 5.81 5.55 5.35 5.20 5.09 4.99
17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.32 5.13 4.99 4.87 4.78
18 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 5.56 5.39 5.13 4.94 4.80 4.68 4.59
19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 4.97 4.78 4.64 4.52 4.43
20 14.82 9.95 8.10 7.10 6.46 6.02 5.69 5.44 5.24 5.08 4.82 4.64 4.49 4.38 4.29
21 14.59 9.77 7.94 6.95 6.32 5.88 5.56 5.31 5.11 4.95 4.70 4.51 4.37 4.26 4.17
22 14.38 9.61 7.80 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.58 4.40 4.26 4.15 4.06
23 14.20 9.47 7.67 6.70 6.08 5.65 5.33 5.09 4.89 4.73 4.48 4.30 4.16 4.05 3.96
24 14.03 9.34 7.55 6.59 5.98 5.55 5.23 4.99 4.80 4.64 4.39 4.21 4.07 3.96 3.87
25 13.88 9.22 7.45 6.49 5.89 5.46 5.15 4.91 4.71 4.56 4.31 4.13 3.99 3.88 3.79
26 13.74 9.12 7.36 6.41 5.80 5.38 5.07 4.83 4.64 4.48 4.24 4.06 3.92 3.81 3.72
27 13.61 9.02 7.27 6.33 5.73 5.31 5.00 4.76 4.57 4.41 4.17 3.99 3.86 3.75 3.66
28 13.50 8.93 7.19 6.25 5.66 5.24 4.93 4.69 4.50 4.35 4.11 3.93 3.80 3.69 3.60
29 13.39 8.85 7.12 6.19 5.59 5.18 4.87 4.64 4.45 4.29 4.05 3.88 3.74 3.63 3.54
30 13.29 8.77 7.05 6.12 5.53 5.12 4.82 4.58 4.39 4.24 4.00 3.82 3.69 3.58 3.49
35 12.90 8.47 6.79 5.88 5.30 4.89 4.59 4.36 4.18 4.03 3.79 3.62 3.48 3.38 3.29
40 12.61 8.25 6.59 5.70 5.13 4.73 4.44 4.21 4.02 3.87 3.64 3.47 3.34 3.23 3.14
50 12.22 7.96 6.34 5.46 4.90 4.51 4.22 4.00 3.82 3.67 3.44 3.27 3.41 3.04 2.95
60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.86 3.69 3.54 3.32 3.15 3.02 2.91 2.83
70 11.80 7.64 6.06 5.20 4.66 4.28 3.99 3.77 3.60 3.45 3.23 3.06 2.93 2.83 2.74
80 11.67 7.54 5.97 5.12 4.58 4.20 3.92 3.70 3.53 3.39 3.16 3.00 2.87 2.76 2.68
90 11.57 7.47 5.91 5.06 4.53 4.15 3.87 3.65 3.48 3.34 3.11 2.95 2.82 2.71 2.63
100 11.50 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.30 3.07 2.91 2.78 2.68 2.59
120 11.38 7.32 5.78 4.95 4.42 4.04 3.77 3.55 3.38 3.24 3.02 2.85 2.72 2.62 2.53
150 11.27 7.24 5.71 4.88 4.35 3.98 3.71 3.49 3.32 3.18 2.96 2.80 2.67 2.56 2.48
200 11.15 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 2.90 2.74 2.61 2.51 2.42
250 11.09 7.10 5.59 4.77 4.25 3.88 3.61 3.40 3.23 3.09 2.87 2.71 2.58 2.48 2.39
300 11.04 7.07 5.56 4.75 4.22 3.86 3.59 3.38 3.21 3.07 2.85 2.69 2.56 2.46 2.37
400 10.99 7.03 5.53 4.71 4.19 3.83 3.56 3.35 3.18 3.04 2.82 2.66 2.53 2.43 2.34
500 10.96 7.00 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.81 2.64 2.52 2.41 2.33
600 10.94 6.99 5.49 4.68 4.16 3.80 3.53 3.32 3.15 3.01 2.80 2.63 2.51 2.40 2.32
750 10.91 6.97 5.48 4.67 4.15 3.79 3.52 3.31 3.14 3.00 2.78 2.62 2.49 2.39 2.31
1000 10.89 6.96 5.46 4.65 4.14 3.78 3.51 3.30 3.13 2.99 2.77 2.61 2.48 2.38 2.30
.
STATISTICAL TABLES 8
.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.