Stat 231 Course Notes
Stat 231 Course Notes
4. ESTIMATION 135
4.1 Statistical Models and Estimation . . . . . . . . . . . . . . . . . . . . . . . 135
4.2 Estimators and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . 136
4.3 Interval Estimation Using the Likelihood Function . . . . . . . . . . . . . . 141
4.4 Con…dence Intervals and Pivotal Quantities . . . . . . . . . . . . . . . . . . 145
4.5 The Chi-squared and t Distributions . . . . . . . . . . . . . . . . . . . . . . 154
4.6 Likelihood-Based Con…dence Intervals . . . . . . . . . . . . . . . . . . . . . 158
iii
iv CONTENTS
Preface
These notes are a work-in-progress with contributions from those students taking the
courses and the instructors teaching them. An original version of these notes was prepared
by Jerry Lawless. Additions and revisions were made by Cyntha Struthers, Don McLeish,
Jock MacKay, and others. Richard Cook supplied the example in Chapter 8. In order
to provide improved versions of the notes for students in subsequent terms, please email
typos and errors, or sections that are confusing, or additional comments/suggestions to
[email protected].
Speci…c topics in these notes also have associated video …les or Powerpoint shows that
can be accessed at www.watstat.ca.
1. INTRODUCTION TO
STATISTICAL SCIENCES
1
2 1. INTRODUCTION TO STATISTICAL SCIENCES
making conclusions. In particular, Statistical Sciences deal with the study of variability
in populations and processes, and with informative and cost-e¤ective ways to collect and
analyze data about such populations and processes.
Statistical data analysis occurs in a huge number of areas. For example, statistical
algorithms are the basis for software involved in the automated recognition of handwritten
or spoken text; statistical methods are commonly used in law cases, for example in DNA
pro…ling; statistical process control is used to increase the quality and productivity of
manufacturing and service processes; individuals are selected for direct mail marketing
campaigns through a statistical analysis of their characteristics. With modern information
technology, massive amounts of data are routinely collected and stored. But data do not
equal information, and it is the purpose of Statistical Sciences to provide and analyze data
so that the maximum amount of information or knowledge may be obtained1 . Poor or
improperly analyzed data may be useless or misleading. The same could be said about
poorly collected data.
Probability models are used to represent many phenomena, populations, or processes
and to deal with problems that involve variability. You studied these models in your proba-
bility course and you have seen how they can be used to describe variability. This course will
focus on the collection, analysis and interpretation of data and the probability models you
studied previously will be used extensively. The most important material from your proba-
bility course is the material dealing with random variables, including distributions such as
the Binomial, Poisson, Multinomial, Normal or Gaussian, Uniform and Exponential. It is
important to review this material on our own.
Statistical Sciences is a large discipline and this course is only an introduction. The
broad objective of this course is to discuss all aspects of: problem formulation, planning of
an empirical study, formal and informal analysis of data, and the conclusions and limitations
of such an analysis. We must remember that data are collected and models are constructed
for a speci…c reason. In any given application we should keep the big picture in mind (e.g.
Why are we studying this? What else do we know about it?) even when considering one
speci…c aspect of a problem.
Here is a quote2 from Hal Varien, Google’s chief economist.
“The ability to take data - to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it’s going to be a hugely important skill in the next
decades, not only at the professional level but even at the educational level for elementary
school kids, for high school kids, for college kids. Because now we really do have essen-
tially free and ubiquitous data. So the complimentary(sic) scarce factor is the ability to
understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also want to be able to
1
A brilliant example of how to create information through data visualization is found in the video by
Hans Rosling at: http://www.youtube.com/watch?v=jbkSRLYSojo
2
For the complete article see “How the web challenges managers” Hal Varian, The McKinsey Quarterly,
January 2009
1.2. DATA COLLECTION 3
visualize the data, communicate the data, and utilize it e¤ectively. But I do think those
skills - of being able to access, understand, and communicate the insights you get from data
analysis - are going to be extremely important. Managers need to be able to access and
understand the data themselves.”
image is associated with each city (the unit) then the image is also a complex variate.
The values of a variate typically vary across the units in a population or process. This
variability generates uncertainty and makes it necessary to study populations and processes
by collecting data about them. By data, we mean the values of the variates for a sample
of units drawn from a population or process. It is important to identify the types of
variates in an empirical study since this identi…cation will help us in choosing
statistical models for the data which will aid us in the analysis of the data.
We are interested in functions of the variates over the population or process; for example
the average drop in blood pressure due to a treatment for individuals with hypertension
or the proportion of a population having a certain characteristic. We call these functions
attributes of the population or process.
In planning to collect data about a population or process, we must carefully specify
what the objectives are. Then, we must consider feasible methods for collecting data as
well as the extent it will be possible to answer questions of interest. This sounds simple
but is usually di¢ cult to do well, especially since resources are always limited.
There are several ways in which we can obtain data. One way is purely according to
what is available: that is, data are provided by some existing source. Huge amounts of
data collected by many technological systems are of this type, for example, data on credit
card usage or on purchases made by customers in a supermarket. Sometimes it is not
clear what available data represent and they may be unsuitable for serious analysis. For
example, people who voluntarily provide data in a web survey may not be representative of
the population at large. Alternatively, we may plan and execute a sampling plan to collect
new data. Statistical Sciences stress the importance of obtaining data that will be objective
and provide maximal information at a reasonable cost.
Recall that an empirical study is one in which we learn by observation or experiment.
Most often this is done by collecting data. The empirical studies we will consider will
usually be one of the following types:
(i) Sample surveys: The object of many empirical studies is to learn about a …nite pop-
ulation (e.g. all persons over 19 in Ontario as of September 1 in a given year). In this
case information about the population may be obtained by selecting a “representa-
tive”sample of units from the population and determining the variates of interest for
each unit in the sample. Obtaining such a sample can be challenging and expensive.
In a survey sample the variates of interest are most often collected using a question-
naire. Sample surveys are widely used in government statistical studies, economics,
marketing, public opinion polls, sociology, quality assurance and other areas.
(ii) Observational studies: An observational study is one in which data are collected
about a population or process without any attempt to change the value of one or
more variates for the sampled units. For example, in studying risk factors associated
with a disease such as lung cancer, we might investigate all cases of the disease at a
particular hospital (or perhaps a sample of them) that occur over a given time period.
1.2. DATA COLLECTION 5
We would also examine a sample of individuals who did not have the disease. A dis-
tinction between a sample survey and an observational study is that for observational
studies the population of interest is usually in…nite or conceptual. For example, in
investigating risk factors for a disease, we prefer to think of the population of interest
as a conceptual one consisting of persons at risk from the disease recently or in the
future.
These three types of empirical studies are not mutually exclusive, and many studies
involve aspects of all of them. Here are some slightly more detailed examples.
Consider, for example, soft drinks sold in nominal 355 ml cans. Because of inherent
variation in the …lling process, the amount of liquid y that goes into a can varies over a
small range. Note that the manufacturer would like the variability in y to be as small as
possible, and for cans to contain at least 355 ml. Suppose that the manufacturer has just
added a new …lling machine to increase the plant’s capacity. The process engineer wants to
compare the new machine with an old one. Here the population of interest is the cans …lled
in the future by both machines. The process engineer decides to do this by sampling some
…lled cans from each machine and accurately measuring the amount of liquid y in each can.
This is an observational study.
How exactly should the sample be chosen? The machines may drift over time (that is,
the average of the y values or the variability in the y values may vary systematically up or
down over time) so we should select cans over time from each machine. We have to decide
how many, over what time period, and when to collect the cans from each machine.
Numerical Summaries
We now describe some numerical summaries which are useful for describing features of a sin-
gle variate in a data set. These summaries fall generally into three categories: measures of
location (mean, median, and mode), measures of variability or dispersion (variance, range,
and interquartile range), and measures of shape (skewness and kurtosis). These summaries
are used when the variate is either discrete or continuous.
Measures of location
The sample median m ^ or the middle value when n is odd and the sample is ordered
from smallest to largest, and the average of the two middle values when n is even.
The sample mode, or the value of y which appears in the sample with the highest
frequency (not necessarily unique).
The sample mean, median and mode describe the “center”of the distribution of variate
values in a data set. The units for mean, median and mode (e.g. centimeters, degrees
Celsius, etc.) are the same as for the original variate.
Since the median is less a¤ected by a few extreme observations (see Problem 1), it is a
more robust measure of location.
The range = y(n) y(1) where y(n) = max (y1 ; y2 ; : : : ; yn ) and y(1) = min (y1 ; y2 ; : : : ; yn ).
The sample variance and sample standard deviation measure the variability or spread of
the variate values in a data set. The units for standard deviation, range, and interquartile
range (e.g. centimeters, degrees Celsius, etc.) are the same as for the original variate.
Since the interquartile range is less a¤ected by a few extreme observations (see Problem
2), it is a more robust measure of variability.
1.3. DATA SUMMARIES 9
Measures of shape
Measures of shape generally indicate how the data, in terms of a relative frequency
histogram, di¤er from the Normal bell-shaped curve, for example whether one “tail” of
the relative frequency histogram is substantially larger than the other so the histogram is
asymmetric, or whether both tails of the relative frequency histogram are large so the data
are more prone to extreme values than data from a Normal distribution.
Sample skewness and sample kurtosis have no units.
0.25
0.2
relative frequency
skewness = 1.15
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12
y
Figure 1.1: Relative frequency histogram for data with positive skewness
The sample skewness is a measure of the (lack of) symmetry in the data. When the
relative frequency histogram of the data is approximately symmetric then there is an ap-
P
n
proximately equal balance between the positive and negative values in the sum (yi y)3
i=1
and this results in a value for the sample skewness that is approximately zero.
If the relative frequency histogram of the data has a long right tail (see Figure 1.1), then
the positive values of (yi y)3 dominate the negative values in the sum and the value of
10 1. INTRODUCTION TO STATISTICAL SCIENCES
0.35
0.3
0.25
relative frequency
skewness = -1.35
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12
y
Figure 1.2: Relative frequency histogram for data with negative skewness
the skewness will be positive. Similarly if the relative frequency histogram of the data had
a long left tail (see Figure 1.2) then the negative values of (yi y)3 dominate the positive
values in the sum and the value of the skewness will be negative.
0.35
0.25
0.2
G(0.15,1.52) p.d.f.
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4 5 6 7
y
Figure 1.3: Relative frequency histogram for data with kurtosis > 3
The sample kurtosis measures the heaviness of the tails and the peakedness of the data
relative to data that are Normally distributed. Since the term (yi y)4 is always positive,
the kurtosis is always positive. If the sample kurtosis is greater than 3 then this indicates
heavier tails (and a more peaked center) than data that are Normally distributed. For data
that arise from a model with no tails, for example the Uniform distribution, the sample
1.3. DATA SUMMARIES 11
0.14
skewness= 0.08
relative frequency 0.12 kurtosis= 1.73
0.1
0.08
0.06
0.04
G(4.9,2.9) p.d.f.
0.02
0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
y
Figure 1.4: Relative frequency histogram for data with kurtosis < 3
kurtosis will be less than 3. See Figures 1.3 and 1.4. Typical …nancial data such as the
S&P500 index have kurtosis values greater than three, because the extreme returns (both
large and small) are more frequent than one would expect for Normally distributed data.
De…nition 1 Let y(1) ; y(2) ; : : : ; y(n) where y(1) y(2) y(n) be the order statistic
for the data set fy1 ; y2 ; : : : ; yn g. For 0 < p < 1, the pth (sample) quantile (also called the
100pth (sample) percentile), is a value, call it q(p), determined as follows:
If k is not an integer but 1 < k < n then determine the closest integer j such that
j < k < j + 1 and then q(p) = 21 y(j) + y(j+1) .
12 1. INTRODUCTION TO STATISTICAL SCIENCES
The quantiles q(0:25), q(0:5) and q(0:75) are often used to summarize a data set and
are given special names.
De…nition 2 The quantiles q(0:25), q(0:5) and q(0:75) are called the lower or …rst quartile,
the median, and the upper or third quartile respectively.
Example 1.3.1
Consider the data set of 12 observations which has already been ordered from smallest
to largest:
1:2 6:6 6:8 7:6 7:9 9:1 10:9 11:5 12:2 12:7 13:1 14:3
A way to quantify the variability of the variate values in a data set is to use the in-
terquartile range (IQR) which is the di¤erence between the lower and upper quartiles.
The …ve number summary provides a concise numerical summary of a data set which
provides information about the location (through the median), the spread (through the
lower and upper quartiles) and the range (through the minimum and maximum values).
De…nition 4 The …ve number summary of a data set consists of the smallest observation,
the lower quartile, the median, the upper quartile and the largest value, that is, the …ve
values: y(1) ; q (0:25) ; q (0:5) ; q (0:75) ; y(n) .
1.3. DATA SUMMARIES 13
weight(kg)
BMI =
[height(m)]2
BMI is a continuous variate. Often the value of BMI is used to classify a subject as being
“overweight”, “normal weight”, “underweight”, etc. One possible classi…cation is given in
Table 1.1.
Suppose Table 1.1 was used to determine the BMI class for each subject and we called this
new variate “BMI class”. BMI class is an example of an ordinal variate.
The data are available in the …le bmidata.txt posted on the course website. To analyse
the data, it is convenient to record the data in row-column format (see Table 1.2). The
…rst row of the …le gives the variate names, in this case, subject number, sex (M = male or
F = female), height, weight and BMI. Each subsequent row gives the variate values for a
particular subject.
The …ve number summaries for the variate BMI for each sex are given in Table 1.3 along
with the sample mean and standard deviation. We see that there are only small di¤erences
in the median and the mean. For the standard deviation, IQR and the range we notice
that the values are all larger for the females. In other words, there is more variability in
the BMI measurements for females than for males in this sample.
We can also construct a relative frequency table that gives the proportion of subjects
that fall within each BMI class by sex (see Table 1.4). From the table we can see that
the reason that the variability in the BMI variate for females is larger than for males is
because there is a larger proportion of females in the two extreme classes “underweight”
and “severely obese” as compared to the males.
Sample correlation
So far we have looked only at numerical summaries of a data set fy1 ; y2 ; : : : ; yn g. Often
we have bivariate data of the form f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g. A numerical summary
of such data is the sample correlation.
De…nition 5 The sample correlation, denoted by r, for data f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g
is
Sxy
r=p
Sxx Syy
where
2
P
n P
n 1 P
n
Sxx = (xi x)2 = x2i xi
i=1 i=1 n i=1
Pn P
n 1 P
n P
n
Sxy = (xi x) (yi y) = xi yi xi yi
i=1 i=1 n i=1 i=1
2
P
n P
n 1 P
n
Syy = (yi y)2 = yi2 yi
i=1 i=1 n i=1
1.3. DATA SUMMARIES 15
The sample correlation, which takes on values between 1 and 1, is a measure of the
linear relationship between the two variates x and y. If the value of r is close to 1 then
we say that there is a strong positive linear relationship between the two variates while if
the value of r is close to 1 then we say that there is a strong negative linear relationship
between the two variates. If the value of r is close to 0 then we say that there is no linear
relationship between the two variates.
Relative risk
Recall that values for a categorical variate are category names that do not necessarily
have any ordering. If two variates of interest in a study are categorical variates then the
sample correlation cannot be used as a measure of the relationship between the two variates.
A A Total
B y11 y12 y11 + y12
B y21 y22 y21 + y22
Total y11 + y21 y12 + y22 n
Table 1.6: General two-way table
16 1. INTRODUCTION TO STATISTICAL SCIENCES
Recall that events A and B are independent events if P (A \ B) = P (A) P (B) or equiva-
lently P (A) = P (AjB) = P AjB . If A and B are independent events then
P (AjB)
=1
P AjB
and otherwise the ratio is not equal to one. In the Physicians’Health Study if we let A =
takes daily aspirin and B = experienced CHD then we can estimate this ratio using the
ratio of the sample proportions.
De…nition 6 For categorical data in the form of Table 1.6 the relative risk of event A in
group B as compared to group B is
The data suggest that the group taking the placebo are nearly twice as likely to experience
CHD as compared to the group taking the daily aspirin. Can we conclude that daily aspirin
reduces the occurrence of CHD? The topic of causation will be discussed in more detail in
Chapter 8.
Graphical Summaries
Graphical summaries or data visualizations are important tools for seeing patterns in data
and for communicating results. Although the graphical summaries we present here are
quite simple, they provide the building blocks for more advanced visualizations used in
data science and data mining.
We consider graphical summaries for both univariate data sets fy1 ; y2 ; : : : ; yn g and bi-
variate data sets f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g.
1.3. DATA SUMMARIES 17
Frequency histograms
Consider measurements fy1 ; y2 ; : : : ; yn g on a variate y. Partition the range of y into k
non-overlapping intervals Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate
for j = 1; 2; : : : ; k. The fj are called the observed frequencies for I1 ; I2 ; : : : ; Ik ; note that
Pk
fj = n.
j=1
(a) a “standard” frequency histogram where the intervals Ij are of equal length. The
height of the rectangle for Ij is the frequency fj or relative frequency fj =n.
(b) a “relative” frequency histogram, where the intervals Ij = [aj 1 ; aj ) may or may not
be of equal length. The height of the rectangle for Ij is set equal to
fj =n
aj aj 1
so that the area of the jth rectangle equals fj =n. With this choice of height we have
P
k fj =n 1 Pk n
(aj aj 1) = fj = = 1
j=1 (aj aj 1) n j=1 n
If intervals of equal length are used then a standard frequency histogram and a relative
frequency histogram look identical except for the labeling of the vertical axis. As just
shown, the sum of the areas of the rectangles for a relative frequency histogram equals
one. Recall that the area under a probability density function for a continuous random
variable equals one. Therefore if we wish to superimpose a probability density function
on a histogram to see how well the model …ts the data we must use a relative frequency
histogram. If we wish to compare two data sets which have di¤erent sample sizes then a
relative frequency histogram must always be used. The vertical axis is labelled “density”
to emphasize that such a histogram is being used.
To construct a frequency histogram, the number and location of the intervals must be
chosen. The intervals are typically selected so that there are ten to …fteen intervals and
each interval contains at least one y value from the sample (that is, each fj 1). If a
software package is used to produce the frequency histogram then the intervals are usually
chosen automatically. An option for user speci…ed intervals is also usually provided.
18 1. INTRODUCTION TO STATISTICAL SCIENCES
0.14
0.12
skewness= 0.41
0.1
kurtosis= 3.03
Density
0.08
0.06
0.04
0.02
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BodyMass Index
0.09
0.08
skewness= 0.30
0.07 kurtosis= 2.79
0.06
Density
0.05
0.04
0.03
0.02
0.01
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BodyMass Index
0.018
0.016
0.014
0.012
skewness = 1.28
Density
0.01
0.008
0.006
0.004
0.002
0
0 15 30 45 60 75 90 105 120 135 150 165 180
Lifetime
Bar Graphs
For categorical data, a bar graph or bar chart is a useful graphical summary. A bar
graph has a bar for each of the possible values of the categorical variate with height equal
to the frequency or relative frequency of that category. Usually the order of the di¤erent
possible categories is not important. The width of the bar is also not important. Gaps are
left between the bars to emphasize that the data are categorical.
Figure 1.9 illustrates how a bar graph can be used to compare the global market share
of browsers in June 2015, June 2016, and June 2017.
0.7
0.6
0.4
0.3
0.2
0.1
0
Chrome Safari IE Firefox UCBrowser Other
0.7
June 2015
June 2016
0.6 June 2017
0.5
relative frequency
0.4
0.3
0.2
0.1
0
Chrome Safari IE Firefox UCBrowser Other
Pie charts, which are another way to display categorical data, are often used in the
media. Pie charts are used very infrequently by statisticians since the human eye is not
good at judging how much area is taken up by a wedge.
Bar graphs and pie charts are often used incorrectly in the media. See Chapter 1, Prob-
lems 19-23.
3:1 0:6 1:6 1:8 0:3 3:8 1:0 0:8 2:9 1:7
Order the observations from smallest to largest to obtain the order statistic
0:3 0:6 0:8 1:0 1:6 1:7 1:8 2:9 3:1 3:8
Suppose we assume that these observations come from an unknown cumulative distribution
F (y) = P (Y y). If we wanted to estimate F (1:5) = P (Y 1:5) then intuitively it seems
reasonable to estimate this probability by determining the proportion of observations which
are less than or equal to 1:5. Since there are four such values (0:3; 0:6; 0:8 and 1:0), we
estimate F (1:5) by F^ (1:5) = 104
= 0:4. Since there are no observations between 1:0 and
1:6 then for any y 2 [1:0; 1:6) we would estimate F (y) = P (Y y) using F^ (y) = 0:4.
We can estimate F (y) = P (Y y) in a similar way for any value of y. This leads us
to the following de…nition:
De…nition 7 For a data set fy1 ; y2 ; : : : ; yn g, the empirical cumulative distribution function
or e.c.d.f. is de…ned by
For the data set of 10 observations, the graph of F^ (y) is given in Figure 1.10. The
vertical lines are added to make the graph look visually more like a cumulative distribution
function. We note that F^ (y) jumps a height of 0:1 at each of the unique values in the
ordered data set.
More generally, for an ordered data set y(1) ; y(2) ; : : : ; y(n) of unique observations, F^ (y(j) ) =
j=n and the jumps are all of size 1=n.
22 1. INTRODUCTION TO STATISTICAL SCIENCES
0.9
0.8
0.7
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3 3.5 4
y
Example 1.3.6
Figure 1.11 shows a graph of the empirical cumulative distribution function for 100
observations which were randomly generated from an Exponential model and then rounded
to one decimal place. The observations are not all unique. In particular, at y = 4:3 there
is a jump of height 0:04 which would indicate that there are 4 observations equal to 4:3.
The plot of the empirical cumulative distribution function does not show the shape
of the distribution quite as clearly as a plot of the relative frequency histogram does. It
requires more e¤ort to determine if the distribution is symmetric or skewed. We can see
from Figure 1.11 that for this data set the values of q (p) are changing more rapidly for
p 0:8. This means the data are not symmetric but are positively skewed with a long right
tail.
Often when large data sets are reported in the media or research journals the individ-
ual observations are not reported. Sometimes only a graph like the empirical cumulative
distribution function is given. What information can we obtain from the graph of the em-
pirical cumulative distribution function? In addition to the information about the shape
mentioned above, the graph allows us to determine the pth quantile or 100pth percentile
q (p). For example, from Figure 1.11 we can determine, using the red dashed lines, that
the lower quartile = q (0:25) = 0:9, the median = q (0:5) = 2:6, and the upper quartile
= q (0:75) = 5:3. These are not exactly the same values that would be obtained if we had
all the data and used De…nition 1, however the values would be very close. From q (0:75)
and q (0:25) we can determine that the IQR = q (0:75) q (0:25) = 5:3 0:9 = 4:4. Finally
we can also see that y(1) = 0:0 and y(100) = 16:1 and therefore the range = 16:1 0:0 = 16:1.
1.3. DATA SUMMARIES 23
0.9
0.8
0.7
empirical c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18
y
The empirical cumulative distribution function can also be used to compare two data
sets by graphing their empirical cumulative distribution functions on the same graph as
shown in the next example.
0.9
0.8
Males
0.7
empirical c.d.f.
0.6
0.5 Females
0.4
0.3
0.2
0.1
0
1.4 1.5 1.6 1.7 1.8 1.9 2
Height
Figure 1.12: Empirical cumulative distribution function of heights for males and
for females
Boxplots
In many situations, we want to compare the values of a variate for two or more groups.
For example, in Example 1.3.2 Continued we compared the heights for males versus females
by plotting side-by-side empirical distribution functions. When the number of groups is
large or the sample sizes within groups are small, side-by-side boxplots (also called box and
whisker plots) are a convenient way to display the data.
A boxplot gives a graphical summary about the shape of the distribution and is usually
displayed vertically. The line inside the box corresponds to the median q(0:5). The top
edge of the box corresponds to the upper quartile q(0:75) and the lower edge of the box
corresponds to the lower quartile q(0:25). The so-called whiskers extend down and up
from the box to a horizontal line. The lower line is placed at the smallest observed data
value that is larger than the value q(0:25) 1:5 IQR where IQR = q(0:75) q(0:25) is
the interquartile range. The upper line is placed at the largest observed data value that
is smaller than the value q(0:75) + 1:5 IQR. Values beyond the whiskers (often called
outliers) are plotted with special symbols.
1.3. DATA SUMMARIES 25
120
110
100
90
Weight(kg)
80
70
60
50
40
Males Females
Figure 1.13 displays side-by-side boxplots of male and female weights from Example
1.3.2. As mentioned previously, when large data sets are reported in the media or research
journals the individual observations are not reported. What information can we obtain
from these two boxplots?
The shape and spread of the two distributions are very similar. For the males and the
females, the center line in the box, which corresponds to the median, divides both the box
and the whiskers approximately in half which indicates that both distributions are roughly
symmetric about the median. For the females there are two large outliers.
For the boxplots we can determine that the median weight for females is approxi-
mately 70 and for males the median weight is approximately 81. For females q (0:25) = 62,
q (0:75) = 79, IQR = 79 62 = 17, and range = 111 40 = 71. For males q (0:25) = 73,
q (0:75) = 91, IQR = 91 73 = 18, and range = 117 52 = 65. The IQR and range for
females are very similar to the IQR and range for males.
Since the boxplot for the males is shifted up relative to the boxplot for females this
implies that males generally weigh more than females.
26 1. INTRODUCTION TO STATISTICAL SCIENCES
Boxplots are particularly useful for comparing more than two groups.
100
90
80
70
Internet Users (per 100 people)
60
50
40
30
20
10
A f r ic a A me r ic a A s ia Eu r o p e O c e a n ia
Figure 1.14 shows a comparison of internet users (per 100 people) in 2015 for countries
in the world classi…ed by continent (worldbank.org). The side-by-side boxplots make it
easy to see the di¤erences and similarities between the countries in di¤erent continents. In
this example a unit is a country. The variate of interest measured for each country is the
number of internet users per 100 people. What type of variate is this? Why is the total
number of internet users not used?
For which continent is the median number of internet users per 100 people the smallest?
For which continent is the median number of internet users per 100 people the largest?
For which continent is the IQR the smallest? For which continent is the IQR the
largest? For which continent is the range the smallest? For which continent is the range
the largest? For which continent is the variability the smallest? For which continent is the
variability the largest?
For which continent is the distribution most symmetric? For which continent is the
distribution most asymmetric?
The graphical summaries discussed to this point deal with a single variate. If we have
data on two variates x and y for each unit in the sample then the data set is represented as
f(xi ; yi ); i = 1; 2; : : : ; ng. We are often interested in examining the relationships between
the two variates.
1.3. DATA SUMMARIES 27
Scatterplots
A scatterplot, which is a plot of the points (xi ; yi ); i = 1; 2; : : : ; n, can be used to see
whether two variates are related in some way.
120
110
100
90
weight
80
70
r = 0.55
60
50
1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2
height
120
110
r = 0.31
100
90
weight
80
70
60
50
40
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
height
Figures 1.15 and 1.16 give the scatterplots of y = height versus x = weight for males
and females respectively for the data in Example 1.3.2. As expected, there is a tendency
for weight to increase as height increases for both sexes. What might be surprising is the
variability in weights for a given height.
28 1. INTRODUCTION TO STATISTICAL SCIENCES
Run charts
A run chart is another type of two dimensional plot which is used when we are interested
in a graphical summary which illustrates how a single variate is changing over time.
In Figure 1.17 the run chart shows the closing value of the Canadian dollar in Chi-
nese yuan for the 67 business days between May 1 and August 1, 2017. For example on
August 1, 2017 the Canadian dollar was worth 5.3543 Chinese yuan. The data are from
google.com/…nance. In a run chart consecutive points are joined with straight lines.
5. 5
5. 45
5. 4
5. 35
5. 3
Chinese yuan
5. 25
5. 2
5. 15
5. 1
5. 05
5
MA 1 M A 15 M A 29 JN 12 JN 26 JL 10 JL 24 AU 1
D ay
In Figure 1.18 the market share for the browsers Chrome, Safari and Internet Explorer
is graphed versus the months between June 2016 and July 2017 (gs.stat.counter.com).
0 .7
C hr om e
0 .6 S a fa r i
IE
0 .5
0 .4
Ma r k e t S ha r e
(P e r c e nta ge )
0 .3
0 .2
0 .1
0
MA JL SE NO JN MR MA JL
Month
Figure 1.18: Market share for browsers June 2016 to July 2017
Note that for these data sets the sample correlation coe¢ cient is not meaningful. Why?
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 29
the variate values vary so random variables can describe this variation
empirical studies usually lead to inferences that involve some degree of uncertainty,
and probability is used to quantify this uncertainty
models allow us to characterize processes and to simulate them via computer experi-
ments
f (y; ) = P (Y = y; )
500 y
= (1 )500 y
for y = 0; 1; : : : ; 500 and 0 1
y
(Note that the sampling would be done without replacement so we are assuming that the
number sampled is small relative to the total number in the population.) The parameter
in the probability function represents the unknown proportion of smokers in the population
of young adults aged 14 20 living in Ontario at the time of the study, which is one attribute
of interest in the study.
Note that we use the notation P (Y = y; ) and f (y; ) to emphasize the importance of
the parameter in the model.
Example 1.4.2 An Exponential distribution example
In Example 1:3:4, we examined the lifetime (in 1000 km) of a sample of 200 front brake
pads taken from the population of all cars of a particular model produced in a given time
period. We can model the lifetime of a single brake pad by a continuous random variable
Y with Exponential probability density function (p.d.f.)
1 y=
f (y; ) = e for y > 0
The parameter > 0 represents the mean lifetime of the brake pads in the population since,
in the model, the expected value of Y is E (Y ) = .
To model the sampling procedure, we assume that the data fy1 ; y2 ; : : : ; y200 g represent
200 independent realizations of the random variable Y . That is, we let Yi = the lifetime
for the ith brake pad in the sample, i = 1; 2; : : : ; 200, and we assume that Y1 ; Y2 ; : : : ; Y200
are independent Exponential random variables each having the same mean .
We can use the model and the data to estimate and other attributes of interest such
as the proportion of brake pads that fail in the …rst 100; 000 km of use. In terms of the
model, we can represent this proportion by
Z100
100=
P (Y 100; ) = f (y; )dy = 1 e
0
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 31
linearly on her height x and we write this as E(Y jx) = + x. It would be possible to
reverse the roles of the two variates and consider weight to be the explanatory variate and
height to be the response variate, if for example we wished to predict height using data on
individuals’weights.
Models for describing the relationships among two or more variates are considered in
more detail in Chapters 6 and 7.
Prediction methods are used when we use the observed data to predict a future value
for a variate of a unit to be selected from the process or population. For example, based
on the results of a clinical trial such as Example 1:2:3, we may wish to predict how much
an individual’s blood pressure would drop for a given dosage of a new drug, or, given the
past performance of a stock and other data, to predict the value of the stock at some point
in the future. Examples of prediction methods are given in Sections 4.7 and 6.2.
Statistical analysis involves the use of both descriptive statistics and formal methods of
estimation, prediction and hypothesis testing. As brief illustrations, we return to the …rst
two examples of section 1.2.
Suppose we are interested in the question “Is the smoking rate among females higher than
the rate among males?” From the data, we see that the sample proportion of females
who smoke is 82=250 = 0:328 or 32:8% and the sample proportion of males who smoke is
71=250 = 0:284 or 28:4%. In the sample, the smoking rate for females is higher. But what
can we say about the whole population? To proceed, we formulate the hypothesis that
there is no di¤erence in the population rates. Then assuming the hypothesis is true, we
construct two Binomial models as in Example 1:4:1 each with a common parameter . We
can estimate using the combined data so that ^ = 153=500 = 0:306 or 30:6%. Then using
the model and the estimate, we can calculate the probability of such a large di¤erence in
the observed rates. Such a large di¤erence occurs about 20% of the time (if we selected
samples over and over and the hypothesis of no di¤erence is true) so such a large di¤erence
in observed rates happens fairly often and therefore, based on the observed data, there
is no evidence of a di¤erence in the population smoking rates. In Chapter 7 we discuss a
formal method for testing the hypothesis of no di¤erence in rates between females and males.
357.8
357.6
357.4
357.2
357
volume 356.8
356.6
356.4
356.2
356
355.8
0 5 10 15 20 25 30 35 40
hour
Figure 1.19: Run chart of the volume (ml) for the new machine over time
First we examine if the behaviour of the two machines is stable over time. In Figures
1.19 and 1.20, a run chart of the volumes over time for each machine is given. There is no
indication of a systematic pattern for either machine so we have some con…dence that the
data can be used to predict the performance of the machines in the near future.
360
359.5
359
358.5
volume
358
357.5
357
356.5
356
0 5 10 15 20 25 30 35 40
hour
Figure 1.20: Run chart of the volume for old machine over time
The sample mean and standard deviation for the new machine are 356:8 and 0:54 ml
respectively and, for the old machine, are 357:5 and 0:80. Figures 1.21 and 1.22 show the
relative frequency histograms of the volumes for the new machine and the old machine re-
spectively. To see how well a Gaussian model might …t these data we superimpose Gaussian
probability density functions with the mean equal to the sample mean and the standard
deviation equal to the sample standard deviation on each histogram. The agreement is
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 35
reasonable given that the sample size for both data sets is only 40.
0.9
0.8
0.7
G(356.76,0.54)
0.6
Density
0.5
skewness = 0.22
0.4 kurtosis = 2.38
0.3
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.21: Relative frequency histogram of volumes (ml) for the new machine
0.7
0.4
0.3 G(357.5,0.80)
0.2
0.1
0
355 356 357 358 359 360 361
Volume
Figure 1.22: Relative frequency histogram of volumes (ml) for the old machine
We can use the Gaussian model to estimate the long term proportion of cans that
fall below the required volume of 355 ml. For the new machine, Y G(356:8; 0:54) and
P (Y 355) = 0:0005 so about 5 in 10; 000 cans will be under-…lled. For the old machine,
Y G(357:5; 0:80) and P (Y 355) = 0:0008 so about 8 in 10; 000 cans will be under-…lled.
Of course these estimates are subject to a high degree of uncertainty because they are based
on small sample sizes.
We can see that the new machine is superior because of its smaller sample mean which
translates into less over…ll and hence less cost to the manufacturer. It is possible to adjust
the mean of the new machine to a lower value because of its smaller standard deviation.
36 1. INTRODUCTION TO STATISTICAL SCIENCES
(b) Suppose the data are transformed using ui = a + byi , i = 1; 2; : : : ; n where a and
b are constants with b 6= 0. How are the sample mean and sample median of the
data set fu1 ; u2 ; : : : ; un g related to y and m?
^
(c) Suppose the data are transformed using vi = yi2 , i = 1; 2; : : : ; n. How are the
sample mean and sample median of v1 ; v2 ; : : : ; vn related to y and m?
^
(d) Suppose another observation y0 is added to the data set. Determine the mean
of the augmented data set in terms of y and y0 . What happens to the sample
mean as the magnitude of y0 increases?
(e) Suppose another observation y0 is added to the data set. Determine the me-
dian of the augmented data set. What happens to the sample median as the
magnitude of y0 increases?
(f) Use (d) and (e) to explain why the sample median income of a country might be
a more appropriate summary than the sample mean income.
P
n
(g) Show that V ( ) = (yi )2 is minimized when = y.
i=1
P
n
(h) Challenge Problem: Show that W ( ) = jyi j is minimized when = m.
^
i=1
2. The sample standard deviation s, the interquartile range IQR = q (0:75) q (0:25),
and the range = y(n) y(1) are three di¤erent measures of the variability of a data
set fy1 ; y2 ; : : : ; yn g.
(b) Suppose the data are transformed using ui = a + byi , i = 1; 2; : : : ; n where a and
b are constants and b 6= 0. How are the sample standard deviation, IQR, and
range of the transformed data set fu1 ; u2 ; : : : ; un g related to the sample standard
deviation, IQR, and range of the original data set fy1 ; y2 ; : : : ; yn g?
(c) Suppose another observation y0 is added to the data set. Use the result in (a)
to write the sample standard deviation of the augmented data set in terms of
y0 , s, and y. What happens to the sample variance of the augmented data set
as the magnitude of y0 increases?
38 1. INTRODUCTION TO STATISTICAL SCIENCES
(d) If another observation y0 is added to the data set, what happens to the IQR of
the augmented data set as the magnitude of y0 increases?
(e) If another observation y0 is added to the data set, what happens to the range of
the augmented data set as the magnitude of y0 increases?
3. The sample skewness and kurtosis are two di¤erent measures of the shape of a data
set fy1 ; y2 ; : : : ; yn g. Let g1 be the sample skewness and let g2 be the sample kurtosis
of the data set. Suppose we transform the data so that ui = a + byi , i = 1; 2; : : : ; n
where a and b are constants and b 6= 0. How are the sample skewness and sample
kurtosis of the data set fu1 ; u2 ; : : : ; un g related to g1 and g2 ?
4. Suppose the data fc1 ; c2 ; : : : ; c24 g represents the costs of production for a …rm every
month from January 2018 to December 2019. For this data set the sample mean was
$2500, the sample standard deviation was $5500; the sample median was $2600, the
sample skewness was 1:2, the sample kurtosis was 3:9, and the range was $7500. The
relationship between cost and revenue is given by ri = 7ci + 1000, i = 1; 2; : : : ; 24.
Find the sample mean, standard deviation, median, skewness, kurtosis and range of
the revenues.
12:8 7:3 3:9 3:4 2:9 2:7 2:5 2:3 1:0 0:9
0:8 0:7 0:6 0:4 0:4 0:2 0:0 0:5 0:6 0:7
1:2 1:8 1:8 2:0 2:1 2:5 2:6 2:6 2:7 2:8
3:3 3:4 3:5 3:8 4:3 4:6 4:7 5:1 5:4 5:7
5:8 6:6 6:6 7:0 7:2 7:9 8:5 8:6 8:7 8:9
P
50 P
50
yi = 100:7 yi2 = 1110:79
i=1 i=1
(a) Plot a relative frequency histogram of the data. Is the process producing pistons
within the speci…cations.
(b) Calculate the sample mean y and the sample median of the diameters.
(c) Calculate the sample standard deviation s and the IQR.
(d) Give the …ve number summary for these data.
1.7. CHAPTER 1 PROBLEMS 39
(e) Such data are often summarized using a single performance index called P pk
de…ned as
U y y L
P pk = min ;
3s 3s
where (L; U ) = ( 10; 10) are the lower and upper speci…cation limits. Calculate
P pk for these data.
(f) Explain why larger values of P pk (i.e. greater than 1) are desirable.
(g) Suppose we …t a Gaussian model to the data with mean and standard deviation
equal to the corresponding sample quantities, that is, with = y and = s. Use
the …tted model to estimate the proportion of diameters (in the process) that
are out of speci…cation.
6. In the above problem, we saw how to estimate the performance measure P pk based
on a sample of 50 pistons, a very small proportion of one day’s production. To get an
idea of how reliable this estimate is, we can model the process output by a Gaussian
random variable Y with mean and standard deviation equal to the corresponding
sample quantities. The following R code generates 50 observations and calculates
P pk. This is done 1000 times using a loop statement.
#Import dataset diameterdata.txt from the course website using RStudio
avgx<-mean(diameterdata$diameter) #sample mean
sdx<-sd(diameterdata$diameter) #sample standard deviation
temp<-rep(0,1000) #Store the 1000 generated Ppk values in vector temp
for (i in 1:1000) { #Begin loop
y<-rnorm(50, avgx, sdx) #Generate 50 new observations from a
# G(avgx,sdx) distribution
avg<-mean(y) #sample mean of new data
s<-sd(y) #sample std of new data
ppk<-min((10-avg)/(3*s),(avg+10)/(3*s)) #Ppk for new data
temp[i]<-ppk #Store value of Ppk for this iteration
}
hist(temp) #Plot histogram of 1000 Ppk values
mean(temp) #average of the 1000 Ppk values
sd(temp) #standard deviation of the 1000 Ppk values
(a) Compare the P pk from the original data with the average P pk value from the
1000 iterations. Mark the original P pk value on the histogram of generated P pk
values. What do you notice? What would you conclude about how good the
original estimate of P pk was?
(b) Repeat the above exercise but this time use a sample of 300 pistons rather than
50 pistons. What conclusion would you make about using a sample of 300 versus
50 pistons?
40 1. INTRODUCTION TO STATISTICAL SCIENCES
7. Graph the empirical cumulative distribution function and boxplot for the data
7:6 4:3 5:2 4:5 1:1 8:5 14:0 6:3 3:9 7:2
8. Run the following code on the can …lling data and compare with the summaries given
in Example 1.5.2.
#Import dataset canfillingdata.txt from the course website using RStudio
v1<-canfillingdata$volume[canfillingdata$machine==1] # New Machine
v2<-canfillingdata$volume[canfillingdata$machine==2] # Old Machine
skewness<-function(x) {(sum((x-mean(x))^3)/length(x))/
(sum((x-mean(x))^2)/length(x))^(3/2)}
kurtosis<- function(x) {(sum((x-mean(x))^4)/length(x))/
(sum((x-mean(x))^2)/length(x))^2}
# Numerical summaries by machine
c(mean(v1),sd(v1),skewness(v1),kurtosis(v1))
fivenum(v1) # Gives the 5 number summary
# R defines the 1st and 3rd quartiles slightly different than Def’n 1
c(mean(v2),sd(v2),skewness(v2),kurtosis(v2))
fivenum(v2)
# Plot run charts by machine, one above of the other,
# type="l" joins the points on the plots
par(mfrow=c(2,1)) # Creates 2 plotting areas, one above the other
plot(1:40,v1,xlab="Hour",ylab="Volume",main="New Machine",
ylim=c(355,360),type="l")
plot(1:40,v2,xlab="Hour",ylab="Volume",main="Old Machine",
ylim=c(355,360),type="l")
# Plot side by side relative frequency histograms with same intervals
par(mfrow=c(1,2)) # Creates 2 plotting areas side by side
# Plot relative frequency histogram for New Machine
library(MASS) # truehist is in MASS library
truehist(v1,h=0.5,xlim=c(355,361),xlab="Volume",ylab="Density",main="New
Machine")
# Superimpose Gaussian pdf onto histogram
curve(dnorm(x,mean(v1),sd(v1)),add=TRUE,from=355,to=359,lwd=2)
# Plot relative frequency histogram for Old Machine
truehist(v2,h=0.5,xlim=c(355,361),xlab="Volume",ylab="Density",main="Old
Machine")
# Superimpose Gaussian pdf onto histogram
curve(dnorm(x,mean(v2),sd(v2)),add=TRUE,from=355,to=361,lwd=2)
par(mfrow=c(1,1)) # Change back to one plotting area
1.7. CHAPTER 1 PROBLEMS 41
9. The data below show the lengths in centimeters of 43 male coyotes and 40 female
coyotes captured in Nova Scotia. (Based on Table 2.3.2 in Wild and Seber 1999.)
The data are available in the …le coyotedata.txt posted on the course website.
Females x
71:0 73:7 80:0 81:3 83:5 84:0 84:0 84:5 85:0 85:0 86:0 86:4
86:5 86:5 88:0 87:0 88:0 88:0 88:5 89:5 90:0 90:0 90:2 91:0
91:4 91:5 91:7 92:0 93:0 93:0 93:5 93:5 93:5 96:0 97:0 97:0
97:8 98:0 101:6 102:5
P
40 P
40
xi = 3569:6 x2i = 320223:38
i=1 i=1
Males y
78:0 80:0 80:0 81:3 83:8 84:5 85:0 86:0 86:4 86:5 87:0 88:0
88:0 88:9 88:9 90:0 90:5 91:0 91:0 91:0 91:4 92:0 92:5 93:0
93:5 95:0 95:0 95:0 94:0 95:5 96:0 96:0 96:0 96:0 97:0 98:5
100:0 100:5 101:0 101:6 103:0 104:1 105:0
P
43 P
43
yi = 3958:4 yi2 = 366276:84
i=1 i=1
(a) Plot relative frequency histograms of the lengths for females and males sepa-
rately. Be sure to use the same intervals.
(b) Determine the …ve number summary for each data set.
(c) Plot side by side boxplots for the females and males. What do you notice?
(d) Compute the sample mean and sample standard deviation for the lengths of the
female and male coyotes separately. Assuming = sample mean and
= sample standard deviation, overlay the corresponding Gaussian probability
density function on the histograms for the females and males separately. Com-
ment on how well the Gaussian model …ts each data set.
(e) Plot the empirical distribution function of the lengths for females and males
separately on the same graph. What do you notice?
42 1. INTRODUCTION TO STATISTICAL SCIENCES
11. Does the value of an actor in‡uence the amount grossed by a movie? The “value
of an actor” will be measured by the average amount the actors’movies have made.
The “amount grossed by a movie” is measured by taking the highest grossing movie,
in which that actor played a major part. For example, Tom Hanks, whose value is
103:2 had his best results with Toy Story 3 (gross 415:0). All numbers are corrected
to 2012 dollar amounts and have units “millions of U.S. dollars”. Twenty actors
were selected by taking the …rst twenty alphabetically listed by name on the website
(http://boxo¢ cemojo.com/people/). For each of the twenty actors, the value of the
actor (x) and their highest grossing movie (y) were determined. The data are given
below as well as in the …le actordata.txt posted on the course website.
Actor 1 2 3 4 5 6 7 8 9 10
Value (x) 67 49:6 37:7 47:3 47:3 32:9 36:5 92:8 17:6 14:4
Gross (y) 177:2 201:6 183:4 55:1 154:7 182:8 277:5 415 90:8 83:9
Actor 11 12 13 14 15 16 17 18 19 20
Value (x) 51:1 54 30:5 42:1 23:6 62:4 32:9 26:9 43:7 50:3
Gross (y) 158:7 242:8 37:1 220 146:3 168:4 173:8 58:4 199 533
P
20 P
20 P
20
xi = 860:6 x2i = 43315:04 xi yi = 184540:93
i=1 i=1 i=1
P
20 P
20
yi = 3759:5 yi2 = 971560:19
i=1 i=1
(a) What are the two variates in this data set? Choose one variate to be an explana-
tory variate and the other to be a response variate. Justify your choice.
(b) Plot a scatterplot of the data.
(c) Calculate the sample correlation for the data (xi ; yi ) ; i = 1; 2; : : : ; 20. Is there a
strong positive or negative relationship between the two variates?
(d) Is it reasonable to conclude that the explanatory variate in this problem causes
the response variate? Explain.
(e) Here is R code to plot the scatterplot (in blue) and calculate the sample corre-
lation:
#Import dataset ActorAata.txt from the course website using RStudio
attach(actordata)
1.7. CHAPTER 1 PROBLEMS 43
12. In this course we mainly focus on methods for analyzing univariate and bivariate
datasets. In the real world multivariate data sets are much more common. Learning
how to analyse univariate and bivariate datasets gives us the basic tools for analyzing
multivariate data sets. This problem looks at simple numerical and graphical sum-
maries for a multivariate dataset.
Computers and smartphones are just two of the many devices that use integrated
circuits. A silicon wafer is a thin slice of semiconductor material, such as a silicon
crystal, used in the fabrication of integrated circuits. The thickness of such wafers is
very important since thinner wafers are less costly. However the wafers cannot be too
thin since then they can crack more easily.
tion for each batch of wafers. The data for 182 consecutive batches are available in the
…le waferdata.txt posted on the course website. The data have been approximately
centered and scaled. These data could be used to study the relationships between the
thicknesses at di¤erent locations.
44 1. INTRODUCTION TO STATISTICAL SCIENCES
(a) Comment on any similarities or di¤erences in the numerical summaries for each
of the 9 locations.
(b) What do you notice about the sample correlations between location 1 with lo-
cations 2; 3; 4; 5 as compared to the sample correlations between location 1 and
locations 6; 7; 8; 9? Does what you observe make sense?
(c) Compare the variability in the points for the di¤erent scatterplots. What do you
notice?
14. In a very large population a proportion of people have blood type A. Suppose n
people are selected at random. De…ne the random variable Y = number of people
with blood type A in sample of size n.
(a) What is the probability function for Y ? What assumptions have you made?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose n = 50. What is the probability of observing 20 people with blood type
A as a function of ?
(d) If for n = 50 we observed y = 20 people with blood type A what is a reasonable
estimate of based on this information? Estimate the probability that in a
sample of n = 10 there will be at least one person with blood type A.
(e) More generally, suppose in a given experiment the random variable of interest Y
has a Binomial(n; ) distribution. If the experiment is conducted and y successes
are observed what is a good estimate of based on this information?
(f) Let Y Binomial(n; ). Find E Yn and V ar Yn . What happens to V ar Yn
as n ! 1? What does this imply about how far Yn is from for large n?
Approximate
r r !
Y (1 ) Y (1 )
P 1:96 + 1:96
n n n n
15. Visits to a particular website occur at random at the average rate of visits per
second. Suppose it is reasonable to use a Poisson process to model this process.
De…ne the random variable Y = number of visits to the website in one second.
16. Suppose it is reasonable to model the IQ’s of UWaterloo Math students using a
G ( ; ) distribution. De…ne the random variable Y = IQ of a randomly chosen
UWaterloo Math student.
127 108 127 136 125 130 127 117 123 112 129 109 109 112 91 134
P
16 P
16
yi = 1916; yi2 = 231618
i=1 i=1
(i) What is a reasonable estimate of based on these data?
(ii) What is a reasonable estimate of 2 based on these data?
(iii) Based on theses data, estimate the probability that a randomly chosen UWa-
terloo Math student has an IQ greater than 120.
(c) Suppose Yi G( ; ), i = 1; 2; : : : ; n independently.
(i) What is the distribution of
1 Pn
Y = Yi
n i=1
Find E(Y ), and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
1.7. CHAPTER 1 PROBLEMS 47
p p
(ii) Find P Y 1:96 = n Y + 1:96 = n .
(iii) Find the smallest value of n such that P Y 1:0 0:95 if = 12.
17. Suppose it is reasonable to model the battery life of a certain type of laptop using
the Exponential( ) distribution. De…ne the random variable Y = battery life of a
randomly chosen laptop.
48:0 1047:2 802:3 165:6 76:7 64:2 158:6 338:3 200:6 362:8
119:5 55:9 411:3 706:9 16:2 1277:6 49:4 22:6 1078:4 440:7
P
20
yi = 7442:8
i=1
1 P
n
2
S2 = Yi Y
n 1 i=1
1 P
n
2
= Y2 n Y
n 1 i=1 i
48 1. INTRODUCTION TO STATISTICAL SCIENCES
19. Is the graph in Figure 1.24 e¤ective in conveying information about the snacking
behaviour of students at Ridgemont High School?
b o ys
g ir ls
C an d y
C h ip s
C h o c o la t e B a r s
C o o k ie s
C rackers
F r u it
Ic e C r e a m
P o p co rn
P r e t z e ls
V e g e t a b le s
20. The pie chart in Figure 1.25, from Fox News, shows the support for various Republican
Presidential candidates in 2012. What do you notice about this pie chart?
Figure 1.25: Pie chart for support for Republican Presidental candidates
1.7. CHAPTER 1 PROBLEMS 49
21. The graphs in Figures 1.26 and 1.27 are two more classic Fox News graphs. What do
you notice? What political message do you think they were trying to convey to their
audience?
22. Information about the mortality from malignant neoplasms (cancer) for females living
in Ontario is given in …gures 1.28 and 1.29 for the years 1970 and 2000 respectively.
The same information displayed in these two pie charts is also displayed in the bar
graph in Figure 1.30. Which display seems to carry the most information?
Lung
Other
Breast
Stomach
Colorectal
Figure 1.28: Mortality from malignant neoplasms for females in Ontario 1970
Lung
Other
Stomach
Breast
Colorectal
Figure 1.29: Mortality from malignant neoplasms for females in Ontario in 2000
1.7. CHAPTER 1 PROBLEMS 51
40
1970
2000
35
30
25
20
15
10
0
Lung Leuk. & Lymph. Breast Colorectal Stomach Other
Figure 1.30: Mortality from malignant neoplasms for females living in Ontario,
1970 and 2000
summary is.
24. A study led by Beth Israel Medical Center in New York City has found that live
music can be bene…cial to premature babies. In the study, music therapists helped
parents transform their favorite tunes into lullabies. The researchers concluded that
live music, played or sung, helped to slow infants’heartbeats, calmed their breathing,
improved their sucking behaviors (important for feeding), aided their sleep and pro-
moted their states of quiet alertness. Doctors and researchers say that by reducing
stress and stabilizing vital signs, music can allow infants to devote more energy to
normal development.
The two-year study was conducted between January 2011 and December 2012 in 11
hospitals in New York state. Only hospitals which received approval from their hos-
pital’s institutional review boards were included in the study. The study involved 272
premature infants aged 32 weeks with respiratory distress syndrome, clinical sepsis
(a life-threatening condition that arises when the body’s response to infection causes
injury to its own tissues and organs), and/or SGA (small for gestational age). Over a
two week period the babies experienced 4 di¤erent musical “treatments”. Two of the
treatments involved musical instruments, one involved singing, and the control treat-
ment was no music at all. The instruments and singing were intended to approximate
womb sounds.
The …rst musical instrument was the Remo ocean disc which is a musical instrument
that is round and is …lled with tiny metal balls. When the disc is rotated, the metal
balls move slowly to create a sound e¤ect that is quiet and meant to simulate the ‡uid
sounds of the womb. The second musical instrument was a gato box which is a small
rectangular tuned musical instrument that is used to simulate a heartbeat sound that
the baby would hear in the womb. The singing treatment consisted of live singing of a
1.7. CHAPTER 1 PROBLEMS 53
lullaby chosen by a parent. If a parent did not chose a song then “Twinkle, Twinkle,
Little Star” was used.
Each of the four treatments was given 2 times per week over the course of the two
week study period. The presentation of the treatments was varied by day of the
week within each week and by the time of day and randomized (either morning or
afternoon) across the 2 weeks. For each treatment the baby’s heart rate (beats per
minute), respiratory rate (number of breaths per minute), oxygen saturation (amount
of oxygen in the blood), sucking pattern (active/medium/slow/none), and activity
level (active/quiet/irritable/sleeping) were recorded.
Researchers found that the gato box, the ocean disc and singing all slowed a baby’s
heart rate, though singing seemed to be most e¤ective. Singing also increased the time
babies stayed quietly alert. Sucking behavior improved most with the gato box. The
breathing rate slowed the most and sleeping was the best with the ocean disc. Babies
hearing songs their parents chose had better a better sucking pattern than those who
heard “Twinkle, Twinkle, Little Star.” But the “Twinkle” babies had slightly more
oxygen saturation in their blood. Dr. Loewy, who trains therapists worldwide, said
it did not matter whether parents or music therapists sang, or whether babies were
in incubators or held.
Dr. Lance A. Parton, associate director of the regional neonatal intensive care unit
at Westchester Medical Center’s Maria Fareri Children’s Hospital, which participated
in the research, said it would be useful to see if music could help the sickest and most
premature babies, who were not in the study. “Live music is optimal because it’s in
the moment and can adapt to changing conditions,” said Dr. Standley a professor
of medical music therapy at Florida State University. “If the baby appears to be
falling asleep, you can sing quieter. Recorded music can’t do that. But there are so
many premature babies and so few trained live producers of music therapy that it’s
important to know what recorded music can do.”
25. Many people do not realize how important statistics is in our everyday life. We are
surrounded by examples. Here is a wonderful example given by John Sall, co-founder
and Executive VP of the statistical software company SAS, on the occasion of the
International Year of Statistics in 2013.
“You brush your teeth. The ‡uoride in the toothpaste was studied by scientists using
statistical methods to carefully assure the safety and e¤ectiveness of the ingredient
54 1. INTRODUCTION TO STATISTICAL SCIENCES
and the proper concentration. The toothpaste was formulated through a series of
designed experiments that determined the optimal formulation through statistical
modeling. The toothpaste production was monitored by statistical process control to
ensure quality and consistency, and to reduce variability.
The attributes of the product were studied in consumer trials using statistical meth-
ods. The pricing, packaging and marketing were determined through studies that used
statistical methods to determine the best marketing decisions. Even the location of
the toothpaste on the supermarket shelf was the result of statistically based studies.
The advertising was monitored using statistical methods. Your purchase transaction
became data that was analyzed statistically. The credit card used for the purchase
was scrutinized by a statistical model to make sure that it wasn’t fraudulent.
So statistics is important to the whole process of not just toothpaste, but every prod-
uct we consume, every service we use, every activity we choose. Yet we don’t need
to be aware of it, since it is just an embedded part of the process. Statistics is useful
everywhere you look.”
Think of an example in your everyday life in which statistics played an important
role.
2. STATISTICAL MODELS AND
MAXIMUM LIKELIHOOD
ESTIMATION
2. Past experience with data sets from the population or process, which has shown that
certain distributions are suitable.
3
The material in this section is largely a review of material you have seen in a previous probability course.
This material is available in the STAT 220/230 Notes which are posted on the course website.
4
The University of Wisconsin-Madison statistician George E.P. Box (18 October 1919 –28 March 2013)
said of statistical models that “All models are wrong but some are useful” which is to say that although
models rarely …t very large amounts of data perfectly, they do assist in describing and drawing inferences
from real data.
55
56 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
In probability, there is a large emphasis on factor 1 above, and there are many “families”
of probability distributions that describe certain types of situations. For example, the
Binomial distribution was derived as a model for outcomes in repeated independent trials
with two possible outcomes on each trial while the Poisson distribution was derived as a
model for the random occurrence of events in time or space. The Gaussian or Normal
distribution, on the other hand, is often used to represent the distributions of continuous
measurements such as the heights or weights of individuals. This choice is based largely on
past experience that such models are suitable and on mathematical convenience.
In choosing a model we usually consider families of probability distributions. To be
speci…c, we suppose that for a random variable Y we have a family of probability func-
tions/probability density functions, f (y; ) indexed by the parameter (which may be a
vector of values). In order to apply the model to a speci…c problem we need to choose a
value for . The process of selecting a value for based on the observed data is referred
to as “estimating” the value of or “…tting” the model. The next section describes the
method of maximum likelihood which is the most widely used method for estimating .
Most applications require a sequence of steps in the formulation (the word “speci…ca-
tion” is also used) of a model. In particular, we often start with some family of models in
mind, but …nd after examining the data set and …tting the model that it is unsuitable in cer-
tain respects. (Methods for checking the suitability of a model will be discussed in Section
2.4.) We then try other models, and perhaps look at more data, in order to work towards
a satisfactory model. This is usually an iterative process, which is sometimes represented
by diagrams such as:
Statistics devotes considerable e¤ort to the steps of this process. We will focus on
settings in which the models are not too complicated, so that model formulation problems
are minimized. There are several distributions that you should review before continuing
since they will appear in many examples. See the STAT 220/230/240 Course Notes available
on the course website. You should also consult the Table of Distributions given in Chapter
10 for a condensed table of properties of these distributions including their means, variances
and moment generating functions .
2.1. CHOOSING A STATISTICAL MODEL 57
P Rx
F (x) = P (X x) = P (X = t) F (x) = P (X x) = f (t) dt
c.d.f. t x 1
F is a right continuous step F is a continuous
function for all x 2 < function for all x 2 <
d
p.f./p.d.f. f (x) = P (X = x) f (x) = dx F (x) 6= P (X = x) = 0
P
P (X 2 A) = P (X = x) P (a < X b) = F (b) F (a)
Probability
Px2A Rb
of an event = f (x) = f (x) dx
x2A a
P P R1
Total Probability P (X = x) = f (x) = 1 f (x) dx = 1
all x all x 1
P R1
Expectation E [g (X)] = g (x) f (x) E [g (X)] = g (x) f (x) dx
all x 1
Binomial Distribution
The discrete random variable (r.v.) Y has a Binomial distribution if its probability
function is of the form
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where is a parameter with 0 < < 1. For convenience we write Y Binomial(n; ).
Recall that E(Y ) = n and V ar(Y ) = n (1 ).
Poisson distribution
The discrete random variable Y has a Poisson distribution if its probability function is
of the form
y
e
f (y; ) = for y = 0; 1; 2; : : :
y!
where is a parameter with 0. We write Y Poisson( ). Recall that E(Y ) = and
V ar(Y ) = .
58 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
Exponential distribution
The continuous random variable Y has an Exponential distribution if its probability
density function is of the form
1 y=
f (y; ) = e for y 0
where is parameter with > 0. We write Y Exponential( ). Recall that E(Y ) = and
V ar(Y ) = 2 .
Multinomial distribution
The Multinomial distribution is a multivariate distribution in which the discrete random
variable’s Y1 ; Y2 ; : : : ; Yk (k 2) have the joint probability function
P (Y1 = y1 ; Y2 = y2 ; : : : ; Yk = yk ; ) = f (y1 ; y2 ; : : : ; yk ; )
n! y1 y2 ::: yk
= (2.1)
y1 !y2 ! : : : yk ! 1 2 k
P
k
where yi = 0; 1; : : : for i = 1; 2; : : : ; k and yi = n. The elements of the parameter vector
i=1
P
k
= ( 1; 2; : : : ; k ) satisfy 0 i 1 for i = 1; 2; : : : ; k and i = 1. This distribution is
i=1
a generalization of the Binomial distribution. It arises when there are repeated independent
trials, where each trial has k possible outcomes (call them outcomes 1; 2; : : : ; k), and the
probability outcome i occurs is i . If Yi , i = 1; 2; : : : ; k is the number of times that outcome i
occurs in a sequence of n independent trials, then (Y1 ; Y2 ; : : : ; Yk ) have the joint probability
function given in (2.1). We write (Y1 ; Y2 ; : : : ; Yk ) Multinomial(n; ):
2.2. MAXIMUM LIKELIHOOD ESTIMATION 59
P
k
Since Yi = n we can rewrite f (y1 ; y2 ; : : : ; yk ; ) using only k 1 variables, say
i=1
y1 ; y2 ; : : : ; yk 1
by replacing yk with n y1 yk 1 . We see that the Multinomial
distribution with k = 2 is just the Binomial distribution, where the two possible outcomes
are S (Success) and F (Failure).
We now turn to the problem of …tting a model. This requires estimating or assigning
numerical values to the parameters in the model, for example, in an Exponential model
or and in the Gaussian model.
Let the discrete (vector) random variable Y represent potential data that will be used
to estimate , and let y represent the actual observed data that are obtained in a speci…c
application. Note that to apply the method of maximum likelihood, we must know (or
make assumptions about) how the data y were collected. It is usually assumed here that
the data set consists of measurements on a random sample of units from a population or
process.
L ( ) = L ( ; y) = P (Y = y; ) for 2
Note that the likelihood function is a function of the parameter and the given data y.
For convenience we usually write just L ( ). Also, the likelihood function is the probability
that we observe the data y, considered as a function of the parameter . Obviously values
of the parameter that make the observed data y more probable would seem more credible or
likely than those that make the data less probable. Therefore values of for which L( ) is
large are more consistent with the observed data y. This seems like a “sensible”approach,
and it turns out to have very good properties.
De…nition 10 The value of which maximizes L( ) for given data y is called the maxi-
mum likelihood estimate 5 (m.l. estimate) of . It is the value of which maximizes the
probability of observing the data y. This value is denoted ^.
We are surrounded by polls. They guide the policies of political leaders, the products
that are developed by manufacturers, and increasingly the content of the media. The fol-
lowing is an example of a public opinion poll.
40
35
30
25
20
15
10
0
S uppor t S om ew hat S uppo rt U ns ure O ppos e S om ew hat O ppos e
of 1000 Canadians is 3:1 percentage points, 19 times out of 20. How do we interpret this
statement?
Suppose that the random variable Y represents the number of units in a sample of
n units drawn from a very large population who have a certain characteristic of interest.
Suppose we assume that Y is closely modelled by a Binomial distribution with probability
function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n and 0 1
y
where represents the proportion of the large population that have the characteristic.
Suppose that y units in the sample of size n have the characteristic. The likelihood function
for based on these data is
If y 6= 0 and y 6= n then it can be shown that (2.3) attains its maximum value at = ^ = y=n
by solving dL(d
)
= 0. The estimate ^ = y=n is called the sample proportion.
For the Nanos poll, suppose we are interested in = proportion of Canadian adults who
support or somewhat support the recreational use of marijuana. From this poll we have
y = 680 people out of n = 1000 people support or somewhat support the recreational use
of marijuana so the likelihood function for is
1000 680
L( ) = (1 )320 for 0 1 (2.4)
680
62 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
The maximum likelihood estimate of for theses data is ^ = y=n = 680=1000 = 0:68 or
68% which can also easily be seen from the graph of the likelihood function (2.4) given in
Figure 2.2. The interval suggested by the pollsters was 68 3:1% or [64:9; 71:1]. Looking at
Figure 2.2 we see that the interval [0:649; 0:711] is a reasonable interval for the parameter
since it seems to contain most of the values of with large values of the likelihood L( ).
We will return to the construction of such interval estimates in Chapter 4.
0.03
0.025
0.02
L(θ)
0.015
0.01
0.005
0
0.64 0.66 0.68 0.7 0.72
θ
The shape of the likelihood function and the value of at which it is maximized are
not a¤ected if L( ) is multiplied by a constant. Indeed it is not the absolute value of the
likelihood function that is important but the relative values at two di¤erent values of the
parameter, e.g. L( 1 )=L( 2 ). You might think of this ratio as how much more or less
consistent the data are with the parameter 1 versus 2 . The ratio L( 1 )=L( 2 ) is also
una¤ected if L( ) is multiplied by a constant. In view of this the likelihood may be de…ned
as P (Y = y; ) or as any constant multiple of it, so, for example, we could drop the term
n y
y in (2.3) and de…ne L( ) = (1 )n y . This function and (2.3) are maximized by the
same value = ^ = y=n and have the same shape. Indeed we might rescale the likelihood
function by dividing through by its maximum value L(^) so that the new function has a
maximum value equal to one.
Sometimes it is easier to work with the log (log = ln) of the likelihood function.
l( ) = ln L ( ) = log L ( ) for 2
1.5
1
L( θ)
0.5
0
-0.5
-1
l(θ)
-1.5
-2
-2.5
-3
0.23 0.24 0.25 0.26 0.27 0.28 0.29
θ
Figure 2.3: The functions L ( ) (upper graph) and l ( ) (lower graph) are both
maximized at the same value = ^
Figure 2.3 displays the graph of a likelihood function L ( ), rescaled to have a maximum
value of one at = ^, and the corresponding log likelihood function l( ) = log L ( ) with a
maximum value of log(1) = 0. We see that l( ), the lower of the two curves, is a monotone
function of L( ) so that the two functions increase together and decrease together. Both
functions have a maximum at the same value = ^.
Because functions are often (but not always!) maximized by setting their derivatives
equal to zero, we can usually obtain ^ by solving the equation
d
l( ) = 0
d
y
For example, from L( ) = (1 )n y we get l( ) = y log( ) + (n y) log(1 ) and
d y n y y n
l( ) = = for 0 < <1
d 1 (1 )
Solving dl=d = 0 gives = y=n. The First Derivative Test can be used to verify that this
corresponds to a maximum value so the maximum likelihood estimate of is ^ = y=n. This
derivation holds if y 6= 0 and y 6= n. See Problem 2 for the derivation of ^ if y = 0 or y = n.
64 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
Recall that if Y1 ; Y2 ; : : : ; Yn are independent random variables then their joint probability
function is the product of their individual probability functions.
or more simply
ny n
L( ) = e for 0
The log likelihood is
l ( ) = n (y log ) for >0
with derivative
d y n
l( ) = n 1 = (y ) for > 0
d
The First Derivative Test can be used to verify that the value = y maximizes l( ) and so
^ = y is the maximum likelihood estimate of .
L( ) = L1 ( ) L2 ( ) for 2
Example 2.2.3
In 2011, Harris/Decima (a research polling company) conducted a poll of the Canadian
adult population in which they asked respondents whether they agreed with the statement:
“University and college teachers earn too much”. In 2011, y2 = 540 people agreed with
the statement. In a previous poll conducted by Harris/Decima in 2010, y1 = 520 people
agreed with the same statement. If we assume that = the proportion of the Canadian
adult population that agree with the statement is the same in both years then may be
estimated using the data from these two independent polls. The combined likelihood would
be
L ( ) = P (Y1 = y1 ; Y2 = y2 ; )
= P (Y1 = y1 ; )P (Y2 = y2 ; )
2000 520 2000 540
= (1 )1480 (1 )1460
520 540
2000 2000 1060
= (1 )2940 for 0 1
520 540
or, ignoring the constants with respect to , we have
L( ) = 1060
(1 )2940 for 0 1
Sometimes the likelihood function for a given set of data can be constructed in more
than one way as the following example illustrates.
Example 2.2.4
Suppose that the random variable Y represents the number of persons infected with
the human immunode…ciency virus (HIV) in a randomly selected group of n persons. We
assume the data are reasonably modeled by Y Binomial(n; ) with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the population that are infected. In this case, if we
select a random sample of n persons and test them for HIV, we have Y = Y , and y = y as
the observed number infected. Thus
n y
L( ) = (1 )n y
for 0 1
y
or more simply
y
L( ) = (1 )n y
for 0 1 (2.5)
and again L( ) is maximized by the value ^ = y=n.
66 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
For this random sample of n persons who are tested for HIV, we could also de…ne the
indicator random variable
P
n
where y = yi . This is the same likelihood function as (2.5). The reason for this is
i=1
P
n
because the random variable Yi has a Binomial(n; ) distribution.
i=1
( v)y v
P (Y = y; ) = f (y; ) = e for y = 0; 1; : : : ; 0 (2.6)
y!
where is the average number of bacteria per milliliter of water.
There is an inexpensive test which can detect the presence (but not the number) of
bacteria in a water sample. In this case we do not observe Y , but rather the “presence”
indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0
2.2. MAXIMUM LIKELIHOOD ESTIMATION 67
-17
-18
-19
l(θ)
-20
-21
-22
-23
-24
-25
θ
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
vi (ml) 8 4 2 1
number of samples 10 10 10 10
number with zi = 1 10 8 7 3
This gives
8 4 2
l( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 for 0
A few remarks about numerical methods are in order. Aside from a few simple models,
it is not possible to maximize likelihood functions explicitly. However, software exists which
implements powerful numerical methods which can easily maximize (or minimize) functions
of one or more variables. Multi-purpose optimizers can be found in many software packages;
in R the function nlm() is powerful and easy to use. In addition, statistical software packages
contain special functions for …tting and analyzing a large number of statistical models. The
R package MASS (which can be accessed by the command library(MASS)) has a function
fitdistr that will …t many common models.
L ( ) = L ( ; y) = P (Y = y; ) for 2
Z
1:15
assuming the function f (y; ) is reasonably smooth over the interval. More generally, sup-
pose y1 ; y2 ; : : : ; yn are the observations from a random sample from the distribution with
probability density function f (y; ) which have been rounded to the nearest which is
assumed to be small. Then
Q
n
n Q
n
P (Y = y; ) t f (yi ; ) = f (yi ; )
i=1 i=1
If we assume that the precision does not depend on the unknown parameter , then the
term n can be ignored. This argument leads us to adopt the following de…nition of the
likelihood function for a random sample from a continuous distribution.
with derivative
d 1 y
l( ) = n 2
d
n
= 2 (y )
Now dd l ( ) = 0 for = y. The First Derivative Test can be used to verify that the value
= y maximizes l( ) and so ^ = y is the maximum likelihood estimate of .
70 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
Table 2.1
Summary of Maximum Likelihood Method for Named Distributions
y n y
1
R( ) = ^ 1 ^
Binomial(n; ) y ^= y ~= Y
n n
0< <1
n^ ^
R( ) = ^ en( )
Poisson( ) y1 ; y2 ; : : : ; yn ^=y ~=Y
>0
n ny
1
R( ) = ^ 1 ^
Geometric( ) y1 ; y2 ; : : : ; yn ^= 1 ~= 1
1+y 1+Y
0< <1
nk ny
1
R( ) = ^
Negative 1 ^
y1 ; y2 ; : : : ; yn ^= k ~= k
Binomial(k; ) k+y k+Y
0< <1
^ n n(1 ^= )
R( ) = e
Exponential( ) y1 ; y2 ; : : : ; yn ^=y ~=Y
>0
2.3. LIKELIHOOD FUNCTIONS FOR CONTINUOUS DISTRIBUTIONS 71
Since
P
n P
n P
n P
n P
n P
n
(yi y) = yi y= yi ny = yi yi = 0
i=1 i=1 i=1 i=1 i=1 i=1
and
Pn P
n P
n P
n P
n
(yi )2 = (yi y+y )2 = (yi y)2 + 2 (y ) (yi y) + (y )2
i=1 i=1 i=1 i=1 i=1
Pn
= (yi y)2 + n (y )2
i=1
n 1 P
n n(y )2
L( ; ) = exp 2
(yi y)2 exp 2
2 i=1 2
1 P
n n(y )2
l( ) = l( ; ) = n log 2
(yi y)2 2
for 2 < and >0
2 i=1 2
To maximize l( ; ) with respect to both parameters and we solve 6 the two equations7
@l n @l n 1 P
n
= 2 (y ) = 0 and = + 3
(yi y)2 = 0
@ @ i=1
n! y1 y2 y3
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = 1 2 3
y1 !y2 !y3 !
2
1 = ; 2 = 2 (1 ); 3 = (1 )2
n! 2 y1
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = [ ] [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !
n!
L( ) = [ 2 ]y1 [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !
n!
= 2y2 2y1 +y2 (1 )y2 +2y3 for 0 1
y1 !y2 !y3 !
or more simply
2y1 +y2
L( ) = (1 )y2 +2y3 for 0 1
with
dl 2y1 + y2 y2 + 2y3
=
d 1
and
dl 2y1 + y2 2y1 + y2
= 0 if = =
d 2y1 + 2y2 + 2y3 2n
so
2y1 + y2
^=
2n
is the maximum likelihood estimate of .
74 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
Example 2.5.1
Suppose we want to estimate attributes associated with BMI for some population of
individuals (for example, Canadian males age 21-35). If the distribution of BMI values in
the population is well described by a Gaussian model, Y G( ; ), then by estimating
and we can estimate any attribute associated with the BMI distribution. For example:
(i) The mean BMI in the population corresponds to = E(Y ) for the Gaussian distri-
bution.
(ii) The median BMI in the population corresponds to the median of the Gaussian
distribution which equals since the Gaussian distribution is symmetric about its mean.
(iii) For the BMI population, the 0:1 (population) quantile, Q (0:1) = 1:28 . (To
see this, note that P (Y 1:28 ) = P (Z 1:28) = 0:1, where Z = (Y )= has a
G(0; 1) distribution.)
(iv) The fraction of the population with BMI over 35:0 given by
35:0
p=1
Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson
model
In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment
in which they recorded the number of alpha particles emitted from a polonium source (as
detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The
number of particles j detected in the time interval and the frequency fj of that number of
particles is given in Table 2.3.
We can see whether a Poisson model …t these data by comparing the observed frequencies
with the expected frequencies calculated assuming a Poisson model. To calculate these
expected frequencies we need to specify the mean of the Poisson model. We estimate
using the sample mean for the data which is
^ = 1 P 14
jfj
2608 j=0
1
= (10097)
2608
= 3:8715
This comparison of observed and expected frequencies to check the …t of a model can
also be used for data that have arisen from a continuous model. The following is an example.
Zaj
1 y=49:0275
ej = 200 e dy
49:0275
aj 1
aj 1 =49:0275 aj =49:0275
= 200 e e
The expected frequencies are also given in Table 2.2. We notice that the observed and
2.6. CHECKING THE MODEL 77
expected frequencies are not close in this case and therefore the Exponential model does
not seem to be a good model for these data.
Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200
Table 2.4: Frequency table for brake pad data
The drawback of this method for continuous data is that the intervals must be selected
and this adds a degree of arbitrariness to the method. The following graphical methods
provide better techniques for checking the …t of the model for continuous data.
The …rst graphical method is to superimpose the probability density function of the pro-
posed model on the relative frequency histogram of the data. Figure 2.5 gives the relative
frequency histogram of the female BMI data with a superimposed Gaussian probability den-
sity function. Since the mean is unknown we estimate using the sample mean y = 26:9
and since the standard deviation is unknown we estimate it using the sample standard
deviation s = 4:60.
Figure 2.6 gives the relative frequency histogram of the male BMI data with a super-
imposed Gaussian probability density function. Since the mean is unknown we estimate
using the sample mean y = 27:08 and since the standard deviation is unknown we estimate
it using the sample standard deviation s = 3:56. In both …gures the relative frequency his-
tograms are in reasonable agreement with the superimposed Gaussian probability density
functions.
78 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
0.14
0.12
0.1
relative frequency
0.08
G(27.08,3.56)
0.06
0.04
0.02
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Figure 2.5: Relative frequency histogram of female BMI data with Gaussian
p.d.f.
0.09
0.08
0.07
0.06
relative frequency
G(26.9,4.60)
0.05
0.04
0.03
0.02
0.01
0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Figure 2.6: Relative frequency histogram of male BMI data with Gaussian p.d.f.
The drawback of this technique is that the intervals for the relative frequency histogram
must be chosen.
2.6. CHECKING THE MODEL 79
0.07
0.06
0.04
0.03
G(15.4,7.4)
0.02
0.01
0
2 6 10 14 18 22 26 30 34 38 42 46
y
A second graphical method which can be used to check the …t of a model is to plot the
empirical cumulative distribution function F^ (y) which was de…ned in Chapter 1 and then
superimpose on this a plot of the cumulative distribution function, P (Y y; ) = F (y; )
for the proposed model. If the graphs of the two functions di¤er a great deal, this would
suggest that the proposed model is a poor …t to the data. Systematic departures may also
suggest a better model for the data.
Figure 2.9 is a graph of the empirical cumulative distribution function F^ (y) for the data
in Figure 2.7 with an Exponential cumulative distribution function superimposed. The
unknown mean is estimated using the sample mean y = 15:4. In this case the agreement
between the two curves is very poor. The disagreement between the curves suggests that
the proposed Exponential model disagrees with the observed distribution in both tails of
the distribution.
80 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
0.9
0.8
0.7
empirical c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 2 4 6 8 10 12 14 16 18
y
0.9
0.8
0.7
empirical c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30 35 40 45
y
1
0.9
0.8
0.7
emprical c.d.f.
0.6
0.5
0.4
0.3
0.2
G(72.3,13.9)
0.1
0
30 40 50 60 70 80 90 100 110 120
time between eruptions
Figure 2.10: Empirical c.d.f. of times between eruptions of Old Faithful and
Gaussian c.d.f.
The relative frequency histogram in Figure 2.11 seems to indicate that the distribution
of the times appears to have two modes. The plot of the empirical cumulative distribution
function does not show the shape of the distribution as clearly as the histogram.
0.035
0.03
0.025
relative frequency
0.02
0.015
G(72.3,13.9)
0.01
0.005
0
43 49 55 61 67 73 79 85 91 97 103 109
time between eruptions
Figure 2.11: Relative frequency histogram for times between eruptions of Old
Faithful and Gaussian p.d.f.
82 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
1
0.9
0.8
empirical c.d.f.
0.7
0.6
0.5 G(1.62,0.064)
0.4
0.3
0.2
0.1
0
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
height
6
relative frequency
4
G(1.62,0.064)
3
0
1.4 1.5 1.6 1.7 1.8
height
Figure 2.13: Relative frequency histogram of female heights and Gaussian p.d.f.
Figure 2.13 shows a relative frequency histogram for these data with the G(1:62; 0:0637)
probability density function superimposed. The two types of plots give complementary but
consistent pictures. An advantage of the distribution function comparison is that the exact
heights in the sample are used, whereas in the histogram plot the data are grouped into
intervals to form the histogram. However, the histogram and probability density function
2.6. CHECKING THE MODEL 83
show the distribution of heights more clearly. Both graphs indicate that a Gaussian model
seems reasonable for these data.
Since the Gaussian model is used frequently for modeling data we look at one more graphical
technique called a (Gaussian) qqplot for checking how well a Gaussian model …ts a set of
data. The idea behind this method is that we expect the empirical cumulative distribution
function and the cumulative distribution for a Gaussian random variable to agree if a
Gaussian model is appropriate for the data as we saw in Figure 2.12. Deciding if two curves
are in agreement is usually more di¢ cult than deciding if a set of points lie along a straight
line. A qqplot is a graph for which the expected plot would reasonably be a straight line
plot if the Gaussian model is a good …t.
Suppose for the moment that we want to check if a G ( ; ) model …ts the set of data
fy1 ; y2 ; : : : ; yn g where and are known. As usual we let fy(1) ; y(2) ; : : : ; y(n) g represent
the order statistic or the data ordered from smallest to largest. Let Q (p) be the pth
(theoretical) quantile for the G ( ; ) distribution, that is, Q (p) satis…es P (Y Q (p)) = p
where Y G ( ; ). Recall also that q (p) is the pth sample quantile de…ned in Chapter
1. If the Gaussian model is appropriate then for a reasonable size data set, we would
expect Q (0:5) = median = to be close in value to the sample quantile q (0:5) = sample
median, Q (0:25) to be close in value to the lower quartile q (0:25), Q (0:75) to be close in
i
value to the upper quartile q (0:75), and so on. More generally we would expect Q n+1
i
to be close in value to the sample quantile q n+1 (see De…nition 1) for i = 1; 2; : : : ; n.
i i
(Note that we use n+1 rather than n since Q (1) = 1.) For a reasonably large data set
i
we also have q n+1 t y(i) , i = 1; 2; : : : ; n. Therefore if the Gaussian model …ts the data
i i
we expect Q n+1 to be close in value to q n+1 , i = 1; 2; : : : ; n. If we plot the points
i i
Q n+1 ; q n+1 , i = 1; 2; : : : ; n then we should see a set of points that lie reasonably
along a straight line.
But what if and are unknown? Let Qz (p) be the pth quantile for the G (0; 1)
distribution. We know that if Y G ( ; ) then Y G (0; 1) and therefore
i i
Q (p) = + Qz (p). Therefore if we plot the points QZ n+1 ; q n+1 , i = 1; 2; : : : ; n
we should still see a set of points that lie reasonably along a straight line if a Gaussian model
is reasonable model for the data. Such a plot is called a (Normal) qqplot. The advantage
of a qqplot is that the unknown parameters and do not need to be estimated.
Qqplots exist for other models but we only use Gaussian qqplots.
Since reading qqplots requires some experience, it is a good idea to generate many plots
where we know the correct answer. This can be done by generating data from a known
distribution and then plotting a qqplot. See Chapter 2, Problems 20 and21. A qqplot of
100 observations randomly generated from a G ( 2; 3) distribution is given in Figure 2.14.
84 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
The theoretical quantiles are plotted on the horizontal axis and the empirical or sample
quantiles are plotted on the vertical axis. The line in the qqplot is the line joining the lower
and upper quartiles of the empirical and Gaussian distributions, that is, the line joining
(QZ (0:25) ; q (0:25)) and (QZ (0:75) ; q (0:75)) where QZ (0:75) = 0:674.
8
6
4
Sample Quantiles
-2
-4
-6
-8
-10
-12
-3 -2 -1 0 1 2 3
G(0,1) Quantiles
We do not expect the points to lie exactly along a straight line since the sample quantiles
are based on the observed data which in general will be di¤erent every time the experiment
i i
is conducted. We only expect Q n+1 to be close in value to the sample quantile q n+1
for a reasonably large data set.
0.4
0.35
0.3
0.25
G(0,1)
p.d.f.
0.2
0.15
0.1
0.05
0
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
z
As well the points at both ends of the line can be expected to lie further from the line
since the quantiles of the Gaussian distribution change in value more rapidly in the tails
of the distribution. To understand this consider Figure 2.15. The area under a G (0; 1)
probability density function, which is equal to one, has been divided into sixteen areas all
i
of the same size equal to 1=16. The theoretical quantiles Q n+1 , i = 1; 2; : : : ; 15 can be
1
read from the z axis. For example, Q 16 = 1:53, and Q 10 16 = 0:32. Since the area
under the G (0; 1) probability density function is more concentrated near zero, the values
of the quantiles increase more quickly in the tails of the distribution. In Figure 2.15 this
is illustrated by the vertical lines being closer together near z = 0 and further apart for
z < 1 and z > 1. This means we would not expect the sample quantiles in both tails to
be as close to the theoretical quantiles as compared to what we observe in the center of the
distribution.
A qqplot of the female heights is given in Figure 2.16. Overall the points lie reasonably
along a straight line with the points at both ends lying not as close to the line which is
what we expect. As was the case for the relative frequency histogram and the empirical
cumulative distribution function, the qqplot indicates that the Gaussian model is reasonable
for these data. Since the heights in meters are rounded to two decimal places there are
many repeated values in the dataset. The repeated values result in the qqplot looking like
a set of small steps.
1.85
1.8
1.75
1.7
Sample Quantiles
1.65
1.6
1.55
1.5
1.45
1.4
-3 -2 -1 0 1 2 3
G(0,1) Quantiles
4
sample quantiles
3
-1
-2
-3 -2 -1 0 1 2 3
G(0,1) quantiles
Figure 2.17: Qqplot of a random sample of 100 observations from a Exponential (1)
distribution
1
0.9
0.8
0.7
0.6
f(x)
0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
x
If we plot the theoretical quantiles of an Exponential (1) distribution versus the the-
oretical quantiles of an G (0; 1) distribution for n = 15 we obtain the U-shaped graph in
2.6. CHECKING THE MODEL 87
Figure 2.19. Since we are using the theoretical quantiles for both distributions the points
lie along a curve. For real data the qqplot would look similar to the plot in Figure 2.17. In
general if a dataset has a relative frequency histogram with a long right tail then the qqplot
will exhibit this U-shape behaviour. Such a qqplot suggests that a Gaussian model is not
reasonable for the data and a model with a long right tail like the Exponential distribution
would be more suitable.
2.5
2
Exponential
quantiles
1.5
0.5
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
G(0,1) quantiles
A qqplot of the lifetimes of brake pads (Example 1.3.4) is given in Figure 2.20. The
points form a U-shaped curve. This pattern is consistent with the long right tail and
positive skewness that we observed before. The Gaussian model is not a reasonable model
for these data.
200
Sample Quantiles
150
100
50
-50
-3 -2 -1 0 1 2 3
G(0,1) Quantiles
A qqplot of 100 observations randomly generated from a U nif orm (0; 1) distribution is
given in Figure 2.21.
1.5
Sample Quantiles 1
0.5
-0.5
-1
-3 -2 -1 0 1 2 3
G(0,1) Quantiles
We notice that the points form an S-shape. This is typical of data which are best
modeled by a Uniform distribution.
To understand why this happens the area under a U nif orm (0; 1) probability density
function has been divided into sixteen areas all of the same size equal to 1=16 in Figure
2.22. The theoretical quantiles can be read from the x axis. The values of the quantiles
increase uniformly.
1.4
1.2
0.8
f(x)
0.6
0.4
0.2
0
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x
If we plot the theoretical quantiles of an U nif orm (0; 1) distribution versus the theoret-
ical quantiles of an G (0; 1) distribution for n = 15 we obtain the S-shaped graph in Figure
2.23. Since we are using the theoretical quantiles for both distributions the points lie along
2.6. CHECKING THE MODEL 89
a curve. For real data the qqplot would look similar to the plot in Figure 2.21. In general if
a dataset has a relative frequency histogram which is quite symmetric and with short tails
then the qqplot will exhibit this S-shape behaviour. Such a qqplot suggests that a Gaussian
model is not reasonable for these data and a model such as the Uniform distribution would
be more suitable.
1
0.9
0.8
0.7
0.6
Uniform
quantiles
0.5
0.4
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
G(0,1) quantiles
A qqplot of the times between eruptions of Old Faithful is given in Figure 2.24. The
points do not lie along a straight line which indicates as we saw before that the Gaussian
model is not a reasonable model for these data. The two places at which the shape of
the points changes direction correspond to the two modes of these data that we observed
previously.
140
120
Sample Quantiles
100
80
60
40
20
-3 -2 -1 0 1 2 3
G(0,1) Quantiles
(a) G ( ) = a
(1 )b ; 0 1
a b=
(b) G ( ) = e ; >0
a b
(c) G ( ) = e ; 0
2
(d) G ( ) = e a( b) ; 2<
L( ) = y
(1 )n y
for 0 1
3. Consider the following two experiments whose purpose was to estimate , the fraction
of a large population with blood type B.
Experiment 1: Individuals were selected at random until 10 with blood type B were
found. The total number of people examined was 100.
Experiment 2: One hundred individuals were selected at random and it was found
that 10 of them have blood type B.
(a) Find the likelihood function for for each experiment and show that the like-
lihood functions are proportional. Show the maximum likelihood estimate ^ is
the same in each case.
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, use the Nor-
mal approximation to the Binomial distribution (remember to use a continuity
correction) to determine how large should n be to ensure that the probability of
getting 10 or more donors with blood type B is at least 0:90? Use the R function
pbinom to determine the exact value of n.
2.7. CHAPTER 2 PROBLEMS 91
4. Specimens of a high-impact plastic are tested by repeatedly striking them with a ham-
mer until they fracture. Let Y = the number of blows required to fracture a specimen.
If the specimen has a constant probability of surviving a blow, independently of the
number of previous blows received, then the probability function for Y is
y 1
f (y; ) = P (Y = y; ) = (1 ) for y = 1; 2; : : : ; 0 <1
(a) For observed data y1 ; y2 ; : : : ; yn , …nd the likelihood function L( ) and the maxi-
mum likelihood estimate ^.
P
200
(b) Find the relative likelihood function R ( ). Plot R ( ) if n = 200 and yi = 400.
i=1
(c) Estimate the probability that a specimen fractures on the …rst blow using the
data in (b).
( t)y t
P (Y = y; ) = f (y; ) = e for y = 0; 1; : : : and 0
y!
(a) The numbers of transactions received in 10 separate one minute intervals were
8, 3, 2, 4, 5, 3, 6, 5, 4, 1. Write down the likelihood function for and …nd the
maximum likelihood estimate ^.
(b) Estimate the probability that no transactions arrive during a two-minute interval
using the data in (a).
(c) Use the R function rpois with the value = 4:1 to simulate the number of
transactions received in 100 one minute intervals. Calculate the sample mean
and sample variance. Are they approximately the same?
(Note that E(Y ) = V ar(Y ) = for the Poisson model.)
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
(b) Find the relative likelihood function R ( ).
P20
(c) Plot R ( ) for n = 20 and yi2 = 72.
i=1
92 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
(a) If is known, …nd the likelihood function L( ) and the maximum likelihood
estimate ^ .
(b) If is known, …nd the likelihood function L( ) and the maximum likelihood
estimate ^ .
(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
(b) Find the log relative likelihood function r ( ) = log R ( ). If n = 15 and
P
15
log yi = 34:5 then plot r ( ).
i=1
9. Suppose that in a population of twins, males (M ) and females (F ) are equally likely
to occur and that the probability that a pair of twins is identical is . If twins are
not identical, their sexes are independent.
10. When Wayne Gretzky played for the Edmonton Oilers (1979-88) he scored an incred-
ible 1669 points in 696 games. The data are given in the frequency table below:
Number of points Observed number of
in a game: y games with y points: fy
0 69
1 155
2 171
3 143
4 79
5 57
6 14
7 6
8 2
9 0
Total 696
2.7. CHAPTER 2 PROBLEMS 93
The Poisson( ) model has been proposed for the random variable Y = number of
points Wayne scores in a game.
(a) Show that the likelihood function for based on the Poisson model and the data
in the frequency table simpli…es to
1669 696
L( ) = e for 0
11. Here are the data for Sidney Crosby playing for the Pittsburgh Penguins in the years
2005-2016.
Number of points Observed number of
in a game: y games with y points: fy
0 219
1 259
2 185
3 90
4 24
5 4
6 2
7 0
Total 783
How well does the Poisson model …t these data?
12. The following model has been proposed for the distribution of Y = the number of
children in a family, for a large population of families:
1 2 y 1
P (Y = 0; ) = ; P (Y = y; ) = for y = 1; 2; : : : and 0 (2.7)
1 2
(a) What does the parameter represent?
(b) Suppose that n families are selected at random and the observed data were
(c) Consider a di¤erent type of sampling in which a single child is selected at random
and then the number of o¤spring in that child’s family is determined. Let
X = the number of children in the family of a randomly chosen child. Assuming
that the model (2.7) holds then show that
x 1
P (X = x; ) = cx for x = 1; 2; : : : and 0 <
2
where
(1 )2
c=
x 1 2 3 4 >4 Total
fx 22 7 3 1 0 33
Find the probability of observing these data and thus determine the maximum
likelihood estimate of . Estimate the probability a couple has no children using
these data.
(e) Suppose the sample in (d) was incorrectly assumed to have arisen from the
sampling plan in (b). What would ^ be found to be? This problem shows that
the way the data have been collected can a¤ect the model.
13. Radioactive particles are emitted randomly over time from a source at an average rate
of per second. In n time periods of varying lengths t1 ; t2 ; : : : ; tn (seconds), the num-
bers of particles emitted (as determined by an automatic counter) were y1 ; y2 ; : : : ; yn
respectively. Let Yi = the number of particles emitted in time interval i of length
ti , i = 1; 2; : : : ; n. Suppose it is reasonable to assume that Yi has a Poisson( ti )
distribution, i = 1; 2; : : : ; n independently.
(a) Show that the likelihood function for based on the Poisson model and the data
(yi ; ti ), i = 1; 2; : : : ; n can be simpli…ed to
ny nt
L( ) = e for 0
1 P
n
where t = n ti . Find the maximum likelihood estimate of .
i=1
(b) Suppose that the intervals are all of equal length (t1 = t2 = = tn = t) and that
instead of knowing the yi ’s, we know only whether or not there were one or more
particles emitted in each time interval of length t. Find the likelihood function
for based on these data, and determine the maximum likelihood estimate of .
2.7. CHAPTER 2 PROBLEMS 95
14. Run the following R code for checking the Gaussian model using numerical and graph-
ical summaries.
# Gaussian Data Example
set.seed(456458)
yn<-rnorm(200,5,2) # 200 observations from G(5,2) distribution
c(mean(yn),sd(yn)) # display sample mean and standard deviation
skewness(yn) # sample skewness
kurtosis(yn) # sample kurtosis
fivenum(yn) # five number summary
IQR(yn) # IQR
#plot relative frequency histogram and superimpose Gaussian pdf
truehist(yn,main="Relative Frequency Histogram of Data")
curve(dnorm(x,mean(yn),sd(yn)),col="red",add=T,lwd=2)
#plot Empirical cdf’s and superimpose Gaussian cdf
plot(ecdf(yn),verticals=T,do.points=F,xlab="y",ylab="ecdf",main="")
title(main="Empirical and Gaussian C.D.F.’s")
curve(pnorm(x,mean(yn),sd(yn)),add=T,col="red",lwd=2)
#plot qqplot of the data
qqnorm(yn,xlab="Standard Normal Quantiles",main="Qqplot of Data")
qqline(yn,col="red",lwd=1.5) # add line for comparison
#
#
# Exponential Data Example
ye<-rexp(200,1/5) # 200 observations from Exponential(5) dist’n
c(mean(ye),sd(ye)) # display sample mean and standard deviation
skewness(ye) # sample skewness
kurtosis(ye) # sample kurtosis
fivenum(ye) # five number summary
IQR(y) # IQR
#plot relative frequency histogram and superimpose Gaussian pdf
truehist(ye,main="Relative Frequency Histogram of Data")
curve(dnorm(x,mean(ye),sd(ye)),col="red",add=T,lwd=2)
#plot Empirical cdf’s and superimpose Gaussian cdf
plot(ecdf(ye),verticals=T,do.points=F,xlab="y",ylab="ecdf",main="")
title(main="Empirical and Gaussian C.D.F.’s")
curve(pnorm(x,mean(ye),sd(ye)),add=T,col="red",lwd=2)
#plot qqplot of the data
qqnorm(ye,xlab="Standard Normal Quantiles",main="Qqplot of Data")
qqline(ye,col="red") # add line for comparison in red
96 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
For both examples assume that you don’t know how the data were generated. Use the
numerical and graphical summaries obtained by running the R code to assess whether
it is reasonable to assume that the data have approximately a Gaussian distribution.
Support your conclusion with clear reasons written in complete sentences.
15. The marks out of 30 for 100 students on a tutorial test in STAT 231 were:
The data are available in the …le tutorialtestdata.txt posted on the course website.
For these data
P
100 P 2
100
yi = 1913 and yi = 38556
i=1 i=1
30
25
20
15
10
Marks
Qqplot of Marks
35
30
25
Sample Quantiles
20
15
10
0
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
(a) Determine the sample mean y and the sample standard deviation s for these
data.
(b) Determine the proportion of observations in the interval [y s; y + s] and
[y 2s; y + 2s]. Compare these proportions with P (Y 2 [ ; + ]) and
P (Y 2 [ 2 ; + 2 ]) where Y G ( ; ).
(c) Find the sample skewness and sample kurtosis for these data. Are these values
close to what you would expect for Normally distributed data?
2.7. CHAPTER 2 PROBLEMS 99
17. Consider the data on heights of adult males and females from Chapter 1. The data
are available in the …le bmidata.txt posted on the course website.
(a) Assume that for each sex the heights in the population from which the samples
were drawn can be modeled by a Gaussian distribution. Obtain the maximum
likelihood estimates of the mean and standard deviation in each case.
(b) Give the maximum likelihood estimates for q (0:1) and q (0:9), the 10th and 90th
percentiles of the height distribution for males and for females.
(c) Give the maximum likelihood estimate for the probability P (Y > 1:83) for males
and females (i.e. the fraction of the population over 1:83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that does not use the Gaussian model is
18. The qqplot of the brake pad data in Figure 2.20 indicates that the Normal distribution
is not a reasonable model for these data. Sometimes transforming the data gives a
data set for which the Normal model is more reasonable. A log transformation is often
used. Plot a qqplot of the log lifetimes and indicate whether the Normal distribution
is a reasonable model for these data. The data are posted on the course website.
100 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
19. In a large population of males ages 40 50, the proportion who are regular smokers is
where 0 1 and the proportion who have hypertension (high blood pressure)
is where 0 1. If the events S (a person is a smoker) and H (a person has
hypertension) are independent, then for a man picked at random from the population
the probabilities he falls into the four categories SH; S H; SH; S H are respectively,
; (1 ); (1 ) ; (1 )(1 ). Explain why this is true.
(a) Suppose that 100 men are selected and the numbers in each of the four categories
are as follows:
Category SH S H SH S H
Frequency 20 15 22 43
Assuming that S and H are independent events, determine the likelihood func-
tion for and based on the Multinomial distribution, and …nd the maximum
likelihood estimates of and .
(b) Compute the expected frequencies for each of the four categories using the max-
imum likelihood estimates. Do you think the model used is appropriate? Why
might it be inappropriate?
22. A qqplot was generated for 100 values of a variate. See Figure 2.27. Based on this
qqplot, answer the following questions:
(a) What is the approximate value of the sample median of these data?
(b) What is the approximate value of the IQR of these data?
(c) Would the frequency histogram of these data be reasonably symmetric about
the sample mean?
(d) The frequency histogram for these data would most resemble a Normal probabil-
ity density function, an Exponential probability density function or a Uniform
probability density function?
2.5
1.5
Sample Quantiles
0.5
-0.5
-1
-1.5
-3 -2 -1 0 1 2 3
N(0,1) Quantiles
24. Challenge Problem: Censored lifetime data Consider the Exponential dis-
tribution as a model for the lifetimes of equipment. In experiments, it is often not
feasible to run the study long enough that all the pieces of equipment fail. For ex-
ample, suppose that n pieces of equipment are each tested for a maximum of c hours
(c is called a “censoring time”). The observed data are: k (where 0 k n) pieces
fail, at times y1 ; y2 ; : : : ; yk and n k pieces are still working after time c.
^ = 1 P yi + (n
k
k)c
k i=1
(c) What does part (b) give when k = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them
to large ‡uctuations in temperature inside specially designed ovens. For one
particular type of component, 50 units were tested and k = 5 failed before
P
5
c = 400 hours, when the test was terminated, with yi = 450 hours. Find the
i=1
maximum likelihood estimate of .
(b) For observed k, n and y …nd the value N ^ that maximizes the probability in
part (a). Does this ever di¤er much from the intuitive estimate N ~ = kn=y?
(Hint: The likelihood L(N ) depends on the discrete parameter N , and a good
way to …nd where L(N ) is maximized over f1; 2; 3; : : :g is to examine the ratios
L(N + 1)=L(N ):)
(c) When might the model in part (a) be unsatisfactory?
2.7. CHAPTER 2 PROBLEMS 103
26. Challenge Problem: Poisson model with a covariate Let Y represent the
number of claims in a given year for a single general insurance policy holder. Each
policy holder has a numerical “risk score” x assigned by the company, based on
available information. The risk score may be used as an explanatory variate when
modeling the distribution of Y , and it has been found that models of the form
[ (x)]y (x)
P (Y = yjx) = e for y = 0; 1; : : :
y!
(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; : : : ; xn
had y1 ; y2 ; : : : ; yn claims, respectively, in a given year. Determine the likelihood
function for and based on these data.
(b) Can ^ and ^ be found explicitly?
104 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
3. PLANNING AND
CONDUCTING EMPIRICAL
STUDIES
Problem: a clear statement of the study’s objectives, usually involving one or more
questions
Plan: the procedures used to carry out the study including how the data will be
collected
Analysis: the analysis of the data collected in light of the Problem and the Plan
Conclusion: The conclusions that are drawn about the Problem and their limitations
We will use this set of steps, which we will refer to as PPDAC, to discuss the important
ideas which must be considered when planning an empirical study. These steps, which are
designed to emphasize the statistical aspects of empirical studies, are described in more
detail in Section 3.2.
105
106 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
PPDAC can be used in two ways - …rst to actively formulate, plan and carry out investi-
gations and second as a framework to critically scrutinize reported empirical investigations.
These reports include articles in the popular press, scienti…c papers, government policy
statements, and various business reports. If you see the phrase “evidence based decision”
or “evidence based management”, look for an empirical study. In this course we will use
PPDAC most often to critically assess empirical studies reported in the media.
The following example will be used in the next section to show how PPDAC can be
used to describe and critically examine how the empirical study was conducted.
drinking. Those in the combined group did both. They were reminded of their answers via email
during the one month course of the study and asked to continue practising this mental simulation.
All groups completed an online survey at various points, indicating how much they had drunk
the previous week. Over the course of one month, Dr Conroy found that students who imagined
positive outcomes of non-drinking reduced their weekly alcohol consumption from 20 units to 14
units on average. Similarly, students who imagined required strategies for non-drinking reduced the
frequency of binge drinking episodes –classi…ed as six or more units in one session for women, and
eight or more units for men –from 1.05 episodes a week to 0.73 episodes a week on average.
Interestingly, the research indicates that perceptions of non-drinkers were also more favourable
after taking part in the study. Dr Conroy says this could not be directly linked to the intervention
but was an interesting additional feature of the study. He says: “Studies have suggested that holding
negative views of non-drinkers may be closely linked to personal drinking behaviour and we were
interested to see in the current study that these views may have improved as a result of taking
part in a non-drinking exercise. “I think this shows that health campaigns need to be targeted
and easy to …t into daily life but also help support people to accomplish changes in behaviour that
might sometimes involve ‘going against the grain’, such as periodically not drinking even when in
the company of other people who are drinking.”
Problem
The Problem step describes what the experimenters are trying to learn or what questions
they want to answer. Often this can be done using questions starting with “What”.
What group of things or people do the experimenters want the conclusions to apply?
Types of problems
Three common types of statistical problems that are encountered are described below.
involves problems of this type. For example, the government needs to know the
national unemployment rate and whether it has increased or decreased over the past
month.
“Does taking a low dose of aspirin reduce the risk of heart disease among men over the
age of 50?”
“Does changing from assignments to multiple term tests improve student learning in
STAT 231?”
“Does second-hand smoke from parents cause asthma in their children.
“Does compulsory driver training reduce the incidence of accidents among new drivers?”
Predictive: The problem is to predict a future value for a variate of a unit to be selected
from the process or population. This is often the case in …nance or in economics. For
example, …nancial institutions need to predict the price of a stock or interest rates in
a week or a month because this e¤ects the value of their investments.
De…nition 15 The target population or target process is the collection of units to which
the experimenters conducting the empirical study wish the conclusions to apply.
3.2. THE STEPS OF PPDAC 109
In the drinking study the units are university students and the target population consists
of English university students aged 18 25 in the United Kingdom at the time of the
study. Note that “all university students aged 18 25 in the world” would not be a
suitable target population since it would not make much sense to include countries in which
the consumption of alcohol is not allowed. A target population of “all English university
students aged 18 25”with no time mentioned is also not a suitable target population for
this study since we might expect the drinking behaviour of university students to change
over time.
In Chapter 1 we considered a survey of Ontario residents aged 14 20 in a speci…c
week to learn about their smoking behaviour. In this study the units are young adults and
the target population is all young adults aged 14 20 living in Ontario at the time of the
survey. Since smoking behaviour varies from province to province and year to year, the
target population of young adults aged 14 20 in Ontario at the time of the study is the
best choice.
In Chapter 1 we considered the comparison of two can …lling machines used by a manu-
facturer with respect to the volume of liquid in the …lled cans. The units are the individual
cans. The target process is all cans, which could be …lled by the manufacturer using the
two machines, now and into the future under current operating conditions. Note that in
de…ning the target process the expression “under current operating conditions” has not
been well de…ned.
The values of the variates change from unit to unit in the population/process. There
are usually many variates associated with each unit.
In the drinking study the most important variates are the weekly alcohol consumption
measured over the course of a month, and which mental exercise the student was assigned
to. Other variates which were collected are the age of the student and the sex of the student.
In the smoking survey, whether or not each young adult (unit) in the target population
smokes is the variate of primary interest. Other variates of interest de…ned for each unit
might be age and sex. In the can-…lling example, the volume of liquid in each can (unit) is
a variate. Whether the old machine or the new machine was used to …ll the can is another
variate.
The questions of interest in the Problem are speci…ed in terms of attributes of the target
population/process. In the university student drinking study the mean (average) alcohol
consumption for the di¤erent mental exercise groups is the most important attribute. In
the smoking example, one important attribute is the proportion of young adults in the
target population who smoke. In the can-…lling example, the attributes of interest were the
mean (average) volume and the variability (standard deviation) of the volumes for all
110 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
cans …lled by each machine under current conditions. Possible questions of interest (among
others) are:
“Is there a di¤erence in the mean alcohol consumption between the four di¤erent mental
exercise groups?”
“What proportion of young adults aged 14 20 in Ontario smoke?”
“Is the standard deviation of volumes of cans …lled by the new machine less than that
of the old machine?”
We can also ask questions about graphical attributes of the target population such
as the population histogram, population cumulative distribution function, or a
scatterplot of one variate versus another over the target population.
It is very important that the Problem step contain clear questions about one or more
attributes of the target population.
Plan
The Plan step depends on the questions posed in the Problem step. The Plan step includes
a description of the population or process of units from which units will be selected, what
variates will be collected for the units selected, and how the variates will be measured.
In most cases, the attributes of interest for the target population/process cannot be
estimated since only units from a subset of the target population/process can be considered
for study or only units from another population completely can be considered for study.
This may be due to lack of resources and time, as in the smoking survey in which it would be
very costly and nearly impossible to create a list of all young adults aged 14 20 in Ontario
at the time of the study. It may also be a physical impossibility such as in the development
of a new product where the manufacturer may wish to make conclusions about a production
process in the future but only units produced in a pilot process can be examined. It may
also be unethical such as in a clinical trial of a new treatment whose side e¤ects for humans
are unknown and which could be life threatening and therefore only laboratory animals
such as mice can be used.
De…nition 18 The study population or study process is the collection of units available to
be included in the study.
The study population is often but not always a subset of the target population. In many
surveys, the study population is a list of people de…ned by their telephone number. The
sample is selected by calling a subset of the telephone numbers. The study population is
a subset of the target population which excludes those people without telephones or with
unlisted numbers. In the clinical trial example the study population only consists of the
laboratory animals that are available for the study which is not a subset of any target
population of humans. In the development of new products example, the units in the pilot
process are not a subset of the target process which are the units produced in the future.
3.2. THE STEPS OF PPDAC 111
The news item for the drinking study does not indicate how the students in the study
were recruited. To determine this information we need to check the research journal arti-
cle. The more detailed article in the British Journal of Health Psychology indicated that
administrators at 80 academic departments across 45 English universities were asked to for-
ward a pre-prepared recruitment message to their students containing a URL to an online
survey. No reason was given for choosing only these universities. Note that there are over
100 English universities in the United Kingdom as well as other universities in Scotland
and Wales (also part of the United Kingdom) which English students could attend. The
study population is therefore English university students aged 18 25 at the time of the
study at these 45 English universities which is a subset of the target population.
In the smoking survey, it would be di¢ cult to create a list of all young adults aged
14 20 living in Ontario at the time of the survey. Since schools must keep a list of
students attending their school as well as student contact information, the researchers may
decide to choose a study population of all young adults aged 14 20 living in Ontario at
the time of the survey who are attending school. The study population is a subset of the
target population.
In the can-…lling study a possible study process is all cans which are available at the
time of the study and could possibly be …lled by the manufacturer using the two machines
under current operating conditions. In this case the study process is a subset of the target
process.
The study population/process is nearly always di¤erent than the target population/process
since there are always restrictions on the units which are available to be studied.
De…nition 19 If the attributes in the study population/process di¤ er from the attributes
in the target population/process then the di¤ erence is called study error.
Study error cannot be quanti…ed since the values of the target population/process at-
tributes and the study population/process attributes, are unknown. (If these attributes
were known then an empirical study would not be necessary!) Context experts would
need to be consulted, for example, in order to decide whether or not it is reasonable to
assume that conclusions from an investigation using mice are relevant to the human tar-
get population. The statistician’s role it to warn the context experts of the possibility of
such error, especially when the study population/process is very di¤erent from the target
population/process.
In the drinking study, the study population only included English students at the 45
English universities contacted. If the mean alcohol consumption under various mental exer-
cises at these universities was systematically di¤erent than the mean alcohol consumption
under various mental exercises for students in the target population then this di¤erence
would be study error.
Suppose in the smoking survey that young adults aged 14 20 living in Ontario at
the time of the survey who are attending school were less likely to smoke (people with
more education tend to smoke less). In this case the proportion of smokers in the target
112 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
population would be di¤erent than the proportion of smokers in the study population and
this di¤erence would be study error.
De…nition 20 The sampling protocol is the procedure used to select a sample of units from
the study population/process. The number of units sampled is called the sample size.
In Chapter 2, we discussed modeling the data and often claimed that we had a “random
sample”so that our model was simple. In practice, it is exceedingly di¢ cult and expensive
to select a random sample of units from the study population and so other less rigorous
methods are used. Often researchers “take what they can get”.
Sample size is usually driven by cost or availability. In Section 4.4 we will see how to
use the Binomial model to determine sample sizes, and in Section 4.6 we will see how to
use the Gaussian model to determine sample sizes.
In the drinking study, the sampling protocol involved asking administrators at 80 aca-
demic departments across 45 English universities to forward a pre-prepared recruitment
message to their students containing a URL to an online survey. Departments could decide
whether or not to forward the message to their students and students who received the mes-
sage could decide whether or not to take part in the study. The sample size, as reported in
the news item, was 211. Although not indicated in the news item, the journal article indi-
cates that students who agreed to participate were randomly assigned by the researchers to
one of the four mental health exercises (imagining positive outcomes of non-drinking dur-
ing a social occasion; imagining strategies required to successfully not drink during a social
occasion; imagining both positive outcomes and required strategies; or completing a drinks
diary task). The importance of randomization in making a cause and e¤ect conclusion is
discussed in Chapter 8. The students were then asked to report their alcohol consumption
in units in the week before they completed the various online surveys over a period of one
month.
De…nition 21 If the attributes in the sample di¤ er from the attributes in the study popu-
lation/process the di¤ erence is called sample error.
Sample error cannot be quanti…ed since the values of the study population/process
attributes are unknown. Di¤erent random sampling protocols can produce di¤erent sample
errors. We will see in Chapter 4 how models can be used to get an idea of how large this
error might be.
In the university student drinking study, not all academic departments forwarded the
recruitment message (only 23 according to the journal article). Suppose only departments
who thought students at their university had drinking issues forwarded the message and
then only students who were heavy drinkers chose to participate in the study. If the
mean alcohol consumption under various mental exercises for students who received the
recruitment message and decided to participate was systematically higher than the mean
alcohol consumption under various mental exercises for students in the study population
3.2. THE STEPS OF PPDAC 113
then this di¤erence is sample error. Sample error should be suspected in all surveys in
which the participants are volunteers.
The experimenters must decide which variates are going to be measured or determined
for the units in the sample. For any attributes of interest, as de…ned in the Problem
step, the corresponding variates must certainly be measured. Other variates which may
aid the analysis may also need to be measured. In the smoking survey, experimenters
must determine whether each young adult in the sample smokes or not (this requires a
careful de…nition). They may also determine other demographic variates such as age and
sex so that they can compare the smoking rate across age groups, sex, etc. In experimental
studies, the experimenters assign the value of a variate they are controlling to each unit in
the sample. For example, in a clinical trial, sampled units can be assigned to the treatment
group or the placebo group by the experimenters.
When the value of a variate is determined for a given unit, errors are often introduced
by the measurement system which determines the value.
De…nition 22 If the measured value and the true value of a variate are not identical the
di¤ erence is called measurement error.
Measurement errors are unknown since the true value of the variate is unknown. (If we
knew the true value we would not need to measure it!) In practice, experimenters try to
ensure that the processes used to take the measurements, referred to as the measurement
systems, do not contribute substantial error to the conclusions. They may have to study
the measurement systems which are used in separate studies to ensure that this is true.
See, for example, the case study in Section 3.3.
One variate which was determined for each unit (student) in the drinking study was
which mental exercise group the student was assigned to. If the actual group assignment
was recorded incorrectly then this is measurement error. The students were also asked to
report their daily alcohol consumption in the week before they completed the online sur-
veys. The journal article indicated that students measured their daily alcohol consumption
in UK units (an alcohol unit in the United Kingdom is de…ned as 10 milliliters of pure ethyl
alcohol) with the help of a visual aid and that the online surveys occurred at the begin-
ning of the study, at two weeks, and at four weeks. Note that a di¤erent variate would
be associated with each time that the student reported their alcohol consumption. These
variates were self-reported by each student. If a student does not accurately report their
alcohol consumption then this is measurement error. Measurement error should always be
suspected when variates are measured by self-reporting.
and a uniformed police o¢ cer is sent to each address to interview an adult resident. Is there
a possible bias in this study? It is likely that those who are strong supporters of the police
are quite happy to respond but those with misgivings about the police will either choose
not to change some of their responses to favour the police or not respond at all. This
type of bias is called response bias. When those that do respond have a somewhat di¤erent
characteristics than the population at large, the quality of the data is threatened, especially
when the response rate (the proportion who do respond to the survey) is lower. For example
in Canada in 2011, the long form of the Canadian Census (response rate around 98%) was
replaced by the National Household Survey (a voluntary version with similar questions,
response rate around 68%) and there was considerable discussion9 of the resulting response
bias. See for example the CBC story “Census Mourned on World Statistics Day”10 .
The …gure below shows the steps in the Plan and the sources of error:
Target Population/Process
# Study error
Study Population/Process
# Sample error
Sample
# Measurement error
Measured variate values
A person using PPDAC for an empirical study should, by the end of the Plan step, have
a good understanding of the study population/process, the sampling protocol, the variates
which are to be measured, and the quality of the measurement systems that are intended
for use.
In this course you will most often use PPDAC to critically examine a study done by
someone else. You should examine each step in the Plan (you may have to ask to see the
Plan since many reports omit it) for strengths and weaknesses. You must also pay attention
to the various types of error that may occur and how they might impact the conclusions.
Data
The goal of the Data step is to collect the data according to the Plan. Any deviations from
the Plan should be noted. The data must be stored in a way that facilitates the Analysis.
9
http://www.youtube.com/watch?v=0A7ojjsmSsY
10
http://www.cbc.ca/news/technology/story/2010/10/20/long-form-census-world-statistics-day.html
3.2. THE STEPS OF PPDAC 115
The previous sections noted the need to de…ne variates clearly and to have satisfactory
methods of measuring them. It is di¢ cult to discuss the Data step except in the context
of speci…c examples, but we mention a few relevant points.
Mistakes can occur in recording or entering data into a data base. For complex
investigations, it is useful to put checks in place to avoid these mistakes. For example,
if a …eld is missed, the data base should prompt the data entry person to complete
the record if possible.
In many studies the units must be tracked and measured over a long period of time
(e.g. consider a study examining the ability of aspirin to reduce strokes in which
persons are followed for 3 to 5 years). This requires careful management.
When data are recorded over time or in di¤erent locations, the time and place for
each measurement should be recorded.
There may be departures from the study Plan that arise over time (e.g. persons may
drop out of a long term medical study because of adverse reactions to a treatment; it
may take longer than anticipated to collect the data so the number of units sampled
must be reduced). Departures from the Plan should be recorded since they may have
an important impact on the Analysis and Conclusion.
In some studies the amount of data may be extremely large, so data base design and
management is important.
Analysis
The Analysis step includes both simple and complex calculations to process the data into
information. Numerical and graphical methods such as those discussed in Chapter 1, as
well as others, are used in this step to summarize the data.
A key component of the Analysis step is the selection of an appropriate model that
describes the data and how the data were collected. As indicated in Chapter 1 variates
can be of di¤erent types: continuous, discrete, categorical, ordinal, and complex. It is
important to identify the types of variates collected in a study since this helps in selecting
appropriate models. In the Problem step, the problems of interest were stated in terms of
the attributes of interest. These attributes need to be described in terms of the parameters
and properties of the model. It is also very important to check whether a proposed model
is appropriate. Some methods for checking the …t of a model were discussed in Chapter 2.
Other methods will be discussed in Chapter 7.
It is di¢ cult to describe this step in more detail except in the context of speci…c exam-
ples. You will see many examples of formal analyses in the following chapters.
116 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
In the drinking study the researchers conducted a formal analysis to test for di¤erences
between the mean alcohol consumption for the four groups across the three times points.
This type of analysis is beyond the scope of this course. However in Chapter 6 we will
see how to test for a di¤erence between means when the data consist of two independent
groups and the data are assumed to arise from di¤erent Gaussian distributions.
Conclusions
The purpose of the Conclusion step is to answer the questions posed in the Problem. An
attempt should be made to quantify (or at least discuss) potential errors as described in
the Plan step. Limitations to the conclusions should be discussed. The conclusions for the
drinking study are given below.
Here is a PPDAC for the drinking study based on the published information:
Problem The problem was to study the di¤erences in mean alcohol consumption if
di¤erent mental exercises related to non-drinking were used. The target population
was English university students aged 18 25 in the United Kingdom at the time of
the study. The problem is causative since the researchers wanted to study the e¤ect
of the di¤erent mental exercises on mean alcohol consumption.
Plan The study population was English university students aged 18 25 at the
time of the study at 45 English universities. The sampling protocol involved asking
administrators at 80 academic departments across 45 English universities to forward
a pre-prepared recruitment message to their students containing a URL to an online
survey. Departments could decide whether or not to forward the message to their
students and students who received the message could decide whether or not to take
part in the study. The sample size was 211. Students who agreed to participate
were randomly assigned by the researchers to one of the four mental health exercises
(imagining positive outcomes of non-drinking during a social occasion; imagining
strategies required to successfully not drink during a social occasion; imagining both
positive outcomes and required strategies; or completing a drinks diary task). The
age and sex of each student was also recorded. At the beginning of the study, at two
weeks, and at four weeks the students self-reported, using an online survey, how much
alcohol they had consumed in the previous week in UK units using a visual aid.
Data The data included which mental exercise group the student was assigned to,
their age, their sex, and self-reported information about their alcohol consumption at
three di¤erent time points.
Analysis The researchers conducted a formal analysis to test for di¤erences between
the mean alcohol consumption for the four groups across the three times points.
3.3. CASE STUDY 117
Conclusion In the drinking study the researchers concluded that completing mental
exercises relating to non-drinking was more e¤ective in encouraging safer drinking be-
haviour than completing a drinks diary alone. The researchers should have indicated
that the conclusion only applies to students in the study population not students in
the target population and certainly not students in other countries. This is an ex-
perimental study since the researchers determined group assignment for each student
by randomization. There are several serious drawbacks in this study. Students were
not recruited from all English universities. This could lead to study error. Also not
all contacted departments forwarded the recruitment message and participants were
volunteers. Both of these issues could lead to sample error. Alcohol consumption was
self-reported which could lead to measurement error.
Background
An automatic in-line gauge measures the diameter of a crankshaft journal on 100% of
the 500 parts produced per shift. The measurement system does not involve an operator
directly except for calibration and maintenance. Figure 3.1 shows the diameter in question.
The journal is a “cylindrical”part of the crankshaft. The diameter of the journal must
be de…ned since the cross-section of the journal is not perfectly round and there may be
taper along the axis of the cylinder. The gauge measures the maximum diameter as the
crankshaft is rotated at a …xed distance from the end of the cylinder.
The speci…cation for the diameter is 10 to +10 units with a target of 0. The mea-
surements are re-scaled automatically by the gauge to make it easier to see deviations from
the target. If the measured diameter is less than 10, the crankshaft is scrapped and a
cost is incurred. If the diameter exceeds +10, the crankshaft can be reworked, again at
considerable cost. Otherwise, the crankshaft is judged acceptable.
Overall Project
A project is planned by a crankshaft manufacturer to reduce scrap/rework by reducing
part-to-part variation in the diameter. A …rst step involves an investigation of the mea-
surement system itself. There is some speculation that the measurement system contributes
substantially to the overall process variation and that bias in the measurement system is
resulting in the scrapping and reworking of good parts. To decide if the measurement
118 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
system is making a substantial contribution to the overall process variability, we also need
a measure of this attribute for the current and future population of crankshafts. Since
there are three di¤erent attributes of interest, it is convenient to split the project into three
separate applications of PPDAC.
Study 1
In this application of PPDAC, we estimate the properties of the errors produced by the
measurement system. In terms of the model, we will estimate the bias and variability due
to the measurement system. We hope that these estimates can be used to predict the future
performance of the system.
Problem
The target process is all future measurements made by the gauge on crankshafts to be
produced by the manufacturer. The response variate is the measured diameter associated
with each unit. The attributes of interest are the average measurement error and the
population standard deviation of these errors. We can quantify these concepts using a
model (see below). A detailed …shbone diagram for the measurement system is also shown
in Figure 3.2. In such a diagram, we list explanatory variates organized by the major
“bones” that might be responsible for variation in the response variate, here the measured
journal diameter. We can use the diagram in formulating the Plan.
Note that the measurement system includes the gauge itself, the way the part is loaded
into the gauge, who loads the part, the calibration procedure (every two hours, a master
3.3. CASE STUDY 119
part is put through the gauge and adjustments are made based on the measured diameter
of the master part; that is “the gauge is zeroed”), and so on.
G auge J ournal
actual size
position of part
condition dirt
wear on head
out-of-round
Meas ured J ournal D iam eter
training
frequency
E nv ironm ent
C alibration O perator
Plan
To determine the properties of the measurement errors we must measure crankshafts with
known diameters. “Known” implies that the diameters were measured by an o¤-line mea-
surement system that is very reliable. For any measurement system study in which bias is
an issue, there must be a reference measurement system which is known to have negligible
bias and variability which is much smaller than the system under study.
There are many issues in establishing a study process or a study population. For con-
venience, we want to conduct the study quickly using only a few parts. However, this
restriction may lead to study error if the bias and variability of the measurement system
change as other explanatory variates change over time or parts. We guard against this
latter possibility by using three crankshafts with known diameters as part of the de…nition
of the study process. Since the units are the taking of measurements, we de…ne the study
population as all measurements that can be taken in one day on the three selected crank-
shafts. These crankshafts were selected so that the known diameters were spread out over
the range of diameters Normally seen. This will allow us see if the attributes of the system
depend on the size of the diameter being measured. The known diameters which were used
were: 10, 0, and +10: Remember the diameters have been rescaled so that a diameter of
10 is okay.
No other explanatory variates were measured. To de…ne the sampling protocol, it
was proposed to measure the three crankshafts ten times each in a random order. Each
measurement involved the loading of the crankshaft into the gauge. Note that this was to
be done quickly to avoid delay of production of the crankshafts. The whole procedure took
120 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Data
The repeated measurements on the three crankshafts are shown below. Note that due to
poor explanation of the sampling protocol, the operator measured each part ten times in
a row and did not use a random ordering. (Unfortunately non-adherence to the sampling
protocol often happens when real data are collected and it is important to consider the
e¤ects of this in the Analysis and Conclusion steps.)
Analysis
where i = 1 to 3 indexes the three crankshafts and j = 1; 2; : : : ; 10 indexes the ten repeated
measurements. The parameter i represents the long term average measurement for crank-
shaft i. The random variables Rij (called the residuals) represent the variability of the
measurement system, while m quanti…es this variability. Note that we have assumed, for
simplicity, that the variability m is the same for all three crankshafts in the study.
We can rewrite the model in terms of the random variables Yij so that Yij G( i ; m ).
Now we can write the likelihood as in Example 2.3.2 and maximize it with respect to the
four parameters 1 , 2 , 3 , and m (the trick is to solve @`=@ i = 0, i = 1; 2; 3 …rst). Not
surprisingly the maximum likelihood estimates for 1 , 2 , 3 are the sample averages for
each crankshaft so that
1 Pn
^ i = yi = yij for i = 1; 2; 3
10 j=1
To examine the assumption that m is the same for all three crankshafts we can calculate
the sample standard deviation for each of the three crankshafts. Let
s
1 P10
si = (yij yi )2 for i = 1; 2; 3
9 j=1
yi si
Crankshaft 1 10:3 1:49
Crankshaft 2 0:6 1:17
Crankshaft 3 10:3 1:42
The estimate of the bias for crankshaft 1 is the di¤erence between the observed average
y1 and the known diameter value which is equal to 10 for crankshaft 1, that is, the
estimated bias is 10:3 ( 10) = 0:3. For crankshafts 2 and 3 the estimated biases are
0:6 0 = 0:6 and 10:3 10 = 0:3 respectively so the estimated biases in this study are all
small.
Note that the sample standard deviations s1 ; s2 ; s3 are all about the same size and
our assumption about a common value seems reasonable. (Note: it is possible to test this
assumption more formally.) An estimate of m is given by
r
s21 + s22 + s23
sm = = 1:37
3
Note that this estimate is not the average of the three sample standard deviations but the
square root of the average of the three sample variances. (Why does this estimate make
sense? Is it the maximum likelihood estimate of m ? What if the number of measurements
for each crankshaft were not equal?)
Conclusion
The observed biases 0:3, 0:6, 0:3 appear to be small, especially when measured against
the estimate of m and there is no apparent dependence of bias on crankshaft diameter.
To interpret the variability, we can use the model (3.1). Recall that if Yij G ( i ; m )
then
P ( i 2 m Yij i + 2 m ) = 0:95
Therefore if we repeatedly measure the same journal diameter, then about 95% of the time
we would expect to see the observations vary by about 2 (1:37) = 2:74.
There are several limitations to these conclusions. Because we have carried out the
study on one day only and used only three crankshafts, the conclusion may not apply to
all future measurements (study error). The fact that the measurements were taken within
a few minutes on one day might be misleading if something special was happening at that
time (sample error). Since the measurements were not taken in random order, another
source of sample error is the possible drift of the gauge over time.
We could recommend that, if the study were to be repeated, more than three known-
value crankshafts could be used, that the time frame for taking the measurements could be
extended and that more measurements be taken on each crankshaft. Of course, we would
also note that these recommendations would add to the cost and complexity of the study.
We would also insist that the operator be better informed about the Plan.
122 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Study 2
The second study is designed to estimate the overall population standard deviation of the
diameters of current and future crankshafts (the target population). We need to estimate
this attribute to determine what variation is due to the process and what is due to the mea-
surement system. A cause-and-e¤ect or …shbone diagram listing some possible explanatory
variates for the variability in journal diameter is given in Figure 3.3. Note that there are
many explanatory variates other than the measurement system. Variability in the response
variate is induced by changes in the explanatory variates, including those associated with
the measurement system.
M e a su re m e n ts M e th o d M a ch in e
m a i n te n a n c e s p e e d o f ro ta ti o n
s e t-u p o f to o l i n g
o p e ra to r angle of c ut
line s peed
c a l i b ra ti o n
c u tti n g to o l e d g e
p o s i ti o n i n g a u g e J o u rn a l D i a m e te r
s e t-u p m e th o d
h a rd n e s s
d i rt o n p a rt
tra i n i n g
quenc hant
te m p e ra tu re o p e ra to r
c a s ti n g c h e m i s try
e n vi ro n m e n t m a i n te n a n c e
c a s ti n g l o t
E n v i ro n m e n t M a te ri a l O p e ra to r
Plan
The study population is de…ned as those crankshafts available over the next week, about
7500 parts (500 per shift times 15 shifts). No other explanatory variates were measured.
Initially it was proposed to select a sample of 150 parts over the week (ten from each
shift). However, when it was learned that the gauge software stores the measurements for
the most recent 2000 crankshafts measured, it was decided to select a point in time near the
end of the week and use the 2000 measured values from the gauge memory to be the sample.
One could easily criticize this choice (sample error), but the data were easily available and
inexpensive.
Data
The individual observed measurements are too numerous to list but a histogram of the data
is shown in Figure 3.4. From this, we can see that the measured diameters vary from 14
3.3. CASE STUDY 123
to +16.
Figure 3.4: Histogram of 2000 measured values from the gauge memory
Analysis
where Yi represents the distribution of the measurement of the ith diameter, represents
the study population mean diameter and the residual Ri represents the variability due to
sampling and the measurement system. We let quantify this variability. We have not
included a bias term in the model because we assume, based on our results from Study 1,
that the measurement system bias is small. As well we assume that the sampling protocol
does not contribute substantial bias.
The histogram of the 2000 measured diameters shows that there is considerable spread in
the measured diameters. About 4:2% of the parts require reworking and 1:8% are scrapped.
The shape of the histogram is approximately symmetrical and centred close to zero. The
sample mean is
P
1 2000
y= yi = 0:82
2000 i=1
which gives us an estimate of (the maximum likelihood estimate) and the sample standard
deviation is s
P
1 2000
s= (yi y)2 = 5:17
1999 i=1
which gives us an estimate of (not quite the maximum likelihood estimate).
124 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
Conclusion
The overall process variation is estimated by s. Since the sample contained 2000 parts
measured consecutively, many of the explanatory variates did not have time to change as
they would in the study populations Thus, there is a danger of sample error producing an
estimate of the variation that is too small.
The variability due to the measurement system, estimated to be 1:37 in Study 1, is much
less than the overall variability which is estimated to be 5:17. One way to compare the two
standard deviations m and is to separate the total variability into the variability due
to the measurement system m and that due to all other sources. In other words, we are
interested in estimating the variability that would be present if there were no variability
in the measurement system ( m = 0). If we assume that the total variability arises from
two independent sources, the measurement system and all other sources, then we have
2 = 2 + 2 or
p
2 2 where
m p p = m p quanti…es the variability due to all other
uncontrollable variates (sampling variability). An estimate of p is given by
p q
s2 s2m = (5:17)2 (1:37)2 = 4:99
Hence, eliminating all of the variability due to the measurement system would produce an
estimated variability of 4:99 which is a small reduction from 5:17. The measurement system
seems to be performing well and not contributing substantially to the overall variation.
Comments
Study 3 revealed that the measurement system had a serious long term problem. At …rst,
it was suspected that the cause of the variability was the fact that the gauge was not
calibrated over the course of the study. Study 3 was repeated with a calibration before
each measurement. A pattern similar to that for Study 3 was seen. A detailed examination
of the gauge by a repairperson from the manufacturer revealed that one of the electronic
3.3. CASE STUDY 125
components was not working properly. This was repaired and Study 3 was repeated. This
study showed variation similar to the variation of the short term study (Study 1) so that
the overall project could continue. When Study 2 was repeated, the overall variation and
the number of scrap and reworked crankshafts was substantially reduced. The project was
considered complete and long term monitoring showed that the scrap rate was reduced to
about 0:7% which produced an annual savings of more than $100,000.
As well, three similar gauges that were used in the factory were put through the “long
term” test. All were working well.
Summary
An important part of any Plan is the choice and assessment of the measurement
system.
The measurement system may contribute substantial error that can result in poor
decisions (e.g. scrapping good parts, accepting bad parts).
We represent systematic measurement error by bias in the model. The bias can be
assessed only by measuring units with known values, taken from another reference
measurement system. The bias may be constant or depend on the size of the unit
being measured, the person making the measurements, and so on.
Variability can be assessed by repeatedly measuring the same unit. The variability
may depend on the unit being measured or any other explanatory variates.
Both bias and variability may be a function of time. This can be assessed by examining
these attributes over a su¢ ciently long time span as in Study 3.
126 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
2. Four weeks before a national election, a political party conducts a poll to assess what
proportion of eligible voters plan to vote and, of those, what proportion support the
party. This will determine how they run the rest of the campaign. They are able to
obtain a list of eligible voters and their telephone numbers in the 20 most populated
areas. They select 3000 names from the list and call them. Of these, 1104 eligible
voters agree to participate in the survey with the results summarized in the table
below. Answer the questions below based on this information.
Support Party
Plan to Vote YES NO
YES 351 381
NO 107 265
(a) De…ne the Problem for this study. What type of Problem is this and why?
3.4. CHAPTER 3 PROBLEMS 127
resource. At six weeks, the benchmarking assessment was repeated and the pre- and
post-training scores were compared. The di¤erence in benchmarking scores provided
the measure of generalized cognitive improvement resulting from training. Similarly,
for each training task, the …rst and last scores were compared to give a measure of
speci…c improvement on that task.
The relationship between the number of training sessions and changes in benchmark
performance was negligible in all groups for all tests. These results provide no evi-
dence for any generalized improvements in cognitive function following brain training
in a large sample of healthy adults.
Answer the questions below based on the information provided.
4. U.S. to fund study of Ontario math curriculum, Globe & Mail, January 17,
2014, Caroline Alphonso - Education Reporter (article has been condensed)
The U.S. Department of Education has funded a $2.7-million (U.S.) project, led by
a team of Canadian researchers at Toronto’s Hospital for Sick Children. The study
will look at how elementary students at several Ontario public schools fare in math
using the current provincial curriculum as compared to the JUMP math program,
which combines the conventional way of learning the subject with so-called discovery
learning. Math teaching has come under scrutiny since OECD results that measured
the scholastic abilities of 15-year-olds in 65 countries showed an increasing percentage
of Canadian students failing the math test in nearly all provinces. Dr. Tracy Solomon
3.4. CHAPTER 3 PROBLEMS 129
and her team are collecting and analyzing two years of data on students in primary
and junior grades from one school board, which she declined to name. The students
were in Grades 2 and 5 when the study began, and are now in Grades 3 and 6, which
means they will participate in Ontario’s standardized testing program this year. The
research team randomly assigned some schools to teach math according to the Ontario
curriculum, which allows open-ended student investigations and problem-solving. The
other schools are using the JUMP program. Dr. Solomon said the research team is
using classroom testing data, lab tests on how children learn and other measures to
study the impact of the two programs on student learning.
Answer the questions below based on this article.
5. Playing racing games may encourage risky driving, study …nds, Globe &
Mail, January 8, 2015 (article has been condensed)
Playing an intense racing game makes players more likely to take risks such as speed-
ing, passing on the wrong side, running red lights or using a cellphone in a simulated
driving task shortly afterwards, according to a new study. Young adults with more
adventurous personalities were more inclined to take risks, and more intense games
led to greater risk-taking, the authors write in the journal Injury Prevention. Other
research has found a connection between racing games and inclination to risk-taking
while driving, so the new results broaden that evidence base, said lead author of the
new study, Mingming Deng of the School of Management at Xi’an Jiaotong University
in Xi’an, China. “I think racing gamers should be [paying] more attention in their
130 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
6. Higher co¤ee consumption associated with lower risk of early death, Euro-
pean Society of Cardiology, August 27, 2017
Higher co¤ee consumption is associated with a lower risk of death, according to re-
search presented today at ESC Congress. The study in nearly 20; 000 participants
suggests that co¤ee can be part of a healthy diet in healthy people. “Co¤ee is one
of the most widely consumed beverages around the world,” said Dr Adela Navarro,
3.4. CHAPTER 3 PROBLEMS 131
(j) Suppose you are not a co¤ee drinker. On the basis of this study do you think
it would be a good idea to start drinking four cups of co¤ee a day. Why or why
not?
7. Answer the following questions based on the study given in Chapter 1, Problem 24.
(a) De…ne the Problem for this study in one or two sentences.
(b) What type of Problem is this? Explain why.
(c) De…ne a suitable target population for this study.
(d) De…ne a suitable study population for this study.
(e) Describe possible sources of study error for this study.
(f) Describe the sampling protocol for this study in as much detail as possible.
(g) What is the sample and sample size for this study?
(h) Describe possible sources of sample error for this study.
(i) Describe possible sources of measurement error for this study.
(j) What is the most serious limitation to the conclusion(s) of this study?
(k) Use the information in the article and your answers to the above questions to
construct a PPDAC for this empirical study in as much detail as possible.
8. Suppose you wish to study the smoking habits of teenagers and young adults, in order
to understand what personal factors are related to whether, and how much, a person
smokes. Brie‡y describe the main components of such a study, using the PPDAC
framework. Be speci…c about the target and study population, the sample, and the
variates you would collect.
9. Suppose you wanted to study the relationship between a person’s “resting”pulse rate
(heart beats per minute) and the amount and type of exercise they get.
(a) List some factors (including exercise) that might a¤ect resting pulse rate. You
may wish to draw a cause and e¤ect (…shbone) diagram to represent potential
causal factors.
(b) Describe brie‡y how you might study the relationship between pulse rate and
exercise using (i) an observational study, and (ii) an experimental study.
10. A large company uses photocopiers leased from two suppliers A and B. The lease
rates are slightly lower for B’s machines but there is a perception among workers
that they break down and cause disruptions in work ‡ow substantially more often.
Describe brie‡y how you might design and carry out a study of this issue, with the
ultimate objective being a decision whether to continue the lease with company B.
What additional factors might a¤ect this decision?
3.4. CHAPTER 3 PROBLEMS 133
11. For a study like the one in Example 1.3.2, where heights x and weights y of individuals
are to be recorded, discuss sources of variability due to the measurement of x and y
on any individual.
134 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
4. ESTIMATION
(1) Where do we get our probability model? What if it is not a good description of the
population or process?
We discussed the …rst question in Chapters 1 and 2. It is important to check the
adequacy (or “…t”) of the model; some ways of doing this were discussed in Chapter
2 and more formal methods will be considered in Chapter 7. If the model used is not
satisfactory, it is not wise to use the estimates based on it. For the lifetimes of brake
pads data introduced in Example 1.3.4, a Gaussian model did not …t the data well.
Sometimes the data can be transformed in such a way that the Gaussian model does
…t (see Chapter 2, Problem 18).
(2) The estimation of parameters or population attributes depends on data collected from
the population or process, and the likelihood function is based on the probability of
the observed data. This implies that factors associated with the selection of sample
units or the measurement of variates (e.g. measurement error) must be included in
the model. In many examples it is assumed that the variate of interest is measured
without error for a random sample of units from the population. We will typically
135
136 4. ESTIMATION
assume that the data come from a random sample of population units, but in any
given application we would need to design the data collection plan to ensure this
assumption is valid.
(3) Suppose in the model chosen the population mean is represented by the parameter
. The sample mean y is an estimate of , but not usually equal to it. How far away
from is y likely to be? If we take a sample of only n = 50 units, would we expect
the estimate y to be as “good” as y based on 150 units? What does “good” mean?
We focus on the third point in this chapter and assume that we can deal with the …rst
two points with the methods discussed in Chapters 1 and 2.
^ = y = 1 P yi
n
n i=1
is a point estimate of if y1 ; y2 ; : : : ; yn is an observed random sample from a Poisson
distribution with mean .
The method of maximum likelihood provides a general method for obtaining estimates,
but other methods exist. For example, if = E(Y ) = is the average (mean) value of y
in the population, then the sample mean ^ = y is an intuitively sensible estimate; it is the
maximum likelihood estimate of if Y has a G ( ; ) distribution but because of the Central
Limit Theorem it is a good estimate of more generally. Thus, while we will use maximum
likelihood estimation a great deal, you should remember that the discussion below applies
to estimates of any type.
The problem facing us in this chapter is how to determine or quantify the uncertainty in
an estimate. We do this using sampling distributions, which are based on the following idea.
If we select random samples on repeated occasions, then the estimates ^ obtained from the
di¤erent samples will vary. For example, …ve separate random samples of n = 50 persons
from the same male population described in Example 1.3.2 gave …ve di¤erent estimates
^ = y of E(Y ) as:
1:723 1:743 1:734 1:752 1:736
Estimates vary as we take repeated samples and therefore we associate a random variable
and a distribution with these estimates.
More precisely, we de…ne this idea as follows. Let the random variables Y1 ; Y2 ; : : : ; Yn
represent potential observations in an empirical study. Associate with the estimate
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 137
~ = Y = 1 P Yi
n
n i=1
Example 4.2.1
Suppose we have a variate of interest (for example, the height in meters of a male in
the population of Example 1.3.2) whose distribution it is reasonable to model as a G( ; )
random variable. Suppose also that we plan to take a random sample Y1 ; Y2 ; : : : ; Yn to
estimate the unknown mean where Yi G( ; ), i = 1; 2; : : : ; n. The maximum likelihood
estimator of is
1 Pn
~=Y = Yi
n i=1
From properties of Gaussian random variables (See Chapter 1, Problem 16) we know that
p p
~=Y G( ; = n) and so the sampling distribution of Y is G( ; = n).
If we knew we could determine how often the estimator ~ = Y is within a speci…ed
amount of the unknown mean . For example, if the variate is height and heights are
measured in meters then we could determine how often the estimator ~ = Y is within 0:01
meters of the true mean as follows:
0:01 Y 0:01
P (j~ j 0:01) = P 0:01 Y + 0:01 = P p p p
= n = n = n
p p
= P 0:01 n= Z 0:01 n= where Z G(0; 1)
138 4. ESTIMATION
and if n = 100
p p
P (j~ j 0:01) = P 0:01 100=0:07 Z 0:01 100=0:07 = P ( 1:43 Z 1:43)
= 0:847
This illustrates the rather intuitive fact that, the larger the sample size, the higher the prob-
ability the estimator ~ = Y is within 0:01 meters of the true but unknown mean height
in the population. It also allows us to express the uncertainty in an estimate ^ = y from an
observed sample y1 ; y2 ; : : : ; yn by indicating the probability that any single random sample
will give an estimate within a certain distance of .
Example 4.2.2
In Example 4.2.1 the distribution of the estimator ~ = Y could be determined exactly.
Sometimes the distribution of the estimator can only be determined approximately using the
Central Limit Theorem. For example, for Binomial data with n trials and y successes the
estimator ~ = Y =n has E(~) = and V ar(~) = (1 ) =n. By the Normal approximation
to the Binomial we have
~
q G (0; 1) approximately
(1 )
n
This result could be used, for example, to determine how large n should be to ensure that
In some cases the sampling distribution can be approximated using a simulation study
as illustrated in the next example.
Example 4.2.3
Suppose the population of interest is a …nite population consisting of 500 units. Suppose
associated with each unit is a number between 1 and 10 which is the variate of interest.
If we wanted to estimate the mean of this population we could select a random sample
y1 ; y2 ; : : : ; yn without replacement and estimate using the estimate ^ = y. Let us examine
how good the estimator ~ = Y is in the case of the population which has the distribution
of variate values as indicated in Table 4.1.
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 139
Variate
1 2 3 4 5 6 7 8 9 10 Total
value
No. of
210 127 66 39 23 13 11 7 3 1 500
units
In Figure 4.1 a histogram of the variate values is plotted. We notice that the popula-
tion of variate values is very positively skewed. The population mean and the population
300
200
Frequency
100
50
0
2 4 6 8 10
V a r ia te V a lu e
Figure 4.1: Histogram of the variate values for the …nite population of Table 4.1
Note that the population variance is divided by 500 and not 499. To determine how good
an estimator ~ = Y is we need the sampling distribution of Y . This could be determined
exactly but would require a great deal of e¤ort. Another way to approximate the sampling
distribution is to use a computer simulation. The simulation can be done in two steps.
First a random sample y1 ; y2 ; : : : ; yn is drawn at random without replacement from the
140 4. ESTIMATION
population. Secondly the sample mean y for this sample is determined. These two steps
are repeated k times. The k sample means, y1 ; y2 ; : : : ; yk , can then be considered as a
random sample from the distribution of the random variable ~ = Y , and we can study
the distribution by plotting a histogram of the values y1 ; y2 ; : : : ; yk . The R code for such a
simulation is given in Chapter 4, Problem 1.
0.8
0.6
Density
0.4
0.2
0.0
Figure 4.2: Relative frequency histogram of sample means from 10000 samples
of size 15 drawn from the population de…ned by Table 4.1
The histogram in Figure 4.2 was obtained by drawing k = 10000 samples of size n = 15
from the population de…ned by Table 4.1, calculating the sample means y1 ; y2 ; : : : ; y10000 and
then plotting the relative frequency histogram. The histogram represents an approximation
to the sampling distribution of the estimator Y . The number of simulations k only a¤ects
how good the approximation is. It can be shown11 that the mean and standard deviation
of the true sampling distribution of Y are
1:7433
E Y = = 2:362 and sd Y t p = p = 0:4501
n 15
Does the histogram, which represents the approximate sampling distribution, agree with
these statements? What do you notice about the symmetry of the histogram? Does the
histogram look like a Gaussian distribution?
Based on this simulation we can approximate P Y 2:362 0:5 , the probability
that the sample mean Y is within 0:5 of the population mean = 2:362, by counting the
number of sample means in the simulation which are within 0:5 of the value 2:362. For the
simulation in Figure 4.2 this value was 0:7422.
q
11 N n
For a sample of size n drawn without replacement from a …nite population of size N , sd Y = p
n N 1
.
4.3. INTERVAL ESTIMATION USING THE LIKELIHOOD FUNCTION 141
If samples of size n = 30 were drawn, how would the location, variability and symmetry
of the histogram of simulated means change? How would the estimate of P Y 2:362 0:5
be a¤ected? See Chapter 4, Problem 1.
The estimates and estimators we have discussed so far are often referred to as point
estimates and point estimators respectively. This is because they consist of a single value
or “point”. Sampling distributions allow us to address the uncertainty in a point estimate.
The uncertainty in a point estimate is usually conveyed by an interval estimate, which
takes the form [L (y) ; U (y)] where the endpoints, L (y) and U (y), are both functions of
the observed data y. If we let L (Y) and U (Y) represent the random variables associated
with L (y) and U (y), then [L (Y) ; U (Y)] is called a random interval since the endpoints
are random variables. The probability that the parameter falls in the random interval
[L (Y) ; U (Y)] is P ( 2 [L (Y) ; U (Y)]) = P [L (Y) U (Y)]. This probability tells us
how good the rule is by which the interval estimate was obtained. It tells us, for example,
how often we would expect the statement 2 [L (y) ; U (y)] to be true if we were to draw
many random samples from the same population and each time we constructed the interval
[L (y) ; U (y)] based on the observed data y.
For example, suppose P [L (Y) U (Y)] = 0:95. If we drew a large number of
random samples and each time we constructed the interval [L (y) ; U (y)] from the data
y, then we would expect the true value of the parameter to lie in approximately 95%
of these constructed intervals. This means we can be reasonably con…dent that if we
construct one interval based on one observed data set y, then the interval [L (y) ; U (y)]
will contain the true value of the unknown parameter . In general, uncertainty in a
point estimate is explicitly stated by giving the interval estimate along with the probability
P ( 2 [L (Y) ; U (Y)]).
We will discuss this idea of con…dence related to interval estimates in more detail in
Section 4.4. First we show how the likelihood function can be used to construct interval
estimates.
De…nition 24 Suppose is scalar and that some observed data (say a random sample
y1 ; y2 ; : : : ; yn ) have given a likelihood function L( ). The relative likelihood function R( )
is de…ned as
L( )
R( ) = for 2
L(^)
where ^ is the maximum likelihood estimate and is the parameter space.
Note:
0 R( ) 1 for all 2
Poll 1 : n = 200; y = 80
Poll 2 : n = 1000; y = 400
In each case ^ = 0:40, but the relative likelihood function is more “concentrated”around ^
for the larger poll (Poll 2). The 10% likelihood intervals also re‡ect this. From Figure 4.3
4.3. INTERVAL ESTIMATION USING THE LIKELIHOOD FUNCTION 143
we can determine that the 10% likelihood interval is [0:33; 0:47] for Poll 1 and [0:37; 0:43]
for Poll 2. The interval for Poll 2 is narrower than for Poll 1 which re‡ects the fact that
the larger poll contains more information about .
1
0.9
0.8
0.7
n= 200
0.6
R(θ)
0.5
n= 1000
0.4
0.3
0.2
y = 0.1
0.1
0
0.3 0.35 0.4 0.45 0.5
θ
Figure 4.3: Relative likelihood function for two polls with di¤erent sample sizes
Table 4.2 gives guidelines for interpreting likelihood intervals. These are only guide-
lines for this course.
Table 4.2
Guidelines for Interpreting Likelihood Intervals
Values of inside a 50% likelihood interval are very plausible in light of the observed data.
Values of inside a 10% likelihood interval are plausible in light of the observed data.
Values of outside a 10% likelihood interval are implausible in light of the observed data.
Values of outside a 1% likelihood interval are very implausible in light of the observed data.
144 4. ESTIMATION
The values 1%, 10%, and 50% are typically used because they are nice round numbers
and they provide useful summaries. Other values could also be used. In Section 4.6
we will see that 15% likelihood intervals have a connection with 95% con…dence intervals.
(Values inside a 15% likelihood interval are also plausible in light of the observed data.)
A 10% likelihood interval is useful because it excludes parameter values for which the
1
probability of the observed data is less than 10 of the probability when = ^. In other words
a 10% likelihood interval summarizes the interval of values for the unknown parameter which
are reasonably supported by the observed data in an empirical study. A 50% likelihood
interval contains values of the parameter for which the probability of the observed data is at
least 12 . A narrower 50% likelihood interval might be used if decisions made on the basis of
the plausible values of the unknown parameter in light of the data had serious consequences
in terms of money or lives of people. A 1% likelihood interval, which is wider than a 10%
likelihood interval, would be used if the aim of the empirical study was to summarize all the
parameter values which are supported in some way by the observed data. Which likelihood
interval is used, therefore, depends very much on the goals of the empirical study that is
being conducted.
A drawback of likelihood intervals (as well as con…dence intervals as we will see in the
next section) is that we never know whether the interval obtained contains the true value
of the parameter or not. In Section 4.6 we will see that the construction of a likelihood
interval ensures that we can be reasonably con…dent that it does.
Sometimes it is more convenient to compute the log of the relative likelihood function
instead of R( ).
The plots of the likelihood function R( ) and the log likelihood function r( ) are both
unimodal. As well, both R( ) and r( ) obtain a maximum value at = ^. (Note: R(^) = 1
while r(^) = 0). The plots of R( ) and r( ), however, di¤er in terms of their shape. The
plot of the relative likelihood function R( ) (see for example, Figure 4.3) often resembles a
Gaussian probability density function in shape while the plot of the log relative likelihood
r( ) resembles a quadratic function of (see, for example, Figure 4.4.)
The log relative likelihood function can also be used to compute a 100p% likelihood
interval since R( ) p if and only if r( ) log p. In other words, a 100p% likelihood interval
can also be de…ned as f : r( ) log pg. For example, f : r( ) log (0:1) = 2:30g is a
10% likelihood interval.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 145
The idea of a likelihood interval for a parameter can also be extended to the case of
a vector of parameters = ( 1 ; 2 ; : : : ; k ). In this case R( ) p gives likelihood “regions”
for .
0
y=log(0.1)
n =200
-5
n =1000
-10
r(θ)
-15
-20
-25
0.3 0.35 0.4 0.45 0.5
θ
Figure 4.4: Log relative likelihood function for two polls with di¤erent sample
sizes
The parameter in (4.1) is an unknown constant associated with the population. It is not
a random variable and therefore does not have a distribution. The probability in (4.1) can
be interpreted in the following way. Suppose we were about to draw a random sample of
the same size from the same population and the true value of the parameter was . Suppose
also that we knew that we would construct an interval of the form [L(y); U (y)] once we
had collected the data. Then the probability that will be contained in this new interval is
given by (4.1). When we use the observed data y, to construct the interval [L(y); U (y)] we
note that L(y) and U (y) are numerical values not random variables. Since is an unknown
constant we do not know whether the statement 2 [L(y); U (y)] is true or false. How then
146 4. ESTIMATION
Suppose p = 0:95 and we drew a very large number of random samples from the model.
Suppose also that each time we observed a random sample, we constructed a 95% con…dence
interval [L(y); U (y)] based on the observed data y. Then (4.2) indicates that 95% of these
constructed intervals would contain the true value of the parameter (and of course 5%
do not). This gives us some con…dence that for a particular sample, the true value of the
parameter is contained in the con…dence interval constructed from the sample.
The following example illustrates that the con…dence coe¢ cient sometimes does not
depend on the unknown parameter.
Example 4.4.1 Gaussian distribution with unknown mean and known standard
deviation
Suppose Y1 ; Y2 ; : : : ; Yn is a random sample from a G( ; 1) distribution, that is, = E(Yi )
is unknown but sd (Yi ) = 1 is known.
Consider the interval h i
Y 1:96n 1=2 ; Y + 1:96n 1=2
1 P
n
where Y = n Yi is the sample mean. Since the sampling distribution of Y is
pi=1
Y G( ; 1= n),
p p
P Y 1:96= n Y + 1:96= n
p
=P 1:96 n Y 1:96
= P ( 1:96 Z 1:96) where Z G(0; 1)
= 0:95
p p
and the interval [y 1:96= n; y + 1:96= n] is a 95% con…dence interval for the unknown
mean . Note that the con…dence coe¢ cient did not depend on the value of the unknown
parameter .
Suppose for a particular sample of size n = 16 the observed mean was y = 10:4, then
the 95% con…dence interval would be [y 1:96=4; y + 1:96=4], or [9:91; 10:89]. We cannot
say that P ( 2 [9:91; 10:89]) = 0:95. We can only say that we are 95% con…dent that the
interval [9:91; 10:89] contains .
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 147
We repeat the very important interpretation of a 95% con…dence interval. Suppose the
experiment which was used to estimate was conducted a large number of times and each
time a 95% con…dence interval for was constructed using the observed data and the in-
p p
terval [y 1:96= n; y + 1:96= n]. Then, approximately 95% of these constructed intervals
would contain the true but unknown value of the parameter . Of course approximately
5% of these constructed would not contain the true but unknown value of the parameter
p p
. Since we only have one interval [y 1:96= n; y + 1:96= n]] we do not know whether it
contains the true value of or not. We can only say that we are 95% con…dent that the
p p
interval [y 1:96= n; y + 1:96= n] contains the true value of . In other words, we hope
we were one of the “lucky”95% who constructed an interval containing the true value of .
p p
Warning: P ( 2 [y 1:96= n; y + 1:96= n]) = 0:95 is an incorrect statement!!!
Recall that the coverage probability for the interval in Example 4.4.1 did not depend on
the unknown parameter . This is a highly desirable property to have because we would like
to know the coverage probability without knowing the value of the unknown parameter. We
next consider a general method for …nding con…dence intervals which have this property.
Pivotal Quantities
De…nition 28 A pivotal quantity Q = Q(Y; ) is a function of the data Y and the un-
known parameter such that the distribution of the random variable Q is fully known. That
is, probability statements such as P (Q b) and P (Q a) depend on a and b but not on
or any other unknown information.
Example 4.4.1 Revisited Gaussian distribution with unknown mean and known
standard deviation
In Example 4.4.1 the parameter = E(Yi ) was unknown but the standard deviation
sd (Yi ) = 1 was known. Since Y1 ; Y2 ; : : : ; Yn is a random sample from a G( ; 1) distribution,
p
E(Y ) = , and sd(Y ) = 1= n, it follows that
Y p
p = n Y G (0; 1)
1= n
p
In other words the distribution of the random variable n Y is completely known
p
and therefore n Y is a pivotal quantity. In particular we know that
p
P n Y 1:96 = 0:025
and
p
P n Y 1:96 = 0:025
We now describe how a pivotal quantity can be used to construct a con…dence interval.
We begin with the statement P [a Q(Y; ) b] = p where Q(Y; ) is a pivotal quantity
148 4. ESTIMATION
whose distribution is completely known. Suppose that we can re-express the inequality
a g(Y; ) b in the form L(Y) U (Y) for some functions L and U . Then since
p = P [a Q(Y; ) b]
= P [L(Y) U (Y)]
= P ( 2 [L(Y); U (Y)])
the interval [L (y) ; U (y)] is a 100p% con…dence interval for . The con…dence coe¢ cient
for the interval [L (y) ; U (y)] is equal to p which does not depend on . The con…dence
coe¢ cient does depend on a and b, but these are determined by the known distribution of
Q(Y; ).
Y
Q = Q (Y; ) = p G(0; 1) (4.3)
= n
and G(0; 1) is a completely known distribution, Q is a pivotal quantity. To obtain a 95%
con…dence interval for we …rst note that 0:95 = P ( 1:96 Z 1:96) where Z G (0; 1).
From (4.3) it follows that
Y
0:95 = P 1:96 p 1:96
= n
p p
=P Y 1:96 = n Y + 1:96 = n
so that
p p
y 1:96 = n; y + 1:96 = n
is a 95% con…dence interval for based on the observed data y = (y1 ; y2 ; : : : ; yn ).
Note that if a and b are values such that 0:95 = P (a Z b) where Z G (0; 1) then
p p
the interval [y b = n; y a = n] is also a 95% con…dence interval for . The interval
p p
[y 1:96 = n; y + 1:96 = n] can be shown to be the narrowest possible 95% con…dence
interval for .
p p p
The interval [y 1:96 = n; y + 1:96 = n] or y 1:96 = n is often referred to as a
two-sided con…dence interval. Note that this interval takes the form
where a is a value from the G (0; 1) table. Many two-sided con…dence intervals we will
encounter will take a similar form.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 149
p p
Since P 2 [Y 1:645 = n; 1) = 0:95, the interval [y 1:645 = n; 1) is also a
p
95% con…dence interval for . The interval [y 1:645 = n; 1) is usually referred to as
a one-sided con…dence interval. This type of interval is useful when we are interested in
determining a lower bound on the value of .
Remark
It is important to understand that con…dence intervals vary when we take repeated
samples. In Example 4.4.2, suppose = 2 is known, and the sample size n is 16. A 95%
con…dence interval based on one sample with observed sample mean y is
h p p i
y 1:96 (2) = 16; y + 1:96 (2) = 16
= [y 0:98; y + 0:98]
therefore
2 s s 3
^(1 ^) ^(1 ^)
4^ 1:96 ; ^ + 1:96 5 (4.4)
n n
s
^(1 ^)
= ^ 1:96 (4.5)
n
is an approximate 95% con…dence interval for where ^ = y=n and y is the observed data.
Note Asymptotic Gaussian pivotal quantities exist for other models. See Problem 15
(Poisson), Problem 24 (Exponential), and Problem 17. See Table 4.3 in Section 4.8 for a
summary of the approximate con…dence intervals for these models.
Example 4.4.4
Suppose we want to estimate the probability from a Binomial experiment in which
Y Binomial(n; ) distribution. We use the asymptotic pivotal quantity
~
Qn = q
~(1 ~)
n
which was introduced in Example 4.4.3 and which has approximately a G(0; 1) distribution
for large n, to obtain con…dence intervals for .
Here is a criterion that is widely used for choosing the size of n: Choose n large enough
so that the width of a 95% con…dence interval for is no wider than 2 (0:03). Let us see
where this leads and why this rule is used.
From Example 4.4.3, we know that
2 s s 3
^(1 ^) ^(1 ^)
4^ 1:96 ; ^ + 1:96 5 (4.6)
n n
is an approximate 0:95 con…dence interval for and that the width of this interval is
s
^(1 ^)
2 (1:96)
n
152 4. ESTIMATION
To make this con…dence interval narrower that 2 (0:03), we need n large enough so that
s
^(1 ^)
1:96 0:03
n
or
2
1:96 ^(1 ^)
n (4.7)
0:03
Of course we don’t know what ^ is because we have not taken a sample, but we note that
the interval (4.6) is the widest when ^ = 0:5. To ensure that the inequality (4.7) holds for
all values of ^, we …nd n such that
2
1:96
n (0:5) (0:5) t 1067:1
0:03
Thus, choosing n = 1068 (or larger) will result in an approximate 95% con…dence interval
of the form ^ c, where c 0:03.
If you look or listen carefully when polling results are announced, you will often hear
words like “this poll is accurate to within 3 percentage points 19 times out of 20.” What
this really means is that the estimator ~ (which is usually given in percentile form) approx-
imately satis…es P ( ~ 0:03) = 0:95, or equivalently, that the actual estimate ^ is the
centre of an approximate 95% con…dence interval ^ c, for which c = 0:03. In practice,
many polls are based on 1050 1100 people, giving “accuracy to within 3 percent” with
probability 0:95. Of course, one needs to be able to a¤ord to collect a sample of this size. If
we were satis…ed with an accuracy of 5 percent, then we’d only need n = 385 (can you show
this?). In many situations however this might not be su¢ ciently accurate for the purpose
of the study.
Exercise Show that to ensure that the width of the approximate 95% con…dence interval
is 2 (0:02) = 0:04 or smaller, you need n = 2401. What should n be to make ensure the
width of a 99% con…dence interval is less than 2 (0:02) = 0:04?
Remark Very large Binomial polls (n 2000) are not done very often. Although we can
in theory estimate very precisely with an extremely large poll, there are two problems:
2. In many settings the value of ‡uctuates over time. A poll is at best a snapshot at
one point in time.
As a result, the “real” accuracy of a poll cannot generally be made arbitrarily high.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 153
(2) take a random sample of households in the municipality and then interview a member
of each household.
If a random sample is used it is estimated that each interview will take approximately
20 minutes (travel time plus interview time). If a census is used it is estimated that each
interview will take approximately 10 minutes each since there is less travel time. We can
summarize the costs and precision one would obtain for one question on the form which
asks whether a person agrees/disagrees with a statement about the funding levels for higher
education. Let be the proportion in the population who agree. Suppose we decide that a
“good”estimate of is one that is accurate to within 2% of the true value 95% of the time.
For a census, six interviews can be completed in one hour. At $20 per hour the inter-
viewer cost for the census is approximately
50000
$20 = $166; 667
6
since there are 50; 000 households.
For a random sample, three interviews can be completed in one hour. An approximate
95% con…dence interval for of the form ^ 0:02 requires n = 2401. The cost of the random
sample of size n = 2401 is
2401
$20 t $16; 000
3
as compared to $166; 667 for the census - more than ten times the cost of the random
sample!
Of course, we have also not compared the costs of processing 50; 000 versus 2401 surveys
but it is obvious again that the random sample will be less costly and time consuming.
154 4. ESTIMATION
2
The (Chi-squared) Distribution
To de…ne the Chi-squared distribution we …rst recall the Gamma function and its properties:
Z1
1 y
( )= y e dy for >0
0
1
f (x; k) = x(k=2) 1
e x=2
for x > 0 (4.8)
2k=2 (k=2)
k is referred to as the “degrees of freedom” (d.f.) parameter. In Figure 4.5 you see the
characteristic shapes of the Chi-squared probability density functions. For k = 2; the
probability density function is the Exponential (2) probability density function. For k > 2,
the probability density function is unimodal with maximum value at x = k 2. For values
of k 30, the probability density function resembles that of a N (k; 2k) probability density
function.
The cumulative distribution function, F (x; k), can be given in closed algebraic form
for even values of k. Probabilities for the 2 (k) distribution are provided in the Chi-
squared table at the end of these Course Notes. In R the function dchisq(x,k) gives the
probability density function f (x; k), pchisq(x,k) gives the cumulative distribution function
F (x; k) = P (X x; k), and qchisq(p,k) gives the value a such that P (X a; k) = p.
If X 2 (k) then
4 0.8
3 0.6
f(x;1)
f(x;2)
2 0.4
d.f.=1 d.f.=2
1 0.2
0 0
0 1 2 3 0 2 4 6 8
x x
0.08 0.06
0.06
0.04 d.f.=30
d.f.=15
f(x;15)
f(x;15)
0.04
0.02
0.02
0 0
0 10 20 30 0 20 40 60
x x
and therefore
V ar (X) = E X 2 [E (X)]2 = k (k + 2) k 2 = 2k
The following results will also be very useful.
Proof. Suppose W = Z 2 where Z G(0; 1). Let represent the cumulative distribution
function of a G(0; 1) random variable and let represent the probability density function of
a G(0; 1) random variable.
For w > 0
p p p p
P (W w) = P ( w Z w) = ( w) ( w)
and therefore the probability density function of W is
d d p p
P (W w) = ( w) ( w)
dw dw
p p 1 1=2
= ( w) + ( w) w
2
1 1=2 w=2
=p w e
2
Proof. Since Zi G(0; 1) then by Theorem 30, Zi2 2 (1) and the result follows by
Theorem 29.
Useful Results:
2 (1) p
1. If W then P (W w) = 2 [1 P (Z w)] where Z G (0; 1).
Student’s t Distribution
Student’s t distribution (or more simply the t distribution) has probability density function
(k+1)=2
t2
f (t; k) = ck 1 + for t 2 < and k = 1; 2; : : :
k
The parameter k is called the degrees of freedom. We write T t (k) to indicate that
the random variable T has a t distribution with k degrees of freedom. In Figure 4.6 the
probability density function f (t; k) for k = 2 and k = 25 is plotted together with the G (0; 1)
probability density function.
0.4 0.4
G(0,1) G(0,1)
0.35 0.35
0.3 0.3
0.25 0.25
f(x;25)
f(x;2)
0.2 0.2
0.15 0.15
0.1 0.1
0.05
t(2) 0.05 t(25)
0 0
-5 0 5 -4 -2 0 2 4
x x
Figure 4.6: Probability density functions for t (k) (solid blue) and G (0; 1) (dashed
red)
The t probability density function is similar to that of the G (0; 1) distribution in several
respects: it is symmetric about the origin, it is unimodal, and indeed for large values of k,
the graph of the probability density function f (t; k) is indistinguishable from that of the
G (0; 1) probability density function. The primary di¤erence, for small k such as the one
plotted, is in the tails of the distribution. The t probability density function has fatter
“tails” or more area in the extreme left and right tails. Problem 21 at the end of this
chapter considers some properties of f (x; k).
158 4. ESTIMATION
Probabilities for the t distribution are provided in the t table at the end of these Course
Notes. In R the function dt(t,k) gives the probability density function f (t; k), pt(t,k)
gives the cumulative distribution function F (t; k) = P (T t; k), and qt(p,k) gives the
value a such that P (T a; k) = p.
The t distribution arises as a result of the following theorem. The proof of this theorem
is beyond the scope of this course.
Z
T =p
U=k
is a function of the maximum likelihood estimate ^. Replace the estimate ^ by the random
variable (the estimator) ~ and de…ne the random variable ( )
L( )
( )= 2 log
L(~)
where ~ is the maximum likelihood estimator. The random variable ( ) is called the
likelihood ratio statistic. The following theorem implies that ( ) is an asymptotic pivotal
quantity.
This theorem means that ( ) can be used as a pivotal quantity for su¢ ciently large n
in order to obtain approximate con…dence intervals for . More importantly we can use this
result to show that the likelihood intervals discussed in Section 4.3 are also approximate
con…dence intervals.
By Theorem 33 the con…dence coe¢ cient for this interval can be approximated by
L( )
P[ ( ) 2 log p] = P 2 log 2 log p
L(~)
2
t P (W 2 log p) where W (1)
p
= P jZj 2 log p where Z N (0; 1)
p
= 2P Z 2 log p 1
as required.
Exercise
(a) Show that a 1% likelihood interval is an approximate 99:8% con…dence interval.
(b) Show that a 50% likelihood interval is an approximate 76% con…dence interval.
Theorem 33 can also be used to …nd a likelihood interval which is also an approximate
100p% con…dence interval.
Theorem 35 If anis a value such thato p = 2P (Z a) 1 where Z N (0; 1), then the
a2 =2
likelihood interval : R( ) e is an approximate 100p% con…dence interval.
n o
Proof. The con…dence coe¢ cient corresponding to the interval : R( ) e a2 =2 is
L( ) a2 =2 L( )
P e = P 2 log a2
L(~) ~
L( )
2 2
t P W a where W (1) by Theorem 33
= 2P (Z a) 1 where Z N (0; 1)
= p
as required.
160 4. ESTIMATION
Example
Since
0:95 = 2P (Z 1:96) 1 where Z N (0; 1)
and
(1:96)2 =2 1:9208
e =e t 0:1465 t 0:15
therefore a 15% likelihood interval for is also an approximate 95% con…dence interval for
.
Exercise
(a) Show that a 26% likelihood interval is an approximate 90% con…dence interval.
(b) Show that a 4% likelihood interval is an approximate 99% con…dence interval.
Suppose n = 100 and y = 40 so that ^ = 40=100 = 0:4. From the graph of the relative
likelihood function given in Figure 4.7 we can read o¤ the 15% likelihood interval which is
[0:31; 0:495] which is also an approximate 95% con…dence interval.
0.9
0.8
0.7
0.6
R(θ)
0.5
0.4
0.3
0.2
0.1
0
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
θ
Figure 4.7: Relative likelihood function for Binomial with n = 100 and y = 40
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 161
s
^(1 ^)
^ 1:96 (4.9)
n
is [0:304; 0:496]. The two intervals di¤er slightly but are very close.
If n = 30 and ^ = 0:1 then from Figure 4.8 the 15% likelihood interval is [0:03; 0:24]
which is also an approximate 95% con…dence interval. The approximate 95% con…dence
0.9
0.8
0.7
0.6
R(θ)
0.5
0.4
0.3
0.2
0.1
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
θ
interval based on 4.9 is [ 0:0074; 0:2074] which is quite di¤erent than the likelihood based
approximate con…dence interval and which also contains negative values for . Of course
can only take on values between 0 and 1. This happens because the con…dence interval in
(4.9) is always symmetric about ^ and if ^ is close to 0 or 1 and n is not very large then
the interval can contain values less than 0 or bigger than 1. The graph of the likelihood
interval in Figure 4.8 is not symmetric about ^. In this case the 15% likelihood interval is
a better summary of the values which are supported by the data.
More generally, if ^ is close to 0:5 or n is large then the likelihood interval will be fairly
symmetric about ^ and there will be little di¤erence in the two approximate con…dence
intervals. If ^ is close to 0 or 1 and n is not large then the likelihood interval will not be
symmetric about ^ and the two approximate con…dence intervals will not be similar. In
this case the 15% likelihood interval will be a better summary of the values which are
supported by the data.
162 4. ESTIMATION
Y
T = p t (n 1) (4.12)
S= n
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 163
Note: The random variable T is a pivotal quantity since it is a function of the data
Y1 ; Y2 ; : : : ; Yn and the unknown parameter whose distribution t (n 1) is completely
known.
To see how
Y
T = p t (n 1)
S= n
follows from Theorem 32 let
Y
Z= p G(0; 1)
= n
and
(n 1)S 2
U= 2
We choose this function of S 2 since it can be shown that U 2 (n 1). It can also be
shown that Z and U are independent random variables. The proofs of these very important
results are beyond the scope of this course and are covered in a third year mathematical
statistics course.
By Theorem 32 with k = n 1, we have
Yp
Z = n Y
p =p = p t (n 1)
U=k S2= 2 S= n
In other words if we replace in the pivotal quantity (4.10) by its estimator S, the distri-
bution of the resulting pivotal quantity has a t(n 1) distribution rather than a G(0; 1)
distribution. The degrees of freedom of the t distribution are determined by the degrees of
freedom of the Chi-squared random variable U .
We now show how to use the pivotal quantity (4.12) to obtain a con…dence interval for
when is unknown. Since the t distribution is symmetric we determine the constant
a such that P ( a T a) = p using the t table provided in these Course Notes or R.
Note that, due to symmetry, P ( a T a) = p is equivalent to P (T a) = (1 + p) =2
(you should verify this) and since the t table tabulates the cumulative distribution function
P (T t), it is easier to …nd a such that P (T a) = (1 + p) =2. Then since
p = P( a T a)
Y
= P a p a
S= n
p p
= P Y aS= n Y + aS= n
(Note that if we attempted to use (4.10) to build a con…dence interval we would have two
unknowns in the inequality since both and are unknown.) As usual the method used
to construct this interval implies that 100p% of the con…dence intervals constructed from
samples drawn from this population contain the true value of .
p
We note that this interval is of the form y as= n or
Recall that a con…dence interval for in the case of a G( ; ) population when is known
has a similar form
except that the standard deviation of the estimator is known in this case and the value of
a is taken from a G(0; 1) distribution rather than the t distribution.
165
160
155
Sample Quantiles
150
145
140
135
130
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
G(0,1) Quantiles
and a qqplot of the data are given in Figure 4.9. Since the points in the qqplot lie reasonably
along a straight line with more variability at both ends which is expected, we would conclude
that a Gaussian model is reasonable for these data.
Since
1 + 0:95
P (T 2:0086) = = 0:975 for T t (50)
2
a 95% con…dence interval for based on (4.13) is
p
y 2:0086s= 51
p
= 150:1412 (2:0086) (5:3302) = 51
= 150:1412 1:4992
= [148:6420; 151:6403]
We would usually choose n a little larger than this formula gives to accommodate the fact
that we used G (0; 1) quantiles rather than the quantiles of the t distribution which are
larger in value and we only know approximately.
2
Con…dence intervals for and
2
(n 1)S 2 1 P
n
2 P
n Yi Y 2
U= 2
= 2
Yi Y = (n 1) (4.14)
i=1 i=1
Note: The random variable U is a pivotal quantity since it is a function of the data
Y1 ; Y2 ; : : : ; Yn and the unknown parameter 2 whose distribution 2 (n 1) is completely
known.
While the proof of this result is beyond the scope of this course, we will try to explain
the puzzling number of degrees of freedom n 1, which, at …rst glance, seems wrong
P
n
since (Yi Y )2 is the sum of n squared Normal random variables. Does this contradict
i=1
Corollary 31? It is true that each Wi = (Yi Y ) is a Normally distributed random variable.
However Wi does not have a N (0; 1) distribution and more importantly the Wi ’s are not
independent! (See Problem 23.) One way to see that W1 ; W2 ; : : : ; Wn are not independent
Pn nP1
random variables is to note that since Wi = 0 this implies Wn = Wi so the last
i=1 i=1
term can be determined using the sum of the …rst n 1 terms. Therefore in the sum,
Pn Pn
(Yi Y )2 = Wi2 there are really only n 1 terms that are linearly independent or
i=1 i=1
“free”. This is an intuitive explanation for the n 1 degrees of freedom for the pivotal
quantities (4.14) and (4.12). In both cases, the degrees of freedom are determined by S 2
and are related to the dimension of the subspace inhabited by the terms in the sum for S 2 ,
that is, the terms Wi = Yi Y ; i = 1; 2; : : : ; n.
We will now show how to use the pivotal quantity (4.14) to construct a 100p% con…dence
interval for the parameter 2 or . Using the Chi-squared table or R we can …nd constants
a and b such that
P (a U b) = p
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 167
p = P (a U b)
(n 1)S 2
= P a 2
b
(n 1)S 2 2 (n 1)S 2
= P
b a
r r !
(n 1) S 2 (n 1) S 2
= P
b a
(n 1)s2 (n 1)s2
; (4.15)
b a
The choice for a, b is not unique. For convenience, a and b are usually chosen such that
1 p
P (U a) = P (U > b) = (4.17)
2
where U 2 (n 1). Note that since the Chi-squared table provided in these Course
Notes tabulate the cumulative distribution function, P (U u), this means using the table
to …nd a and b such that
1 p 1 p 1+p
P (U a) = and P (U b) = p + =
2 2 2
The intervals (4.15) and (4.16) are called equal-tailed con…dence intervals. The choice
(4.17) for a, b does not give the narrowest con…dence interval. The narrowest interval must
be found numerically. For large n the equal-tailed interval and the narrowest interval are
nearly the same.
Note that, unlike con…dence intervals for , the con…dence interval for 2 is not sym-
metric about s2 , the estimate of 2 . This happens of course because the 2 (n 1) distri-
bution is not a symmetric distribution.
In some applications we are interested in an upper bound on (because small is
“good” in some sense). In this case we take b = 1 and …nd a such that P (a U ) = p or
P (U a) = 1 p so that a one-sided 100p% con…dence interval for is
" r #
n 1
0; s
a
168 4. ESTIMATION
so a = 5:63 and b = 26:12. Substituting these values along with (14) s2 = 0:002347 into
(4.16) we obtain
" r r #
14 14
0:013 ; 0:013 = [0:00952; 0:0205]
26:119 5:629
E Y Y = =0
and variance
2
2
V ar Y Y = V ar (Y ) + V ar Y = +
n
Since
Y Y
q G (0; 1)
1 + n1
independently of
(n 1)S 2 2
2
(n 1)
then by Theorem 32
Y Y
q
1
1+ n Y Y
p = q t (n 1)
S2= 2
S 1 + n1
170 4. ESTIMATION
p = P( a T a)
0 1
Y Y
= P@ a q aA
S 1 + n1
r r !
1 1
= P Y aS 1 + Y Y + aS 1+
n n
therefore " r r #
1 1
y as 1 + ; y + as 1+ (4.18)
n n
is an interval of values for the future observation Y with con…dence coe¢ cient p. The
interval (4.18) is called a 100p% prediction interval instead of a con…dence interval since
Y is not a parameter but a random variable. Note that the interval (4.18) is wider than a
100p% con…dence interval for mean . This makes sense since is an unknown constant
with no variability while Y is a random variable with its own variability V ar (Y ) = 2 .
A 100p% prediction interval summarizes a set of values for an unknown future observa-
tion (a random variable) based on the observed data. Con…dence intervals are for unknown
but …xed parameters (not random variables). The procedure for constructing the prediction
interval is based on the probability statement
where Y (a random variable) is the future observation and Y (a random variable possibly
a vector) are the data from an experiment. Suppose you conduct the experiment once and
observe the data y. The constructed interval based on the probability statement (4.19) and
the observed data y is [L(y); U (y)].
To interpret a prediction interval, suppose you conducted the same experiment inde-
pendently a large number of times and each time you constructed the interval [L(y); U (y)]
based on your observed data y. (Of course y won’t be the same every time you conduct
the experiment.) Approximately 100p% of these constructed intervals would contain the
future unknown observation. Of course you usually only conduct the experiment once and
you only have one interval [L(y); U (y)]. You would that say you are 100p% con…dent that
your constructed interval contains the value of the future observation.
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 171
Note that this interval is much wider than a 95% con…dence interval for = the mean of
the population of lens thicknesses produced by this manufacturing process which is given
by
p
25:009 2:1448 (0:013) = 15
= 25:009 0:0072
= [25:0018; 25:0162]
172 4. ESTIMATION
R.)
These results are derived from the fact that 2 log R ( ; Y) is an asymptotic pivotal
quantity with approximately a 2 (1) distribution.
Table 4.3
Approximate Con…dence Intervals for Named Distributions
based on Asymptotic Gaussian Pivotal Quantities
Asymptotic Approximate
Point Point
Named Observed Gaussian 100p%
Estimate Estimator
Distribution Data ^ ~ Pivotal Con…dence
Quantity Interval
~ q
y Y q ^(1 ^)
Binomial(n; ) y n n ~(1 ~) ^ a
n n
~ q
q ^ ^
Poisson( ) y1 ; y2 ; : : : ; yn y Y ~ a n
n
~ ^
Exponential( ) y1 ; y2 ; : : : ; yn y Y ~ ^ a pn
p
n
1+p 1+p
Note: The value a is given by P (Z a) = 2 where Z G (0; 1). In R, a = qnorm 2
4.8. CHAPTER 4 SUMMARY 173
Table 4.4
Con…dence/Prediction Intervals for Gaussian
and Exponential Models
G( ; ) Yp p
= n
G (0; 1) y a = n
known
G( ; ) Yp p
S= n
t (n 1) y as= n
unknown
100p% Prediction
G( ; ) Y Y
q t (n 1) Interval
Y 1
S 1+ n q
unknown 1
y as 1+ n
unknown
G( ; ) h i
2 (n 1)S 2 2 (n (n 1)s2 (n 1)s2
2 1) b ; a
unknown
q q
G( ; ) (n 1)S 2 2 (n (n 1)s2 (n 1)s2
2 1) d ; c
unknown
h i
2nY 2 (2n) 2ny 2ny
Exponential( ) d1 ; c1
1+p
Notes: (1) The value a is given by P (Z a) = 2 where Z G (0; 1).
1+p
In R, a = qnorm 2
1+p 1+p
(2) The value b is given by P (T b) = 2 where T t (n 1). In R, b = qt 2 ;n 1
1 p 2 (n
(3) The values c and d are given by P (W c) = 2 = P (W > d) where W 1).
1 p 1+p
In R, c = qchisq 2 ;k and d = qchisq 2 ;k
1 p 2 (2n).
(4) The values c1 and d1 are given by P (W c1 ) = 2 = P (W > d1 ) where W
1 p 1+p
In R, c1 = qchisq 2 ;k and d1 = qchisq 2 ;k
174 4. ESTIMATION
(a) Run the R code and compare the output with the answers in Example 4:2:3.
(b) Run the R code replacing n<-15 with n<-30 and compare the results with
those for n = 15.
(c) Explain how the mean, standard deviation and symmetry of the original popu-
lation a¤ect the histogram of simulated means.
(d) Explain how the sample size n a¤ects the histogram of simulated means.
abline(a=0.15,b=0,col="red",lwd=2)
title(main="Binomial Likelihood for y=15 and n=40")
Modify this code for y = 75 successes in n = 200 trials and y = 150 successes in
n = 400 trials and observe what happens to the width of the 15% likelihood interval.
thetahat<-5
n<-25
theta<-seq(3.7,6.5,0.001)
Rtheta<-exp(n*thetahat*log(theta/thetahat)+n*(thetahat-theta))
plot(theta,Rtheta,type="l")
# draw a horizontal line at 0.15
abline(a=0.15,b=0,col="red",lwd=2)
title(main="Poisson Likelihood for ybar=5 and n=25")
Modify this code for larger sample sizes n = 100 and n = 400, and observe what
happens to the width of the 15% likelihood interval.
4. For Chapter 2, Problem 4(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.
5. For Chapter 2, Problem 6(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.
6. For Chapter 2, Problem 8(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of r( ) or by using the function uniroot in R.
7. (a) For Chapter 2, Problem 9 plot the relative likelihood function R( ) and de-
termine a 10% likelihood interval. The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R. How well can be
determined based on these data?
(b) Suppose that we can …nd out whether each pair of twins is identical or not, and
that it is determined that of 50 pairs, 17 were identical. Obtain the likelihood
function, the maximum likelihood estimate and a 10% likelihood interval for
in this case. Plot the relative likelihood function on the same graph as the one
in (a), and compare how well can be determined based on the two data sets.
8. For Chapter 2, Problem 12(c) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.
176 4. ESTIMATION
9. Suppose that a fraction of a large population of persons are infected with a certain
virus. Let n and k be integers. Suppose that blood samples for n k people are to be
tested to obtain information about . In order to save time and money, pooled testing
is used, that is, samples are mixed together k at a time to give a total of n pooled
samples. A pooled sample will test negative if all k individuals in that sample are not
infected.
(a) Find the probability that y out of n samples will be negative, if the nk people
are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate of in terms
of n, k and y.
(c) Suppose n = 100, k = 10 and y = 89. Find the maximum likelihood estimate of
, and a 10% likelihood interval for .
10. Suppose that a fraction of a large population of persons over 18 years of age never
drink alcohol. In order to estimate , a random sample of n persons is to be selected
and the number y who do not drink determined; the maximum likelihood estimate of
is then ^ = y=n. We want our estimate ^ to have a high probability of being close
to , and want to know how large n should be to achieve this. Consider the random
variable Y and the estimator ~ = Y =n.
(a) Determine P 0:03 ~ 0:03 , if n = 1000 and = 0:5 using the Normal
approximation to the Binomial. You do not need to use a continuity correction.
(b) If = 0:50 determine how large n should be to ensure that
(a) Run this code to determine the proportion of approximate 95% con…dence in-
tervals which contain the true value.
(b) Run this code for n<-100 and n<-1000 and observe what happens to the pro-
portion.
(c) Run this code for theta<-0.1 and observe what happens to the proportion.
12. The following excerpt is from a March 2, 2012 cbc.ca news article:
“Canadians lead in time spent online: Canadians are spending more time online
than users in 10 other countries, a new report has found. The report, 2012 Canada
Digital Future in Focus, by the internet marketing research company comScore, found
Canadians spent an average of 45:3 hours on the internet in the fourth quarter of 2011.
The report also states that smartphones now account for 45% of all mobile phone use
by Canadians.”
Assume that these results are based on a random sample of 1000 Canadians.
(a) Suppose a 95% con…dence interval for , the mean time Canadians spent on the
internet in this quarter, is reported to be [42:8; 47:8]. How should this interval
be interpreted?
(b) Construct an approximate 95% con…dence interval for the proportion of Cana-
dians whose mobile phone is a smartphone.
(c) Since this study was conducted in March 2012 the research company has been
asked to conduct a new survey to determine if the proportion of Canadians whose
mobile phone is a smartphone has changed. What size sample should be used
to ensure that the width of an approximate 95% con…dence interval is less than
2 (0:02)?
13. Two hundred adults are chosen at random from a population and each adult is asked
whether information about abortions should be included in high school public health
sessions. Suppose that 70% say they should.
(a) Obtain an approximate 95% con…dence interval for the proportion of the pop-
ulation who support abortion information included in high school public health
sessions.
178 4. ESTIMATION
(b) Suppose you found out that the 200 persons interviewed consisted of 50 married
couples and 100 other persons. The 50 couples were randomly selected, as were
the other 100 persons. Discuss the validity (or non-validity) of the analysis in
(a).
14. In the United States. the prevalence of HIV (Human Immunode…ciency Virus) infec-
tions in the population of child-bearing women has been estimated by doing blood
tests (anonymously) on all women giving birth in a hospital. One study tested 29; 000
women and found that 64 were HIV positive (had the virus). Give an approximate
99% con…dence interval for , the fraction of the population that is HIV positive.
State any concerns you have about the accuracy of this estimate.
Y
p
Y =n
(a) Show how this asymptotic pivotal quantity leads to an approximate 95% con…-
dence interval for given by r
y
y 1:96
n
(b) Use the result from (a) to construct an approximate 95% con…dence interval for
in Chapter 2, Problem 10.
(c) Compare the approximate 95% con…dence interval for with a 15% likelihood
interval. What do you notice?
16. Company A leased photocopiers to the federal government, but at the end of their
recent contract the government declined to renew the arrangement and decided to
lease from a new vendor, Company B. One of the main reasons for this decision was
a perception that the reliability of Company A’s machines was poor.
(a) Over the preceding year the monthly numbers of failures requiring a service call
from Company A were
12 14 15 16 18 19 19 22 23 25 28 29
Assuming that the number of service calls needed in a one month period has
a Poisson distribution with mean , obtain and graph the relative likelihood
function R( ) based on the data above.
4.9. CHAPTER 4 PROBLEMS 179
(b) In the …rst year using Company B’s photocopiers, the monthly numbers of service
calls were
7 8 9 10 10 12 12 13 13 14 15 17
Under the same assumption as in part (a), obtain R( ) for these data and graph
it on the same graph as used in (a).
(c) Determine the 15% likelihood interval for which is also an approximate 95%
con…dence interval for for each company. The intervals can be obtained from
the graphs of the relative likelihood functions or by using the function uniroot
in R. Do you think the government’s decision was a good one, as far as the
reliability of the machines is concerned?
(d) What conditions would need to be satis…ed to make the assumptions and analysis
in (a) to (c) valid?
(e) Use the result from Problem 15 to determine approximate 95% con…dence inter-
vals for for each company. Compare these intervals with the intervals obtained
in (c).
17. A manufacturing process produces …bers of varying lengths. The length of a …ber Y
is a continuous random variable with probability density function
y y=
f (y; ) = 2e for y 0 and >0
(e) Show how you would use the statement in (d) to construct an approximate 95%
con…dence interval for .
180 4. ESTIMATION
(f) Suppose n = 18 …bers were selected at random and the lengths were:
P
18
For these data yi = 88:92. Give the maximum likelihood estimate of and
i=1
an approximate 95% con…dence interval for using your result from (e).
18. The lifetime T (in days) of a particular type of light bulb is assumed to have a
distribution with probability density function
1 3 2 t
f (t; ) = t e for t > 0 and >0
2
(a) Suppose t1 ; t2 ; : : : ; tn is a random sample from this distribution. Find the maxi-
mum likelihood estimate ^ and the relative likelihood function R( ).
P20
(b) If n = 20 and ti = 996, graph R( ) and determine the 15% likelihood interval
i=1
for which is also an approximate 95% con…dence interval for . The interval
can be obtained from the graph of R( ) or by using the function uniroot in R.
(c) Suppose we wish to estimate the mean lifetime of a light bulb. Show E(T ) = 3= .
Hint: Use the Gamma function. Find an approximate 95% con…dence interval
for the mean.
(d) Show that the probability p that a light bulb lasts less than 50 days is
p = p( )
= P (T 50; )
50 2
= 1 e [1250 + 50 + 1]
(a) Use the Chi-squared table provided at the end of these Course Notes to answer
the following:
(i) If X 2 (10) …nd P (X 2:6) and P (X > 16).
(ii) If X 2 (4) …nd P (X > 15).
(iii) If X 2 (40) …nd P (X 24:4) and P (X 55:8). Compare these values
with P (Y 24:4) and P (Y 55:8) if Y N (40; 80).
(iv) If X 2 (25) …nd a and b such that P (X a) = 0:025 and
P (X > b) = 0:025.
(v) If X 2 (12) …nd a and b such that P (X a) = 0:05 and P (X > b) = 0:05.
(b) Use the R functions pchisq(x,k) and qchisq(p,k) to check the values in (a).
(c) Determine the following WITHOUT using the Chi-squared table:
(i) If X 2 (1) …nd P (X 2) and P (X > 1:4).
(ii) If X 2 (2) …nd P (X 2) and P (X > 3).
(d) If X G (3; 2) and YiExponential (2), i = 1; 2; : : : ; 5 all independently then
P
5 2
what is the distribution of W = Yi + X2 3 ?
i=1
(e) If Xi 2 (i) ; i = 1; 2; : : : ; 10 independently then what is the distribution of
P10
Xi ?
i=1
(a) Show that this probability density function integrates to one for k = 1; 2; : : :
using the properties of the Gamma function.
(b) Plot the probability density function for k = 5, k = 10 and k = 25 on the same
graph. What do you notice?
(c) Show that the moment generating function of Y is given by
M (t) = E etX
k=2 1
= (1 2t) for t <
2
and use this to show that E(X) = k and V ar(X) = 2k.
(d) Prove Theorem 29 using moment generating functions.
182 4. ESTIMATION
(a) Plot the probability density function for k = 1; 5; 25. Plot the N (0; 1) probability
density function on the same graph. What do you notice?
(b) Show that f (t; k) is unimodal.
(c) Use Theorem 32 to show that E (T ) = 0. Hint: If X and Y are independent
random variables then E [g (X) h (Y )] = E [g (X)] E [h (Y )].
(d) Use the t table provided at the end of these Course Notes to answer the following:
(i) If T t(10) …nd P (T 0:88), P (T 0:88) and P (jT j 0:88).
(ii) If T t(17) …nd P (jT j > 2:90).
(iii) If T t(30) …nd P (T 2:04) and P (T 0:26). Compare these values
with P (Z 2:04) and P (Z 0:26) if Z N (0; 1).
(iv) If T t(18) …nd a and b such that P (T a) = 0:025 and P (T > b) = 0:025.
(v) If T t(13) …nd a and b such that P (T a) = 0:05 and P (T > b) = 0:05.
(e) Use the R functions pt(x,k) and qt(p,k) to check the values in (d).
2
(c) Show that Cov (Wi ; Wj ) = n , for all i 6= j which implies that the Wi0 s are
correlated random variable and therefore not independent random variables.
Y
p
= n
has approximately a G (0; 1) distribution. It also follows that
Y
Q= p
Y= n
has approximately a G (0; 1) distribution. Show how the asymptotic pivotal quantity
Q leads to an approximate 100p% con…dence interval for given by
y
y ap
n
where P (Z a) = (1 + p) =2 and Z G (0; 1).
25. In an early study concerning survival time for patients diagnosed with Acquired Im-
mune De…ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis
P
30
of AIDS and death) of 30 male patients were such that yi = 11; 400 days. Assume
i=1
that the survival times are Exponentially distributed with mean days.
(a) Use the result in Problem 24 to obtain an approximate 90% con…dence interval
for .
(b) Graph the relative likelihood function for these data and obtain an approximate
90% likelihood based con…dence interval for . Compare this with the interval
obtained in (a).
(c) Show that m = ln 2 is the median survival time. Give an approximate 90%
con…dence interval for m based on your interval from (b).
(c) Show how the pivotal quantity U can be used to construct an exact con…dence
interval for .
(d) Refer to the data in Problem 25. Obtain an exact 90% con…dence interval for
based on the pivotal quantity U . Compare this with the approximate con…dence
intervals for obtained in Problem 25.
is a pivotal quantity. Show how this pivotal quantity can be used to construct a
100p% con…dence interval for 2 and .
28. A study on the common octopus (Octopus Vulgaris) was conducted by researchers
at the University of Vigo in Vigo, Spain. Nineteen octopi were caught in July 2008
in the Ria de Vigo (a large estuary on the northwestern coast of Spain). Several
measurements were made on each octopus including their weight in grams. These
weights are given in the table below.
680 1030 1340 1330 1260 770 830 1470 1380 1220
920 880 1020 1050 1140 960 1060 1140 860
P
19 P
19
yi = 20340 and (yi y)2 = 884095
i=1 i=1
(a) Use a qqplot to determine how reasonable the Gaussian model is for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(c) The researchers at the University of Vigo were interested in determining whether
the octopi in the Ria de Vigo are healthy. For common octopi, a population mean
weight of 1100 grams is considered to be a healthy population. Determine a 95%
con…dence interval for . What should the researchers conclude about the health
of the octopi, in terms of weight, in the Ria de Vigo?
(d) Determine a 90% con…dence interval for based on these data.
4.9. CHAPTER 4 PROBLEMS 185
29. Consider the data on weights of adult males and females from Chapter 1. The data
are available in the …le bmidata.txt posted on the course website.
(a) Determine whether it is reasonable to assume a Gaussian model for the female
heights and a di¤erent Gaussian model for the male heights.
(b) Obtain a 95% con…dence interval for the mean for the females and males sepa-
rately. Does there appear to be a di¤erence in the means for females and males?
(We will see how to test this formally in Chapter 6.)
(c) Obtain a 95% con…dence interval for the standard deviation for the females and
males separately. Does there appear to be a di¤erence in the standard deviations?
30. Sixteen packages are randomly selected from the production of a detergent packaging
machine. Let yi = weight in grams of the i’th package, i = 1; 2; : : : ; 16.
287 293 295 295 297 298 299 300
300 302 302 303 306 307 308 311
For these data
P
16 P
16
yi = 4803 and yi 2 = 1442369
i=1 i=1
To analyze these data the model Yi G ( ; ) ; i = 1; 2; : : : ; 16 independently is
assumed where and are unknown parameters.
(a) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(b) Obtain 95% con…dence intervals for and .
(c) Let Y represent the weight of a future, independent, randomly selected package.
Obtain a 95% prediction interval for Y .
31. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. University
researchers in Waterloo purchased 12 detectors of the same type at Home Depot. The
detectors were placed in a chamber where they were exposed to 105 picocuries per
liter of radon over 3 days. The readings given by the detectors were:
91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i’th detector, i = 1; 2; : : : ; 12. For these data
P
12 P
12
yi = 1249:6 and (yi y)2 = 971:4267
i=1 i=1
(a) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(b) Obtain a 95% con…dence interval for . Does it contain the value = 105?
(c) Obtain a 95% con…dence interval for .
(d) As a statistician what would you say to the university researchers about the
accuracy and precision of the detectors?
(e) University researchers purchased one more radon detector. It is to be exposed to
105 picocuries per liter of radon over 3 days. Calculate a 95% prediction interval
for the reading for this new radon detector.
(f) Suppose the researchers wanted to determine the mean level of radon detected by
the radon detectors to “within 3 picocurie per liter”. As a statistician we would
interpret this as requiring that the 95% con…dence interval for a should have
width at most 6. How many detectors in total would you advise the researchers
to test?
32. A chemist has two ways of measuring a particular quantity; one has more random
error than the other. For method I, measurements X1 ; X2 ; : : : ; Xm follow a Normal
distribution with mean and variance 21 , whereas for method II, measurements
Y1 ; Y2 ; : : : ; Yn have a Normal distribution with mean and variance 22 .
(a) Assuming that 21 and 22 are known, …nd the combined likelihood function for
based on observed data x1 ; x2 ; : : : ; xm and y1 ; y2 ; : : : ; yn and show that the
maximum likelihood estimate of is
w1 x + w2 y
^=
w1 + w2
(b) Suppose that 1 = 1, 2 = 0:5 and n = m = 10. How would you rationalize
to a non-statistician why you were using the estimate (x + 4y) =5 instead of
(x + y) =2?
(c) Suppose that 1 = 1, 2 = 0:5 and n = m = 10, determine the standard
deviation of the maximum likelihood estimator
w1 X + w2 Y
~=
w1 + w2
p
We denote this by writing Xn ! c.
p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and
p p p
Yn ! c2 , show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
(b) Let X1 ; X2 ; : : : be independent and identically distributed random variables with
probability density function f (x; ). A point estimator ~n based on a random
p
sample X1 ; X2 ; : : : ; Xn is said to be consistent for if ~n ! as n ! 1.
(i) Let X1 ; X2 ; : : : ; Xn be independent and identically distributed Uniform(0; )
random variables. Show that ~n = max (X1 ; X2 ; : : : ; Xn ) is consistent for .
(ii) Let X Binomial(n; ). Show that ~n = X=n is consistent for .
35. Challenge Problem Refer to the de…nition of consistency in Problem 32(b). Di¢ -
culties can arise when the number of parameters increases with the amount of data.
Suppose that two independent measurements of blood sugar are taken on each of n
individuals and consider the model
2
Xi1 ; Xi2 N ( i; ) for i = 1; 2; : : : ; n
where Xi1 and Xi2 are the independent measurements. The variance 2 is to be
estimated, but the i ’s are also unknown.
(a) Find the maximum likelihood estimator ~ 2 and show that it is not consistent.
(b) Suggest an alternative way to estimate 2 by considering the di¤erences
Wi = Xi1 Xi2 .
(c) What does represent physically if the measurements are taken very close to-
gether in time?
36. Challenge Problem Proof of Central Limit Theorem (Special Case) Suppose
Y1 ; Y2 ; : : : are independent random variables with E(Yi ) = ; V ar(Yi ) = 2 and that
they have the same distribution, whose moment generating function exists.
188 4. ESTIMATION
2
(a) Show that (Yi )= has moment generating function of the form (1 + t2 +
3 4 p
terms in t ; th ; : : :) and thusi that (Yi )= n has moment generating function
t2
of the form 1 + 2n + o(n) , where o(n) signi…es a remainder term Rn with the
property that Rn =n ! 0 as n ! 1.
(b) Let p
P
n (Y
i ) n(Y )
Zn = p =
i=1 n
h in
t2
and note that its moment generating function is of the form 1 + 2n + o(n) .
2
Show that as n ! 1 this approaches the limit et =2 , which is the moment
generating function for a G(0; 1) random variable. (Hint: For any real number
a, (1 + a=n)n ! ea as n ! 1.)
5. HYPOTHESIS TESTING
5.1 Introduction
What does it mean to test a hypothesis in the light of observed data or information?
Suppose a statement has been formulated such as “I have extrasensory perception.” or
“This drug that I developed reduces pain better than those currently available.” and an
experiment is conducted to determine how credible the statement is in light of observed
data. How do we measure credibility? If there are two alternatives: “I have ESP.” and
“I do not have ESP.” should they both be considered a priori as equally plausible? If I
correctly guess the outcome on 53 of 100 tosses of a fair coin, would you conclude that
my gift is real since I was correct more than 50% of the time? If I develop a treatment
for pain in my basement laboratory using a mixture of seaweed and tofu, would you treat
the claims “this product is superior to aspirin”and “this product is no better than aspirin”
symmetrically?
To understand a test of hypothesis it is helpful to draw an analogy with the criminal
court system used in many places in the world, where the two hypotheses “the defendant is
innocent”and “the defendant is guilty”are not treated symmetrically. In these courts, the
court assumes a priori that the …rst hypothesis, “the defendant is innocent” is true, and
then the prosecution attempts to …nd su¢ cient evidence to show that this hypothesis of
innocence is not plausible. There is no requirement that the defendant be proved innocent.
At the end of the trial the judge or jury may conclude that there was insu¢ cient evidence
for a …nding of guilty and the defendant is then exonerated. Of course there are two types
of errors that this system can (and inevitably does) make; convict an innocent defendant or
fail to convict a guilty defendant. The two hypotheses are usually not given equal weight a
priori because these two errors have very di¤erent consequences.
A test of hypothesis is analogous to this legal example. We often begin by specifying
a single “default” hypothesis (“the defendant is innocent” in the legal context) and then
check whether the data collected is unlikely under this hypothesis. This default hypothesis
is often referred to as the “null”hypothesis and is denoted by H0 (“null”is used because it
often means a new treatment has no e¤ect). Of course, there is an alternative hypothesis,
which may not always be speci…ed. In many cases the alternative hypothesis is simply that
H0 is not true.
We will outline the logic of a test of hypothesis in the …rst example, the claim that I
189
190 5. HYPOTHESIS TESTING
have ESP. In an e¤ort to prove or disprove this claim, an unbiased observer tosses a fair coin
100 times and before each toss I guess the outcome of the toss. We count Y , the number
of correct guesses which we can assume has a Binomial distribution with n = 100. The
probability that I guess the outcome correctly on a given toss is an unknown parameter .
If I have no unusual ESP capacity at all, then we would assume = 0:5, whereas if I have
some form of ESP, either a positive attraction or an aversion to the correct answer, then
we expect 6= 0:5. We begin by asking the following questions in this context:
(1) Which of the two possibilities, = 0:5 or 6= 0:5, should be assigned to H0 , the null
hypothesis?
(2) What observed values of Y are highly inconsistent with H0 and what observed values
of Y are compatible with H0 ?
(3) What observed values of Y would lead to us to conclude that the data provide no
evidence against H0 and what observed values of Y would lead us to conclude that
the data provide strong evidence against H0 ?
In answer to question (1), hopefully you observed that these two hypotheses ESP and
NO ESP are not equally credible and decided that the null hypothesis should be H0 : = 0:5
or H0 : I do not have ESP.
To answer question (2), we note that observed values of Y that are very small (e.g.
0 10) or very large (e.g. 90 100) would clearly lead us to to believe that H0 is false,
whereas values near 50 are perfectly consistent with H0 . This leads naturally to the concept
of a test statistic or discrepancy measure.
Usually we de…ne D so that D = 0 represents the best possible agreement between the
data and H0 , and values of D not close to 0 indicate poor agreement. A general method for
constructing test statistics will be described in Sections 5:3, but in this example, it seems
natural to use D(Y ) = jY 50j.
Question (3) could be resolved easily if we could specify a threshold value for D, or
equivalently some function of D. In the given example, the observed value of Y was y = 52
and so the observed value of D is d = j52 50j = 2. One might ask what is the probability,
when H0 is true, that the discrepancy measure results in a value less than d. Equivalently,
what is the probability, assuming H0 is true, that the discrepancy measure is greater than
or equal to d? In other words we want to determine P (D d; H0 ) where the notation
“; H0 ” means “assuming that H0 is true”. We can compute this probability easily for this
5.1. INTRODUCTION 191
How can we interpret this value in terms of the test of H0 ? Roughly 76% of claimants
similarly tested for ESP, who have no abilities at all but simply randomly guess, will
perform as well or better (that is, result in at least as large a value of D as the observed
value of 2) than I did. This does not prove I do not have ESP but it does indicate we
have failed to …nd any evidence in these data to support rejecting H0 . There is no evidence
against H0 in the observed value d = 2, and this was indicated by the high probability that,
when H0 is true, we obtain at least this much measured disagreement with H0 .
We now proceed to a more formal treatment of a test of hypothesis. We will concentrate
on two types of hypotheses:
(1) the hypothesis H0 : = 0 where it is assumed that the data Y have arisen from a
family of distributions with probability (density) function f (y; ) with parameter
(2) the hypothesis H0 : Y f0 (y) where it is assumed that the data Y have a speci…ed
probability (density) function f0 (y).
De…nition 39 Suppose we use the test statistic D = D (Y) to test the hypothesis H0 .
Suppose also that d = D (y) is the observed value of D. The p-value or observed signi…cance
level of the test of hypothesis H0 using test statistic D is
p value = P (D d; H0 )
statistic. If d (the observed value of D) is large and consequently the p value is small
then one of the following two statements is correct:
(1) H0 is true but by chance we have observed an outcome that does not happen very often
when H0 is true
or
(2) H0 is false.
If the p value is close to 0:05, then the event of observing a D value as unusual or more
unusual as we have observed happens only 5 times out of 100, that is, not very often.
Therefore we interpret a p value close to 0:05 as indicating that the observed data are
providing evidence against H0 . If the p value is very small, for example less than 0:001,
then the event of observing a D value as unusual or more unusual as we have observed
happens only 1 time out of 1000, that is, very rarely. Therefore we interpret a p value
close to 0:001 as indicating that the observed data are providing strong evidence against
H0 . If the p value is greater than 0:1, then the event of observing a D value as unusual
or more unusual as we have observed happens more than 1 time out of 10, that is, fairly
often and therefore the observed data are consistent with H0 .
Remarks
(1) Note that the p value is de…ned as P (D d; H0 ) and not P (D = d; H0 ) even though
the event that has been observed is D = d. If D is a continuous random variable then
P (D = d; H0 ) is always equal to zero which is not very useful. If D is a discrete random
variable with many possible values then P (D = d; H0 ) will be small which is also not very
useful. Therefore to determine how unusual the observed result is we compare it to all the
other results which are as unusual or more unusual than what has been observed.
(2) The p value is NOT the probability that H0 is true. This is a common misinterpre-
tation.
The following table gives a rough guideline for interpreting p values. These are only
guidelines for this course. The interpretation of p values must always be made in the
context of a given study.
p value Interpretation
p value > 0:10 No evidence against H0 based on the observed data.
0:05 < p value 0:10 Weak evidence against H0 based on the observed data.
0:01 < p value 0:05 Evidence against H0 based on the observed data.
0:001 < p value 0:01 Strong evidence against H0 based on the observed data.
p value 0:001 Very strong evidence against H0 based on the observed data.
Table 5.1: Guidelines for interpreting p values
5.1. INTRODUCTION 193
which can be calculated using R or using the Normal approximation to the Binomial since
n = 200 is large. Using the Normal approximation (without a continuity correction since
it is not essential to have an exact p value) we obtain
p value = P (D 14; H0 )
1
= P (Y 44) where Y Binomial 180;
6
180
X y 180 y
180 1 5
=
y 6 6
y=44
= 0:005
which provides strong evidence against H0 , and suggests that is bigger than 1=6. This is
an example of a one-sided test.
194 5. HYPOTHESIS TESTING
= 0:18
and this probability is not especially small. Indeed almost one die in …ve, though fair, would
show this level of discrepancy with H0 . We conclude that there is no evidence against H0
in light of the observed data.
Note that we do not claim that H0 is true, only that there is no evidence in light of the
data that it is not true. Similarly in the legal example, if we do not …nd evidence against
H0 : “defendant is innocent”, this does not mean we have proven he or she is innocent, only
that, for the given data, the amount of evidence against H0 was insu¢ cient to conclude
otherwise.
The approach to testing a hypothesis described above is very general and straightfor-
ward, but a few points should be stressed:
(1) If the p value is very small then we would conclude that there is strong evidence
against H0 in light of the observed data; this is often termed “statistically
signi…cant”evidence against H0 . We believe that statistical evidence is best measured
when we interpret p values as in Table 5.1. However, it is still common in some areas
of research to adopt a threshold p value such as 0:05 and “reject H0 ” whenever
the p-value is below this threshold. This may be necessary when there are only
two possible decisions from which to choose. For example in a trial, a person is either
convicted or acquitted of a crime. In the examples in these Course Notes we report
the p value and use the guidelines in Table 5.1 to make a conclusion about whether
there is evidence against H0 or not. We emphasize the point that any decisions which
are made after determining the p value for a given hypothesis H0 must be made in
the context of the empirical study.
(2) If the p value is not small, we do not conclude that H0 is true. We simply say
there is no evidence against H0 in light of the observed data. The reason for
this “hedging”is that in most settings a hypothesis may never be strictly “true”. For
example, one might argue when testing H0 : = 1=6 in Example 5.1.2 that no real
die ever has a probability of exactly 1=6 for side 1. Hypotheses can be “disproved”
(with a small degree of possible error) but not proved.
5.1. INTRODUCTION 195
(3) Just because there is strong evidence against a hypothesis H0 , there is no implication
about how “wrong” H0 is. A test of hypothesis should always be supplemented with
an interval estimate that indicates the magnitude of the departure from H0 .
(4) It is important to keep in mind that although we might be able to …nd statisti-
cally signi…cant evidence against a given hypothesis, this does not mean that the
di¤erences found are of practical signi…cance. For example, suppose an insurance
company randomly selects a large number of policies in two di¤erent years and …nds a
statistically signi…cant di¤erence in the mean value of policies sold in those two years
of $5:21. This di¤erence would probably not be of practical signi…cance if the average
value of policies sold in a year was greater than $1000. Similarly, if we collect large
amounts of …nancial data, it is quite easy to …nd evidence against the hypothesis that
stock or stock index returns are Normally distributed. Nevertheless for small amounts
of data and for the pricing of options, such an assumption is usually made and con-
sidered useful. Finally suppose we compared two cryptographic algorithms using the
number of cycles per byte as the unit of measurement. A mean di¤erence of two
cycles per byte might be found to be statistically signi…cant but the decision about
whether this di¤erence is of practical importance or not is best left to a computer
scientist who studies algorithms.
(5) When the observed data provide strong evidence against the null hypothesis, re-
searchers often have an “alternative” hypothesis in mind. For example, suppose a
standard pain reliever provides relief in about 50% of cases and researchers at a phar-
maceutical company have developed a new pain reliever that they wish to test. The
null hypothesis is H0 : P (relief) = 0:5. Suppose there is strong evidence against
H0 based on the data. The researchers will want to know in which direction that
evidence lies. If the probability of relief is greater than 0:5 the researchers might
consider adopting the drug or doing further testing, but if the probability of relief is
less than 0:5, then the pain reliever would probably be abandoned. The choice of the
discrepancy measure D is often made with a particular alternative in mind.
A drawback with the approach to testing described so far is that we do not have a
general method for choosing the test statistic or discrepancy measure D. Often there are
“intuitively obvious” test statistics that can be used; this was the case in the examples in
this section. In Section 5:3 we will see how to use the likelihood function to construct a
test statistic in more complicated situations where it is not always easy to come up with
an intuitive test statistic.
For the Gaussian model with unknown mean and standard deviation we use test statis-
tics based on the pivotal quantities that were used in Chapter 4 for constructing con…dence
intervals.
196 5. HYPOTHESIS TESTING
to estimate 2 .
Recall from Chapter 4 that
Y
T = p t (n 1)
S= n
We use this pivotal quantity to construct a test of hypothesis for the parameter when the
standard deviation is unknown.
p value = P (D d; H0 is true)
= P (jT j d) where T t (n 1)
= 2 [1 P (T d)]
Since values of Y which are larger or smaller than 0 provide evidence against the null
hypothesis this test is called a two-sided test of hypothesis.
5.2. HYPOTHESIS TESTING FOR PARAMETERS IN THE G( ; ) MODEL 197
A: 1:026 0:998 1:017 1:045 0:978 1:004 1:018 0:965 1:010 1:000
B: 1:011 0:966 0:965 0:999 0:988 0:987 0:956 0:969 0:980 0:988
Let Y represent a single measurement on one of the scales, and let represent the
average measurement E(Y ) in repeated weighings of a single 1 kg weight. If an experiment
involving n weighings is conducted then a test of H0 : = 1 can be based on the test
statistic (5.1) with observed value (5.2) and 0 = 1.
The samples from scales A and B above give us
p value = P (D 0:839; = 1)
= P (jT j 0:839) where T t (9)
= 2 [1 P (T 0:839)]
= 2 (1 0:7884)
t 0:42
where the probability is obtained using R. Alternatively if we use the t table provided in
these notes we obtain P (T 0:5435) = 0:7 and P (T 0:88834) = 0:8 so
In either case we have that the p value > 0:1 and thus there is no evidence of bias, that
is, there is no evidence against H0 : = 1 for scale A based on the observed data.
For scale B, however, we obtain
p value = P (D 3:534; = 1)
= P (jT j 3:534) where T t (9)
= 2 [1 P (T 3:534)]
= 0:0064
where the probability is obtained using R. Alternatively if we use the t table we obtain
P (T 3:2498) = 0:995 and P (T 4:2968) = 0:999 so
In either case we have that the p value < 0:01 and thus there is strong evidence against
H0 : = 1. The observed data suggest strongly that scale B is biased.
198 5. HYPOTHESIS TESTING
Finally, note that just although there is strong evidence against H0 for scale B, the
degree of bias in its measurements is not necessarily large enough to be of practical concern.
In fact, we can obtain a 95% con…dence interval for for scale B by using the pivotal
quantity
Y
T = p t (9)
S= 10
For the t table we have P (T 2:2622) = 0:975 and a 95% con…dence interval for is given
by p
y 2:2622s= 10 = 0:981 0:012
or
[0:969; 0:993]
Evidently scale B consistently understates the weight but the bias in measuring the one kg
weight is likely fairly small (about 1% 3%).
Remark The function t.test in R will give con…dence intervals and test hypotheses about
. See Problem 3.
Y
D= p 0
S= n
so that large values of D provide evidence against H0 in the direction of the alternative
> 0.
Let the observed value of D be
y
d= p0
s= n
Then
p value = P (D d; H0 is true)
= P (T d)
=1 P (T d) where T t (n 1)
jY j jy j
if and only if P p 0 p 0 ; H0 : = 0 is true 0:05
S= n s= n
jy j
if and only if P jT j p0 0:05 where T t (n 1)
s= n
jy j
if and only if P jT j p0 0:95
s= n
jy j
if and only if p0 a where P (jT j a) = 0:95
s= n
p p
if and only if 0 2 y as= n; y + as= n
which is a 95% con…dence interval for . In other words, the p value for testing H0 : = 0
is greater than or equal to 0:05 if and only if the value = 0 is an element of a 95% con…-
dence interval for (assuming we use the same pivotal quantity). Note that both endpoints
of the 95% con…dence interval correspond to a p value equal to 0:05 while the values inside
the 95% con…dence interval will have p values greater than 0:05.
More generally, suppose we have data y and a model f (y; ). Suppose also that we
use the same pivotal quantity to construct the (approximate) con…dence interval for and
to test the hypothesis H0 : = 0 . Then the parameter value = 0 is an element of
the 100q% (approximate) con…dence interval for if and only if the p value for testing
H0 : = 0 is greater than or equal to 1 q.
(n 1)S 2 1 P
n
2
= 2
(Yi Y )2 2
(n 1)
i=1
to construct con…dence intervals for the parameter . We may also wish to test a hypothesis
such as H0 : = 0 or equivalently H0 : 2 = 20 . One approach is to use a likelihood
ratio test statistic which is described in the next section. Alternatively we could use the
test statistic
(n 1)S 2
U= 2
0
for testing H0 : = 0 . Large values of U and small values of U provide evidence against
H0 . (Why is this?) Now U has a Chi-squared distribution when H0 is true and the
Chi-squared distribution is not symmetric which makes the determination of “large” and
“small” values somewhat problematic. The following simpler calculation approximates the
p value:
1. Let u = (n 1)s2 = 2
0 denote the observed value of U from the data.
p value = 2P (U u)
where U 2 (n 1).
p value = 2P (U u)
where U 2 (n 1).
1
Figure 5.1 shows a picture for a large observed value of u. In this case P (U u) > 2
and the p value = 2P (U u).
Note:
Only one of the two values 2P (U u) and 2P (U u) will be less than one and this
value is the desired p value.
5.2. HYPOTHESIS TESTING FOR PARAMETERS IN THE G( ; ) MODEL 201
0.09
0.08
0.07
0.06
p.d.f.
0.05
0.04 P(U<u)
0.03
0.02
0.01 P(U>u)
0
0 5 10 15 u 20 25 30
Example 5.2.2
Suppose for the manufacturing process in Example 4.7.2, we wish to test the hypothesis
H0 : = 0:008 (0:008 is the desired or target value of the manufacturer would like to
achieve). Since the 95% con…dence interval for was found to be [0:0095; 0:0204] which
does not contain the value = 0:008 we already know that the p value for a test of H0
based on the test statistic U = (n 1)S 2 = 20 will be less than 0:05.
To …nd the p value, we use the procedure given above:
1. u = (n 1)s2 = 2
0 = (14) s2 = (0:008)2 = 0:002347= (0:008)2 = 36:67
2. The p value is
2
p value = 2P (U u) = 2P (U 36:67) = 0:0017 where U (14)
In either case we have that the p value < 0:01 and thus there is strong evidence based
on the observed data against H0 : = 0:008. Both the observed value of
p
s = 0:002347=14 = 0:0129 and the 95% con…dence interval for suggest that is bigger
than 0:008.
202 5. HYPOTHESIS TESTING
L( 0 )
R( 0 ) =
L(^)
L( 0 )
( 0) = 2 log (5.4)
L(~)
(remember log = ln) which is a one-to-one function of L( 0 )=L(~). We choose this particular
function because, if H0 : = 0 is true, then ( 0 ) has approximately 2 (1) distribution.
12
Recall that L ( ) = L ( ; y) is a function of the observed data y. Replacing y by the correspond-
ing random variable Y means that L ( ; Y) is a random variable. The random variable L( 0 )=L(~) =
L( 0 ; Y)=L(~; Y) is a function of Y in several places including ~ = g (Y).
5.3. LIKELIHOOD RATIO TEST OF HYPOTHESIS - ONE PARAMETER 203
Note that small values of R( 0 ) correspond to large observed values of ( 0 ) and therefore
large observed value of ( 0 ) indicate evidence against the hypothesis H0 : = 0 . We
illustrate this in Figure 5.2. Notice that the more plausible values of the parameter
correspond to larger values of R( ) or equivalently, in the bottom panel, to small values of
( ) = 2 log [R( )]. The particular value displayed 0 is around 0:3 and it appears that
( 0 ) = 2 log [R( 0 )] is quite large, in this case around 9. To know whether this is too
large to be consistent with H0 , we need to compute the p value.
0 .8
0 .6
m o r e p l a u s i b l e va lu e s
R( θ )
0 .4
0 .2
le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
12
10
θ ))
6
-2log(R(
2
le s s p la u s ib le m o r e p l a u s i b l e va lu e s
0
le s s p la u s ib le
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6
θ
θ = 0 .3
0
p value is then
2
p value t P [W ( 0 )] where W (1) (5.5)
p
= P jZj ( 0 ) where Z G (0; 1)
h p i
= 2 1 P Z ( 0)
Let us summarize the construction of a test from the likelihood function. Let the random
variable (or vector of random variables) Y represent data generated from a distribution
with probability function or probability density function f (y; ) which depends on the
scalar parameter . Let be the parameter space (set of possible values) for . Consider
a hypothesis of the form
H0 : = 0
where 0 is a single point (hence of dimension 0). We can test H0 using as our test statis-
tic the likelihood ratio test statistic , de…ned by (5.4). Then large observed values of
( 0 ) correspond to a disagreement between the hypothesis H0 : = 0 and the data and
so provide evidence against H0 . Moreover if H0 : = 0 is true, ( 0 ) has approximately
a 2 (1) distribution so that an approximate p value is obtained from (5.5). The theory
behind the approximation is based on a result which shows that under H0 , the distribution
of approaches 2 (1) as the size of the data set becomes large.
( 0) = 2 log R( 0 )
" #
y n y
0 1 0
= 2 log
^ 1 ^
5.3. LIKELIHOOD RATIO TEST OF HYPOTHESIS - ONE PARAMETER 205
(Note that since R(0:5) = 0:367 > 0:1 then we already know that = 0:5 is a plausible
value of .) The approximate p value for testing H0 : = 0:5 is
2
p value t P (W 2:003) where W (1)
h p i
=2 1 P Z 2:003 where Z G (0; 1)
= 2 [1 P (Z 1:42)] = 2 (1 0:9222)
= 0:1556
and there is no evidence against H0 : = 0:5 based on the data. Note that the test statistic
D = jY 100j used in Example 5.1.1 and the likelihood ratio test statistic (0:5) give
nearly identical results. This is because n = 200 is large.
1 1P
n
n ny=
L( ) = n exp yi = e for >0
i=1
Since the maximum likelihood estimate is ^ = y, the relative likelihood function can be
written as
L( )
R( ) =
L(^)
n ny=
e
= n
^ e ny=^
!n
^ ^=
= en(1 ) for >0
206 5. HYPOTHESIS TESTING
( 0) = 2 log R ( 0 )
" !n #
^ ^=
= 2 log en(1 0 )
0
P
50
with yi = 93840. For these data the maximum likelihood estimate of is
i=1
^ = y = 93840=50 = 1876:8. To check whether the Exponential model is reasonable
for these data we plot the empirical cumulative distribution function for these data and
then superimpose the cumulative distribution function for a Exponential(1876:8) random
variable. See Figure 5.3. Since the agreement between the empirical cumulative distribution
function and the Exponential(1876:8) cumulative distribution function is quite good we
assume the Exponential model to test the hypothesis that the mean lifetime the light
bulbs is 2000 hours. The observed value of the likelihood ratio test statistic for testing
H0 : = 2000 is
0.9
Exponential(1876.8)
0.8
0.7
e.c.d.f.
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Lifetimes of Light Bulbs
(Note that since R(2000) = 0:9058 > 0:1 then we already know that = 2000 is a plausible
value of .) The approximate p value for testing H0 : = 2000 is
2
p value t P (W 0:1979) where W (1)
h p i
=2 1 P Z 0:1979 where Z G (0; 1)
= 2 [1 P (Z 0:44)] = 2 (1 0:67003)
= 0:65994
and there is no evidence against H0 : = 2000 based on the data. Therefore there is no
evidence against the manufacturer’s claim that is 2000 hours based on the data. Although
the maximum likelihood estimate ^ was under 2000 hours (1876:8) it was not su¢ ciently
under to give evidence against H0 : = 2000.
n 1 P
n n(y )2
L( ) = exp 2
(yi y)2 exp 2
2 i=1 2
208 5. HYPOTHESIS TESTING
n(y )2
L( ) = exp 2
for 2<
2
n(y )2
l( ) = 2
for 2<
2
n(y )
l0 ( ) = 2
=0
1 Pn
~=Y = Yi
n i=1
L( )
R( ) =
L(^ )
n(y )2
= exp 2
for 2<
2
L ( 0)
( 0) = 2 log
L (~ )
n(Y 2
0)
= 2 log exp 2
since ~ = Y
2
n(Y 2
0)
= 2
2
Y
= p 0 (5.6)
= n
The purpose of writing the likelihood ratio statistic in the form (5.6) is to draw attention
to the fact that, in this special case, ( 0 ) has exactly a 2 (1) distribution for all values
of n since Y =pn0 G (0; 1).
More generally it is not obvious that the likelihood ratio test statistic has an approximate
2 (1) distribution.
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 209
H0 : 2 0
L(^) = max L( )
2
Similarly we let ^ 0 denote the maximum likelihood estimate of over 0 (i.e. we maximize
the likelihood with the parameter constrained to lie in the set 0 ) so that
L(^0 ) = max L( )
2 0
p value = P ( ; H0 ) t P (W ) (5.8)
where W 2 (k p).
The likelihood ratio test covers a great many di¤erent types of examples, but we only
provide a few examples here.
210 5. HYPOTHESIS TESTING
where = E(Y ) is the mean number of failures per month. (This ignores that the number
of days that the copiers are used varies a little across months. Adjustments could be made
to the analysis to deal with this.) Denote the value of for Company A’s copiers as A and
the value for Company B’s as B . Let us test the hypothesis that the two photocopiers
have the same mean number of failures
H0 : A = B
Essentially we have data from two Poisson distributions with possibly di¤erent parameters.
For convenience let x1 ; : : : ; xn denote the observations for Company A’s photocopier which
are assumed to be a random sample from the model
xe A
A
P (X = x; A) = for x = 0; 1; : : : and A 0
x!
Similarly let y1 ; y2 ; : : : ; ym denote the observations for Company B’s photocopier which are
assumed to be a random sample from the model
y
Be
B
P (Y = y; B) = for y = 0; 1; : : : and B 0
y!
independently of the observations for Company A’s photocopier. In this case the parameter
vector is the two dimensional vector = ( A ; B ) and = f( A ; B ) : A 0; B 0g.
The note that the dimension of is k = 2. Since the null hypothesis speci…es that the
two parameters A and B are equal but does not otherwise specify their values, we have
0 = f( ; ) : 0g which is a space of dimension p = 1.
To construct the likelihood ratio test of H0 : A = B we need the likelihood function
for the parameter vector = ( A ; B ). We …rst note that the likelihood function for A
only based on the data x1 ; x2 ; : : : ; xn is
xi
Q
n Q
n
Ae
A
L1 ( A) = f (xi ; A) = for A 0
i=1 i=1 xi !
or more simply
nx n
L1 ( A) = A e
A for A 0
Similarly the likelihood function for B only based on y1 ; y2 ; : : : ; ym is given by
my m
L2 ( B) = B e
B for B 0
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 211
Since the datasets are independent, the likelihood function for =( A; B) is obtained as
a product of the individual likelihoods
L( ) = L( A ; B ) = L1 ( A ) L2 ( B)
nx n A my m B
= A e B e for ( A; B) 2
The number of photocopy failures in twelve consecutive months for company A and
company B are given below:
Month 1 2 3 4 5 6 7 8 9 10 11 12 Total
P
12
Company A 16 14 25 19 23 12 22 28 19 15 18 29 xi = 240
i=1
P12
Company B 13 7 12 9 15 17 10 13 8 10 12 14 yj = 140
j=1
The values of A and B which maximize l( A; B) are obtained by solving the two equa-
tions
@l @l
=0 =0
@ A @ B
which gives two equations in two unknowns:
240
12 + =0
A
140
12 + =0
B
240
The maximum likelihood estimates of A and B (unconstrained) are ^ A = x = 12 = 20:0
and ^ B = y = 140 ^ = (x; y) = (20:0; 11:667).
12 = 11:667 and
To determine
L(^0 ) = max L( )
2 0
we need to …nd the (constrained) maximum likelihood estimate ^0 , which is the value of
= ( A ; B ) which maximizes l( A ; B ) under the constraint A = B . To do this we
merely let = A = B in (5.9) to obtain
p value = P ( 26:64; H0 )
2
t P (W 26:64) where W (1)
h p i
=2 1 P Z 26:64 where Z G (0; 1)
t0
Based on the data there is very strong evidence against the hypothesis H0 : A = B . The
data suggest that Company B’s photocopiers have a lower rate of failure than Company
A’s photocopiers.
Note that we could also follow up this conclusion by giving a con…dence interval for the
mean di¤erence A B since this would indicate the magnitude of the di¤erence in the
two failure rates. The maximum likelihood estimates ^ A = 20:0 average failures per month
and ^B = 11:67 failures per month di¤er by quite a bit, but we could also give a con…dence
interval in order to express the uncertainty in such estimates.
where h i
n=2
c = log (2 )
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 213
~=Y
1 Pn
~2 = (Yi Y )2
n i=1
( 0) = 2l Y ; ~ 2l Y ; 0
1 P n 1 P
n
= 2n log(~ ) 2 (Yi Y )2 + 2n log( 0) + 2 (Yi Y )2
~ i=1 0 i=1
0 1 1 2
= 2n log + 2 2 n~
~ 0 ~
~2 ~2
=n 2 1 log 2
0 0
This is not as obviously a Chi-squared random variable. It is, as one might expect, a
function of ~ 2 = 20 which is the ratio of the maximum likelihood estimator of the variance
divided by the value of 2 under H0 . In fact the value of ( 0 ) increases as the quantity
~ 2 = 20 gets further away from the value 1 in either direction.
The test proceeds by determining the observed value of ( 0 )
^2 ^2
( 0) = n 2 1 log 2
0 0
n! y1 y2 yk P
k
f (y1 ; y2 ; : : : ; yk ; 1; : : : ; k ) = 1 2 k for yj = 0; 1; : : : ; n and yj = n
y1 !y2 ! yk ! j=1
or more simply
Q
k
yj
L( ) = j
j=1
where
= 2l(^) 2l(^0 )
is the observed value of .
We will give speci…c examples of the Multinomial model in Chapter 7.
5.5. CHAPTER 5 SUMMARY 215
This result is based on the fact that 2 log R ( 0 ; Y) has approximately a 2 (1) distri-
bution assuming H0 : = 0 is true.
Table 5.2
Hypothesis Tests for Named Distributions
based on Asymptotic Gaussian Pivotal Quantities
!
j^ 0 j
2P Z q
y Y q
j~ 0 j 0 (1 0)
Binomial(n; ) n n 0 (1 0)
n
Z G (0; 1)
!
j^q 0 j
2P Z
j~q 0 j 0
Poisson( ) y Y 0
n
Z G (0; 1)
j^ 0 j
2P Z
j~ 0 j p0
n
Exponential( ) y Y p0
n
Z G (0; 1)
Table 5.3
Hypothesis Tests for Gaussian
and Exponential Models
Test Exact
Model Hypothesis
Statistic p value
jy j
2P Z p0
= n
G( ; ) jY j
H0 : = 0 p0
known = n
Z G (0; 1)
jy j
2P T p0
s= n
G( ; ) jY j
H0 : = 0 p0
unknown S= n
T t (n 1)
(n 1)s2
min(2P W 2 ;
0
G( ; ) (n 1)s2
H0 : =
(n 1)S 2 2P W 2 )
0 2 0
unknown 0
W 2 (n 1)
2ny
min(2P W 0
;
2ny
Exponential( ) H0 : = 2nY 2P W 0
)
0 0
W 2 (2n)
Notes:
(1) To …nd P (Z d) where Z G (0; 1) in R, use 1 pnorm(d)
(2) To …nd P (T d) where T t (n 1) in R, use 1 pt(d; n 1)
(3) To …nd P (W d) where W 2 (n 1) in R, use pchisq(d; n 1)
5.6. CHAPTER 5 PROBLEMS 217
(a) Let be the probability the woman guesses the card correctly and let Y be
the number of correct guesses in n repetitions of the procedure. Discuss why
Y Binomial(n; ) would be an appropriate model. If you wanted to test the
hypothesis that the woman is guessing at random what is the appropriate null
hypothesis H0 in terms of the parameter ?
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Using the test
statistic D = jY E (Y )j, calculate the p value for your hypothesis H0 in
(a) and give a conclusion about whether you think the woman has any special
guessing ability.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed cor-
rectly 32 times. Using the test statistic D = jY E (Y )j, calculate the p value
for these data. What would you conclude now?
2. The accident rate over a certain stretch of highway was about = 10 per year for a
period of several years. In the most recent year, however, the number of accidents was
25. We want to know whether this many accidents is very probable if = 10; if not,
we might conclude that the accident rate has increased for some reason. Investigate
this question by assuming that the number of accidents in the current year follows a
Poisson distribution with mean and then testing H0 : = 10. Use the test statistic
D = max(0; Y 10) where Y represents the number of accidents in the most recent
year.
3. A hospital lab has just purchased a new instrument for measuring levels of dioxin
(in parts per billion). To calibrate the new instrument, 20 samples of a “standard”
water solution known to contain 45 parts per billion dioxin are measured by the new
instrument. The observed data are given below:
44:1 46:0 46:6 41:3 44:8 47:8 44:5 45:1 42:9 44:5
42:5 41:5 39:6 42:0 45:8 48:9 46:6 42:9 47:0 43:7
For these data
P
20 P
20
yi = 888:1 and yi 2 = 39545:03
i=1 i=1
(a) Use a qqplot to check whether a G ( ; ) model is reasonable for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
218 5. HYPOTHESIS TESTING
(c) Assuming a G ( ; ) model for these data test the hypothesis H0 : = 45.
Determine a 95% con…dence interval for . What would you conclude about
how well the new instrument is working?
(d) The manufacturer of these instruments claims that the variability in measure-
ments is less than two parts per billion. Test the hypothesis that H0 : = 2 and
determine a 95% con…dence interval for . What would you conclude about the
manufacturer’s claim?
(e) Suppose the hospital lab rechecks the new instrument one week later by taking
25 new measurements on a standard solution of 45 parts per billion dioxin. If
the new data give
y = 44:1 and s = 2:1
what would you conclude about how well the instrument is working now? Explain
the di¤erence between a result which is statistically signi…cant and a result which
is of practical signi…cance in the context of this study.
(f) Run the following R code which does the calculations for (c) and (d)
y<-c(44.1,46,46.6,41.3,44.8,47.8,44.5,45.1,42.9,44.5,
42.5,41.5,39.6,42,45.8,48.9,46.6,42.9,47,43.7)
t.test(y,mu=45,conf.level=0.95) # test hypothesis mu=45
# and gives a 95% confidence interval
df<-length(y)-1 # degrees of freedom
s2<-var(y) # sample variance
p<-0.95 # p=0.95 for 95% confidence interval
a<-qchisq((1-p)/2,df) # lower value from Chi-squared dist’n
b<-qchisq((1+p)/2,df) # upper value from Chi-squared dist’n
c(s2*df/b,s2*df/a) # confidence interval for sigma squared
c(sqrt(s2*df/b),sqrt(s2*df/a)) # confidence interval for sigma
sigma0sq<-2^2 # test hypothesis sigma=2 or sigmasq=4
chitest<-s2*df/sigma0sq
q<-pchisq(chitest,df)
min(2*q,2*(1-q)) # p-value for testing sigma=2
with = 105.
8. Data on the number of accidents at a busy intersection in Waterloo over the last 5
years indicated that the average number of accidents at the intersection was 3 acci-
dents per week. After the installation of new tra¢ c signals the number of accidents
per week for a 25 week period were recorded as follows:
4 5 0 4 2 0 1 4 1 3 1 1 2
2 2 1 1 3 2 3 2 0 2 2 3
(a) To decide whether the mean number of accidents at this intersection has changed
after the installation of the new tra¢ c signals we wish to test the hypothesis H0 :
P25
= 3: Why is the discrepancy measure D = Yi 75 reasonable? Calculate
i=1
the exact p value for testing H0 : = 3. What would you conclude?
(b) Justify the following statement:
!
Y
P p c t P (Z c) where Z N (0; 1)
=n
9. Use the likelihood ratio test statistic to test H0 : = 3 for the data in Problem 8.
Compare this answer to the answers in 8 (a) and 8 (c).
10. For Chapter 2, Problem 6 (b) test the hypothesis H0 : = 5 using the likelihood ratio
test statistic. Is this result consistent with the approximate 95% con…dence interval
for that you found in Chapter 4, Problem 5?
11. For Chapter 2, Problem 8 (b) test the hypothesis H0 : = 0:1 using the likelihood
ratio test statistic. Is this result consistent with the approximate 95% con…dence
interval for that you found in Chapter 4, Problem 6?
12. Data from the 2011 Canadian census indicate that 18% of all families in Canada have
one child. Suppose the data in Chapter 2, Problem 12 (d) represented 33 children
chosen at random from the Waterloo Region. Based on these data, test the hypothesis
that the percentage of families with one child in Waterloo Region is the same as the
national percentage using the likelihood ratio test statistic. Is this result consistent
with the approximate 95% con…dence interval for that you found in Chapter 4,
Problem 8?
13. A company that produces power systems for personal computers has to demonstrate
a high degree of reliability for its systems. Because the systems are very reliable
under normal use conditions, it is customary to ‘stress’the systems by running them
at a considerably higher temperature than they would normally encounter, and to
measure the time until the system fails. According to a contract with one personal
computer manufacturer, the average time to failure for systems run at 70 C should
be no less than 1; 000 hours. From one production lot, 20 power systems were put on
test and observed until failure at 70 C. The 20 failure times y1 ; y2 ; : : : ; y20 were (in
hours):
374:2 544:0 509:4 1113:9 1244:3 551:9 853:2 3391:2 297:0 1501:4
250:2 678:1 379:6 1818:9 1191:1 162:8 332:2 1060:1 63:1 2382:0
5.6. CHAPTER 5 PROBLEMS 221
P
20
Note: yi = 18; 698:6. Failure times are assumed to have an Exponential( ) distri-
i=1
bution.
(a) Check whether the Exponential model is reasonable for these data. (See Example
5:3:2.)
(b) Use a likelihood ratio test to test H0 : = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?
14. The R function runif() generates pseudo random Uniform(0; 1) random variables.
The command y<-runif(n) will produce a vector of n values y1 ; y2 ; : : : ; yn .
(a) Suggest a test statistic which could be used to test that the yi ’s, i = 1; 2; : : : ; n
are consistent with a random sample from Uniform(0; 1).
(See: www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA393366)
(b) Generate 1000 yi ’s and carry out the test in (a).
15. The Poisson model is often used to compare rates of occurrence for certain types of
events in di¤erent geographic regions. For example, consider K regions with popu-
lations P1 ; P2 ; : : : ; PK and let j , j = 1; 2; : : : ; K be the annual expected number of
events per person for region j. By assuming that the number of events Yj for region j
in a given t-year period has a Poisson distribution with mean Pj j t, we can estimate
and compare the j ’s or test that they are equal.
(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; y2 ; : : : ; yK for a given t-year period. Describe
how to test the hypothesis that 1 = 2 = = K.
(c) The data below show the numbers of children yj born with “birth defects”for 5
regions over a given …ve year period, along with the total numbers of births Pj
for each region. Test the hypothesis that the …ve rates of birth defects are equal.
16. Using the data from Chapter 2, Problems 10 and 11 and assuming the Poisson model
holds for each dataset, test the hypothesis that the mean number of points per game is
the same for Wayne Gretzky and Sidney Crosby. Hint: See Example 5.4.3. Comment
on whether you think this is a reasonable way to compare these two great hockey
players.
222 5. HYPOTHESIS TESTING
17. Challenge Problem: Likelihood ratio test statistic for Gaussian model
and unknown Suppose that Y1 ; Y2 ; : : : ; Yn are independent G( ; ) observations.
(a) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
is given by
T2
( 0 ) = n log 1 +
n 1
p
where T = n(Y 0 )=S and S is the sample standard deviation. Note: you
will want to use the identity
P
n
2 P
n
(Yi 0) = (Yi Y )2 + n(Y 0)
2
i=1 i=1
(b) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
can be written as ( 0 ) = U n log (U=n) n where
(n 1)S 2
U= 2
0
18. Challenge Problem: Likelihood ratio test statistic for comparing two
Exponential means Suppose that X1 ; X2 ; : : : ; Xm is a random sample from the
Exponential( 1 ) distribution and independently and Y1 ; Y2 ; : : : ; Yn is a random sample
from the Exponential( 2 ) distribution. Determine the likelihood ratio test statistic
for testing H0 : 1 = 2 .
6. GAUSSIAN RESPONSE
MODELS
6.1 Introduction
A response variate Y is one whose distribution has parameters which depend on the value
of other variates. For the Gaussian models we have studied so far, we assumed that we had
a random sample Y1 ; Y2 ; : : : ; Yn from the same Gaussian distribution G( ; ). A Gaussian
response model generalizes this to permit the parameters of the Gaussian distribution for
Yi to depend on a vector xi of covariates (explanatory variates which are measured for
the response variate Yi ). Gaussian models are by far the most common models used in
statistics.
De…nition 40 A Gaussian response model is one for which the distribution of the response
variate Y , given the associated vector of covariates x = (x1 ; x2 ; : : : ; xk ) for an individual
unit, is of the form
Y G( (x) ; (x))
In most examples we will assume (xi ) = is constant. This assumption is not necessary
but it does make the models easier to analyze. The choice of (x) is guided by past
information and on current data from the population or process. The di¤erence between
various Gaussian response models is in the choice of the function (x) and the covariates.
We often assume (xi ) is a linear function of the covariates. These models are called
Gaussian linear models and can be written as
223
224 6. GAUSSIAN RESPONSE MODELS
where xi = (xi1 ; xi2 ; : : : ; xik ) is the vector of known covariates associated with unit i and
0 ; 1 ; : : : ; k are unknown parameters. These models are also referred to as linear regres-
sion models 13 , and the j ’s are called the regression coe¢ cients. Linear regressions models
are used in both machine learning and data science.
Here are some examples of settings where Gaussian response models can be used.
In this case there is no formula relating and to the machines; they are simply di¤erent.
Notice that an important feature of a machine is the variability of its production so we
have, in this case, permitted the two variance parameters to be di¤erent.
Yi G( 0 + 1 xi ; ) for i = 1; 2; : : : ; n independently
13
The word regression is an historical term introduced in the 19th century in connection with these models.
6.1. INTRODUCTION 225
where 0 , 1 and are unknown parameters and x1 ; x2 ; : : : ; xn are known constants. Note
that, although this model assumes that the mean of the response variate Y depends on the
explanatory variate x, the model assumes that the standard deviation does not depend
on x.
700
650
600
550
500
Price
450
400
350
300
250
200
0 0.5 1 1.5 2 2.5 3 3.5
Size
Yi G( 0 + 1 xi ; ) for i = 1; 2; : : : ; n independently
where 0 , 1 and are unknown parameters and and x1 ; x2 ; : : : ; xn are known constants.
The standard deviation is assumed to be the same for all x.
226 6. GAUSSIAN RESPONSE MODELS
2 .5
2 .4
2 .3
2 .2
S tr e n g th
2 .1
1 .9
1 .8
1 .7
1 .6
0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5
D i a m e te r
In this form we can see that Yi is the sum of a deterministic component, (xi ) (a constant),
and a stochastic component, Ri (a random variable).
We now consider estimation and testing procedures for these Gaussian response models.
We begin with models which have no covariates so that the observations are all from the
same Gaussian distribution.
G( ; ) Model
In Chapters 4 and 5 we discussed estimation and testing hypotheses for samples from a
Gaussian distribution. Suppose that Y G( ; ) models a response variate y in some
population or process. A random sample Y1 ; Y2 ; : : : ; Yn is selected, and we want to estimate
the model parameters and possibly to test hypotheses about them. We can write this model
in the form
Yi = + Ri where Ri G(0; ) (6.2)
so this is a special case of the Gaussian response model in which the mean function is con-
stant. The estimator of the parameter that we used is the maximum likelihood estimator
Pn
Y = n1 Yi . This estimator is also a “least squares estimator”. Y has the property that
i=1
it is closer to the data than any other constant, or
P
n P
n
min (Yi )2 = (Yi Y )2
i=1 i=1
228 6. GAUSSIAN RESPONSE MODELS
You should be able to verify this. It will turn out that the methods for estimation, construct-
ing con…dence intervals, and tests of hypothesis discussed earlier for the single Gaussian
G( ; ) are all special cases of the more general methods derived in Section 6.5.
In the next section we begin with a simple generalization of (6.2) to the case in which
the mean is a linear function of a single covariate.
where
P
n P
n P
n
Sxx = (xi x)2 , Syy = (yi y)2 , and Sxy = (xi x) (yi y)
i=1 i=1 i=1
Such estimates are called least squares estimates. To …nd the least squares estimates we
need to solve the two equations
@g P
n
= (yi xi ) = 0
@ i=1
@g Pn
= (yi xi ) xi = 0
@ i=1
simultaneously. We note that this is equivalent to solving the maximum likelihood equations
(6.4) and (6.5).
In summary we have that the least squares estimates and the maximum likelihood
estimates obtained assuming the model (6.3) are the same estimates. Of course the method
of least squares only provides point estimates of the unknown parameters and while
assuming the model (6.3) allows us to obtain both estimates and con…dence intervals for
the unknown parameters.
Note that the line y = ^ + ^ x is often called the …tted regression line for y on x or
more simply the …tted line.
We now show how to obtain con…dence intervals based on the model (6.3).
~ = 1 P xi (Yi
n
Y)
Sxx i=1
Since
P
n P
n
xi (Yi Y)= (xi x)Yi
i=1 i=1
230 6. GAUSSIAN RESPONSE MODELS
~ = 1 P (xi P
n n (xi x)
x)Yi = ai Yi where ai =
Sxx i=1 i=1 Sxx
which shows that ~ is a linear combination of the Gaussian random variables Yi and there-
fore has a Gaussian distribution. To …nd the mean and variance of ~ we use the identities
P
n P
n P
n 1
ai = 0; ai xi = 1; a2i =
i=1 i=1 i=1 Sxx
and
P
n
V ar( ~ ) = a2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= a2i
i=1
2 P
n 1
= since a2i =
Sxx i=1 Sxx
In summary
~ G ;p
Sxx
Con…dence intervals for are important because the parameter represents the increase
in the mean value of Y , resulting from an increase of one unit in the value of x. As well, if
= 0 then x has no e¤ect on Y (within this model). Since
~
p G (0; 1)
= Sxx
holds independently of
(n 2)Se2 2
2
(n 2) (6.6)
then by Theorem 32 it follows that
~
p t (n 2)
Se = Sxx
This pivotal quantity can be used to obtain con…dence intervals for and to construct tests
of hypotheses about .
Using the t table or R we …nd the constant a such that P ( a T a) = p where
T t (n 2). Since
p = P( a T a)
!
~
= P a p a
Se = Sxx
p p
= P ~ aSe = Sxx ~ + aSe = Sxx
Note that (6.6) can be used to obtain con…dence intervals for 2 or . A 100p% con…-
dence interval for 2 is
(n 2)s2e (n 2)s2e
;
b a
and a 100p% con…dence interval for is
" r r #
n 2 n 2
se ; se
b a
where
1 p
P (U a) = P (U > b) =
2
and U 2 (n 2).
E(Y jx) = + x= + (x + c) = ( + c) + x
Thus, the intercept changes if we rede…ne x, but not . In the examples we consider here
we have kept the given de…nition of xi , for simplicity.
~ (x) = ~ + ~ x = Y + ~ (x x)
since ~ = Y ~ x. Since
~ = P (xi x) Yi
n
i=1 Sxx
we can rewrite ~ (x) as
P
n 1 (xi x)
~ (x) = Y + ~ (x x) = bi Yi where bi = + (x x)
i=1 n Sxx
6.2. SIMPLE LINEAR REGRESSION 233
Since ~ (x) is a linear combination of Gaussian random variables it has a Gaussian distrib-
ution. To …nd the mean and variance of ~ (x) we use the identities
P
n P
n P
n 1 (x x)2
bi = 1, bi xi = x and b2i = +
i=1 i=1 i=1 n Sxx
P
n P
n
E[~ (x)] = bi E(Yi ) = bi ( + xi )
i=1 i=1
P
n P
n
= bi + bi xi
i=1 i=1
Pn P
n
= + x since bi = 1 and bi xi = x
i=1 i=1
= (x)
and
P
n
V ar [~ (x)] = b2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= b2i
i=1
2 1 (x x)2
= +
n Sxx
Note that the variance of ~ (x) is smallest when x is close to x (the center of the data) and
much larger when (x x)2 is large. In summary, we have shown that
0 s 1
1 (x x)2
~ (x) G @ (x); + A
n Sxx
Since
~ (x) (x)
q G (0; 1)
1 (x x)2
n + Sxx
holds independently of (6.6) then by Theorem (32) we obtain the pivotal quantity
~ (x) (x)
q t (n 2)
1 (x x)2
Se n + Sxx
which can be used to obtain con…dence intervals for (x) in the usual manner. Using the t
234 6. GAUSSIAN RESPONSE MODELS
p = P( a T a)
0 1
~ (x) (x)
= P@ a q aA
1 (x x)2
Se + n Sxx
0 s s 1
1 (x x)2 1 (x x)2 A
= P @ ~ (x) aSe + (x) ~ (x) + aSe +
n Sxx n Sxx
where ^ (x) = ^ + ^ x.
Remark Note that since = (0), a 95% con…dence interval for , is given by (6.8) with
x = 0 which gives
s
1 (x)2
^ ase + (6.9)
n Sxx
In fact one can see from (6.9) that if x is large in magnitude (which means the average xi
is large), then the con…dence interval for will be very wide. This would be disturbing if
the value x = 0 is a value of interest, but often it is not.
Since R is independent of ~ (x) (it is not connected to the existing sample), (6.10) is the
sum of independent Normally distributed random variables and is consequently Normally
distributed. Since
and
If we compare (6.8) and (6.12), we observe that the prediction interval will be wider than
the con…dence interval particularly if n is large. The prediction interval is an interval for a
future observation Y which is a random variable whereas the con…dence interval is an in-
terval for the unknown mean (x) = + x. The width of the con…dence interval depends
on the uncertainty in the estimation of the parameters and , that is, it depends on the
variances of the estimators ~ and ~ . The width of the prediction interval depends on the un-
certainty in the estimation of the parameters and as well the variance 2 of the random
variable. In other words the uncertainty in determining an interval for a random variable
Y is greater than the uncertainty in determining an interval for the constant (x) = + x.
n = 65
x = 65:06154 y = 75:38462
Sxx = 10813:75
Sxy = 6869:462
Syy = 8665:385
so we …nd
Note that when calculating these values using a calculator you should use as many decimal
places as possible otherwise the values are a¤ected by roundo¤ error. The estimate ^ =
0:6352523 indicates an increase in average exam mark of 0:6352523 for each one mark
increase in midterm mark x.
6.2. SIMPLE LINEAR REGRESSION 237
Table 6.1
Con…dence/Prediction Intervals for
Simple Linear Regression Model
100p%
Unknown Pivotal
Estimate Estimator Con…dence/
Quantity Quantity
Prediction
Interval
~=
^= ~
p p
Pn
Se = Sxx ^ ase = Sxx
(xi x)Yi
Sxy i=1
Sxx Sxx t (n 2)
r~ q
^= ~= Se 1 (x)2
+S 1 (x)2
n xx ^ ase n + Sxx
y ^x Y ~x
t (n 2)
~ (x) (x)
r
(x) = ^ (x) = ~ (x) = Se 1 (x x)2
+ S q
n xx
1 (x x)2
^ (x) ase n + Sxx
+ x ^ + ^x ~ + ~x t (n 2)
Se2 =
s2e = (n 2)Se2 h i
2 2 (n 2)s2e (n 2)s2e
Pn
2 c ; b
(Yi ~ ~ xi )
Syy ^ Sxy i=1 2 (n
n 2 n 2 2)
r
Y ~ (x) Prediction Interval
1 (x x)2
Y Se 1+ n + S
xx q
1 (x x)2
^ (x) ase 1 + n + Sxx
t (n 2)
Table 6.2
Hypothesis Tests for
Simple Linear Regression Model
Test p value
Hypothesis
Statistic
j~ p 0j j^ p 0j
H0 : = 0 2P T se = Sxx
where T t (n 2)
Se = Sxx
0 1
j~ 0j
r j^ 0j
H0 : = 0 1 (x)2 2P @T r A where T t (n 2)
Se n
+S 1 (x)2
xx se n
+S
xx
(n 2)s2e (n 1)s2e
min 2P W 2 ; 2P W 2
(n 2)Se2 0 0
H0 : = 0 2
0
W 2 (n 2)
Figure 6.4 shows a the scatterplot of the data together with the …tted line, y = ^ + ^ x =
34:05413 + 0:6352523x. The …tted line passes through the points but we notice that there
is a quite a bit of variability about the …tted line.
The p value for testing H0 : = 0 is
0 1
^ 0
2P @T p A
se = Sxx
0 1
j0:6352523 0j
= 2P @T q A
1 (65:06154)2
(8:263079) 65 + 10813:75
= 2P (T 7:994522) t 0
where T t (63). Therefore there is very strong evidence against the hypothesis H0 : = 0
or the hypothesis of no relationship between exam mark and midterm mark based on the
data which is consistence with the what we see in Figure 6.4.
6.2. SIMPLE LINEAR REGRESSION 239
Figure 6.4: Scatterplot and …tted line for exam mark versus midterm mark
where T t (63). Therefore there is very strong evidence against the hypothesis H0 : = 0.
Note that this = 0 corresponds to a midterm mark of x = 0 which is well outside the
range of observed midterm marks. In other words we are assuming the linear relationship
hold outside the range of observed x values which might not be valid. In this example the
hypothesis H0 : = 0 is not of particular interest.
These results can also be more easily be obtained by using the command summary(lm(y~x))
in R. The table below gives us the parts of the output which are of interest to us for this
course.
240 6. GAUSSIAN RESPONSE MODELS
Coe¢ cients:
Estimate Std. Error t value Pr(>jtj)
(Intercept) 34:05413 5:27046 6:461 1:72e 08***
x 0:63525 0:07946 7:995 3:65e 11***
q
^ 1 (x)2 ^ j^ p 0j
x se n + Sxx
p 0
se = Sxx
2P T se = Sxx
where T t (n 2). The Residual standard error is equal to se the estimate of . The
entry 3.65e-11 *** in the row labeled x in the table indicates that the p value for testing
H0 : = 0 is equal to 3:65 10 11 which is less than 0:001.
This interval does not contain any values of close to zero which is consistent with the fact
that the p value for testing H0 : = 0 was approximately zero.
A 95% con…dence interval for is
s
1 (x)2
^ 1:998341se +
n Sxx
s
1 (65:06154)2
= 34:05413 1:998341 (8:263079) +
65 10813:75
= [23:52194; 44:58632]
This interval does not contain any values of close to zero which is consistent with the
fact that the p value for testing H0 : = 0 was approximately zero.
6.2. SIMPLE LINEAR REGRESSION 241
A 95% con…dence interval for (50) = + (50) = the mean exam grade for students
with a midterm mark of x = 50 is
s
1 (x 50)2
^ + ^ (50) 1:998341se+
n Sxx
s
1 (65:06154 50)2
= 65:81674 1:998341 (8:263079) +
65 10813:75
= [62:66799; 68:96549]
Note that this is a con…dence interval for the mean or average exam mark for students who
obtain a midterm mark of x = 50. If we want to give an interval of values for an individual
student who obtained a midterm mark of x = 50 then we should use a prediction interval.
A 95% prediction interval is
s
1 (x 50)2
^ + ^ (50) 1:998341se 1 + +
n Sxx
s
1 (65:06154 50)2
= 65:81674 1:998341 (8:263079) 1 + +
65 10813:75
= [49:00675; 82:62673]
As we have indicated before, this interval is much wider than the con…dence for the mean
exam mark. Based on this interval, what advice would you give to a student who obtained
a mark of 50 on the midterm?
These intervals can also be easily obtained using R. For example, the R commands
confint(lm(y~x),level=0.95)
predict(lm(y~x),data.frame("x"=50),interval="confidence",lev=0.95)
predict(lm(y~x),data.frame("x"=50),interval="prediction",lev=0.95)
give the output
2.5 % 97.5 %
(Intercept) 23.5219441 44.5863083
x 0.4764623 0.7940423
…t lwr upr
1 65.81674 62.66799 68.96549
…t lwr upr
1 65.81674 49.00676 82.62672
The values in the these tables can be compared to the intervals obtained above.
242 6. GAUSSIAN RESPONSE MODELS
to the data.
For these data we obtain ^ = 1:6668, ^ = 2:8378, se = 0:002656 and Sx1 x1 = 0:2244.
The …tted regression line is shown on the scatterplot in Figure 6.5. The model appears to
…t the data well.
2.5
2.4
2.3
2.2
Strength y=1.67+2.84x
2.1
1.9
1.8
1.7
1.6
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.5: Scatterplot plus …tted line for strength versus diameter squared
The parameter represents the increase in average strength (x1 ) from increasing
x1 = x2 by one unit. Using (6.7) and the fact that P (T 2:0484) = 0:975 for T t (28),
a 95% con…dence interval for is given by
p
^ 2:0484se = Sxx = 2:8378 0:2228
= [2:6149; 3:0606]
6.2. SIMPLE LINEAR REGRESSION 243
(1) The assumption that Yi (given any covariates xi ) is Gaussian with constant standard
deviation .
(2) The assumption that E (Yi ) = (xi ) is a linear combination of observed covariates
with unknown coe¢ cients.
Models should always be checked. In problems with only one x covariate, a plot of
the …tted line superimposed on the scatterplot of the data (as in Figures 6.4 and 6.5)
shows pretty clearly how well the model …ts. If there are two or more covariates in the
model, residual plots, which are described below, are very useful for checking the model
assumptions.
Consider the simple linear regression model for which Yi G( i ; ) where i = + xi
and Ri = Yi i G(0; ), i = 1; 2; : : : ; n independently. Residuals are de…ned as the
di¤erence between the observed response and the …tted response, that is, r^i = yi ^ i ,
i = 1; 2; : : : ; n where yi is the observed response and ^ i = ^ + ^ xi is the …tted response.
The idea behind the r^i ’s is that they can be thought of as “observed”Ri ’s. This isn’t exactly
correct since we are using ^ i instead of i , but if the model is correct, then the r^i ’s should
behave roughly like a random sample from the G(0; ) distribution. Another reason why
the r^i ’s only behave roughly like a random sample from the G(0; ) distribution is because
Pn
r^i = 0. To see this, recall that the maximum likelihood estimate of is ^ = y ^ x
i=1
which implies
^ x = 1 P yi ^ xi = 1 P r^i
n n
0=y ^ ^
n i=1 n i=1
or
P
n
r^i = 0
i=1
(1) Plot points (xi ; r^i ); i = 1; 2; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.
(2) Plot points (^ i ; r^i ); i = 1; 2; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.
(3) Plot a Normal qqplot of the residuals r^i . If the model is satisfactory the points should
lie more or less along a straight line. (Note that since the yi0 s do not all have the same
mean, it does not make sense to do a qqplot of the yi0 s.)
244 6. GAUSSIAN RESPONSE MODELS
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
Figure 6.6: Residual plot for example in which model assumptions hold
Figure 6.6 shows a residual plot in which the points lie reasonably within a horizontal
constant band around the line r^i = 0 which suggests that the model assumptions are
reasonable.
Systematic departures from the “expected” pattern suggest problems with the model
assumptions. In Figure 6.7, the points do not lie within a constant band around the line
r^i = 0. As x increases the points lie above the line r^i = 0 then below and then above. This
pattern of points suggests that the mean function i = (xi ) is not correctly speci…ed. A
quadratic form for the mean such as (xi ) = + xi + x2i might provide a better …t to
the data.
1
standardized
residual
0
-1
-2
-3
0 10 20 30 40 50
x
1
standardized
residual
0
-1
-2
-3
50 60 70 80 90 100
x
Figure 6.8: Example of residual plot which indicates that assumption V ar (Yi ) = 2
Reading these plots requires practice. You should try not to read too much into plots
particularly if the plots are based on a small number of points.
Often we prefer to use standardized residuals
r^i
r^i =
se
yi ^i
=
se
yi ^ ^ xi
= for i = 1; 2; : : : ; n
se
Standardized residuals were used in Figures 6.7 and 6.8. The patterns in the plots are
unchanged whether we use r^i or r^i , however the r^i values tend to lie in the interval [ 3; 3].
The reason for this is that, since the r^i ’s behave roughly like a random sample from the
G(0; ) distribution, the r^i ’s should behave roughly like a random sample from the G (0; 1)
distribution. Since P ( 3 Z 3) = 0:9973 where Z G (0; 1), then roughly 99:73% of
the observations should lie in the interval [ 3; 3].
246 6. GAUSSIAN RESPONSE MODELS
1.5
1
standardized
residual
0.5
-0.5
-1
-1.5
-2
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared
Figure 6.9: Standard residuals versus diameter squared for bolt data
A qqplot of the standardized residuals is given in Figure 6.10. There are only 30 points.
The points lie reasonably along a straight line with more variability in the tails which is
expected. The Gaussian assumption seems reasonable based on this small number of points.
2
Quantiles of Standardized Residuals
-1
-2
-3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Standard Normal Quantiles
and obtain the conclusions below as a special case of the linear model. Below we derive the
estimates from the likelihood directly.
The likelihood function for 1 , 2 , is
nj
2 Q
Q 1 1 2
L( 1; 2; )= p exp 2
yji j for 1 2 <; 2 2 <; >0
j=1 i=1 2 2
1 P n1
^1 = y1i = y1
n1 i=1
1 P n2
^2 = y2i = y2
n2 i=1
1 P
n1 P
n2
and ^ 2 = (y1i y1 ) 2 + (y2i y2 )2
n1 + n2 i=1 i=1
1 P
n1 P
n2
s2p = (y1i y1 )2 + (y2i y2 )2
n1 + n2 2 i=1 i=1
(n1 1)s21 + (n2 1)s22
=
n1 + n2 2
n1 + n2
= ^2
n1 + n2 2
where
1 P
n1 1 P
n2
s21 = (y1i y1 )2 and s22 = (y2i y2 )2
n1 1 i=1 n2 1 i=1
are the sample variances obtained from the individual samples. The estimate s2p can be
written as
w1 s21 + w2 s22
s2p =
w1 + w2
248 6. GAUSSIAN RESPONSE MODELS
to show that s2p is a weighted average of the sample variances s2j with weights equal to
wj = nj 1. With these weights the sample variance which has a larger sample size is
weighted more. Why does this make sense?
We will use the estimate s2p for 2 rather than ^ 2 since
2P
n1 P
n2 3
(Y Y1 )2 + (Y2i Y2 )2
6 i=1 1i 7
E Sp2 = E 6
4
i=1 7=
5
2
n1 + n2 2
To determine whether the two populations di¤er and by how much we will need to generate
con…dence intervals for the di¤erence 1 2 . First note that the maximum likelihood
estimator of this di¤erence is Y 1 Y 2 which has expected value
E(Y 1 Y 2) = 1 2
and variance
2 2 1 1
2
V ar(Y 1 Y 2 ) = V ar(Y 1 ) + V ar(Y 2 ) = + = +
n1 n2 n1 n2
1 1
Sp2 +
n1 n2
Theorem 41 If Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and
independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution then
(Y 1 Y 2) ( 1 2)
q t (n1 + n2 2)
1 1
Sp n1 + n2
and
(n1 + n2 2)Sp2 1 P nj
2 P
2
= 2
(Yji Yj )2 2
(n1 + n2 2)
j=1 i=1
Con…dence intervals or tests of hypothesis for 1 2 and can be obtained using these
pivotal quantities.
In particular a 100p% con…dence interval for 1 2 is
r
1 1
y1 y 2 asp + (6.13)
n1 n2
6.3. COMPARISON OF TWO POPULATION MEANS 249
Y1 Y2 0 Y1 Y2
D= q = q (6.14)
Sp n11 + n12 Sp n11 + n12
with
0 1 2 0 13
jy1 y2 0j A jy1 y2 0j A5
p value = P @jT j q = 2 41 P @T q
sp n11 + n12 sp n11 + n12
where
1 p 1+p 2
P (U a) = ; P (U b) = ; and U (n1 + n2 2)
2 2
n1 + n2 n1 + n2
1 2 1 2
with
0 1 2 0 13
jy1 y2 0j A jy1 y2 0j A5
p value = P @jZj q 2 2
= 2 41 P @Z q 2 2
n1 + n1 +
1 2 1 2
n2 n2
Table 6.3
Con…dence Intervals for
Two Sample Gaussian Model
G( Y1 Y2 ( 2)
1; 1) r 1
2 2 q 2 2
1+ 2
G( 2; 2) n1 n2 y1 y2 a n1
1
+ n2
2
1 2
1, 2 known G (0; 1)
G( 1; 1)
Y1 Y2 ( 1 2)
G( q q
2; 2) Sp n1 + n1 1 1
1 2 y1 y2 bsp n1 + n2
1 2
1 = 2= t (n1 + n2 2)
unknown
asymptotic
Gaussian
G( 1; 1) approximate 100p%
pivotal quantity
G( 2; 2) con…dence interval
1 2 Y1 Y2 ( 1 2)
r q
1 6 = 2 2
S1 2
s21 s22
n1
+ n2 y1 y2 a +
2 n1 n2
1, 2 unknown
for large n1 ; n2
Notes:
The value a is given by P (Z a) = 1+p 2 where Z G (0; 1).
1+p
The value b is given by P (T b) = 2 where T t (n1 + n2 2).
The values c and d are given by P (W c) = 1 2 p = P (W > d) where W 2 (n
1 + n2 2).
6.3. COMPARISON OF TWO POPULATION MEANS 251
Table 6.4
Hypothesis Tests for
Two Sample Gaussian Model
Test
Model Hypothesis
Statistic p value
0 1
G( 1; 1) jy1 y2 ( 2 )j
j Y1 Y2 (
r 1 2) j 2P @Z r
2
1
2
A
G( 2; 2) 2 2 1+ 2
H0 : 1 = 2
1+ 2
n1 n2
n1 n2
1, 2 known Z G (0; 1)
!
G( 1; ) jy1 y2 ( 1 2 )j
j Y1 Y2 (
q 1 2) j 2P T q
G( 2; ) 1 1 sp n1 + n1
H0 : = Sp n1
+n 1 2
1 2 2
unknown T t (n1 + n2 2)
1, 2 unknown W 2 (n + n2 1)
1
approximate p value
G( 1; 1)
j Y1 Y2 ( j 0 1
G( 2)
2; 2) 1
r
2 S2 jy1 y2 ( 2 )j
2P @Z A
S1 1
H0 : = + n2 r
1 2 n1 2 s2 2
1 + s2
1 6 = 2 n1 n2
1, 2 unknown
Z G (0; 1)
252 6. GAUSSIAN RESPONSE MODELS
In the case in which 1 and 2 are unknown then there is no exact pivotal quantity
which can be used. However if we replace the quantities 21 and 22 in the pivotal quantity
(6.15) by their respective estimators S12 and S22 to obtain the the random variable
Y1 Y ( 2)
q2 1
(6.16)
S12 S22
n1 + n2
then it can be shown that this asymptotic pivotal quantity has approximately a G (0; 1)
distribution if n1 and n2 are both large. An approximate 100p% con…dence interval for
1 2 based on this pivotal quantity is
s
s21 s2
y1 y2 a + 2 (6.17)
n1 n2
1+p
where P (Z a) = 2 and Z G (0; 1).
These results are summarized in Tables 6.3 and 6.4.
Paint A 12:5 11:7 9:9 9:6 10:3 9:6 9:4 11:3 8:7 11:5 10:6 9:7
Paint B 9:4 11:6 9:7 10:4 6:9 7:3 8:4 7:2 7:0 8:2 12:7 9:2
The objectives of the experiment were to test whether the average re‡ectivities for paints A
and B are the same, and if there is evidence of a di¤erence, to obtain a con…dence interval
for their di¤erence. (In many problems where two attributes are to be compared we start
by testing the hypothesis that they are equal, even if we feel there may be a di¤erence. If
there is no statistical evidence of a di¤erence then we stop there.)
To do this it is assumed that, to a close approximation, the re‡ectivity measurements
Y1i ; i = 1; 2; : : : ; 12 for paint A are independent G( 1 ; 1 ) random variables, and indepen-
dently the measurements Y2i ; i = 1; 2; : : : ; 12 for paint B are independent G( 2 ; 2 ) random
variables. We can test H : 1 2 = 0 and get con…dence intervals for 1 2 by using
the pivotal quantity
Y1 Y2 ( 1 2)
q t (22) (6.18)
1 1
Sp 12 + 12
Using this pivotal quantity means we have assumed that the two population variances are
equal, 1 = 2 = , and that we are using the estimator Sp for . If the observed sample
variances di¤ered by a great deal we would not make this assumption. Unfortunately if the
6.3. COMPARISON OF TWO POPULATION MEANS 253
variances are not assumed equal the problem becomes more di¢ cult. The case of unequal
variances is discussed in the next section.
From these data we have
P
12
n1 = 12 y1 = 10:4 (y1i y1 )2 = 14:08 s1 = 1:1314
i=1
P12
n2 = 12 y2 = 9:0 (y2i y2 )2 = 38:64 s2 = 1:8742
i=1
s
1 P
12 P
12
sp = (y1i y1 ) 2 + (y2i y2 )2 = 1:5480
12 + 12 2 i=1 i=1
jy1 y2 0j 1:4
d= q = q = 2:215
1 1
sp 12 + 12 1:5480 16
with
where T t (22). Since 0:01 < p value < 0:05, there is evidence based on the data against
H0 : 1 = 2 .
Since y1 > y2 , the indication is that paint A keeps its visibility better. Since
P (T 2:074) = 0:975 where T t (22) a 95% con…dence interval for 1 2 based on
(6.13) is
r
1 1
10:4 9:0 2:074 (1:5480) +
12 12
= 1:4 1:3107
= [0:089; 2:711]
This suggests that although the di¤erence in re‡ectivity (and durability) of the paint is
statistically signi…cant, the size of the di¤erence is not really large relative to the sizes of
1 and 2 . This can be seen by noting that ^ 1 = y1 = 10:4, and ^ 2 = y2 = 9:0, whereas
^ 1 ^ 2 = 1:4 so the relative di¤erence is of the order of 10%.
and the di¤erence 1 2 . However, the heights of related persons are not independent.
If we know that one sibling from a family is tall(small) then on average we would expect
other siblings in the family to also be tall (small) so heights of siblings are correlated and
therefore not independent. The method in the preceding section should not be used to es-
timate 1 2 since it would require independent random samples of males and females.
In fact, the primary reason for collecting these data was to consider the joint distribution
of Y1i ; Y2i and to examine their relationship. A clear picture of the relationship could be
obtained by plotting the observed points (y1i ; y2i ) in a scatterplot.
be of much interest to consider E(Y1i ) and E(Y2i ) separately, since there is only a single
observation on each car type for each fuel.
There are two types of Gaussian models which can be used to model paired data. The
…rst involves what is called a Bivariate Normal distribution for (Y1i ; Y2i ), and it could be
used in the fuel consumption example. The Bivariate Normal distribution is a continuous
bivariate model for which each component has a Normal distribution and the components
may be dependent. We will not describe this model here (it is studied in third year courses),
except to note one fundamental property: If (Y1i ; Y2i ) has a Bivariate Normal distribution
then the di¤erence between the two is also Normally distributed; where 2 = V ar(Y1i ) +
V ar(Y2i ) 2Cov(Y1i ; Y2i ). Thus, if we are interested in making inferences about 1 2
then we can do this by analyzing the within-pair di¤ erences Yi = Y1i Y2i and using the
model
2
Yi = Y1i Y2i N ( 1 2; ) i = 1; 2; : : : ; n independently
or equivalently
Yi G ( ; ) i = 1; 2; : : : ; n independently (6.19)
where = 1 2 . The methods for a G ( ; ) model discussed in Sections 4.7 and
5.2 can then be used to estimate and test hypotheses about the parameters
and .
The second Gaussian model used with paired data assumes
2 2
Y1i G 1 + i; 1 ; and Y2i G 2 + i; 2 independently
where the i ’s are unknown constants. The i ’s represent factors speci…c to the di¤erent
pairs so that some pairs can have larger (smaller) expected values than others. This model
also gives a Gaussian distribution like (6.19), since Y1i Y2i has a Gaussian distribution
with
E(Y1i Y2i ) = 1 2 =
Such a model might be reasonable for Example 6.3.4, where i refers to the i’th car type.
Thus, whenever we encounter paired data in which the random variables Y1i and Y2i
are adequately modeled by Gaussian distributions, we will make inferences about 1 2
by working with the model (6.19).
y = 4:895 inches
6.3. COMPARISON OF TWO POPULATION MEANS 257
and
P
1 1401
s2 = (yi y)2 = 6:5480 (inches)2
1400 i=1
Using the pivotal quantity
Y
p t (1400)
S= n
a 95% con…dence interval for = E(Yi ) is given by
p p
y 1:96s= n = 4:895 1:96 6:5480=1401
= 4:895 0:134
= [4:76; 5:03]
Note that t (1400) is indistinguishable from G(0; 1) so we use the value 1:96 from the G(0; 1)
distribution.
Remark The method above assumes that the (brother, sister) pairs are a random sample
from the population of families with a living adult brother and sister. The question arises
as to whether E(Yi ) also represents the di¤erence in the average heights of all adult males
and all adult females (call them 01 and 02 ) in the population. If 01 = 1 (that is, the
average height of all adult males equals the average height of all adult males who also have
an adult sister) and similarly 02 = 2 , then E(Yi ) does represent this di¤erence.
Y1 Y2
the heights of siblings in the same family to be positively correlated since they share parents.
Therefore if we can collect a sample of pairs (Y1i ; Y2i ), this is better than two independent
random samples (one of Y1i ’s and one of Y2i ’s) for estimating 1 2 . Note on the other
hand that if 12 < 0, then pairing is a bad idea since it increases the value of V ar(Y1 Y2 ).
Table 6.5 shows the cholesterol levels (in millimole per liter) for each subject, measured
at the end of each 6 week period. We let the random variables Y1i ; Y2i represent the
cholesterol levels for subject i on the high …bre and low …bre diets, respectively. We’ll also
assume that the di¤erences can be modeled using
The observed di¤erences yi , shown in Table 6.3, give y = 0:020 and s = 0:411. Since
6.4. GENERAL GAUSSIAN RESPONSE MODELS 259
This con…dence interval includes 1 2 = 0, and there is clearly no evidence that the high
…bre diet gives a lower cholesterol level at least in the time frame represented in this study.
Remark The results here can be obtained using the R function t.test.
Exercise Compute the p value for the test of hypothesis H0 : 1 2 = 0, using the test
statistic (5.1).
Final Remarks When you see data from a comparative study (that is, one whose
objective is to compare two distributions, often through their means), you have to determine
whether it involves paired data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from
a paired study unless there are equal numbers of each, but if there are equal numbers the
study might be either “paired”or “unpaired”. Note also that there is a subtle di¤erence in
the study populations in paired and unpaired studies. In the former it is pairs of individual
units that form the population where as in the latter there are (conceptually at least)
separate individual units for Y1 and Y2 measurements.
P
k
Yi G( i ; ) with (xi ) = j xij for i = 1; 2; : : : ; n independently
j=1
(Note: To facilitate the matrix proof below we have taken 0 = 0 in (6.1). The estimator of
0 can be obtained from the result below by letting xi1 = 1 for i = 1; 2; : : : ; n and 0 = 1 .)
For convenience we de…ne the n k (where n > k) matrix X of covariate values as
and the n 1 vector of responses Yn 1 = (Y1 ; Y2 ; : : : ; Yn )T . We assume that the values xij
are non-random quantities which we observe. We now summarize some results about the
maximum likelihood estimators of the parameters = ( 1 ; 2 ; : : : ; k )T and .
260 6. GAUSSIAN RESPONSE MODELS
=( T
Maximum Likelihood Estimators of 1; 2; : : : ; k) and of
~ = (X T X) 1
XT Y (6.20)
1 Pn P
k
~ xij
and ~2 = (Yi ~ i )2 where ~ i = j (6.21)
n i=1 j=1
l( ; ) = log L( ; )
1 P
n
2
= n log 2
(yi i)
2 i=1
Note that if we take the derivative with respect to a particular j and set this derivative
equal to 0, we obtain,
@l 1 P n @ i
= 2 (yi i) =0
@ j 2 i=1 @ j
or
P
n
(yi i ) xij = 0
i=1
X T (y X )= 0
or X T y = X T X :
Assuming that the k k matrix X T X has an inverse we can solve these equations to obtain
the maximum likelihood estimate of , in matrix notation as
^ = (X T X) 1
XT y
e = (X T X) 1
XT Y
In order to …nd the maximum likelihood estimator of , we take the derivative with respect
to and set the derivative equal to zero and obtain
@l @ 1 P
n
2
= n log 2
(yi i) =0
@ @ 2 i=1
6.4. GENERAL GAUSSIAN RESPONSE MODELS 261
or
n 1 P
n
2
+ 3
(yi i) =0
i=1
Recall that when we estimated the variance for a single sample from the Gaussian
distribution we considered a minor adjustment to the denominator and with this in mind
we also de…ne the following estimator14 of the variance 2 :
1 P
n n
Se2 = (Yi ~ i )2 = ~2
n k i=1 n k
Note that for large n there will be small di¤erences between the observed values of ~ 2 and
Se2 .
Theorem 43 1. The estimators ~ j are all Normally distributed random variables with
expected value j and with variance given by the j 0 th diagonal element of the matrix
2 (X T X) 1 ; j = 1; 2; : : : ; k:
Proof. The estimator ~ j can be written using (6.20) as a linear combination of the Normal
random variables Yi ,
~ = P bji Yi
n
j
i=1
14
It is clear why we needed to assume k < n: Otherwise n k 0 and we have no “degrees of freedom”
left for estimating the variance.
262 6. GAUSSIAN RESPONSE MODELS
P
n
E( ~ j ) = bji E(Yi )
i=1
Pn P
k
= bji i where i = l xil
i=1 l=1
Pn
= bji i
i=1
P
k
Note that i = l xil is the j’th component of the vector X which implies that E( ~ j )
l=1
is the j’th component of the vector BXX . But since BX is the identity matrix, this is
the j’th component of the vector or j : Thus E( ~ j ) = j for all j. The calculation of
the variance is similar.
P
n
V ar( ~ j ) = b2ji V ar(Yi )
i=1
2 P
n
= b2ji
i=1
P
n
and an easy matrix calculation will show, since BB T = (X T X) 1; that b2ji is the j’th
i=1
diagonal element of the matrix (X T X) 1 . We will not attempt to prove part (3) here,
which is usually proved in a subsequent statistics course.
Remark The maximum likelihood estimate ^ is also called a least squares estimate
of in that it is obtained by taking the sum of squared vertical distances between the
observations Yi and the corresponding …tted values ^ i and then adjusting the values of the
estimated j until this sum is minimized. Least squares is a method of estimation in linear
models that predates the method of maximum likelihood. Problem 16 describes the method
of least squares.
Remark15 From Theorem 32 we can obtain con…dence intervals and test hypotheses for
the regression coe¢ cients using the pivotal
~
j j
p t (n k) (6.23)
Se cj
1
where cj is the j’th diagonal element of the matrix X T X .
15 2
p
Recall: If Z G(0; 1) and W (m) then the random variable T = Z= W=m t (m).
~ (n k)S 2
j j
Let Z = p
cj
,W = 2 and m = n k to obtain this result.
6.4. GENERAL GAUSSIAN RESPONSE MODELS 263
to obtain
^ p ^ + ase pcj
j ase cj j j
where
1 P
n P
k
^ xij
s2e = (yi ^ i )2 and ^ i = j
n k i=1 j=1
We now consider a special case of the Gaussian response models. We have already
seen this case in Chapter 4, but it provides a simple example to validate the more general
formulae.
(n 1)S 2
2
P
n P
n P
n 1 (x x)2
bi = 1, bi xi = x and b2i = +
i=1 i=1 i=1 n Sxx
1 (xi x)
where bi = + (x x)
n Sxx
2. Solve the three equations
@l 1 P
n
= 2
(yi xi ) = 0
@ i=1
@l 1 Pn
= 2
(yi xi ) xi = 0
@ i=1
@l n 1 P
n
= + 3
(yi xi )2 = 0
@ i=1
3. Twenty-…ve female nurses working at a large hospital were selected at random and
their age (x) and systolic blood pressure (y) were recorded. The data are:
x y x y x y x y x y
46 136 37 115 58 139 48 134 59 142
36 132 45 129 50 156 35 120 54 135
62 138 39 127 41 132 42 137 57 150
26 115 28 134 31 115 27 120 60 159
53 143 32 133 51 143 34 128 38 127
x = 43:20 y = 133:56
Sxx = 2802:00 Syy = 3284:16 Sxy = 2325:20
To analyze these data assume the simple linear regression model Yi G( + xi ; ),
i = 1; 2; : : : ; 25 independently where the xi ’s are known constants.
6.5. CHAPTER 6 PROBLEMS 265
(a) Determine the maximum likelihood (least squares) estimates of and and an
unbiased estimate of 2 .
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) Construct a 95% con…dence interval for .
(d) Construct a 90% con…dence interval for the mean systolic blood pressure of
nurses aged x = 35.
(e) Construct a 99% prediction interval for the systolic blood pressure Y of a nurse
aged x = 50.
4. This problem is designed to cover concepts in this chapter as well as previous chapters.
The data below are the STAT 230 …nal grades (x) and STAT 231 …nal grades (y) for
30 students chosen at random from the group of students enrolled in STAT 231 in
Winter 2013. The data are available in the …le statgradedata.txt posted on the course
website.
x y x y x y x y x y x y
76 76 60 60 87 76 65 69 83 83 94 94
77 79 81 85 71 50 71 43 88 88 83 83
57 54 86 82 63 75 66 60 52 52 51 37
75 64 96 88 77 72 90 96 75 75 77 90
74 64 79 72 96 84 50 50 99 99 77 67
x = 76:73_ y = 72:23_
Sxx = 5135:86_ Syy = 7585:36_ Sxy = 5106:86_
amount grossed by movies for actors whose value is x = 100. What assumption
is being made in constructing the interval for x = 100?
6. Consider the price versus size of commercial building in Example 6:1:2. For these
data
n = 30 x = 0:9543 y = 548:9700
Sxx = 22:9453 Sxy = 3316:6771 Syy = 489; 624:723
(d) These data were used to decide a fair assessment value for a large building of
size x = 4:47 ( 105 )m2 . Determine a 95% con…dence interval for the mean price
of a building of this size.
(e) Determine a 95% prediction interval for a building of size x = 4:47 ( 105 )m2 .
(f) If you were an assessor deciding the fair assessment for a building of size x = 4:47
( 105 )m2 , would you use the interval in (e) or (f )?
(a) Construct a 95% con…dence interval for the mean breaking strength of bolts of
diameter x = 0:35, that is, x1 = (0:35)2 = 0:1225.
(b) Construct a 95% prediction interval for the breaking strength Y of a single bolt
of diameter x = 0:35. Compare this with the interval in (a).
(c) Suppose that a bolt of diameter 0:35 is exposed to a large force V that could
potentially break it. In structural reliability and safety calculations, V is treated
as a random variable and if Y represents the breaking strength of the bolt (or
some other part of a structure), then the probability of a “failure”of the bolt is
P (V > Y ). Give a point estimate of this value if V G(1:60; 0:10), where V
and Y are independent.
8. There are often both expensive (and highly accurate) and cheaper (and less accurate)
ways of measuring concentrations of various substances (e.g. glucose in human blood,
salt in a can of soup). The table below gives the actual concentration x (determined
by an expensive but very accurate procedure) and the measured concentration y
268 6. GAUSSIAN RESPONSE MODELS
x y x y x y x y
4:01 3:7 13:81 13:02 24:85 24:69 36:9 37:54
6:24 6:26 15:9 16 28:51 27:88 37:26 37:2
8:12 7:8 17:23 17:27 30:92 30:8 38:94 38:4
9:43 9:78 20:24 19:9 31:44 31:03 39:62 40:03
12:53 12:4 24:81 24:9 33:22 33:01 40:15 39:4
x = 23:7065 y = 23:5505
Sxx = 2818:946855 Syy = 2820:862295 Sxy = 2818:556835
The data are available in the …le expensivevscheapdata.txt posted on the course web-
site. To analyze these data assume the regression model: Yi G( + xi ; ),
i = 1; 2; : : : ; 20 independently.
(a) Fit the model to these data. Use the plots discussed in Section 6.2 to check the
adequacy of the model.
(b) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1. Construct 95% con…dence intervals for the intercept and test the
hypothesis = 0. Why are these hypotheses of interest?
(c) Describe brie‡y how you would characterize the cheap measurement process’s
accuracy to a lay person.
(d) If the units to be measured have true concentrations in the range 0 40, do you
think that the cheap method tends to produce a value that is lower than the true
concentration? Support your answer based on the data and the assumed model.
is the maximum likelihood estimate of and also the least squares estimate of
.
(b) Show that 0 1
P
n
xi Yi
B 2 C
~= i=1
NB
@ ;
C
A
Pn P
n
x2i x2i
i=1 i=1
6.5. CHAPTER 6 PROBLEMS 269
P
n
Hint: Write ~ in the form ai Yi .
i=1
(c) Prove the identity
2 1
P
n
^ xi
2 P
n P
n P
n
yi = yi2 xi yi x2i
i=1 i=1 i=1 i=1
1 P
n
^ xi
2
s2e = yi
n 1 i=1
~
s t (n 1)
P
n
Se = x2i
i=1
~
0
s
P
n
Se = x2i
i=1
(c) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1.
(d) Using the results of this analysis as well as the analysis in Problem 8 what would
you conclude about using the model Yi G( + xi ; ) versus the simpler model
Yi G( xi ; ) for these data?
11. The following data were recorded concerning the relationship between drinking
(x = per capita wine consumption) and y = death rate from cirrhosis of the liver in
n = 46 states of the U.S.A. (for simplicity the data have been rounded).
x y x y x y x y x y x y
5 41 12 77 7 67 4 52 7 41 16 91
4 32 7 57 18 57 16 87 13 67 2 30
3 39 14 81 6 38 9 67 8 48 6 28
7 58 12 34 31 130 6 40 28 123 3 52
11 75 10 53 13 70 6 56 23 92 8 56
9 60 10 55 20 104 21 58 22 76 13 56
6 54 14 58 19 84 15 74 23 98
3 48 9 63 10 66 17 98 7 34
x = 11:5870 y = 63:5870
Sxx = 2155:1522 Syy = 24801:1521 Sxy = 6175:1522
The data are available in the …le liverdata.txt posted on the course website.
12. Skinfold body measurements are used to approximate the body density of individuals.
The data on n = 92 men, aged 20-25, where x = skinfold measurement and Y = body
density are given available in the …le SkinfoldData.txt posted on the course website.
Note: The R function lm, with the command lm(y~x) gives the calculations for lin-
ear regression. The command summary(lm(y~x)) gives a summary of the calculations.
6.5. CHAPTER 6 PROBLEMS 271
(a) Run the given R code. What is the equation of the …tted line?
(b) What is the value of the test statistic and the p value for the hypothesis of no
relationship? What would you conclude?
(c) Give an estimate of .
(d) What do the plots indicate about the …t of the model?
(e) What is a 95% con…dence interval for ?
(f) What is a 90% con…dence interval for the mean body density of males with a
skinfold measurement of 2?
(g) What is a 99% prediction interval for the body density of a male with skinfold
measurement of 1:8?
(h) What is a 95% con…dence interval for ?
(i) Do you think that the skinfold measurements provide a reasonable approximation
to body density measurements?
13. The following data, collected by the British botanist Joseph Hooker in the Himalaya
Mountains between 1848 and 1850, relate atmospheric pressure to the boiling point
of water. Hooker wanted to estimate altitude above sea level from measurements of
the boiling point of water. He knew that the altitude could be determined from the
atmospheric pressure, measured with a barometer, with lower pressures correspond-
ing to higher altitudes. His interest in the above modelling problem was motivated
by the di¢ culty of transporting the fragile barometers of the 1840’s. Measuring the
boiling point would give travelers a quick way to estimate elevation, using both the
known relationship between elevation and atmospheric pressure, and the model relat-
ing atmospheric pressure to the boiling point of water. The data in the table below
are also available in the …le boilingpointdata.txt on the course website.
(a) Let y = atmospheric pressure (in Hg) and x = boiling point of water (in F).
Fit a simple linear regression model to the data (xi ; yi ), i = 1; 2; : : : ; 31. Prepare
a scatterplot of y versus x and draw on the …tted line. Plot the standardized
residuals versus x. How well does the model …t these data?
(b) Let z = log y. Fit a simple linear regression model to the data (xi ; zi ),
i = 1; 2; : : : ; 31. Prepare a scatterplot of z versus x and draw on the …tted line.
Plot the standardized residuals versus x. How well does the model …t these data?
(c) Based on the results in (a) and (b) which data are best …t by a linear model?
Does this con…rm the theory’s model?
(d) Obtain a 95% con…dence interval for the mean atmospheric pressure if the boiling
6.5. CHAPTER 6 PROBLEMS 273
42 43 55 26 62 37 33 41 19 54 20 85
Control Group:
46 10 17 60 53 42 37 42 55 28 48
The data are available in the …le treatmentvscontroldata.txt posted on the course
website.
Let y1j = the DRP test score for the treatment group, j = 1; 2; : : : ; 21:
Let y2j = the DRP test score for the control group, j = 1; 2; : : : ; 23: For these data
P
21
y1 = 51:4762 (y1j y1 )2 = 2423:2381
j=1
P23
y2 = 41:5217 (y2j y2 )2 = 6469:7391
j=1
274 6. GAUSSIAN RESPONSE MODELS
Y1j G( 1; ); j = 1; 2; : : : ; 21 independently
Y2j G( 2; ); j = 1; 2; : : : ; 23 independently
(d) Test the hypothesis of no di¤erence between the means, that is, test the hypoth-
esis H0 : 1 = 2 . What conclusion should the educator make based on these
data? Be sure to indicate any limitations to these conclusions.
(e) Here is the R code for doing this analysis
#Import dataset treatmentvscontroldata.txt in folder S231Datasets
y<-TeatmentVsContolData$DRP
y1<-y[seq(1,21,1)] # data for Treatment Group
y2<-y[seq(22,44,1)] # data for Control Group
# qqplots
qqnorm(y1,main="Qqplot for Treatment Group")
qqnorm(y2,main="Qqplot for Control Group")
# t test for hypothesis of no difference in means
# and 95% confidence interval for mean difference mu
# note that R uses mu = mu_control - mu_treament
t.test(DRP~Group,data=treatmentvscontroldata,var.equal=T,
conf.level=0.95)
15. A study was done to compare the durability of diesel engine bearings made of two
di¤erent compounds. Ten bearings of each type were tested. The following table gives
the “times” until failure (in units of millions of cycles):
Type I: y1i 3:03 5:53 5:60 9:30 9:92 12:51 12:95 15:21 16:04 16:84
Type II: y2i 3:19 4:26 4:47 4:53 4:67 4:69 12:78 6:79 9:37 12:75
P
10 P
10
y1 = 10:693 (y1i y1 )2 = 209:02961 y2 = 6:75 (y2i y2 )2 = 116:7974
i=1 i=1
6.5. CHAPTER 6 PROBLEMS 275
Y1j G( 1; ); j = 1; 2; : : : ; 10 independently
Y2j G( 2; ); j = 1; 2; : : : ; 10 independently
(a) Obtain a 90% con…dence interval for the di¤erence in the means 1 2.
(c) It has been suggested that log failure times are approximately Normally dis-
tributed, but not failure times. Assuming that the log Y ’s for the two types of
bearing are Normally distributed with the same variance, test the hypothesis
that the two distributions have the same mean. How does the answer compare
with that in part (b)?
(d) How might you check whether Y or log Y is closer to Normally distributed?
(e) Give a plot of the data which could be used to describe the data and your
analysis.
16. To compare the mathematical abilities of incoming …rst year students in Mathemat-
ics and Engineering, 30 Math students and 30 Engineering students were selected
randomly from their …rst year classes and given a mathematics aptitude test. A sum-
mary of the resulting marks y1i (for the math students) and y2i (for the engineering
students), i = 1; 2; : : : ; 30, is as follows:
P
30
Math students: n = 30 y1 = 120 (y1i y1 )2 = 3050
i=1
P30
Engineering students: n = 30 y2 = 114 (y2i y2 )2 = 2937
i=1
Y1j G( 1; ); j = 1; 2; : : : ; 30 independently
Y2j G( 2; ); j = 1; 2; : : : ; 30 independently
(a) Obtain a 95% con…dence interval for the di¤erence in mean scores for …rst year
Math and Engineering students.
(b) Test the hypothesis that the di¤erence is zero.
17. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and
the numbers of cycles to failure were observed. The sample mean and variance of the
log failure times were y1 = 14:564 and s21 = 0:0914. Similar tests on ten additional
girders with repaired welds gave y2 = 14:291 and s22 = 0:0422. Log failure times are
assumed to be independent with a Gaussian distribution. Assuming equal variances
for the two types of girders, obtain a 95% con…dence interval for the di¤erence in
mean log failure times and test the hypothesis of no di¤erence.
18. Consider the data in Chapter 1 on the lengths of male and female coyotes. The data
are available in the …le coyotedata.txt posted on the course website.
(a) Construct a 95% con…dence interval the di¤erence in mean lengths for the two
sexes. State your assumptions.
(b) Estimate P (Y1 > Y2 ) (give the maximum likelihood estimate), where Y1 is the
length of a randomly selected female and Y2 is the length of a randomly selected
male. Can you suggest how you might get a con…dence interval?
(c) Give separate con…dence intervals for the average length of males and females.
19. To assess the e¤ect of a low dose of alcohol on reaction time, a sample of 24 student
volunteers took part in a study. Twelve of the students (randomly chosen from the 24)
were given a …xed dose of alcohol (adjusted for body weight) and the other twelve got
a nonalcoholic drink which looked and tasted the same as the alcoholic drink. Each
student was then tested using software that ‡ashes a coloured rectangle randomly
placed on a screen; the student has to move the cursor into the rectangle and double
click the mouse. As soon as the double click occurs, the process is repeated, up to a
total of 20 times. The response variate is the total reaction time (i.e. time to complete
the experiment) over the 20 trials. The data are given below.
“Alcohol” Group:
1:33 1:55 1:43 1:35 1:17 1:35 1:17 1:80 1:68 1:19 0:96 1:46
P
12
y1 = 16:44
12 = 1:370 (y1i y1 )2 = 0:608
i=1
“Non-Alcohol” Group:
1:68 1:30 1:85 1:64 1:62 1:69 1:57 1:82 1:41 1:78 1:40 1:43
P
12
y2 = 19:19
12 = 1:599 (y2i y2 )2 = 0:35569
i=1
6.5. CHAPTER 6 PROBLEMS 277
Y1j G( 1; ); j = 1; 2; : : : ; 12 independently
Y2j G( 2; ); j = 1; 2; : : : ; 12 independently
for the Non-Alcohol Group where 1 ; 2 and are unknown parameters. Determine
a 95% con…dence interval for the di¤erence in the means 1 2 . What can the
researchers conclude on the basis of this study?
20. An experiment was conducted to compare gas mileages of cars using a synthetic oil
and a conventional oil. Eight cars were chosen as representative of the cars in general
use. Each car was run twice under as similar conditions as possible (same drivers,
routes, etc.), once with the synthetic oil and once with the conventional oil, the order
of use of the two oils being randomized.
The gas mileages were as follows:
Car 1 2 3 4 5 6 7 8
Synthetic: y1i 21:2 21:4 15:9 37:0 12:1 21:1 24:5 35:7
Conventional: y21 18:0 20:6 14:2 37:8 10:6 18:5 25:9 34:7
yi = y1i y2i 3:2 0:8 1:7 0:8 1:5 2:6 1:4 1
P
8
y1 = 23:6125 (y1i y1 )2 = 535:16875
i=1
P8
y2 = 22:5375 (y2i y2 )2 = 644:83875
i=1
P
8
y = 1:075 (yi y)2 = 17:135
i=1
(a) Obtain a 95% con…dence interval for the di¤erence in mean gas mileage, and
state the assumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking
only one measurement on each of 16 cars?
21. The following table gives the number of sta¤ hours per month lost due to accidents
in eight factories of similar size over a period of one year and after the introduction
of an industrial safety program.
Factory i 1 2 3 4 5 6 7 8
After: y1i 28:7 62:2 28:9 0:0 93:5 49:6 86:3 40:2
Before: y2i 48:5 79:2 25:3 19:7 130:9 57:6 88:8 62:1
yi = y1i y2i 19:8 17:0 3:6 19:7 37:4 8:0 2:5 21:9
278 6. GAUSSIAN RESPONSE MODELS
P
8
y= 15:3375 (yi y)2 = 1148:79875
i=1
There is a natural pairing of the data by factory. Factories with the best safety records
before the safety program tend to have the best records after the safety program as
well. The analysis of the data must take this pairing into account and therefore the
model
Yi G( ; ); i = 1; 2; : : : ; 8 independently
(a) The parameters and correspond to what attributes of interest in the study
population?
(b) Calculate a 95% con…dence interval for .
(c) Test the hypothesis of no di¤erence due to the safety program, that is, test the
hypothesis H0 : = 0:
22. Comparing sorting algorithms: Suppose you want to compare two algorithms
A and B that will sort a set of numbers into an increasing sequence. (The R function,
sort(x), will, for example, sort the elements of the numeric vector x.) To compare
the speed of algorithms A and B, you decide to “present” A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would
set up such a study, and discuss what pairing would mean in this context.
23. Sorting algorithms continued: Two sort algorithms as in the preceding problem
were each run on (the same) 20 sets of numbers (there were 500 numbers in each set).
Times to sort the sets of two numbers are shown below.
Set: 1 2 3 4 5 6 7 8 9 10
A: 3:85 2:81 6:47 7:59 4:58 5:47 4:72 3:56 3:22 5:58
B: 2:66 2:98 5:35 6:43 4:28 5:06 4:36 3:91 3:28 5:19
yi 1:19 :17 1:12 1:16 0:30 0:41 0:36 :35 :06 0:39
Set: 11 12 13 14 15 16 17 18 19 20
A: 4:58 5:46 3:31 4:33 4:26 6:29 5:04 5:08 5:08 3:47
B: 4:05 4:78 3:77 3:81 3:17 6:02 4:84 4:81 4:34 3:48
yi 0:53 0:68 :46 0:52 1:09 0:27 0:20 0:27 0:74 :01
20
X
y = 0:409 s2 = 1
19 (yi y)2 = 0:237483
i=1
Data are available in the …le sortdata.txt available on the course website.
6.5. CHAPTER 6 PROBLEMS 279
(a) Since the two algorithms are each run on the same 20 sets of numbers we analyse
the di¤erences yi = yAi yBi , i = 1; 2; : : : ; 20. Construct a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B,
assuming the di¤erence have a Gaussian distribution.
(b) Use a Normal qqplot to determine if a Gaussian model is reasonable for the
di¤erences.
(c) Give a point estimate of the probability that algorithm B will sort a randomly
selected list faster than A.
(d) Another way to estimate the probability p in part (c) is to notice that of the 20
sets of numbers in the study, B sorted faster on 15 sets of numbers. Obtain an
approximate 95% con…dence interval for p. (It is also possible to get a con…dence
interval using the Gaussian model.)
(e) Suppose the study had actually been conducted using two independent samples of
size 20 each. Using the two sample Normal analysis determine a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B.
Note:
y1 = 4:7375 s21 = 1:4697 y2 = 4:3285 s22 = 0:9945
How much better is the paired study as compared to the two sample study?
(f) Here is the R code for doing the t tests and con…dence intervals for the paired
analysis and the unpaired analysis:
# Import dataset sortdata.txt in folder S231Datasets
t.test(Time~Alg,data=sortdata,paired=T,conf.level=0.99)
t.test(Time~Alg,data=sortdata,paired=F,var.equal=T,conf.level=0.99)
25. Challenge Problem Readings produced by a set of scales are independent and
Normally distributed about the true weight of the item being measured. A study
is carried out to assess whether the standard deviation of the measurements varies
according to the weight of the item.
(a) Ten weighings of a 10 kilogram weight yielded y = 10:004 and s = 0:013 as the
sample mean and standard deviation. Ten weighings of a 40 kilogram weight
yielded y = 39:989 and s = 0:034. Is there any evidence of a di¤erence in the
standard deviations for the measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kilogram item. How could
you study the question of interest further?
280 6. GAUSSIAN RESPONSE MODELS
26. Challenge Problem Suppose you have a model where the mean of the response
variable Yi given the covariates xi = (xi1 ; : : : ; xik ) has the form
Show that the least squares estimate of is the same as the maximum likelihood
estimate of in the Gaussian model Yi G( i ; ), when i is of the form
P
k
i = (xi ; ) = j xij
j=1
27. Challenge Problem Optimal Prediction In many settings we want to use co-
variates x to predict a future value Y . (For example, we use economic factors x to
predict the price Y of a commodity a month from now.) The value Y is random, but
suppose we know (x) = E(Y jx) and (x)2 = V ar(Y jx).
(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction” function.
Show that the minimum achievable value of E(Y^ Y )2 is minimized by choosing
g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when
g(x) = (x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal”prediction
(in terms of Euclidean distance) is possible. Part (b) shows that we should try
to …nd covariates x for which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
7. MULTINOMIAL MODELS
AND GOODNESS OF FIT TESTS
n! y1 y2 yk
L( 1 ; 2; : : : ; k ) = 1 2 k
y1 !y2 ! yk !
or more simply
Q
k
yj
L( ) = j (7.2)
j=1
H0 : j = j( ) for j = 1; 2; : : : ; k (7.3)
H0 : 1 = 1; 2 = 1 + 2; 3 = 2; 4 =1 2( 1 + 2)
281
282 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
then = ( 1 ; 2 ) and p = 2.
A likelihood ratio test of (7.3) is based on the likelihood ratio statistic
" #
L(~0 )
= 2 log (7.4)
L(~)
Ej = n j (~ ) for j = 1; 2; : : : ; k
we can rewrite as
k
X Yj
=2 Yj log (7.5)
Ej
j=1
Let
k
X yj
=2 yj log
ej
j=1
This approximation is accurate when n is large and none of the j ’s is too small. In
particular, the expected frequencies determined assuming H0 is true should all be at least
5 to use the Chi-squared approximation.
An alternative test statistic that was developed historically before the likelihood ratio
test statistic is the Pearson goodness of …t statistic
k
X (Yj Ej )2
D= (7.6)
Ej
j=1
The Pearson goodness of …t statistic has similar properties to , that is, d takes on small
values if the yj ’s and ej ’s are close in value and d takes on large values if the yj ’s and ej ’s
di¤er greatly. It also turns out that, like , the statistic D has a limiting 2 (k 1 p)
distribution when H0 is true.
The remainder of this chapter consists of the application of the general methods above
to some important testing problems.
L1 ( ) = L( 1 ( ); 2( ); 3 ( ))
2 17 46
= c( ) [2 (1 )] [(1 )2 ]37
80
=c (1 )120 for 0 1
where c is a constant with respect to . We easily …nd that ^ = 0:40. The observed
expected frequencies under (7.7) are e1 = 100^ 2 = 16, e2 = 100[2^ (1 ^ )] = 48, e3 =
100[(1 ^ )2 ] = 36. The observed value of the likelihood ratio statistic (7.5) is
3
X yj 17 46 37
2 yj log = 2 17 log + 46 log + 37 log = 0:17
ej 16 48 36
j=1
284 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
The collapsed table has …ve categories so k = 5 and only one parameter has been estimated
under H0 so p = 1. The degrees of freedom for the Chi-squared approximation equal
k 1 p = 5 1 1 = 3. The approximate p value is
2
p value t P (W > 0:43) = 0:93 where W (3)
Since p value > 0:1 there is no evidence based on the data against the hypothesis that
the Poisson model …ts these data.
Since there is only one unknown parameter under (7.8), p = 1. It is possible to maximize
L( ) to obtain ^ = 310:0. The expected frequencies, ej = 100pj (^), j = 1; 2; : : : ; 7, are
given in the table.
286 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
A goodness of …t test has some arbitrary elements, since we could have used di¤erent
intervals and a di¤erent number of intervals. Theory has been developed on how best to
choose the intervals. For this course we only give rough guidelines which are: chose 4 10
intervals, so that the observed expected frequencies under H0 are at least 5.
Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200
Table 7.1: Frequency table for brake pad data
The expected frequencies are all at least …ve and so k = 9. There is only one parameter
to be estimated under the hypothesized Exponential( ) model so p = 1. The degrees of
7.3. TWO-WAY (CONTINGENCY) TABLES 287
2
p value = P ( 50:36; H0 ) t P (W 50:36) t 0 where W (7)
and there is very strong evidence based on the data against the hypothesis that an Ex-
ponential model …ts these data. This conclusion is not unexpected since, as we noted in
Example 2.6.2, the observed and expected frequencies are not in close agreement. We could
have chosen a di¤erent set of intervals for these continuous data but the same conclusion
of a lack of …t would be obtained for any reasonable choice of intervals.
P
b P
a P
a P
b
where ri = yij are the row totals, cj = yij are the column totals, and yij = n.
j=1 i=1 i=1 j=1
Let ij be the probability a randomly selected individual is combined type (Ai ; Bj ) and
288 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
P
a P
b
note that ij = 1. The a b frequencies (Y11 ; Y12 ; : : : ; Yab ) follow a Multinomial
i=1 j=1
distribution with k = ab classes.
To test independence of the A and B classi…cations, we test the hypothesis
H0 : ij = i j for i = 1; 2; : : : ; a; j = 1; 2; : : : ; b (7.9)
P
a P
b
where 0 < i < 1, 0 < j < 1, i = 1, j = 1. Note that
i=1 j=1
and that (7.9) is the standard de…nition for independent events: P (Ai \ Bj ) = P (Ai )P (Bj ).
We note that testing (7.9) falls into the general framework of Section 7.1, where k = ab,
and the number of parameters estimated under (7.9) is p = (a 1) + (b 1) = a + b 2.
All that needs to be done in order to use the statistics (7.5) or (7.6) to test H0 is to obtain
the maximum likelihood estimates ^ i , ^ j under the model (7.9), and then the calculate the
expected frequencies eij .
Under the model (7.9), the likelihood function for the yij ’s is
Q
a Q
b
L1 ( ; ) = [ ij ( ; )]yij
i=1 j=1
Q
a Q
b
yij
= ( i j)
i=1 j=1
The log likelihood function `( ; ) = log L( ; ) must be maximized subject to the linear
Pa P
b
constraints i = 1, j = 1. The maximum likelihood estimates can be shown to be
i=1 j=1
ri ^ cj
^i = ; j= i = 1; 2; : : : a; j = 1; 2; : : : b
n n
and the expected frequencies are given by
ri cj
eij = n^ i ^ j = i = 1; 2; : : : a; j = 1; 2; : : : b (7.10)
n
The observed value of the likelihood ratio statistic for testing H0 is
a X
X b
yij
=2 yij log
eij
i=1 j=1
Rh+ 82 89 54 19 244
Rh 13 27 7 9 56
82 89 54 19
Rh+ 244
(77:27) (94:35) (49:61) (22:77)
13 27 7 9
Rh 56
(17:73) (21:65) (11:39) (5:23)
the hypothesis of independence based on the data. Note that by comparing the eij ’s and
the yij ’s we see that the degree of dependence does not appear large.
290 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
i1 + i2 + + ib = 1 for each i = 1; 2; : : : ; a
H0 : 1 = 2 = = a; (7.11)
and 236 were assigned to the placebo group. (There were actually an equal number in each
group but four patients withdrew from the placebo group during the study.) The patients
were followed for three years, and it was determined for each person whether they had a
stroke during that period or not. The data were as follows (expected frequencies are given
in brackets).
Stroke No Stroke Total
Aspirin Group 64 (75:6) 176 (164:4) 240
Placebo Group 86(74:4) 150 (161:6) 236
Total 150 326 476
We can think of the persons receiving aspirin and those receiving placebo as two groups,
and test the hypothesis
H0 : 11 = 21
where 11 = P (stroke) for a person in the aspirin group and 21 = P (stroke) for a person
in the placebo group. The expected frequencies under H0 : 11 = 21 are
(yi+ )(y+j )
eij = for i = 1; 2
476
This gives the expected frequencies shown in the table in brackets. The observed value of
the likelihood ratio statistic is
2 X
X 2
yij
2 yij log = 5:25
eij
i=1 j=1
so there is evidence against H0 based on the data. A look at the yij ’s and the eij ’s indicates
that persons receiving aspirin have had fewer strokes than expected under H0 , suggesting
that 11 < 21 .
This test can be followed up with estimates for 11 and 21 . Because each row of the
table follows a Binomial distribution, we have
(~11 ~21 ) ( 11 21 )
q
~11 (1 ~11 )=n1 + ~21 (1 ~21 )=n2
Remark This and other tests involving Binomial probabilities and contingency tables
can be carried out using the R function prop.test which uses the Pearson goodness of …t
statistic.
7.4. CHAPTER 7 PROBLEMS 293
Test the hypothesis that each of the colours has the same probability H0 : i = 18 ;
i = 1; 2; : : : ; 8.
The following R code calculates the observed values of the likelihood ratio test statis-
tic and the Pearson goodness of …t statistic D and the corresponding p values.
y<-c(556,678,739,653,725,714,566,797) # observed frequencies
e<-sum(y)/8 # expected frequencies
lambda<-2*sum(y*log(y/e)) # observed value of LR statistic
df<-7 # degrees for freedom for this example equal 7
pvalue<-1-pchisq(lambda,df) # p-value for LR test
c(lambda,df,pvalue)
d<-sum((y-e)^2/e) # observed value of Pearson goodness of fit statistic
pvalue<-1-pchisq(d,df) # p-value for Pearson goodness of fit test
c(d,df,pvalue)
What would you conclude about the distribution of colours in boxes of Smarties?
2. Test whether a Poisson model for Y = the number of alpha particles emitted in a time
interval of 1=8 minute is consistent with the Rutherford and Geiger data of Example
2.6.1.
3. Test whether a Poisson model for Y = the number of points per game is consistent
with the data for Wayne Gretzky given in Chapter 2, Problem 10.
4. Test whether a Poisson model for Y = the number of points per game is consistent
with the data for Sidney Crosby given in Chapter 2, Problem 11.
5. In the Wintario lottery draw, six digit numbers were produced by six machines that
operate independently and which each simulate a random selection from the digits
0; 1; : : : ; 9. Of 736 numbers drawn over a period from 1980-82, the following frequen-
cies were observed for position 1 in the six digit numbers:
Digit (i): 0 1 2 3 4 5 6 7 8 9 Total
Frequency (fi ): 70 75 63 59 81 92 75 100 63 58 736
If the machines operate in a truly “random” fashion, then we should have j = 0:1,
j = 0; 1; : : : ; 9.
(a) Test this hypothesis using the likelihood ratio test. What do you conclude?
(b) The data above were for digits in the …rst position of the six digit Wintario
numbers. Suppose you were told that similar likelihood ratio tests had in fact
been carried out for each of the six positions, and that position one had been
singled out for presentation above because it gave the largest observed value of
the likelihood ratio statistic . How would you test the hypothesis j = 0:1;
j = 0; 1; 2; : : : ; 9 for all six positions simultaneously?
1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5
(a) Give an appropriate probability model for the number of digits between two
successive zeros, if the pseudo random number generator is truly producing digits
for which P (any digit = j) = 0:1, j = 0; 1; : : : ; 9, independent of any other digit.
(b) Construct a frequency table and test the goodness of …t of your model.
(a) Test the hypothesis that the number of defective items Y in a single carton has
a Binomial(12; ) distribution.
(b) Why might the Binomial not be a suitable model?
7.4. CHAPTER 7 PROBLEMS 295
8. The table below records data on 292 litters of mice classi…ed according to litter size
P
and number of females in the litter. Note that yn+ = ynj .
j
(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter
of size n has Binomial distribution with parameters n and n = P (female). Test
the Binomial model separately for each of the litter sizes n = 2; n = 3 and n = 4.
(Why is it of scienti…c interest to do this?)
(b) Assuming that the Binomial model is appropriate for each litter size, test the
hypothesis that 1 = 2 = 3 = 4 .
9. The following data on heights of 210 married couples were presented by Yule in 1900.
Test the hypothesis that the heights of husbands and wives are independent.
The following R code determines the p value for testing the hypothesis of independence.
# matrix of observed frequencies
f<-matrix(c(18,28,19,20,51,28,12,25,9),ncol=3,byrow=TRUE)
row<-margin.table(f,1) # row totals
col<-margin.table(f,2) # column totals
e<-outer(row,col)/sum(f) # matrix of expected frequencies
lambda<-2*sum(f*log(f/e)) # observed value of likelihood ratio statistic
df<-(length(row)-1)*(length(col)-1) # degrees of freedom
pvalue<-1-pchisq(lambda,df)
c(lambda,df,pvalue)
296 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS
10. A study was undertaken to determine whether there is an association between the
birth weights of infants and the smoking habits of their parents. Out of 50 infants of
above average weight, 9 had parents who both smoked, 6 had mothers who smoked
but fathers who did not, 12 had fathers who smoked but mothers who did not, and
23 had parents of whom neither smoked. The corresponding results for 50 infants of
below average weight were 21, 10, 6, and 13, respectively.
(a) Test whether these results are consistent with the hypothesis that birth weight
is independent of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of
the mother, the smoking habits of the father are not related to birth weight?
11. School children with tonsils were classi…ed according to tonsil size and absence or
presence of the carrier for streptococcus pyogenes. The results were as follows:
12. A random sample of 1000 Canadians aged 25 34 were classi…ed according to their
highest level of education and whether they were employed or not (data based on
2011 Canadian census data).
High school
diploma or equivalent 185 16 201
Postsecondary
certi…cate, diploma or degree 683 40 723
Test the hypothesis that level of education is independent of whether or not a Cana-
dian aged 25 34 is employed.
7.4. CHAPTER 7 PROBLEMS 297
13. In the following table, 64 sets of triplets are classi…ed according to the age of their
mother at their birth and their sex distribution:
(a) Is there any evidence of an association between the sex distribution and the age
of the mother?
(b) Suppose that the probability of a male birth is 0:5, and that the sexes of triplets
are determined independently. Find the probability that there are y boys in a
set of triples y = 0; 1; 2; 3, and test whether the column totals are consistent with
this distribution.
14. To investigate the e¤ectiveness of a rust-proo…ng procedure, 50 cars that had been
rust-proofed and 50 cars that had not were examined for rust …ve years after pur-
chase. For each car it was noted whether rust was present (actually de…ned as having
moderate or heavy rust) or absent (light or no rust). The data are as follows:
(a) Test the hypothesis that the probability of rust occurring is the same for the
rust-proofed cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proo…ng prevents rust?
How might a better study be designed?
independent set of 1400 utterances algorithm B made 62 errors. Test the hypothesis
that the probability of an error is the same for both algorithms.
B
Correct Incorrect
A Correct y11 y12
Incorrect y21 y22
n
where, for example, y11 = the number of utterances correctly identi…ed by both al-
gorithms A and B. Since (Y11 ; Y12 ; Y21 ; Y22 ) Multinomial(n; 11 ; 12 ; 21 ; 22 ), the
hypothesis that the probability of an error is the same for both algorithms is
H0 : P (A identi…es utterance correctly) = 11 + 12 = P (B identi…es utterance correctly)
= 11 + 21 or equivalently H0 : 12 = 21 .
(b) Use the likelihood ratio test to test the hypothesis H0 : 12 = 21 for the data
B
Correct Incorrect
A Correct 1325 3
Incorrect 13 59
1400
299
300 8. CAUSAL RELATIONSHIPS
leaving x as the causal agent. This is much easier to do in experimental studies, where
explanatory variates may be controlled, than in observational studies. The following are
brief examples.
There are many more lung cancer cases among the smokers, but without further information
or assumptions we cannot conclude that a causal relationship (smoking causes lung cancer)
exists. Alternative explanations might explain some or all of the observed di¤erence. (This
is an observational study and other possible explanatory variates are not controlled.) For
example, family history is an important factor in many cancers; maybe smoking is also
related to family history. Moreover, smoking tends to be connected with other factors such
as diet and alcohol consumption; these may explain some of the e¤ect seen.
The last example illustrates that association (statistical dependence) between
two variates x and y does not imply that a causal relationship exists. Suppose for
8.2. EXPERIMENTAL STUDIES 301
example that we observe a positive correlation between x and y; higher values of x tend to
go with higher values of y in a unit. Then there are at least three “explanations” for this
correlation:
(2) y causes x
We’ll now consider the question of cause and e¤ect in experimental and observational
studies in a little more detail.
Here’s how we rule out alternative explanations: suppose you claim that its not the aspirin
but dietary factors and blood pressure that cause this observed e¤ect. I respond that the
randomization procedure has lead to those factors being balanced in the two treatment
groups. That is, the aspirin group and the placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a di¤erence in
the two groups should not be due to these factors.
84 cars of eight di¤erent types were used; each car was used for 8 test drives.
the cars were each driven twice for 600 km on the track at each of four speeds: 80,
100, 120, and 140 km/hr.
8 drivers were involved, each driving each of the 8 cars for one test, and each driving
two tests at each of the four speeds.
the cars had similar initial mileages and were carefully checked and serviced so as to
make them as comparable as possible; they used comparable fuels.
the drivers were instructed to drive steadily for the 600 km. Each was allowed a 30
minute rest stop after 300 km.
the order in which each driver did their 8 test drives was randomized. The track
was large enough that all 8 drivers could be on it at the same time. (The tests were
conducted over 8 days.)
The response variate was the amount of fuel consumed for each test drive. Obviously
in the analysis we must deal with the fact that the cars di¤er in size and engine type, and
their fuel consumption will depend on that as well as on driving speed. A simple approach
would be to add the fuel amounts consumed for the 16 test drives at each speed, and to
compare them (other methods are also possible). Then, for example, we might …nd that
the average consumption (across the 8 cars) at 80, 100, 120, and 140 km/hr were 43.0, 44.1,
45.8, and 47.2 liters respectively. Statistical methods of testing and estimation could then
8.3. OBSERVATIONAL STUDIES 303
be used to test or estimate the di¤erences in average fuel consumption at each of the four
speeds. (Can you think of a way to do this?)
We want to see if females have a lower probability of admission than males. If we looked
only at the totals for Engineering plus Arts, then it would appear that the probability a
male applicant is admitted is a little higher than the probability for a female applicant.
However, if we look separately at Arts and Engineering, we see the probability for females
being admitted appears higher in each case! The reason for the reverse direction in the
totals is that Engineering has a higher admission rate than Arts, but the fraction of women
applying to Engineering is much lower than for Arts.
304 8. CAUSAL RELATIONSHIPS
In cause and e¤ect language, we would say that the faculty one applies to (i.e. Engi-
neering or Arts) is a causative factor with respect to probability of admission. Furthermore,
it is related to the sex (male or female) of an applicant, so we cannot ignore it in trying to
see if sex is also a causative factor.
Remark The feature illustrated in the example above is sometimes called Simpson’s Para-
dox. In probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can have
but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend
i=1
on what P (Ci jB1 ) and P (Ci jB2 ) are.) In the example above we can take B1 = fperson
is femaleg, B2 = fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson
applies to Artsg, and A = fperson is admittedg.
Exercise Write down estimated probabilities for the various events based on Example
8.3.1, and so illustrate Simpson’s paradox.
The association between x and y must be observed in many studies of di¤erent types
among di¤erent groups. This reduces the chance that an observed association is due
to a defect in one type of study or a peculiarity in one group of subjects.
The association between x and y must continue to hold when the e¤ects of plausible
confounding variates are taken into account.
There must be a consistent response, that is, y always increases (decreases) when x
increases.
explain the association. For example, data about nonsmokers who are exposed to second-
hand smoke contradicts the genetic hypothesis. Animal experiments have demonstrated
conclusively that tobacco smoke contains substances that cause cancerous tumors. There-
fore there is a known pathway by which smoking causes lung cancer. The lung cancer rates
for ex-smokers decrease over time since smoking cessation. The evidence for causation here
is about as strong as non-experimental evidence can be.
Problem
Investigate the e¤ect of clo…brate on the risk of fatal heart attack for patients with a
history of a previous heart attack.
The target population consists of all individuals with a previous non-fatal heart attack
who are at risk for a subsequent heart attack. The response of interest is the occurrence/non-
occurrence of a fatal heart attack. This is primarily a causative problem in that the investi-
gators are interested in determining whether the prescription of clo…brate causes a reduction
in the risk of subsequent heart attack. The …shbone diagram (Figure 8.1) indicates a broad
variety of factors a¤ecting the occurrence (or not) of a heart attack.
Plan
The study population consists of men aged 30 to 64 who had a previous heart attack not
more than three months prior to initial contact. The sample consists of subjects from the
study population who were contacted by participating physicians, asked to participate in
the study, and provided informed consent. (All patients eligible to participate had to sign a
consent form to participate in the study. The consent form usually describes current state
of knowledge regarding the best available relevant treatments, the potential advantages and
disadvantages of the new treatment, and the overall purpose of the study.)
The following treatment protocol was developed:
16
The Coronary Drug Research Group, New England Journal of Medicine (1980), pg. 1038.
306 8. CAUSAL RELATIONSHIPS
Me a s u r e me n t Ma te r ia l Pe r s o n n e l
ag e
s tr es s
follow- up time mental health
diet
per s onality type
follow- up method dos e
g ender
definition of hear t attac k exer c is e
dr ug s m oking s tatus
dr inking s tatus
doc tor medic ations
family his tor y
phys ic al tr aits
per s onal his tor y
F a ta l H e a rt Atta c k
weather method of adminis tr ation
En v ir o n me n t Me th o d s
Randomly assign eligible men to either clo…brate or placebo treatment groups. (This
is an attempt to make the clo…brate and placebo groups alike with respect to most ex-
planatory variates other than the focal explanatory variate. See the …shbone diagram
above.)
Follow patients for 5 years and record the occurrence of any fatal heart attacks expe-
rienced in either treatment group.
Data
1,103 patients were assigned to clo…brate and 2,789 were assigned to the placebo
group.
221 of the patients in the clo…brate group died and 586 of the patients in the placebo
group died.
8.4. CLOFIBRATE STUDY 307
Analysis
The proportion of patients in the two groups having subsequent fatal heart attacks
(clo…brate: 221=1103 = 0:20 and placebo: 586=2789 = 0:21) are comparable.
Conclusions
Based on these data we would conclude that Clo…brate does not reduce mortality due
to heart attacks in high risk patients.
This conclusion has several limitations. For example, study error has been introduced
by restricting the study population to male subjects alone. While clo…brate might be
discarded as a bene…cial treatment for the target population, there is no information
in this study regarding its e¤ects on female patients at risk for secondary heart attacks.
Problem
Investigate the occurrence of fatal heart attacks in the group of patients assigned to
clo…brate who were adherers.
Plan
Compare the occurrence of heart attacks in patients assigned to clo…brate who main-
tained the designated treatment schedule with the patients assigned to clo…brate who
abandoned their assigned treatment schedule.
Data
In the clo…brate group, 708 patients were adherers and 357 were non-adherers. The
remaining 38 patients could not be classi…ed as adherers or non-adherers and so were
excluded from this analysis. Of the 708 adherers, 106 had a fatal heart attack during
the …ve years of follow up. Of the 357 non-adherers, 88 had a fatal heart attack during
the …ve years of follow up.
308 8. CAUSAL RELATIONSHIPS
Analysis
The proportion of adherers su¤ering from subsequent heart attack is given by 106=708 =
0:15 while this proportion for the non-adherers is 88=357 = 0:25.
Conclusions
It would appear based on these data that clo…brate does reduce mortality due to
heart attack for high risk patients if properly administered.
However, great care must be taken in interpreting the above results since they are
based on an observational plan. While the data were collected based on an exper-
imental plan, only the treatment was controlled. The comparison of the mortality
rates between the adherers and non-adherers is based on an explanatory variate (ad-
herence) that was not controlled in the original experiment. The investigators did not
decide who would adhere to the protocol and who would not; the subjects decided
themselves.
Now the possibility of confounding is substantial. Perhaps, adherers are more health
conscious and exercised more or ate a healthier diet. Detailed measurements of these
variates are needed to control for them and reduce the possibility of confounding.
8.5. CHAPTER 8 PROBLEMS 309
(a) Test the hypothesis that birth weight is independent of the mother’s smoking
habits.
(b) Explain why it is that these results do not prove that birth weights would increase
if mothers stopped smoking during pregnancy. How should a study to obtain
such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount
smoked by the father. Explain why this is to be expected even if the father’s
smoking habits are irrelevant.
2. One hundred and …fty Statistics students took part in a study to evaluate computer-
assisted instruction (CAI). Seventy-…ve received the standard lecture course while
the other 75 received some CAI. All 150 students then wrote the same examination.
Fifteen students in the standard course and 29 of those in the CAI group received a
mark over 80%.
(a) Are these results consistent with the hypothesis that the probability of achieving
a mark over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances
of a mark over 80%. How should the study have been carried out in order for
this conclusion to be valid?
3.
(a) The following data were collected some years ago in a study of possible sex bias
in graduate admissions at a large university:
Test the hypothesis that admission status is independent of sex. Do these data
indicate a lower admission rate for females?
310 8. CAUSAL RELATIONSHIPS
(b) The following table shows the numbers of male and female applicants and the
percentages admitted for the six largest graduate programs in (a):
Men Women
Program Applicants % Admitted Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Test the independence of admission status and sex for each program. Do any of
the programs show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the
results for individual programs in (b) do not?
(a) Test the hypothesis that there is no di¤erence between the mean amount of rust
for rust-proofed cars as compared to non-rust-proofed cars.
(b) The manufacturer was surprised to …nd that the data did not show a bene…cial
e¤ect of rust-proo…ng. Describe problems with their study and outline how you
might carry out a study designed to demonstrate a causal e¤ect of rust-proo…ng.
7. In randomized clinical trials that compare two (or more) medical treatments it is
customary not to let either the subject or their physician know which treatment they
have been randomly assigned. These are referred to as double blind studies.
Discuss why doing a double blind study is a good idea in an experimental study.
9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat
231 Course Notes)
C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis
and Inference. John Wiley and Sons, New York.
J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability.
American Statistician 57,74-79
313
314 9. REFERENCES AND SUPPLEMENTARY RESOURCES
10. DISTRIBUTIONS AND
STATISTICAL TABLES
315
Summary of Discrete Distributions
Moment
Probability
Notation and Mean Variance Generating
Function
Parameters EY VarY Function
fy
Mt
Discrete Uniforma, b b
1
b−a1 ab b−a1 2 −1 1
∑ e tx
b≥a 2 12
b−a1
xa
y a, a 1, … , b
a, b integers t∈
HypergeometricN, r, n r
y
N−r
n−y
Nn
N 1, 2, … nr nr
1 − r
N−n Not tractable
n 0, 1, … , N y max 0, n − N r, N N N N−1
r 0, 1, … , N … , minr, n
Binomialn, p
ny p y q n−y np npq pe t q n
0 ≤ p ≤ 1, q 1 − p
y 0, 1, … , n t∈
n 1, 2, …
Bernoullip p y q 1−y p pq pe t q
0 ≤ p ≤ 1, q 1 − p y 0, 1 t∈
yk−1
Negative Binomialk, p y pkqy p k
kq kq
1−qe t
0 p ≤ 1, q 1 − p −k
y p −q k y p p2
t − ln q
k 1, 2, … y 0, 1, …
p
Geometricp pq y q q 1−qe t
p p2
0 p ≤ 1, q 1 − p y 0, 1, … t − ln q
e − y
Poisson e e −1
t
x!
≥0 y 0, 1, … t∈
fy 1 , y 2 , … , y k
Multinomialn; p 1 , p 2 , … , p k y y y Mt 1 , t 2 , … , t k
n!
y 1 !y 2 !y k !
p 11 p 22 p k k
0 ≤ pi ≤ 1 VarY i p 1 e t 1 p 2 e t 2
y i 0, 1, … , n EY i np i
i 1, 2, … , k np i 1 − p i p k−1 e t k−1 p k n
i 1, 2, … , k i 1, 2, … , k
k i 1, 2, … , k ti ∈
and ∑ p i 1 k
i1 and ∑ y i n i 1, 2, … , k − 1
i1
Summary of Continuous Distributions
Probability Moment
Notation and Density Mean Variance Generating
Parameters Function EY VarY Function
fy Mt
e bt −e at
Uniforma, b 1
ab b−a 2 b−at
t≠0
b−a
2 12
ba a≤y≤b 1 t0
1 1
Exponential e −y/
2 1−t
1
0 y≥0 t
1 2 2
N, 2 G, e −y− /2 e t
2 t 2 /2
2 2
∈ , 2 0 y∈ t∈
y k/2−1 e −y/2
2 k/2 Γk/2
2 k y0 1 − 2t −k/2
k 2k
1
k 1, 2, … t 2
Γa x a−1 −x
e dx
0
ck
y2
1 k1/2
if k 3, 4, …
k
tk 0 if k 2, 3, … k
DNE
y∈ k−2
k 1, 2, … Γ k1 DNE if k 1 DNE if k 1, 2
ck 2
k Γ k
2
N(0,1) Cumulative
Distribution Function
Student t Quantiles
This table gives values of x for p = P(X ≤ x ) = F (x ), for p ≥ 0.6
df \ p 0.6 0.7 0.8 0.9 0.95 0.975 0.99 0.995 0.999 0.9995
1 0.3249 0.7265 1.3764 3.0777 6.3138 12.7062 31.8205 63.6567 318.3088 636.6192
2 0.2887 0.6172 1.0607 1.8856 2.9200 4.3027 6.9646 9.9248 22.3271 31.5991
3 0.2767 0.5844 0.9785 1.6377 2.3534 3.1824 4.5407 5.8409 10.2145 12.9240
4 0.2707 0.5686 0.9410 1.5332 2.1318 2.7764 3.7469 4.6041 7.1732 8.6103
5 0.2672 0.5594 0.9195 1.4759 2.0150 2.5706 3.3649 4.0321 5.8934 6.8688
6 0.2648 0.5534 0.9057 1.4398 1.9432 2.4469 3.1427 3.7074 5.2076 5.9588
7 0.2632 0.5491 0.8960 1.4149 1.8946 2.3646 2.9980 3.4995 4.7853 5.4079
8 0.2619 0.5459 0.8889 1.3968 1.8595 2.3060 2.8965 3.3554 4.5008 5.0413
9 0.2610 0.5435 0.8834 1.3830 1.8331 2.2622 2.8214 3.2498 4.2968 4.7809
10 0.2602 0.5415 0.8791 1.3722 1.8125 2.2281 2.7638 3.1693 4.1437 4.5869
11 0.2596 0.5399 0.8755 1.3634 1.7959 2.2010 2.7181 3.1058 4.0247 4.4370
12 0.2590 0.5386 0.8726 1.3562 1.7823 2.1788 2.6810 3.0545 3.9296 4.3178
13 0.2586 0.5375 0.8702 1.3502 1.7709 2.1604 2.6503 3.0123 3.8520 4.2208
14 0.2582 0.5366 0.8681 1.3450 1.7613 2.1448 2.6245 2.9768 3.7874 4.1405
15 0.2579 0.5357 0.8662 1.3406 1.7531 2.1314 2.6025 2.9467 3.7328 4.0728
16 0.2576 0.5350 0.8647 1.3368 1.7459 2.1199 2.5835 2.9208 3.6862 4.0150
17 0.2573 0.5344 0.8633 1.3334 1.7396 2.1098 2.5669 2.8982 3.6458 3.9651
18 0.2571 0.5338 0.8620 1.3304 1.7341 2.1009 2.5524 2.8784 3.6105 3.9216
19 0.2569 0.5333 0.8610 1.3277 1.7291 2.0930 2.5395 2.8609 3.5794 3.8834
20 0.2567 0.5329 0.8600 1.3253 1.7247 2.0860 2.5280 2.8453 3.5518 3.8495
21 0.2566 0.5325 0.8591 1.3232 1.7207 2.0796 2.5176 2.8314 3.5272 3.8193
22 0.2564 0.5321 0.8583 1.3212 1.7171 2.0739 2.5083 2.8188 3.5050 3.7921
23 0.2563 0.5317 0.8575 1.3195 1.7139 2.0687 2.4999 2.8073 3.4850 3.7676
24 0.2562 0.5314 0.8569 1.3178 1.7109 2.0639 2.4922 2.7969 3.4668 3.7454
25 0.2561 0.5312 0.8562 1.3163 1.7081 2.0595 2.4851 2.7874 3.4502 3.7251
26 0.2560 0.5309 0.8557 1.3150 1.7056 2.0555 2.4786 2.7787 3.4350 3.7066
27 0.2559 0.5306 0.8551 1.3137 1.7033 2.0518 2.4727 2.7707 3.4210 3.6896
28 0.2558 0.5304 0.8546 1.3125 1.7011 2.0484 2.4671 2.7633 3.4082 3.6739
29 0.2557 0.5302 0.8542 1.3114 1.6991 2.0452 2.4620 2.7564 3.3962 3.6594
30 0.2556 0.5300 0.8538 1.3104 1.6973 2.0423 2.4573 2.7500 3.3852 3.6460
40 0.2550 0.5286 0.8507 1.3031 1.6839 2.0211 2.4233 2.7045 3.3069 3.5510
50 0.2547 0.5278 0.8489 1.2987 1.6759 2.0086 2.4033 2.6778 3.2614 3.4960
60 0.2545 0.5272 0.8477 1.2958 1.6706 2.0003 2.3901 2.6603 3.2317 3.4602
70 0.2543 0.5268 0.8468 1.2938 1.6669 1.9944 2.3808 2.6479 3.2108 3.4350
80 0.2542 0.5265 0.8461 1.2922 1.6641 1.9901 2.3739 2.6387 3.1953 3.4163
90 0.2541 0.5263 0.8456 1.2910 1.6620 1.9867 2.3685 2.6316 3.1833 3.4019
100 0.2540 0.5261 0.8452 1.2901 1.6602 1.9840 2.3642 2.6259 3.1737 3.3905
>100 0.2535 0.5247 0.8423 1.2832 1.6479 1.9647 2.3338 2.5857 3.1066 3.3101