100% found this document useful (1 vote)
284 views326 pages

Stat 231 Course Notes

This document appears to be course notes for Statistics 221/231 from the University of Waterloo. It covers various topics in statistical sciences including: introduction to empirical studies and statistical concepts; data collection and summaries; probability distributions; statistical inference; statistical software and R; planning empirical studies; estimation; hypothesis testing; Gaussian response models; multinomial models; and causal relationships. The notes are intended to help students in the courses and provide a work-in-progress that is improved based on student and instructor feedback.

Uploaded by

Jerry Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
284 views326 pages

Stat 231 Course Notes

This document appears to be course notes for Statistics 221/231 from the University of Waterloo. It covers various topics in statistical sciences including: introduction to empirical studies and statistical concepts; data collection and summaries; probability distributions; statistical inference; statistical software and R; planning empirical studies; estimation; hypothesis testing; Gaussian response models; multinomial models; and causal relationships. The notes are intended to help students in the courses and provide a work-in-progress that is improved based on student and instructor feedback.

Uploaded by

Jerry Chen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 326

STATISTICS 221/231 COURSE NOTES

Department of Statistics and Actuarial Science, University of Waterloo

Winter 2021 Edition


ii 1
Contents

1. INTRODUCTION TO STATISTICAL SCIENCES 1


1.1 Empirical Studies and Statistical Sciences . . . . . . . . . . . . . . . . . . . 1
1.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Probability Distributions and Statistical Models . . . . . . . . . . . . . . . . 29
1.5 Data Analysis and Statistical Inference . . . . . . . . . . . . . . . . . . . . . 32
1.6 Statistical Software and R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.7 Chapter 1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMA-


TION 55
2.1 Choosing a Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Likelihood Functions for Continuous Distributions . . . . . . . . . . . . . . 68
2.4 Likelihood Functions For Multinomial Models . . . . . . . . . . . . . . . . . 72
2.5 Invariance Property of Maximum Likelihood Estimate . . . . . . . . . . . . 74
2.6 Checking the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.7 Chapter 2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

3. PLANNING AND CONDUCTING EMPIRICAL STUDIES 105


3.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.2 The Steps of PPDAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.4 Chapter 3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4. ESTIMATION 135
4.1 Statistical Models and Estimation . . . . . . . . . . . . . . . . . . . . . . . 135
4.2 Estimators and Sampling Distributions . . . . . . . . . . . . . . . . . . . . . 136
4.3 Interval Estimation Using the Likelihood Function . . . . . . . . . . . . . . 141
4.4 Con…dence Intervals and Pivotal Quantities . . . . . . . . . . . . . . . . . . 145
4.5 The Chi-squared and t Distributions . . . . . . . . . . . . . . . . . . . . . . 154
4.6 Likelihood-Based Con…dence Intervals . . . . . . . . . . . . . . . . . . . . . 158

iii
iv CONTENTS

4.7 Con…dence Intervals for Parameters in the G( ; ) Model . . . . . . . . . . 162


4.8 Chapter 4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.9 Chapter 4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5. HYPOTHESIS TESTING 189


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.2 Hypothesis Testing for Parameters in the G( ; ) Model . . . . . . . . . . . 196
5.3 Likelihood Ratio Test of Hypothesis - One Parameter . . . . . . . . . . . . 202
5.4 Likelihood Ratio Test of Hypothesis - Multiparameter . . . . . . . . . . . . 209
5.5 Chapter 5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.6 Chapter 5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

6. GAUSSIAN RESPONSE MODELS 223


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.3 Comparison of Two Population Means . . . . . . . . . . . . . . . . . . . . . 247
6.4 General Gaussian Response Models . . . . . . . . . . . . . . . . . . . . . . . 259
6.5 Chapter 6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS 281


7.1 Likelihood Ratio Test for the Multinomial Model . . . . . . . . . . . . . . . 281
7.2 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
7.3 Two-Way (Contingency) Tables . . . . . . . . . . . . . . . . . . . . . . . . . 287
7.4 Chapter 7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

8. CAUSAL RELATIONSHIPS 299


8.1 Establishing Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.2 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.3 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.4 Clo…brate Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.5 Chapter 8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

9. REFERENCES AND SUPPLEMENTARY RESOURCES 313


9.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
9.2 Departmental Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 313

10. DISTRIBUTIONS AND STATISTICAL TABLES 315


0 CONTENTS

Preface
These notes are a work-in-progress with contributions from those students taking the
courses and the instructors teaching them. An original version of these notes was prepared
by Jerry Lawless. Additions and revisions were made by Cyntha Struthers, Don McLeish,
Jock MacKay, and others. Richard Cook supplied the example in Chapter 8. In order
to provide improved versions of the notes for students in subsequent terms, please email
typos and errors, or sections that are confusing, or additional comments/suggestions to
[email protected].
Speci…c topics in these notes also have associated video …les or Powerpoint shows that
can be accessed at www.watstat.ca.
1. INTRODUCTION TO
STATISTICAL SCIENCES

1.1 Empirical Studies and Statistical Sciences

An empirical study is one in which knowledge is gained by observation or by experiment.


Empirical studies may be conducted to further knowledge, improve systems, or determine
public policy. For example, in disciplines such as insurance or …nance, decisions must be
made about what premium to charge for an insurance policy or whether to buy or sell a
stock on the basis of available data. In medical research, decisions must be made about the
safety and e¢ cacy of new treatments for diseases such as cancer based on clinical trials.
Government scientists collect data on …sh stocks in order to provide information to policy
makers who must set quotas or limits on commercial …shing.
Empirical studies deal with populations and processes which are collections of individual
units. In order to study a population a sample of units is carefully selected from that
population. To study a process a sample of units generated by the process is examined.
Since only a sample from the population or process is observed and not all of the units are the
same, there will be uncertainty in the conclusions drawn from such a study. For example,
researchers at a pharmaceutical company may conduct a study to assess the e¤ect of a
new drug for controlling hypertension (high blood pressure). For cost and ethical reasons,
they can only involve a relatively small sample of subjects in the study. Since people have
varying degrees of hypertension, they react di¤erently to the drug, and they have di¤erent
side e¤ects, there will be uncertainly in the conclusions drawn from the study. In another
example, a …nancial engineer may collect data on currency or stock values during a previous
time period to try and predict their values in a future time period. These predictions would
involve uncertainty due to the variability present in such data. Finally a commercial website
may conduct a study to examine changes in the number of website hits before and after
an advertising campaign. Data would be collected over a …xed period before and after
the campaign. Since only a sample of the process is collected, the conclusions about any
changes would also involve uncertainty.
Statistical Sciences are concerned with all aspects of empirical studies including formu-
lating the problem, planning the experiment, collecting the data, analyzing the data, and

1
2 1. INTRODUCTION TO STATISTICAL SCIENCES

making conclusions. In particular, Statistical Sciences deal with the study of variability
in populations and processes, and with informative and cost-e¤ective ways to collect and
analyze data about such populations and processes.
Statistical data analysis occurs in a huge number of areas. For example, statistical
algorithms are the basis for software involved in the automated recognition of handwritten
or spoken text; statistical methods are commonly used in law cases, for example in DNA
pro…ling; statistical process control is used to increase the quality and productivity of
manufacturing and service processes; individuals are selected for direct mail marketing
campaigns through a statistical analysis of their characteristics. With modern information
technology, massive amounts of data are routinely collected and stored. But data do not
equal information, and it is the purpose of Statistical Sciences to provide and analyze data
so that the maximum amount of information or knowledge may be obtained1 . Poor or
improperly analyzed data may be useless or misleading. The same could be said about
poorly collected data.
Probability models are used to represent many phenomena, populations, or processes
and to deal with problems that involve variability. You studied these models in your proba-
bility course and you have seen how they can be used to describe variability. This course will
focus on the collection, analysis and interpretation of data and the probability models you
studied previously will be used extensively. The most important material from your proba-
bility course is the material dealing with random variables, including distributions such as
the Binomial, Poisson, Multinomial, Normal or Gaussian, Uniform and Exponential. It is
important to review this material on our own.
Statistical Sciences is a large discipline and this course is only an introduction. The
broad objective of this course is to discuss all aspects of: problem formulation, planning of
an empirical study, formal and informal analysis of data, and the conclusions and limitations
of such an analysis. We must remember that data are collected and models are constructed
for a speci…c reason. In any given application we should keep the big picture in mind (e.g.
Why are we studying this? What else do we know about it?) even when considering one
speci…c aspect of a problem.
Here is a quote2 from Hal Varien, Google’s chief economist.
“The ability to take data - to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it’s going to be a hugely important skill in the next
decades, not only at the professional level but even at the educational level for elementary
school kids, for high school kids, for college kids. Because now we really do have essen-
tially free and ubiquitous data. So the complimentary(sic) scarce factor is the ability to
understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also want to be able to
1
A brilliant example of how to create information through data visualization is found in the video by
Hans Rosling at: http://www.youtube.com/watch?v=jbkSRLYSojo
2
For the complete article see “How the web challenges managers” Hal Varian, The McKinsey Quarterly,
January 2009
1.2. DATA COLLECTION 3

visualize the data, communicate the data, and utilize it e¤ectively. But I do think those
skills - of being able to access, understand, and communicate the insights you get from data
analysis - are going to be extremely important. Managers need to be able to access and
understand the data themselves.”

1.2 Data Collection


A population is a collection of units. Examples are: the population of all students taking
STAT 231 this term; the population of all persons aged 18-25 living in Ontario on January
1, 2020; and the population of all car insurance policies issued by a particular insurance
company in the year 2019. A process is a system by which units are produced. For example,
the hits on a website could be considered as units in a process. Of course this process would
be quite complex and di¢ cult to describe. Students taking STAT 231 now and into the
future or claims made by car insurance policy holders could also be considered as units in a
process. A key feature of processes is that they usually occur over time whereas populations
are often static (de…ned at one moment in time).
We pose questions about populations or processes by de…ning variates for the units
which are characteristics of the units. Variates can be of di¤erent types. If the units are
people then variates such as their height, weight, age, and time until recurrence of disease
after medical treatment are all examples of continuous variates. The lifetime of an electrical
component (the unit) is also a continuous variate.
Suppose the unit of interest is a production run of smartphones made by a particular
company then the number of defective smartphones produced in a run is an example of a
discrete variate. If the units are trees in a particular forest then the number of aphids on
a tree is also a discrete variate. The number of deaths in a year on a particular section
of dangerous highway is another example of a discrete variate. What is the unit in this
example?
Variates such as hair colour, university program or marital status of a person (the unit)
are examples of categorical variates since these variates do not take on numerical values.
Another example of a categorical variate would be the presence or absence of a disease in
a unit. Sometimes, to facilitate the analysis of the data, we might rede…ne the variate of
interest to be 1 if the disease is present and 0 if the disease is absent. We would now call
the variate a discrete variate. Since the variate only takes on values 0 or 1 such a variate
is often referred to as a binary variate.
If a categorical variate has a natural ordering then it is called an ordinal variate. For
example, the size of a unit is an ordinal variate if the categories for size are: large, medium,
small. Another example of an ordinal variate would be the opinion of a person (the unit)
on a given statement in a poll for which the categories might be: strongly agree, agree,
neutral, disagree, strongly disagree.
Variates can also be complex. The open ended response by a person (the unit) to a
question on a survey is an example of a complex variate. If the units are cities and an aerial
4 1. INTRODUCTION TO STATISTICAL SCIENCES

image is associated with each city (the unit) then the image is also a complex variate.
The values of a variate typically vary across the units in a population or process. This
variability generates uncertainty and makes it necessary to study populations and processes
by collecting data about them. By data, we mean the values of the variates for a sample
of units drawn from a population or process. It is important to identify the types of
variates in an empirical study since this identi…cation will help us in choosing
statistical models for the data which will aid us in the analysis of the data.
We are interested in functions of the variates over the population or process; for example
the average drop in blood pressure due to a treatment for individuals with hypertension
or the proportion of a population having a certain characteristic. We call these functions
attributes of the population or process.
In planning to collect data about a population or process, we must carefully specify
what the objectives are. Then, we must consider feasible methods for collecting data as
well as the extent it will be possible to answer questions of interest. This sounds simple
but is usually di¢ cult to do well, especially since resources are always limited.
There are several ways in which we can obtain data. One way is purely according to
what is available: that is, data are provided by some existing source. Huge amounts of
data collected by many technological systems are of this type, for example, data on credit
card usage or on purchases made by customers in a supermarket. Sometimes it is not
clear what available data represent and they may be unsuitable for serious analysis. For
example, people who voluntarily provide data in a web survey may not be representative of
the population at large. Alternatively, we may plan and execute a sampling plan to collect
new data. Statistical Sciences stress the importance of obtaining data that will be objective
and provide maximal information at a reasonable cost.
Recall that an empirical study is one in which we learn by observation or experiment.
Most often this is done by collecting data. The empirical studies we will consider will
usually be one of the following types:

(i) Sample surveys: The object of many empirical studies is to learn about a …nite pop-
ulation (e.g. all persons over 19 in Ontario as of September 1 in a given year). In this
case information about the population may be obtained by selecting a “representa-
tive”sample of units from the population and determining the variates of interest for
each unit in the sample. Obtaining such a sample can be challenging and expensive.
In a survey sample the variates of interest are most often collected using a question-
naire. Sample surveys are widely used in government statistical studies, economics,
marketing, public opinion polls, sociology, quality assurance and other areas.

(ii) Observational studies: An observational study is one in which data are collected
about a population or process without any attempt to change the value of one or
more variates for the sampled units. For example, in studying risk factors associated
with a disease such as lung cancer, we might investigate all cases of the disease at a
particular hospital (or perhaps a sample of them) that occur over a given time period.
1.2. DATA COLLECTION 5

We would also examine a sample of individuals who did not have the disease. A dis-
tinction between a sample survey and an observational study is that for observational
studies the population of interest is usually in…nite or conceptual. For example, in
investigating risk factors for a disease, we prefer to think of the population of interest
as a conceptual one consisting of persons at risk from the disease recently or in the
future.

(iii) Experimental studies: An experimental study is one in which the experimenter


(that is, the person conducting the study) intervenes and changes or sets the values
of one or more variates for the units in the sample. For example, in an engineering
experiment to quantify the e¤ect of temperature on the performance of a certain type
of computer chip, the experimenter might decide to run a study with 40 chips, ten of
which are operated at each of four temperatures 10, 20, 30, and 40 degrees Celsius.
Since the experimenter decides the temperature level for each chip in the sample, this
is an experiment.

These three types of empirical studies are not mutually exclusive, and many studies
involve aspects of all of them. Here are some slightly more detailed examples.

Example 1.2.1 A sample survey about smoking


Suppose we wish to study the smoking behaviour of Ontario residents aged 14 20
years. (Think about reasons why such studies are considered important.) Of course, the
population of Ontario residents aged 14 20 years and their smoking habits both change
over time, so we will content ourselves with a snapshot of the population at some point in
time (e.g. the second week of September in a given year). Since we cannot a¤ord to contact
all persons in the population, we decide to select a sample of persons from the population
of interest. (Think about how we might do this - it is quite di¢ cult!) We decide to measure
the following variates on each person in the sample: age, sex, place of residence, occupation,
current smoking status, length of time smoked, etc.
Note that we have to decide how we are going to obtain our sample and how large it
should be. The former question is very important if we want to ensure that our sample
provides a good picture of the overall population. The amount of time and money available
to carry out the study heavily in‡uences how we will proceed.

Example 1.2.2 A study of a manufacturing process


When a manufacturer produces a product in packages stated to weigh or contain a
certain amount, they are generally required by law to provide at least the stated amount in
each package. Since there is always some inherent variation in the amount of product which
the manufacturing process deposits in each package, the manufacturer has to understand
this variation and set up the process so that no packages or only a very small fraction of
packages contain less than the required amount.
6 1. INTRODUCTION TO STATISTICAL SCIENCES

Consider, for example, soft drinks sold in nominal 355 ml cans. Because of inherent
variation in the …lling process, the amount of liquid y that goes into a can varies over a
small range. Note that the manufacturer would like the variability in y to be as small as
possible, and for cans to contain at least 355 ml. Suppose that the manufacturer has just
added a new …lling machine to increase the plant’s capacity. The process engineer wants to
compare the new machine with an old one. Here the population of interest is the cans …lled
in the future by both machines. The process engineer decides to do this by sampling some
…lled cans from each machine and accurately measuring the amount of liquid y in each can.
This is an observational study.
How exactly should the sample be chosen? The machines may drift over time (that is,
the average of the y values or the variability in the y values may vary systematically up or
down over time) so we should select cans over time from each machine. We have to decide
how many, over what time period, and when to collect the cans from each machine.

Example 1.2.3 A clinical trial in medicine


In studies of the treatment of disease, it is common to compare alternative treatments
in experiments called clinical trials. Consider, for example, a population of persons who
are at high risk of a stroke. Some years ago it was established in clinical trials that small
daily doses of aspirin (which acts as a blood thinner) could lower the risk of stroke. This
was done by giving some high risk subjects daily doses of aspirin (call this Treatment 1)
and others a daily dose of a placebo (an inactive compound) given in the same form as
the aspirin (call this Treatment 2). The two treatment groups were then followed for a
period of time, and the number of strokes in each group was observed. Note that this is an
experimental study because the researchers decided which subjects in the sample received
Treatment 1 and which subjects received Treatment 2.
This sounds like a simple plan to implement but there are several important points.
For example, patients should be assigned to receive Treatment 1 or Treatment 2 in some
random fashion to avoid unconscious bias (e.g. doctors might otherwise tend to put persons
at higher risk of stroke in the aspirin group) and to balance other factors (e.g. age, sex,
severity of condition) across the two groups. It is also best not to let the patients or their
doctors know which treatment they are receiving. This type of study is called a double-blind
study. Many other questions must also be addressed. For example, what variates should
we measure other than the occurrence of a stroke? What should we do about patients who
are forced to drop out of the study because of adverse side e¤ects? Is it possible that the
aspirin treatment works for certain types of patients but not others? How long should the
study go on? How many persons should be included?
As an example of a statistical setting where the data are not obtained by a sample
survey, an experimental study, or even an observational study, consider the following.
1.3. DATA SUMMARIES 7

Example 1.2.4 Direct marketing campaigns


Nearly every major retailer has a predictive analytics department devoted to under-
standing not just consumers’ shopping habits but also their personal habits, so they can
market to them more e¢ ciently. The retail chain Target has been particularly good at this.
Since Target sells everything from food to toys to lawn furniture to electronics, one of its
primary goals is to try and convince customers that Target is the only store they need.
Once consumers adopt certain shopping habits, however, it is very di¢ cult to change them
even with the most ingenious ad campaigns. One group that is more open to changes in
their buying habits is new parents. Because birth records are usually public, new parents
are bombarded with o¤ers and advertisements from all sorts of companies as soon as the
baby arrives. Target hypothesized that if they could identify women earlier in their preg-
nancy, before the baby was born, and send them specially designed ads, then there was a
good chance of getting them to shop at Target for years. How did Target determine if a
woman was pregnant?
Target has collected large amounts of data on their customers for decades. Every person
who makes a credit card purchase, …lls out a survey, mails in a refund on a purchase, calls
the customer help line, or visits the website is assigned a guest ID. Linked to the guest ID is
information on credit card purchases as well as demographic information like age, marital
status, number of children, address, estimated salary, types of credit cards and websites
visited. Target also buys data about ethnicity, job history, magazines read, college attended,
topics discussed online, etc. The data scientist working for Target was able to identify a
large number of variates that, when analyzed together, allowed him to assign each shopper a
“pregnancy prediction”score. Based on these scores Target could then select which women
to send the specially designed ads. For more information on how Target used these scores
to increase sales see www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

1.3 Data Summaries


When we study a population or process we collect data. We cannot answer the questions
of interest without summarizing the data. Summaries are especially important when we
report the conclusions of the study. Summaries must be clear and informative with respect
to the questions of interest and, since they are summaries, we need to make sure that they
are not misleading. There are two classes of summaries: numerical and graphical.
We represent variates by letters such as x; y; z. For example, we might de…ne a variate
y as the size in dollars of an insurance claim or the …rst language that a person learned to
speak.
Suppose that data on a variate y is collected for n units in a population or process.
By convention, we label the units as 1; 2; : : : ; n and denote their respective y values as
y1 ; y2 ; : : : ; yn . We might also collect data on a second variate x for each unit, and we would
denote the values as x1 ; x2 ; : : : ; xn . We refer to n as the sample size and to fx1 ; x2 ; : : : ; xn g,
fy1 ; y2 ; : : : ; yn g or f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g as data sets. Most data sets contain the
8 1. INTRODUCTION TO STATISTICAL SCIENCES

values for many variates.

Numerical Summaries
We now describe some numerical summaries which are useful for describing features of a sin-
gle variate in a data set. These summaries fall generally into three categories: measures of
location (mean, median, and mode), measures of variability or dispersion (variance, range,
and interquartile range), and measures of shape (skewness and kurtosis). These summaries
are used when the variate is either discrete or continuous.

Measures of location

The sample mean also called the sample average:


n
1X
y= yi
n
i=1

The sample median m ^ or the middle value when n is odd and the sample is ordered
from smallest to largest, and the average of the two middle values when n is even.

The sample mode, or the value of y which appears in the sample with the highest
frequency (not necessarily unique).

The sample mean, median and mode describe the “center”of the distribution of variate
values in a data set. The units for mean, median and mode (e.g. centimeters, degrees
Celsius, etc.) are the same as for the original variate.
Since the median is less a¤ected by a few extreme observations (see Problem 1), it is a
more robust measure of location.

Measures of dispersion or variability

The sample variance:


" #
2
2 1 P
n
2 1 P
n 1 P
n
s = (yi y) = yi2 yi
n 1 i=1 n 1 i=1 n i=1
p
and the sample standard deviation: s = s2 .

The range = y(n) y(1) where y(n) = max (y1 ; y2 ; : : : ; yn ) and y(1) = min (y1 ; y2 ; : : : ; yn ).

The interquartile range IQR (see De…nition 3).

The sample variance and sample standard deviation measure the variability or spread of
the variate values in a data set. The units for standard deviation, range, and interquartile
range (e.g. centimeters, degrees Celsius, etc.) are the same as for the original variate.
Since the interquartile range is less a¤ected by a few extreme observations (see Problem
2), it is a more robust measure of variability.
1.3. DATA SUMMARIES 9

Measures of shape

The sample skewness


1 P
n
n (yi y)3
i=1
g1 = 3=2
1 P
n
n (yi y)2
i=1

The sample kurtosis


1 P
n
n (yi y)4
i=1
g2 = 2
1 P
n
n (yi y)2
i=1

Measures of shape generally indicate how the data, in terms of a relative frequency
histogram, di¤er from the Normal bell-shaped curve, for example whether one “tail” of
the relative frequency histogram is substantially larger than the other so the histogram is
asymmetric, or whether both tails of the relative frequency histogram are large so the data
are more prone to extreme values than data from a Normal distribution.
Sample skewness and sample kurtosis have no units.

0.25

0.2
relative frequency

skewness = 1.15
0.15

0.1

0.05

0
0 1 2 3 4 5 6 7 8 9 10 11 12
y

Figure 1.1: Relative frequency histogram for data with positive skewness

The sample skewness is a measure of the (lack of) symmetry in the data. When the
relative frequency histogram of the data is approximately symmetric then there is an ap-
P
n
proximately equal balance between the positive and negative values in the sum (yi y)3
i=1
and this results in a value for the sample skewness that is approximately zero.
If the relative frequency histogram of the data has a long right tail (see Figure 1.1), then
the positive values of (yi y)3 dominate the negative values in the sum and the value of
10 1. INTRODUCTION TO STATISTICAL SCIENCES

0.35

0.3

0.25

relative frequency
skewness = -1.35
0.2

0.15

0.1

0.05

0
0 1 2 3 4 5 6 7 8 9 10 11 12
y

Figure 1.2: Relative frequency histogram for data with negative skewness

the skewness will be positive. Similarly if the relative frequency histogram of the data had
a long left tail (see Figure 1.2) then the negative values of (yi y)3 dominate the positive
values in the sum and the value of the skewness will be negative.

0.35

0.3 skewness= 0.71


kurtosis= 5.24
relative frequency

0.25

0.2
G(0.15,1.52) p.d.f.
0.15

0.1

0.05

0
-4 -3 -2 -1 0 1 2 3 4 5 6 7
y

Figure 1.3: Relative frequency histogram for data with kurtosis > 3

The sample kurtosis measures the heaviness of the tails and the peakedness of the data
relative to data that are Normally distributed. Since the term (yi y)4 is always positive,
the kurtosis is always positive. If the sample kurtosis is greater than 3 then this indicates
heavier tails (and a more peaked center) than data that are Normally distributed. For data
that arise from a model with no tails, for example the Uniform distribution, the sample
1.3. DATA SUMMARIES 11

0.14
skewness= 0.08
relative frequency 0.12 kurtosis= 1.73

0.1

0.08

0.06

0.04
G(4.9,2.9) p.d.f.
0.02

0
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
y

Figure 1.4: Relative frequency histogram for data with kurtosis < 3

kurtosis will be less than 3. See Figures 1.3 and 1.4. Typical …nancial data such as the
S&P500 index have kurtosis values greater than three, because the extreme returns (both
large and small) are more frequent than one would expect for Normally distributed data.

Another way to numerically summarize data is to use sample percentiles or quantiles.

Sample Quantiles and Percentiles


For 0 < p < 1; the pth quantile (also called the 100pth percentile) is a value such
that approximately a fraction p of the y values in the data set are less than q(p) and
approximately 1 p are greater than q(p). Depending on the size of the data set, quantiles
are not uniquely de…ned for all values of p. There are di¤erent conventions for de…ning
quantiles in these cases. If the sample size is large, the di¤erences in the quantiles based on
the various de…nitions are small. We will use the following de…nition to determine quantiles.

De…nition 1 Let y(1) ; y(2) ; : : : ; y(n) where y(1) y(2) y(n) be the order statistic
for the data set fy1 ; y2 ; : : : ; yn g. For 0 < p < 1, the pth (sample) quantile (also called the
100pth (sample) percentile), is a value, call it q(p), determined as follows:

Let k = (n + 1)p where n is the sample size.

If k is an integer and 1 k n, then q(p) = y(k) .

If k is not an integer but 1 < k < n then determine the closest integer j such that
j < k < j + 1 and then q(p) = 21 y(j) + y(j+1) .
12 1. INTRODUCTION TO STATISTICAL SCIENCES

The quantiles q(0:25), q(0:5) and q(0:75) are often used to summarize a data set and
are given special names.

De…nition 2 The quantiles q(0:25), q(0:5) and q(0:75) are called the lower or …rst quartile,
the median, and the upper or third quartile respectively.

Example 1.3.1
Consider the data set of 12 observations which has already been ordered from smallest
to largest:

1:2 6:6 6:8 7:6 7:9 9:1 10:9 11:5 12:2 12:7 13:1 14:3

For p = 0:25, k = (12 + 1) (0:25) = 3:25 so


1 1
lower quartile = q (0:25) = y(3) + y(4) = (6:8 + 7:6) = 7:2
2 2
For p = 0:5, k = (12 + 1) (0:5) = 6:5 so
1 1
median = m
^ = q (0:5) = y(6) + y(7) = (9:1 + 10:9) = 10
2 2
For p = 0:75, k = (12 + 1) (0:75) = 9:75 so
1 1
upper quartile = q (0:75) = y(9) + y(10) = (12:2 + 12:7) = 12:45
2 2
Also for p = 0:1, k = (12 + 1) (0:1) = 1:3 so
1 1
q (0:1) = y(1) + y(2) = (1:2 + 6:6) = 3:9
2 2

A way to quantify the variability of the variate values in a data set is to use the in-
terquartile range (IQR) which is the di¤erence between the lower and upper quartiles.

De…nition 3 The interquartile range is IQR = q(0:75) q(0:25).

The …ve number summary provides a concise numerical summary of a data set which
provides information about the location (through the median), the spread (through the
lower and upper quartiles) and the range (through the minimum and maximum values).

De…nition 4 The …ve number summary of a data set consists of the smallest observation,
the lower quartile, the median, the upper quartile and the largest value, that is, the …ve
values: y(1) ; q (0:25) ; q (0:5) ; q (0:75) ; y(n) .
1.3. DATA SUMMARIES 13

Example 1.3.2 Comparison of body mass index


In a study of obesity in New Zealand, a sample of 150 men and 150 women were selected
from workers aged 18 to 60. The height in meters and weight in kilograms was measured
for each subject (unit). These variates are both continuous variates. Height and weight
were recorded to 2 decimal places. This means that there are a …nite number of possible
values for the recorded height and the recorded weight. This does not imply that height
and weight are discrete variates. The accuracy of the measuring device does not change
the type of the variate. The type of a variate is important because we use this information
in choosing a probability model to analyse the data as discussed in Section 1.4. Variates
such as height and weight are typically modelled using a continuous distribution such as
the Gaussian distribution.
For each subject the body mass index (BMI) was also calculated using

weight(kg)
BMI =
[height(m)]2

BMI is a continuous variate. Often the value of BMI is used to classify a subject as being
“overweight”, “normal weight”, “underweight”, etc. One possible classi…cation is given in
Table 1.1.

Underweight BMI < 18:5


Normal Weight 18:5 BMI < 25:0
Overweight 25:0 BMI < 30:0
Moderately Obese 30:0 BMI < 35:0
Severely Obese 35:0 BMI
Table 1.1: BMI classi…cation

Suppose Table 1.1 was used to determine the BMI class for each subject and we called this
new variate “BMI class”. BMI class is an example of an ordinal variate.
The data are available in the …le bmidata.txt posted on the course website. To analyse
the data, it is convenient to record the data in row-column format (see Table 1.2). The
…rst row of the …le gives the variate names, in this case, subject number, sex (M = male or
F = female), height, weight and BMI. Each subsequent row gives the variate values for a
particular subject.

subject sex height weight BMI


1 M 1:76 63:81 20:6
2 M 1:77 89:60 28:6
3 M 1:91 88:65 24:3
4 M 1:80 74:84 23:1
Table 1.2: First 5 rows of the …le bmidata.txt
14 1. INTRODUCTION TO STATISTICAL SCIENCES

The …ve number summaries for the variate BMI for each sex are given in Table 1.3 along
with the sample mean and standard deviation. We see that there are only small di¤erences
in the median and the mean. For the standard deviation, IQR and the range we notice
that the values are all larger for the females. In other words, there is more variability in
the BMI measurements for females than for males in this sample.

Sex y(1) q (0:25) q (0:5) q (0:75) y(n) y s


Female 16:4 23:4 26:8 29:75 38:8 26:92 4:60
Male 18:3 24:6 26:75 29:15 37:5 27:08 3:56
Table 1.3 Summary of BMI by sex

We can also construct a relative frequency table that gives the proportion of subjects
that fall within each BMI class by sex (see Table 1.4). From the table we can see that
the reason that the variability in the BMI variate for females is larger than for males is
because there is a larger proportion of females in the two extreme classes “underweight”
and “severely obese” as compared to the males.

BMI Class Males Females


Underweight 0:01 0:02
Normal Weight 0:28 0:33
Overweight 0:50 0:42
Moderately Obese 0:19 0:17
Severely Obese 0:02 0:06
Total 1:00 1:00
Table 1.4: BMI Class Relative Frequency Table by Sex

Sample correlation
So far we have looked only at numerical summaries of a data set fy1 ; y2 ; : : : ; yn g. Often
we have bivariate data of the form f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g. A numerical summary
of such data is the sample correlation.

De…nition 5 The sample correlation, denoted by r, for data f(x1 ; y1 ) ; (x2 ; y2 ) ; : : : ; (xn ; yn )g
is
Sxy
r=p
Sxx Syy
where
2
P
n P
n 1 P
n
Sxx = (xi x)2 = x2i xi
i=1 i=1 n i=1
Pn P
n 1 P
n P
n
Sxy = (xi x) (yi y) = xi yi xi yi
i=1 i=1 n i=1 i=1
2
P
n P
n 1 P
n
Syy = (yi y)2 = yi2 yi
i=1 i=1 n i=1
1.3. DATA SUMMARIES 15

The sample correlation, which takes on values between 1 and 1, is a measure of the
linear relationship between the two variates x and y. If the value of r is close to 1 then
we say that there is a strong positive linear relationship between the two variates while if
the value of r is close to 1 then we say that there is a strong negative linear relationship
between the two variates. If the value of r is close to 0 then we say that there is no linear
relationship between the two variates.

Example 1.3.2 Continued


If we let x = height and y = weight then the sample correlation for the males is r = 0:55
which indicates that there is a positive linear relationship between height and weight which
is what we would expect. For the females r = 0:31 which also indicates that there is a
positive linear relationship between height and weight but the relationship is not as strong
as for males.

Relative risk
Recall that values for a categorical variate are category names that do not necessarily
have any ordering. If two variates of interest in a study are categorical variates then the
sample correlation cannot be used as a measure of the relationship between the two variates.

Example 1.3.3 Physicians’Health Study


During the 1980’s in the United States a very large study called the Physicians’Health
Study was conducted to study the relationship between taking daily aspirin and the occur-
rence of coronary heart disease (CHD). For each physician (unit) in the study two categorical
variates were collected: (1) whether the physician was assigned to the daily aspirin group
or the placebo group and (2) whether or not the physician experienced CHD during the
study. The data can be summarized by giving the observed frequency for each of the four
possible outcomes as shown in Table 1.5.

CHD No CHD Total


Placebo 189 10845 11034
Daily Aspirin 104 10933 11037
Total 293 21778 22071
Table 1.5: Physicians’Health Study

To summarize the relationship between two categorical variates consider a generalized


version of Table 1.5 given by

A A Total
B y11 y12 y11 + y12
B y21 y22 y21 + y22
Total y11 + y21 y12 + y22 n
Table 1.6: General two-way table
16 1. INTRODUCTION TO STATISTICAL SCIENCES

Recall that events A and B are independent events if P (A \ B) = P (A) P (B) or equiva-
lently P (A) = P (AjB) = P AjB . If A and B are independent events then

P (AjB)
=1
P AjB

and otherwise the ratio is not equal to one. In the Physicians’Health Study if we let A =
takes daily aspirin and B = experienced CHD then we can estimate this ratio using the
ratio of the sample proportions.

De…nition 6 For categorical data in the form of Table 1.6 the relative risk of event A in
group B as compared to group B is

y11 = (y11 + y12 )


relative risk =
y21 = (y21 + y22 )

Example 1.3.3 Revisited


For the Physicians’ Health Study the relative risk of CHD in the placebo group as
compared to the aspirin group is

189= (189 + 10845)


relative risk =
104= (104 + 10933)
= 1:82

The data suggest that the group taking the placebo are nearly twice as likely to experience
CHD as compared to the group taking the daily aspirin. Can we conclude that daily aspirin
reduces the occurrence of CHD? The topic of causation will be discussed in more detail in
Chapter 8.

In Chapter 7 we consider methods for analyzing data which can be summarized in a


two way table like Table 1.6.

Graphical Summaries
Graphical summaries or data visualizations are important tools for seeing patterns in data
and for communicating results. Although the graphical summaries we present here are
quite simple, they provide the building blocks for more advanced visualizations used in
data science and data mining.
We consider graphical summaries for both univariate data sets fy1 ; y2 ; : : : ; yn g and bi-
variate data sets f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g.
1.3. DATA SUMMARIES 17

Frequency histograms
Consider measurements fy1 ; y2 ; : : : ; yn g on a variate y. Partition the range of y into k
non-overlapping intervals Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate

fj = number of values from fy1 ; y2 ; : : : ; yn g that are in Ij

for j = 1; 2; : : : ; k. The fj are called the observed frequencies for I1 ; I2 ; : : : ; Ik ; note that
Pk
fj = n.
j=1

A histogram is a graph in which a rectangle is constructed above each interval I1 ; I2 ; : : : ; Ik .


The height of the rectangle for interval Ij is chosen so that the area of the rectangle is pro-
portional to fj . Two main types of frequency histograms are:

(a) a “standard” frequency histogram where the intervals Ij are of equal length. The
height of the rectangle for Ij is the frequency fj or relative frequency fj =n.

(b) a “relative” frequency histogram, where the intervals Ij = [aj 1 ; aj ) may or may not
be of equal length. The height of the rectangle for Ij is set equal to

fj =n
aj aj 1

so that the area of the jth rectangle equals fj =n. With this choice of height we have

P
k fj =n 1 Pk n
(aj aj 1) = fj = = 1
j=1 (aj aj 1) n j=1 n

so the total area of the rectangles is equal to one.

If intervals of equal length are used then a standard frequency histogram and a relative
frequency histogram look identical except for the labeling of the vertical axis. As just
shown, the sum of the areas of the rectangles for a relative frequency histogram equals
one. Recall that the area under a probability density function for a continuous random
variable equals one. Therefore if we wish to superimpose a probability density function
on a histogram to see how well the model …ts the data we must use a relative frequency
histogram. If we wish to compare two data sets which have di¤erent sample sizes then a
relative frequency histogram must always be used. The vertical axis is labelled “density”
to emphasize that such a histogram is being used.

To construct a frequency histogram, the number and location of the intervals must be
chosen. The intervals are typically selected so that there are ten to …fteen intervals and
each interval contains at least one y value from the sample (that is, each fj 1). If a
software package is used to produce the frequency histogram then the intervals are usually
chosen automatically. An option for user speci…ed intervals is also usually provided.
18 1. INTRODUCTION TO STATISTICAL SCIENCES

Example 1.3.2 Continued


Figures 1.5 and 1.6, give the relative frequency histograms for BMI for males and females
separately. We often say that histograms show the distribution of the data. Here the shapes
of the two distributions are somewhat bell-shaped. In each case the skewness is positive
but close to zero while the kurtosis is close to three.

0.14

0.12
skewness= 0.41
0.1
kurtosis= 3.03
Density

0.08

0.06

0.04

0.02

0
16 18 20 22 24 26 28 30 32 34 36 38 40
BodyMass Index

Figure 1.5: Relative frequency histogram for male BMI data

0.09

0.08
skewness= 0.30
0.07 kurtosis= 2.79
0.06
Density

0.05

0.04

0.03

0.02

0.01

0
16 18 20 22 24 26 28 30 32 34 36 38 40
BodyMass Index

Figure 1.6: Relative frequency histogram for female BMI data


1.3. DATA SUMMARIES 19

Example 1.3.4 Lifetimes of brake pads


A frequency histogram can have many di¤erent shapes. Figure 1.7 shows a relative
frequency histogram of the lifetimes (in terms of number of thousand km driven) for the
front brake pads on 200 new mid-size cars of the same type. The variate lifetime is a
continuous variate.
The data are available in the …le brakepaddata.txt posted on the course website. Notice
that the distribution of the brake pad lifetimes has a very di¤erent shape compared to the
BMI histograms. The shape does not resemble a bell-shaped curve. The distribution is not
symmetric and has a long right tail which is consistent with a skewness value equal to 1:28
which is positive and not close to zero. The sample mean is y = 49:03 thousand km and
the sample standard deviation is s = 36:65 thousand km. The large variability in lifetimes
is due to the wide variety of driving conditions which di¤erent cars are exposed to, as well
as to variability in how soon car owners decide to replace their brake pads.

0.018

0.016

0.014

0.012
skewness = 1.28
Density

0.01

0.008

0.006

0.004

0.002

0
0 15 30 45 60 75 90 105 120 135 150 165 180
Lifetime

Figure 1.7: Relative frequency histogram of brake pad lifetime data

Bar Graphs
For categorical data, a bar graph or bar chart is a useful graphical summary. A bar
graph has a bar for each of the possible values of the categorical variate with height equal
to the frequency or relative frequency of that category. Usually the order of the di¤erent
possible categories is not important. The width of the bar is also not important. Gaps are
left between the bars to emphasize that the data are categorical.

Example 1.3.5 Global market share of browsers


The bar chart in Figure 1.8 shows the global market share of browsers in June 2017
according to StatCounter, a web analytics company. What data might StatCounter have
collected to create this graphical summary? See gs.stat.counter.com for details.
20 1. INTRODUCTION TO STATISTICAL SCIENCES

Figure 1.9 illustrates how a bar graph can be used to compare the global market share
of browsers in June 2015, June 2016, and June 2017.

0.7

0.6

relative frequency 0.5

0.4

0.3

0.2

0.1

0
Chrome Safari IE Firefox UCBrowser Other

Figure 1.8: Global market share of browsers June 2017

0.7
June 2015
June 2016
0.6 June 2017

0.5
relative frequency

0.4

0.3

0.2

0.1

0
Chrome Safari IE Firefox UCBrowser Other

Figure 1.9: Global market share of browsers June 2015-2017


1.3. DATA SUMMARIES 21

Pie charts, which are another way to display categorical data, are often used in the
media. Pie charts are used very infrequently by statisticians since the human eye is not
good at judging how much area is taken up by a wedge.
Bar graphs and pie charts are often used incorrectly in the media. See Chapter 1, Prob-
lems 19-23.

Empirical cumulative distribution function


Consider the following data set of 10 observations

3:1 0:6 1:6 1:8 0:3 3:8 1:0 0:8 2:9 1:7

Order the observations from smallest to largest to obtain the order statistic

0:3 0:6 0:8 1:0 1:6 1:7 1:8 2:9 3:1 3:8

Suppose we assume that these observations come from an unknown cumulative distribution
F (y) = P (Y y). If we wanted to estimate F (1:5) = P (Y 1:5) then intuitively it seems
reasonable to estimate this probability by determining the proportion of observations which
are less than or equal to 1:5. Since there are four such values (0:3; 0:6; 0:8 and 1:0), we
estimate F (1:5) by F^ (1:5) = 104
= 0:4. Since there are no observations between 1:0 and
1:6 then for any y 2 [1:0; 1:6) we would estimate F (y) = P (Y y) using F^ (y) = 0:4.
We can estimate F (y) = P (Y y) in a similar way for any value of y. This leads us
to the following de…nition:

De…nition 7 For a data set fy1 ; y2 ; : : : ; yn g, the empirical cumulative distribution function
or e.c.d.f. is de…ned by

number of values in the set fy1 ; y2 ; : : : ; yn g which are y


F^ (y) = for all y 2 <
n
The empirical cumulative distribution function is an estimate, based on the data, of the
population cumulative distribution function.

A graph of F^ (y) gives us a graphical summary of the data set fy1 ; y2 ; : : : ; yn g.

For the data set of 10 observations, the graph of F^ (y) is given in Figure 1.10. The
vertical lines are added to make the graph look visually more like a cumulative distribution
function. We note that F^ (y) jumps a height of 0:1 at each of the unique values in the
ordered data set.
More generally, for an ordered data set y(1) ; y(2) ; : : : ; y(n) of unique observations, F^ (y(j) ) =
j=n and the jumps are all of size 1=n.
22 1. INTRODUCTION TO STATISTICAL SCIENCES

0.9

0.8

0.7

empirical c.d.f. 0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.5 1 1.5 2 2.5 3 3.5 4
y

Figure 1.10: Empirical cumulative distribution function for 10 observations

Example 1.3.6
Figure 1.11 shows a graph of the empirical cumulative distribution function for 100
observations which were randomly generated from an Exponential model and then rounded
to one decimal place. The observations are not all unique. In particular, at y = 4:3 there
is a jump of height 0:04 which would indicate that there are 4 observations equal to 4:3.
The plot of the empirical cumulative distribution function does not show the shape
of the distribution quite as clearly as a plot of the relative frequency histogram does. It
requires more e¤ort to determine if the distribution is symmetric or skewed. We can see
from Figure 1.11 that for this data set the values of q (p) are changing more rapidly for
p 0:8. This means the data are not symmetric but are positively skewed with a long right
tail.
Often when large data sets are reported in the media or research journals the individ-
ual observations are not reported. Sometimes only a graph like the empirical cumulative
distribution function is given. What information can we obtain from the graph of the em-
pirical cumulative distribution function? In addition to the information about the shape
mentioned above, the graph allows us to determine the pth quantile or 100pth percentile
q (p). For example, from Figure 1.11 we can determine, using the red dashed lines, that
the lower quartile = q (0:25) = 0:9, the median = q (0:5) = 2:6, and the upper quartile
= q (0:75) = 5:3. These are not exactly the same values that would be obtained if we had
all the data and used De…nition 1, however the values would be very close. From q (0:75)
and q (0:25) we can determine that the IQR = q (0:75) q (0:25) = 5:3 0:9 = 4:4. Finally
we can also see that y(1) = 0:0 and y(100) = 16:1 and therefore the range = 16:1 0:0 = 16:1.
1.3. DATA SUMMARIES 23

0.9

0.8

0.7
empirical c.d.f.

0.6

0.5

0.4

0.3

0.2

0.1

0
0 2 4 6 8 10 12 14 16 18
y

Figure 1.11: Empirical cumulative distribution function of 100 observations

The empirical cumulative distribution function can also be used to compare two data
sets by graphing their empirical cumulative distribution functions on the same graph as
shown in the next example.

Example 1.3.2 Continued


Figure 1.12 shows the empirical cumulative distribution function for male and female
heights on the same plot for the data in the …le bmidata.txt posted on the course website. As
you might expect we see that the distribution of male heights is similar to the distribution
of female heights but shifted to the right re‡ecting the fact that males are generally taller
than females.
We can also determine from Figure 1.12 that the median height for females is 1:60 and
for males the median height is 1:73. The symmetry of the two curves about their respective
medians indicates that the distribution of heights is reasonably symmetric for both males
and females.
For females q (0:25) = 1:57, q (0:75) = 1:67, IQR = 1:67 1:57 = 0:1, and range
= 1:79 1:41 = 0:38. For males q (0:25) = 1:71, q (0:75) = 1:79, IQR = 1:79 1:71 = 0:08,
and range = 1:93 1:56 = 0:37. The IQR and range for females are very similar to the
IQR and range for males.
24 1. INTRODUCTION TO STATISTICAL SCIENCES

0.9

0.8

Males
0.7
empirical c.d.f.

0.6

0.5 Females

0.4

0.3

0.2

0.1

0
1.4 1.5 1.6 1.7 1.8 1.9 2
Height

Figure 1.12: Empirical cumulative distribution function of heights for males and
for females

Boxplots
In many situations, we want to compare the values of a variate for two or more groups.
For example, in Example 1.3.2 Continued we compared the heights for males versus females
by plotting side-by-side empirical distribution functions. When the number of groups is
large or the sample sizes within groups are small, side-by-side boxplots (also called box and
whisker plots) are a convenient way to display the data.
A boxplot gives a graphical summary about the shape of the distribution and is usually
displayed vertically. The line inside the box corresponds to the median q(0:5). The top
edge of the box corresponds to the upper quartile q(0:75) and the lower edge of the box
corresponds to the lower quartile q(0:25). The so-called whiskers extend down and up
from the box to a horizontal line. The lower line is placed at the smallest observed data
value that is larger than the value q(0:25) 1:5 IQR where IQR = q(0:75) q(0:25) is
the interquartile range. The upper line is placed at the largest observed data value that
is smaller than the value q(0:75) + 1:5 IQR. Values beyond the whiskers (often called
outliers) are plotted with special symbols.
1.3. DATA SUMMARIES 25

120

110

100

90
Weight(kg)

80

70

60

50

40

Males Females

Figure 1.13: Boxplots of weights for males and females

Figure 1.13 displays side-by-side boxplots of male and female weights from Example
1.3.2. As mentioned previously, when large data sets are reported in the media or research
journals the individual observations are not reported. What information can we obtain
from these two boxplots?
The shape and spread of the two distributions are very similar. For the males and the
females, the center line in the box, which corresponds to the median, divides both the box
and the whiskers approximately in half which indicates that both distributions are roughly
symmetric about the median. For the females there are two large outliers.
For the boxplots we can determine that the median weight for females is approxi-
mately 70 and for males the median weight is approximately 81. For females q (0:25) = 62,
q (0:75) = 79, IQR = 79 62 = 17, and range = 111 40 = 71. For males q (0:25) = 73,
q (0:75) = 91, IQR = 91 73 = 18, and range = 117 52 = 65. The IQR and range for
females are very similar to the IQR and range for males.
Since the boxplot for the males is shifted up relative to the boxplot for females this
implies that males generally weigh more than females.
26 1. INTRODUCTION TO STATISTICAL SCIENCES

Boxplots are particularly useful for comparing more than two groups.

100

90

80

70
Internet Users (per 100 people)

60

50

40

30

20

10

A f r ic a A me r ic a A s ia Eu r o p e O c e a n ia

Figure 1.14: Boxplots of internet use by di¤erent continents

Figure 1.14 shows a comparison of internet users (per 100 people) in 2015 for countries
in the world classi…ed by continent (worldbank.org). The side-by-side boxplots make it
easy to see the di¤erences and similarities between the countries in di¤erent continents. In
this example a unit is a country. The variate of interest measured for each country is the
number of internet users per 100 people. What type of variate is this? Why is the total
number of internet users not used?
For which continent is the median number of internet users per 100 people the smallest?
For which continent is the median number of internet users per 100 people the largest?
For which continent is the IQR the smallest? For which continent is the IQR the
largest? For which continent is the range the smallest? For which continent is the range
the largest? For which continent is the variability the smallest? For which continent is the
variability the largest?
For which continent is the distribution most symmetric? For which continent is the
distribution most asymmetric?

The graphical summaries discussed to this point deal with a single variate. If we have
data on two variates x and y for each unit in the sample then the data set is represented as
f(xi ; yi ); i = 1; 2; : : : ; ng. We are often interested in examining the relationships between
the two variates.
1.3. DATA SUMMARIES 27

Scatterplots
A scatterplot, which is a plot of the points (xi ; yi ); i = 1; 2; : : : ; n, can be used to see
whether two variates are related in some way.

120

110

100

90
weight

80

70
r = 0.55
60

50
1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9 1.95 2
height

Figure 1.15: Scatterplot of weight versus height for males

120

110
r = 0.31
100

90
weight

80

70

60

50

40
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
height

Figure 1.16: Scatterplot of weight versus height for females

Figures 1.15 and 1.16 give the scatterplots of y = height versus x = weight for males
and females respectively for the data in Example 1.3.2. As expected, there is a tendency
for weight to increase as height increases for both sexes. What might be surprising is the
variability in weights for a given height.
28 1. INTRODUCTION TO STATISTICAL SCIENCES

Run charts
A run chart is another type of two dimensional plot which is used when we are interested
in a graphical summary which illustrates how a single variate is changing over time.
In Figure 1.17 the run chart shows the closing value of the Canadian dollar in Chi-
nese yuan for the 67 business days between May 1 and August 1, 2017. For example on
August 1, 2017 the Canadian dollar was worth 5.3543 Chinese yuan. The data are from
google.com/…nance. In a run chart consecutive points are joined with straight lines.

5. 5

5. 45

5. 4

5. 35

5. 3
Chinese yuan

5. 25

5. 2

5. 15

5. 1

5. 05

5
MA 1 M A 15 M A 29 JN 12 JN 26 JL 10 JL 24 AU 1
D ay

Figure 1.17: Value of Canadian dollar in Chinese yuan May-July 2017

In Figure 1.18 the market share for the browsers Chrome, Safari and Internet Explorer
is graphed versus the months between June 2016 and July 2017 (gs.stat.counter.com).

0 .7

C hr om e

0 .6 S a fa r i

IE

0 .5

0 .4
Ma r k e t S ha r e
(P e r c e nta ge )
0 .3

0 .2

0 .1

0
MA JL SE NO JN MR MA JL
Month

Figure 1.18: Market share for browsers June 2016 to July 2017

Note that for these data sets the sample correlation coe¢ cient is not meaningful. Why?
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 29

1.4 Probability Distributions and Statistical Models


In your previous probability course you were introduced to the following statistical models
and their physical setups: Binomial, Poisson, Uniform, Exponential, and Gaussian (Nor-
mal). In this course we use statistical models to describe processes such as the daily closing
value of a stock or the occurrence and size of claims over time in a portfolio of insurance
policies. For populations, we use a statistical model to describe the selection of the units
and the measurement of the variates. The model depends on the distribution of variate
values in the population (the population histogram is a graphical summary of this distrib-
ution) and the selection procedure. We exploit this connection when we want to estimate
attributes of the population and quantify the uncertainty in our conclusions. We use the
models in several ways:

questions are often formulated in terms of parameters of the model

the variate values vary so random variables can describe this variation

empirical studies usually lead to inferences that involve some degree of uncertainty,
and probability is used to quantify this uncertainty

procedures for making decisions are often formulated in terms of models

models allow us to characterize processes and to simulate them via computer experi-
ments

Data summaries and properties of probability models


If we model the selection of a data set fy1 ; y2 ; : : : ; yn g as n independent realizations
of a random variable Y , we can draw strong parallels between summaries of the data set
described in Section 1.3 and properties of the corresponding probability model for Y . For
example,

The sample mean y corresponds to the population mean E (Y ) = .

The sample median m ^ corresponds to the population median m. For continuous


distributions the population median is the solution m of the equation F (m) = 0:5
where F (y) = P (Y y) is the cumulative distribution function of Y . For discrete
distributions, it is a point m chosen such that P (Y m) 0:5 and P (Y m) 0:5.

The sample standard deviation s corresponds to , the population standard deviation


of Y , where 2 = E[(Y )2 ].

The relative frequency histogram corresponds to the probability histogram of Y for


discrete distributions and the probability density function of Y for continuous distri-
butions.
30 1. INTRODUCTION TO STATISTICAL SCIENCES

Example 1.4.1 A Binomial distribution example


Consider again the survey of smoking habits of teenagers described in Example 1:2:1.
To select a sample of 500 units (young adults aged 14 20), suppose we had a list of most
of the units in the population of interest (young adults aged 14 20 living in Ontario at
the time of the study). Getting such a list would be expensive and time consuming so the
actual selection procedure is likely to be very di¤erent. We select a sample of 500 units
from the list at random and count the number of smokers in the sample. We model this
selection process using a Binomial random variable Y with probability function (p.f.)

f (y; ) = P (Y = y; )
500 y
= (1 )500 y
for y = 0; 1; : : : ; 500 and 0 1
y

(Note that the sampling would be done without replacement so we are assuming that the
number sampled is small relative to the total number in the population.) The parameter
in the probability function represents the unknown proportion of smokers in the population
of young adults aged 14 20 living in Ontario at the time of the study, which is one attribute
of interest in the study.
Note that we use the notation P (Y = y; ) and f (y; ) to emphasize the importance of
the parameter in the model.
Example 1.4.2 An Exponential distribution example
In Example 1:3:4, we examined the lifetime (in 1000 km) of a sample of 200 front brake
pads taken from the population of all cars of a particular model produced in a given time
period. We can model the lifetime of a single brake pad by a continuous random variable
Y with Exponential probability density function (p.d.f.)
1 y=
f (y; ) = e for y > 0

The parameter > 0 represents the mean lifetime of the brake pads in the population since,
in the model, the expected value of Y is E (Y ) = .
To model the sampling procedure, we assume that the data fy1 ; y2 ; : : : ; y200 g represent
200 independent realizations of the random variable Y . That is, we let Yi = the lifetime
for the ith brake pad in the sample, i = 1; 2; : : : ; 200, and we assume that Y1 ; Y2 ; : : : ; Y200
are independent Exponential random variables each having the same mean .
We can use the model and the data to estimate and other attributes of interest such
as the proportion of brake pads that fail in the …rst 100; 000 km of use. In terms of the
model, we can represent this proportion by

Z100
100=
P (Y 100; ) = f (y; )dy = 1 e
0
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 31

Example 1.4.3 A Gaussian distribution example


Earlier, we described an experiment where the goal was to see if there is a relation-
ship between operating performance y of a computer chip and ambient temperature x. In
the experiment, there were four groups of 10 chips and each group operated at a di¤erent
temperature x = 10; 20; 30; 40. The data are f(x1 ; y1 ); (x2 ; y2 ) ; : : : ; (x40 ; y40 )g. A model
for Y1 ; Y2 ; : : : ; Y40 should depend on the temperatures xi and one possibility is to assume
Yi G( 0 + 1 xi ; ), i = 1; 2; : : : ; 40 independently. In this model, the mean of Y is a
linear function of the temperature xi . The parameter allows for variability in performance
among chips operating at the same temperature. We will consider such models in detail in
Chapter 6.

Response versus explanatory variates


Suppose we wanted to study the relationship between second hand smoke and asthma
among children aged 10 and under. The two variates of interest could be de…ned as:
x = whether the child lives in a household where adults smoke,
Y = whether the child su¤ers from asthma.
In this study there is a natural division of the variates into two classes: response variate
and explanatory variate. In this example Y , the asthma status, is the response variate
(often coded as Y = 1 if child su¤ers from asthma, Y = 0 otherwise) and x, whether the
child lives in a household where adults smoke, is the explanatory variate (also often coded as
x = 1 if child lives in household where adults smoke and x = 0 otherwise). The explanatory
variate x is in the study to investigate whether the distribution of the response variate Y
is di¤erent for di¤erent observed values of x.
Similarly in an observational study of 1718 men aged 40 55, the men were classi…ed
according to whether they were heavy co¤ee drinkers (more than 100 cups/month) or not
(less than 100 cups/month) and whether they su¤ered from CHD (coronary heart disease) or
not. In this study there are also two categorical variates. One variate is the amount of co¤ee
consumption while the other variate is whether or not the subject had experienced CHD or
not. The question of interest is whether there is a relationship between co¤ee consumption
and CHD. Unlike Example 1:4:3, neither variate is under the control of the researchers.
We might be interested in whether co¤ee consumption can be used to “explain” CHD. In
this case we would call co¤ee consumption an explanatory variate while CHD would be the
response variate. However if we were interested in whether CHD can be used to explain
co¤ee consumption (a somewhat unlikely proposition to be sure) then CHD would be the
explanatory variate and co¤ee habits would be the response variate.
In some cases it is not clear which is the explanatory variate and which is the response
variate. For example, the response variate Y might be the weight (in kg) of a randomly
selected female in the age range 16 25, in some population. A person’s weight is related
to their height. We might want to study this relationship by considering females with a
given height x (in meters), and proposing that the distribution of Y , given x is Gaussian,
G( + x; ). That is, we propose that the average (expected) weight of a female depends
32 1. INTRODUCTION TO STATISTICAL SCIENCES

linearly on her height x and we write this as E(Y jx) = + x. It would be possible to
reverse the roles of the two variates and consider weight to be the explanatory variate and
height to be the response variate, if for example we wished to predict height using data on
individuals’weights.

Models for describing the relationships among two or more variates are considered in
more detail in Chapters 6 and 7.

1.5 Data Analysis and Statistical Inference


Whether we are collecting data to increase our knowledge or to serve as a basis for making
decisions, proper analysis of the data is crucial. We distinguish between two broad aspects
of the analysis and interpretation of data. The …rst is what we refer to as descriptive
statistics. This is the portrayal of the data, or parts of it, in numerical and graphical ways
so as to show features of interest. (On a historical note, the word “statistics”in its original
usage referred to numbers generated from data; today the word is used both in this sense
and to denote the discipline of Statistics.) We have considered a few methods of descriptive
statistics in Section 1.3. The terms data mining and knowledge discovery in data bases
(KDD) refer to exploratory data analysis where the emphasis is on descriptive statistics.
This is often carried out on very large data bases. The goal, often vaguely speci…ed, is to
…nd interesting patterns and relationships
A second aspect of a statistical analysis of data is what we refer to as statistical inference.
That is, we use the data obtained in the study of a process or population to draw general
conclusions about the process or population itself. This is a form of inductive inference, in
which we reason from the speci…c (the observed data on a sample of units) to the general
(the target population or process). This may be contrasted with deductive inference (as
in logic and mathematics) in which we use general results (e.g. axioms) to prove speci…c
things (e.g. theorems).
This course introduces some basic methods of statistical inference. Three main types
of statistical methods will be discussed, loosely referred to as estimation, hypothesis tests,
and prediction. Methods of estimation are used when we are interested in estimating one
or more attributes of a process or population based on observed data. For example, we
may wish to estimate the proportion of Ontario residents aged 14 20 who smoke, or to
estimate the distribution of survival times for certain types of AIDS patients. Another type
of estimation problem is that of “…tting” or selecting a probability model for a process.
Methods of estimation are discussed in all chapters of these Course Notes.
Hypothesis tests involve using the data to assess the truth of some question or hypothesis
about the population or process. For example, we may hypothesize that in the 14 20 age
group a higher proportion of females than males smoke, or that the use of a new treatment
will increase the average survival time of AIDS patients by at least 50 percent. Tests of
hypotheses will be discussed in more detail in Chapter 5.
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 33

Prediction methods are used when we use the observed data to predict a future value
for a variate of a unit to be selected from the process or population. For example, based
on the results of a clinical trial such as Example 1:2:3, we may wish to predict how much
an individual’s blood pressure would drop for a given dosage of a new drug, or, given the
past performance of a stock and other data, to predict the value of the stock at some point
in the future. Examples of prediction methods are given in Sections 4.7 and 6.2.
Statistical analysis involves the use of both descriptive statistics and formal methods of
estimation, prediction and hypothesis testing. As brief illustrations, we return to the …rst
two examples of section 1.2.

Example 1.5.1 Smoking behaviour survey


Suppose in Example 1:2:1, we sampled 250 males and 250 females aged 14 20 as
described in Example 1:4:1. Here we focus only on the sex of each person in the sample,
and whether or not they smoked. The data are summarized in the following two-way table:

Smokers Non-smokers Total


Female 82 168 250
Male 71 179 250
Total 153 347 500

Suppose we are interested in the question “Is the smoking rate among females higher than
the rate among males?” From the data, we see that the sample proportion of females
who smoke is 82=250 = 0:328 or 32:8% and the sample proportion of males who smoke is
71=250 = 0:284 or 28:4%. In the sample, the smoking rate for females is higher. But what
can we say about the whole population? To proceed, we formulate the hypothesis that
there is no di¤erence in the population rates. Then assuming the hypothesis is true, we
construct two Binomial models as in Example 1:4:1 each with a common parameter . We
can estimate using the combined data so that ^ = 153=500 = 0:306 or 30:6%. Then using
the model and the estimate, we can calculate the probability of such a large di¤erence in
the observed rates. Such a large di¤erence occurs about 20% of the time (if we selected
samples over and over and the hypothesis of no di¤erence is true) so such a large di¤erence
in observed rates happens fairly often and therefore, based on the observed data, there
is no evidence of a di¤erence in the population smoking rates. In Chapter 7 we discuss a
formal method for testing the hypothesis of no di¤erence in rates between females and males.

Example 1.5.2 Can …ller study


Recall Example 1:2:2 where the purpose of the study was to compare the performance
of the two machines in the future. A study was conducted in which one can from the new
machine and one can from the old machine were selected each hour over a period of 40
hours. The volume in milliliters of each selected can was measured. Volume is a continuous
variate. The data are available in the …le can…llingdata.txt posted on the course website.
34 1. INTRODUCTION TO STATISTICAL SCIENCES

357.8
357.6
357.4
357.2
357
volume 356.8
356.6
356.4
356.2
356
355.8
0 5 10 15 20 25 30 35 40
hour

Figure 1.19: Run chart of the volume (ml) for the new machine over time

First we examine if the behaviour of the two machines is stable over time. In Figures
1.19 and 1.20, a run chart of the volumes over time for each machine is given. There is no
indication of a systematic pattern for either machine so we have some con…dence that the
data can be used to predict the performance of the machines in the near future.

360

359.5

359

358.5
volume

358

357.5

357

356.5

356
0 5 10 15 20 25 30 35 40
hour

Figure 1.20: Run chart of the volume for old machine over time

The sample mean and standard deviation for the new machine are 356:8 and 0:54 ml
respectively and, for the old machine, are 357:5 and 0:80. Figures 1.21 and 1.22 show the
relative frequency histograms of the volumes for the new machine and the old machine re-
spectively. To see how well a Gaussian model might …t these data we superimpose Gaussian
probability density functions with the mean equal to the sample mean and the standard
deviation equal to the sample standard deviation on each histogram. The agreement is
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 35

reasonable given that the sample size for both data sets is only 40.

0.9

0.8

0.7
G(356.76,0.54)
0.6
Density

0.5
skewness = 0.22
0.4 kurtosis = 2.38
0.3

0.2

0.1

0
355 356 357 358 359 360 361
Volume

Figure 1.21: Relative frequency histogram of volumes (ml) for the new machine

0.7

0.6 skewness = 0.54


kurtosis = 2.84
0.5
Density

0.4

0.3 G(357.5,0.80)

0.2

0.1

0
355 356 357 358 359 360 361
Volume

Figure 1.22: Relative frequency histogram of volumes (ml) for the old machine

We can use the Gaussian model to estimate the long term proportion of cans that
fall below the required volume of 355 ml. For the new machine, Y G(356:8; 0:54) and
P (Y 355) = 0:0005 so about 5 in 10; 000 cans will be under-…lled. For the old machine,
Y G(357:5; 0:80) and P (Y 355) = 0:0008 so about 8 in 10; 000 cans will be under-…lled.
Of course these estimates are subject to a high degree of uncertainty because they are based
on small sample sizes.
We can see that the new machine is superior because of its smaller sample mean which
translates into less over…ll and hence less cost to the manufacturer. It is possible to adjust
the mean of the new machine to a lower value because of its smaller standard deviation.
36 1. INTRODUCTION TO STATISTICAL SCIENCES

1.6 Statistical Software and R


Statistical software is essential for data manipulation and analysis. It is also used to deal
with numerical calculations, to produce graphics, and to simulate probability models. There
are many statistical software systems; some of the most comprehensive and popular are SAS,
S-Plus, SPSS, Strata, Systat Minitab and R. Spreadsheet software such as EXCEL is also
useful.
We will use the R software system since it has the lowest cost (free!) and the greatest
functionality. It is an open source package that has extensive statistical capabilities and very
good graphics procedures. R is the most common programming language among data sci-
entists according to the 2016 O’Reilly Data Science Salary Survey (www.oreilly.com/ideas).
Information about how to use R is available in the document Introduction to R and
RStudio which is posted on the course website.
1.7. CHAPTER 1 PROBLEMS 37

1.7 Chapter 1 Problems


1. The sample mean y and the sample median m
^ are two ways to measure the location
of a data set fy1 ; y2 ; : : : ; yn g.

(a) Prove the identity


P
n
(yi y) = 0
i=1

(b) Suppose the data are transformed using ui = a + byi , i = 1; 2; : : : ; n where a and
b are constants with b 6= 0. How are the sample mean and sample median of the
data set fu1 ; u2 ; : : : ; un g related to y and m?
^
(c) Suppose the data are transformed using vi = yi2 , i = 1; 2; : : : ; n. How are the
sample mean and sample median of v1 ; v2 ; : : : ; vn related to y and m?
^
(d) Suppose another observation y0 is added to the data set. Determine the mean
of the augmented data set in terms of y and y0 . What happens to the sample
mean as the magnitude of y0 increases?
(e) Suppose another observation y0 is added to the data set. Determine the me-
dian of the augmented data set. What happens to the sample median as the
magnitude of y0 increases?
(f) Use (d) and (e) to explain why the sample median income of a country might be
a more appropriate summary than the sample mean income.
P
n
(g) Show that V ( ) = (yi )2 is minimized when = y.
i=1
P
n
(h) Challenge Problem: Show that W ( ) = jyi j is minimized when = m.
^
i=1

2. The sample standard deviation s, the interquartile range IQR = q (0:75) q (0:25),
and the range = y(n) y(1) are three di¤erent measures of the variability of a data
set fy1 ; y2 ; : : : ; yn g.

(a) Prove the identity


2
P
n P
n P
n 1 P
n P
n
Syy = (yi y)2 = yi (yi y) = yi2 yi = yi2 n (y)2
i=1 i=1 i=1 n i=1 i=1

(b) Suppose the data are transformed using ui = a + byi , i = 1; 2; : : : ; n where a and
b are constants and b 6= 0. How are the sample standard deviation, IQR, and
range of the transformed data set fu1 ; u2 ; : : : ; un g related to the sample standard
deviation, IQR, and range of the original data set fy1 ; y2 ; : : : ; yn g?
(c) Suppose another observation y0 is added to the data set. Use the result in (a)
to write the sample standard deviation of the augmented data set in terms of
y0 , s, and y. What happens to the sample variance of the augmented data set
as the magnitude of y0 increases?
38 1. INTRODUCTION TO STATISTICAL SCIENCES

(d) If another observation y0 is added to the data set, what happens to the IQR of
the augmented data set as the magnitude of y0 increases?
(e) If another observation y0 is added to the data set, what happens to the range of
the augmented data set as the magnitude of y0 increases?

3. The sample skewness and kurtosis are two di¤erent measures of the shape of a data
set fy1 ; y2 ; : : : ; yn g. Let g1 be the sample skewness and let g2 be the sample kurtosis
of the data set. Suppose we transform the data so that ui = a + byi , i = 1; 2; : : : ; n
where a and b are constants and b 6= 0. How are the sample skewness and sample
kurtosis of the data set fu1 ; u2 ; : : : ; un g related to g1 and g2 ?

4. Suppose the data fc1 ; c2 ; : : : ; c24 g represents the costs of production for a …rm every
month from January 2018 to December 2019. For this data set the sample mean was
$2500, the sample standard deviation was $5500; the sample median was $2600, the
sample skewness was 1:2, the sample kurtosis was 3:9, and the range was $7500. The
relationship between cost and revenue is given by ri = 7ci + 1000, i = 1; 2; : : : ; 24.
Find the sample mean, standard deviation, median, skewness, kurtosis and range of
the revenues.

5. Mass production of complicated assemblies such as automobiles depend on the ability


to manufacture components to very tight speci…cations. The component manufacturer
tracks performance by measuring a sample of parts and comparing the measurements
to the speci…cation. Suppose the speci…cation for the diameter of a piston is a nominal
value 10 microns (10 6 meters). The data below are the diameters of 50 pistons
collected from the more than 10; 000 pistons produced in one day. (The measurements
are the diameters minus the nominal value in microns.) The data are available in the
…le diameterdata.txt posted on the course website.

12:8 7:3 3:9 3:4 2:9 2:7 2:5 2:3 1:0 0:9
0:8 0:7 0:6 0:4 0:4 0:2 0:0 0:5 0:6 0:7
1:2 1:8 1:8 2:0 2:1 2:5 2:6 2:6 2:7 2:8
3:3 3:4 3:5 3:8 4:3 4:6 4:7 5:1 5:4 5:7
5:8 6:6 6:6 7:0 7:2 7:9 8:5 8:6 8:7 8:9

P
50 P
50
yi = 100:7 yi2 = 1110:79
i=1 i=1

(a) Plot a relative frequency histogram of the data. Is the process producing pistons
within the speci…cations.
(b) Calculate the sample mean y and the sample median of the diameters.
(c) Calculate the sample standard deviation s and the IQR.
(d) Give the …ve number summary for these data.
1.7. CHAPTER 1 PROBLEMS 39

(e) Such data are often summarized using a single performance index called P pk
de…ned as
U y y L
P pk = min ;
3s 3s
where (L; U ) = ( 10; 10) are the lower and upper speci…cation limits. Calculate
P pk for these data.
(f) Explain why larger values of P pk (i.e. greater than 1) are desirable.
(g) Suppose we …t a Gaussian model to the data with mean and standard deviation
equal to the corresponding sample quantities, that is, with = y and = s. Use
the …tted model to estimate the proportion of diameters (in the process) that
are out of speci…cation.

6. In the above problem, we saw how to estimate the performance measure P pk based
on a sample of 50 pistons, a very small proportion of one day’s production. To get an
idea of how reliable this estimate is, we can model the process output by a Gaussian
random variable Y with mean and standard deviation equal to the corresponding
sample quantities. The following R code generates 50 observations and calculates
P pk. This is done 1000 times using a loop statement.
#Import dataset diameterdata.txt from the course website using RStudio
avgx<-mean(diameterdata$diameter) #sample mean
sdx<-sd(diameterdata$diameter) #sample standard deviation
temp<-rep(0,1000) #Store the 1000 generated Ppk values in vector temp
for (i in 1:1000) { #Begin loop
y<-rnorm(50, avgx, sdx) #Generate 50 new observations from a
# G(avgx,sdx) distribution
avg<-mean(y) #sample mean of new data
s<-sd(y) #sample std of new data
ppk<-min((10-avg)/(3*s),(avg+10)/(3*s)) #Ppk for new data
temp[i]<-ppk #Store value of Ppk for this iteration
}
hist(temp) #Plot histogram of 1000 Ppk values
mean(temp) #average of the 1000 Ppk values
sd(temp) #standard deviation of the 1000 Ppk values

(a) Compare the P pk from the original data with the average P pk value from the
1000 iterations. Mark the original P pk value on the histogram of generated P pk
values. What do you notice? What would you conclude about how good the
original estimate of P pk was?
(b) Repeat the above exercise but this time use a sample of 300 pistons rather than
50 pistons. What conclusion would you make about using a sample of 300 versus
50 pistons?
40 1. INTRODUCTION TO STATISTICAL SCIENCES

7. Graph the empirical cumulative distribution function and boxplot for the data

7:6 4:3 5:2 4:5 1:1 8:5 14:0 6:3 3:9 7:2

without using statistical software.

8. Run the following code on the can …lling data and compare with the summaries given
in Example 1.5.2.
#Import dataset canfillingdata.txt from the course website using RStudio
v1<-canfillingdata$volume[canfillingdata$machine==1] # New Machine
v2<-canfillingdata$volume[canfillingdata$machine==2] # Old Machine
skewness<-function(x) {(sum((x-mean(x))^3)/length(x))/
(sum((x-mean(x))^2)/length(x))^(3/2)}
kurtosis<- function(x) {(sum((x-mean(x))^4)/length(x))/
(sum((x-mean(x))^2)/length(x))^2}
# Numerical summaries by machine
c(mean(v1),sd(v1),skewness(v1),kurtosis(v1))
fivenum(v1) # Gives the 5 number summary
# R defines the 1st and 3rd quartiles slightly different than Def’n 1
c(mean(v2),sd(v2),skewness(v2),kurtosis(v2))
fivenum(v2)
# Plot run charts by machine, one above of the other,
# type="l" joins the points on the plots
par(mfrow=c(2,1)) # Creates 2 plotting areas, one above the other
plot(1:40,v1,xlab="Hour",ylab="Volume",main="New Machine",
ylim=c(355,360),type="l")
plot(1:40,v2,xlab="Hour",ylab="Volume",main="Old Machine",
ylim=c(355,360),type="l")
# Plot side by side relative frequency histograms with same intervals
par(mfrow=c(1,2)) # Creates 2 plotting areas side by side
# Plot relative frequency histogram for New Machine
library(MASS) # truehist is in MASS library
truehist(v1,h=0.5,xlim=c(355,361),xlab="Volume",ylab="Density",main="New
Machine")
# Superimpose Gaussian pdf onto histogram
curve(dnorm(x,mean(v1),sd(v1)),add=TRUE,from=355,to=359,lwd=2)
# Plot relative frequency histogram for Old Machine
truehist(v2,h=0.5,xlim=c(355,361),xlab="Volume",ylab="Density",main="Old
Machine")
# Superimpose Gaussian pdf onto histogram
curve(dnorm(x,mean(v2),sd(v2)),add=TRUE,from=355,to=361,lwd=2)
par(mfrow=c(1,1)) # Change back to one plotting area
1.7. CHAPTER 1 PROBLEMS 41

# Plot side by side boxplots


boxplot(v1,v2,names=c("New Machine","Old Machine"))
# Plot empirical cdf’s on same graph
plot(ecdf(v1),verticals=TRUE,do.points=FALSE,col="red",xlab="Volume",
ylab="e.c.d.f.",main="Empirical c.d.f.’s")
legend(356,0.8,c("New Machine (Red)","Old Machine (Blue)"))
plot(ecdf(v2),verticals=TRUE,do.points=FALSE,add=TRUE,col="blue")

9. The data below show the lengths in centimeters of 43 male coyotes and 40 female
coyotes captured in Nova Scotia. (Based on Table 2.3.2 in Wild and Seber 1999.)
The data are available in the …le coyotedata.txt posted on the course website.

Females x
71:0 73:7 80:0 81:3 83:5 84:0 84:0 84:5 85:0 85:0 86:0 86:4
86:5 86:5 88:0 87:0 88:0 88:0 88:5 89:5 90:0 90:0 90:2 91:0
91:4 91:5 91:7 92:0 93:0 93:0 93:5 93:5 93:5 96:0 97:0 97:0
97:8 98:0 101:6 102:5

P
40 P
40
xi = 3569:6 x2i = 320223:38
i=1 i=1

Males y
78:0 80:0 80:0 81:3 83:8 84:5 85:0 86:0 86:4 86:5 87:0 88:0
88:0 88:9 88:9 90:0 90:5 91:0 91:0 91:0 91:4 92:0 92:5 93:0
93:5 95:0 95:0 95:0 94:0 95:5 96:0 96:0 96:0 96:0 97:0 98:5
100:0 100:5 101:0 101:6 103:0 104:1 105:0

P
43 P
43
yi = 3958:4 yi2 = 366276:84
i=1 i=1

(a) Plot relative frequency histograms of the lengths for females and males sepa-
rately. Be sure to use the same intervals.
(b) Determine the …ve number summary for each data set.
(c) Plot side by side boxplots for the females and males. What do you notice?
(d) Compute the sample mean and sample standard deviation for the lengths of the
female and male coyotes separately. Assuming = sample mean and
= sample standard deviation, overlay the corresponding Gaussian probability
density function on the histograms for the females and males separately. Com-
ment on how well the Gaussian model …ts each data set.
(e) Plot the empirical distribution function of the lengths for females and males
separately on the same graph. What do you notice?
42 1. INTRODUCTION TO STATISTICAL SCIENCES

10. Prove the identities


2
P
n P
n P
1 P
n n
Sxx = (xi x)2 = xi (xi x) = x2i
xi
i=1 i=1 i=1 n i=1
Pn P
n P
n
Sxy = (xi x) (yi y) = xi (yi y) = (xi x) yi
i=1 i=1 i=1

11. Does the value of an actor in‡uence the amount grossed by a movie? The “value
of an actor” will be measured by the average amount the actors’movies have made.
The “amount grossed by a movie” is measured by taking the highest grossing movie,
in which that actor played a major part. For example, Tom Hanks, whose value is
103:2 had his best results with Toy Story 3 (gross 415:0). All numbers are corrected
to 2012 dollar amounts and have units “millions of U.S. dollars”. Twenty actors
were selected by taking the …rst twenty alphabetically listed by name on the website
(http://boxo¢ cemojo.com/people/). For each of the twenty actors, the value of the
actor (x) and their highest grossing movie (y) were determined. The data are given
below as well as in the …le actordata.txt posted on the course website.
Actor 1 2 3 4 5 6 7 8 9 10
Value (x) 67 49:6 37:7 47:3 47:3 32:9 36:5 92:8 17:6 14:4
Gross (y) 177:2 201:6 183:4 55:1 154:7 182:8 277:5 415 90:8 83:9

Actor 11 12 13 14 15 16 17 18 19 20
Value (x) 51:1 54 30:5 42:1 23:6 62:4 32:9 26:9 43:7 50:3
Gross (y) 158:7 242:8 37:1 220 146:3 168:4 173:8 58:4 199 533

P
20 P
20 P
20
xi = 860:6 x2i = 43315:04 xi yi = 184540:93
i=1 i=1 i=1
P
20 P
20
yi = 3759:5 yi2 = 971560:19
i=1 i=1

(a) What are the two variates in this data set? Choose one variate to be an explana-
tory variate and the other to be a response variate. Justify your choice.
(b) Plot a scatterplot of the data.
(c) Calculate the sample correlation for the data (xi ; yi ) ; i = 1; 2; : : : ; 20. Is there a
strong positive or negative relationship between the two variates?
(d) Is it reasonable to conclude that the explanatory variate in this problem causes
the response variate? Explain.
(e) Here is R code to plot the scatterplot (in blue) and calculate the sample corre-
lation:
#Import dataset ActorAata.txt from the course website using RStudio
attach(actordata)
1.7. CHAPTER 1 PROBLEMS 43

cor(Value,Gross) # Calculates sample correlation


plot(Value,Gross,main ="Actor Data",col="blue") # scatterplot
# round correlation to 4 decimal places and convert to character
crt<-as.character(round(cor(Value,Gross),4))
txt<-paste("Sample Correlation = ",crt) # create text
text(30,500,txt) # add text to plot

12. In this course we mainly focus on methods for analyzing univariate and bivariate
datasets. In the real world multivariate data sets are much more common. Learning
how to analyse univariate and bivariate datasets gives us the basic tools for analyzing
multivariate data sets. This problem looks at simple numerical and graphical sum-
maries for a multivariate dataset.

Computers and smartphones are just two of the many devices that use integrated
circuits. A silicon wafer is a thin slice of semiconductor material, such as a silicon
crystal, used in the fabrication of integrated circuits. The thickness of such wafers is
very important since thinner wafers are less costly. However the wafers cannot be too
thin since then they can crack more easily.

To gain information about wafer thicknesses at a particular semiconductor fabrication


plant or fab, the thickness of a single wafer, is measured at 9 locations as shown in
Figure 1.23. A single wafer is removed from a tray of wafers always at the same posi-

Figure 1.23: Locations at which wafer thickness is measured

tion for each batch of wafers. The data for 182 consecutive batches are available in the
…le waferdata.txt posted on the course website. The data have been approximately
centered and scaled. These data could be used to study the relationships between the
thicknesses at di¤erent locations.
44 1. INTRODUCTION TO STATISTICAL SCIENCES

Run the following R code:


#Import dataset waferdata.txt in folder S231Datasets using RStudio
#Install the packages moments and car if necessary
install.packages("moments")
install.packages("car")
library(moments)
library(car)
attach(waferdata)
apply(waferdata,2,mean) # sample means for each location
apply(waferdata,2,sd) # sample standard deviations for each location
apply(waferdata,2,fivenum) # five number summaries for each location
apply(waferdata,2,skewness) # sample skewness for each location
apply(waferdata,2,kurtosis) # sample kurtosis for each location
cor(waferdata) # all sample correlations
scatterplotMatrix(waferdata[,1:5],smooth=F,regLine=F,
var.labels=colnames(waferdata[,1:5]),cex.labels=1,
diagonal=list(method="histogram"),plot.points=T)
scatterplotMatrix(waferdata[,c(1,6:9)],smooth=F,regLine=F,
var.labels=colnames(waferdata[,c(1,6:9)]),cex.labels=1,
diagonal=list(method="histogram"),plot.points=T)

(a) Comment on any similarities or di¤erences in the numerical summaries for each
of the 9 locations.
(b) What do you notice about the sample correlations between location 1 with lo-
cations 2; 3; 4; 5 as compared to the sample correlations between location 1 and
locations 6; 7; 8; 9? Does what you observe make sense?
(c) Compare the variability in the points for the di¤erent scatterplots. What do you
notice?

13. Two hundred volunteers participated in an experiment to examine the e¤ectiveness of


vitamin C in preventing colds. One hundred were selected at random to receive daily
doses of vitamin C and the others received a placebo. (None of the volunteers knew
which group they were in.) During the study period, twenty of those taking vitamin
C and thirty of those receiving the placebo caught colds.

(a) Create a two-way table for these data.


(b) Calculate the relative risk of a cold in the vitamin C group as compared to the
placebo group.
(c) What do these data suggest? Can you conclude that vitamin C reduces the
chances of catching a cold?
1.7. CHAPTER 1 PROBLEMS 45

*Problems 14 to 18 are based on material covered in STAT 220/230/240.


This material will be used frequently in these notes. You may wish to
review the relevant material from STAT 220/230/240 before attempt-
ing these problems.

14. In a very large population a proportion of people have blood type A. Suppose n
people are selected at random. De…ne the random variable Y = number of people
with blood type A in sample of size n.

(a) What is the probability function for Y ? What assumptions have you made?
(b) What are E(Y ) and V ar(Y )?
(c) Suppose n = 50. What is the probability of observing 20 people with blood type
A as a function of ?
(d) If for n = 50 we observed y = 20 people with blood type A what is a reasonable
estimate of based on this information? Estimate the probability that in a
sample of n = 10 there will be at least one person with blood type A.
(e) More generally, suppose in a given experiment the random variable of interest Y
has a Binomial(n; ) distribution. If the experiment is conducted and y successes
are observed what is a good estimate of based on this information?
(f) Let Y Binomial(n; ). Find E Yn and V ar Yn . What happens to V ar Yn
as n ! 1? What does this imply about how far Yn is from for large n?
Approximate
r r !
Y (1 ) Y (1 )
P 1:96 + 1:96
n n n n

You may ignore the continuity correction.


(g) There are actually 4 blood types: A, B, AB, O. In a sample of size n let
Y1 = number with type A, Y2 = number with type B,
Y3 = number with type AB, and Y4 = number with type O.
Let 1 = proportion of type A, 2 = proportion of type B,
3 = proportion of type AB, and 4 = proportion of type O in the population.

What is the joint probability function of Y1 , Y2 , Y3 , Y4 ?


(h) Suppose in a sample of n people the observed data were y1 ; y2 ; y3 ; y4 . What are
reasonable estimates of 1 ; 2 ; 3 ; 4 based on these data?
46 1. INTRODUCTION TO STATISTICAL SCIENCES

15. Visits to a particular website occur at random at the average rate of visits per
second. Suppose it is reasonable to use a Poisson process to model this process.
De…ne the random variable Y = number of visits to the website in one second.

(a) Give the probability function for Y , E(Y ) and V ar(Y ).


(b) How well do you think the Poisson process assumptions might hold in this case?
(c) Suppose over a 10 second period the number of visits to the website in each
second were
1 4 5 1 0 2 5 4 3 2
(i) What is the probability of observing these data as a function of ?
(ii) What is a reasonable estimate of based on these data?
(iii) Based on these data, estimate the probability that there is at least one visit
to the website in a one second interval.
(d) Suppose Yi Poisson( ), i = 1; 2; : : : ; n independently.
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
p p
(ii) Approximate P Y 1:96 =n Y + 1:96 =n . You may ignore
the continuity correction.

16. Suppose it is reasonable to model the IQ’s of UWaterloo Math students using a
G ( ; ) distribution. De…ne the random variable Y = IQ of a randomly chosen
UWaterloo Math student.

(a) Give the probability density function of Y , E(Y ) and V ar(Y ).


(b) Suppose that the IQ’s for a random sample of 16 students were:

127 108 127 136 125 130 127 117 123 112 129 109 109 112 91 134
P
16 P
16
yi = 1916; yi2 = 231618
i=1 i=1
(i) What is a reasonable estimate of based on these data?
(ii) What is a reasonable estimate of 2 based on these data?
(iii) Based on theses data, estimate the probability that a randomly chosen UWa-
terloo Math student has an IQ greater than 120.
(c) Suppose Yi G( ; ), i = 1; 2; : : : ; n independently.
(i) What is the distribution of
1 Pn
Y = Yi
n i=1
Find E(Y ), and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
1.7. CHAPTER 1 PROBLEMS 47
p p
(ii) Find P Y 1:96 = n Y + 1:96 = n .
(iii) Find the smallest value of n such that P Y 1:0 0:95 if = 12.

17. Suppose it is reasonable to model the battery life of a certain type of laptop using
the Exponential( ) distribution. De…ne the random variable Y = battery life of a
randomly chosen laptop.

(a) Give the probability density function of Y , E(Y ) and V ar(Y ).


(b) Suppose the lifetimes (in minutes) of a random sample of twenty laptop batteries
were:

48:0 1047:2 802:3 165:6 76:7 64:2 158:6 338:3 200:6 362:8
119:5 55:9 411:3 706:9 16:2 1277:6 49:4 22:6 1078:4 440:7
P
20
yi = 7442:8
i=1

(i) What is a reasonable estimate of based on these data?


(ii) Based on these data, estimate the probability that lifetime of a randomly
chosen laptop is longer than 100 hours.
(c) Suppose Yi Exponential( ), i = 1; 2; : : : ; n independently.
(i) Find E(Y ) and V ar(Y ). What happens to V ar(Y ) as n ! 1? What does
this imply about how far Y is from for large n?
p p
(ii) Approximate P Y 1:6449 = n Y + 1:6449 = n .

18. Suppose Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random variables


with E (Yi ) = and V ar (Yi ) = 2 , i = 1; 2; : : : ; n.

(a) Find E Yi2 .


Hint: Rearrange the equation V ar (Y ) = E Y 2 [E (Y )]2 .
(b) Find E(Y ), V ar(Y ) and E (Y )2 .
(c) Use (a) and (b) to show that E S 2 = 2 where

1 P
n
2
S2 = Yi Y
n 1 i=1
1 P
n
2
= Y2 n Y
n 1 i=1 i
48 1. INTRODUCTION TO STATISTICAL SCIENCES

19. Is the graph in Figure 1.24 e¤ective in conveying information about the snacking
behaviour of students at Ridgemont High School?

b o ys

g ir ls

C an d y

C h ip s

C h o c o la t e B a r s

C o o k ie s

C rackers

F r u it

Ic e C r e a m

P o p co rn

P r e t z e ls

V e g e t a b le s

0 50 100 150 200 250 300


N u m b er o f S tu d en ts

Figure 1.24: Preferred snacks of students at Ridgemont High School

20. The pie chart in Figure 1.25, from Fox News, shows the support for various Republican
Presidential candidates in 2012. What do you notice about this pie chart?

Figure 1.25: Pie chart for support for Republican Presidental candidates
1.7. CHAPTER 1 PROBLEMS 49

21. The graphs in Figures 1.26 and 1.27 are two more classic Fox News graphs. What do
you notice? What political message do you think they were trying to convey to their
audience?

Figure 1.26: Unemployment rate under President Obama

Figure 1.27: Federal welfare in the US


50 1. INTRODUCTION TO STATISTICAL SCIENCES

22. Information about the mortality from malignant neoplasms (cancer) for females living
in Ontario is given in …gures 1.28 and 1.29 for the years 1970 and 2000 respectively.
The same information displayed in these two pie charts is also displayed in the bar
graph in Figure 1.30. Which display seems to carry the most information?

Lung

Leukemia & Lymphoma

Other

Breast

Stomach
Colorectal

Figure 1.28: Mortality from malignant neoplasms for females in Ontario 1970

Lung

Other

Leukemia & Lymphoma

Stomach
Breast

Colorectal

Figure 1.29: Mortality from malignant neoplasms for females in Ontario in 2000
1.7. CHAPTER 1 PROBLEMS 51

40

1970
2000
35

30

25

20

15

10

0
Lung Leuk. & Lymph. Breast Colorectal Stomach Other

Figure 1.30: Mortality from malignant neoplasms for females living in Ontario,
1970 and 2000

23. The following article and graphical summary appeared on canadiansinternet.com on


May 24, 2016.
Canadians love using social media in 2016 and Facebook continues to be the social
network they use most, a new survey from InsightsWest has determined. Facebook,
YouTube and Instagram use is still growing at a healthy pace overall in Canada, while
Twitter, Google+, Pinterest, LinkedIn, Tumblr and Reddit usage has slowed down a
bit. A respectable 18% of social interactions are with businesses. In spite of social
media’s popularity in Canada, websites are the most common way for Canadians to
interact with businesses online.
Canadian millennials use social media di¤erently than other age groups. YouTube,
Instagram, Twitter and Snapchat are growing in usage among millennials in this
country. Conversely, the older our residents are, the less likely they are to have tried
each social media network. In all age groups combined, only 15% said they had never
tried YouTube and just 16% said they’d never used Facebook.
Social media usage among women is growing steadily across all networks. Growth
among Canadian men is slower by comparison. The ladies are using visual social
networks more, with Instagram and Pinterest seeing more growth by comparison to
men. LinkedIn growth among Canadian males is almost double the growth of women
using the network.
The popularity of social networks is based on more than the number of members they
have. The following statistics show how many of the Canadians surveyed visit each
social network at least twice per week. Comment on e¤ective you think the graphical
52 1. INTRODUCTION TO STATISTICAL SCIENCES

summary is.

24. A study led by Beth Israel Medical Center in New York City has found that live
music can be bene…cial to premature babies. In the study, music therapists helped
parents transform their favorite tunes into lullabies. The researchers concluded that
live music, played or sung, helped to slow infants’heartbeats, calmed their breathing,
improved their sucking behaviors (important for feeding), aided their sleep and pro-
moted their states of quiet alertness. Doctors and researchers say that by reducing
stress and stabilizing vital signs, music can allow infants to devote more energy to
normal development.
The two-year study was conducted between January 2011 and December 2012 in 11
hospitals in New York state. Only hospitals which received approval from their hos-
pital’s institutional review boards were included in the study. The study involved 272
premature infants aged 32 weeks with respiratory distress syndrome, clinical sepsis
(a life-threatening condition that arises when the body’s response to infection causes
injury to its own tissues and organs), and/or SGA (small for gestational age). Over a
two week period the babies experienced 4 di¤erent musical “treatments”. Two of the
treatments involved musical instruments, one involved singing, and the control treat-
ment was no music at all. The instruments and singing were intended to approximate
womb sounds.
The …rst musical instrument was the Remo ocean disc which is a musical instrument
that is round and is …lled with tiny metal balls. When the disc is rotated, the metal
balls move slowly to create a sound e¤ect that is quiet and meant to simulate the ‡uid
sounds of the womb. The second musical instrument was a gato box which is a small
rectangular tuned musical instrument that is used to simulate a heartbeat sound that
the baby would hear in the womb. The singing treatment consisted of live singing of a
1.7. CHAPTER 1 PROBLEMS 53

lullaby chosen by a parent. If a parent did not chose a song then “Twinkle, Twinkle,
Little Star” was used.
Each of the four treatments was given 2 times per week over the course of the two
week study period. The presentation of the treatments was varied by day of the
week within each week and by the time of day and randomized (either morning or
afternoon) across the 2 weeks. For each treatment the baby’s heart rate (beats per
minute), respiratory rate (number of breaths per minute), oxygen saturation (amount
of oxygen in the blood), sucking pattern (active/medium/slow/none), and activity
level (active/quiet/irritable/sleeping) were recorded.
Researchers found that the gato box, the ocean disc and singing all slowed a baby’s
heart rate, though singing seemed to be most e¤ective. Singing also increased the time
babies stayed quietly alert. Sucking behavior improved most with the gato box. The
breathing rate slowed the most and sleeping was the best with the ocean disc. Babies
hearing songs their parents chose had better a better sucking pattern than those who
heard “Twinkle, Twinkle, Little Star.” But the “Twinkle” babies had slightly more
oxygen saturation in their blood. Dr. Loewy, who trains therapists worldwide, said
it did not matter whether parents or music therapists sang, or whether babies were
in incubators or held.
Dr. Lance A. Parton, associate director of the regional neonatal intensive care unit
at Westchester Medical Center’s Maria Fareri Children’s Hospital, which participated
in the research, said it would be useful to see if music could help the sickest and most
premature babies, who were not in the study. “Live music is optimal because it’s in
the moment and can adapt to changing conditions,” said Dr. Standley a professor
of medical music therapy at Florida State University. “If the baby appears to be
falling asleep, you can sing quieter. Recorded music can’t do that. But there are so
many premature babies and so few trained live producers of music therapy that it’s
important to know what recorded music can do.”

(a) Is this a sample survey, an observational study or an experimental study? Ex-


plain why.
(b) What are the units of interest in this study? Based on the given information,
what population or collection of units are the researchers interested in?
(c) Give at least 6 important variates in this study and indicate the type of each.
(d) Give at least 5 attributes of interest for this study.

25. Many people do not realize how important statistics is in our everyday life. We are
surrounded by examples. Here is a wonderful example given by John Sall, co-founder
and Executive VP of the statistical software company SAS, on the occasion of the
International Year of Statistics in 2013.
“You brush your teeth. The ‡uoride in the toothpaste was studied by scientists using
statistical methods to carefully assure the safety and e¤ectiveness of the ingredient
54 1. INTRODUCTION TO STATISTICAL SCIENCES

and the proper concentration. The toothpaste was formulated through a series of
designed experiments that determined the optimal formulation through statistical
modeling. The toothpaste production was monitored by statistical process control to
ensure quality and consistency, and to reduce variability.
The attributes of the product were studied in consumer trials using statistical meth-
ods. The pricing, packaging and marketing were determined through studies that used
statistical methods to determine the best marketing decisions. Even the location of
the toothpaste on the supermarket shelf was the result of statistically based studies.
The advertising was monitored using statistical methods. Your purchase transaction
became data that was analyzed statistically. The credit card used for the purchase
was scrutinized by a statistical model to make sure that it wasn’t fraudulent.
So statistics is important to the whole process of not just toothpaste, but every prod-
uct we consume, every service we use, every activity we choose. Yet we don’t need
to be aware of it, since it is just an embedded part of the process. Statistics is useful
everywhere you look.”
Think of an example in your everyday life in which statistics played an important
role.
2. STATISTICAL MODELS AND
MAXIMUM LIKELIHOOD
ESTIMATION

2.1 Choosing a Statistical Model


A statistical model is a mathematical model that incorporates probability3 in some way.
As described in Chapter 1, our interest here is in studying variability and uncertainty in
populations and processes and drawing inferences, where warranted, in the presence of
this uncertainty. This will be done by using random variables to model characteristics of
randomly selected units in the population or process. It is very important to be clear about
what the “target” population or process is, and exactly how the variates being considered
are de…ned and measured. These issues are discussed in Chapter 3.

An important step in statistics is the choice of a statistical model4 to suit a given


application. The choice of a model is usually determined by some combination of the
following three factors:

1. Background knowledge or assumptions about the population or process which lead to


certain distributions.

2. Past experience with data sets from the population or process, which has shown that
certain distributions are suitable.

3. A current data set, against which models can be assessed.

3
The material in this section is largely a review of material you have seen in a previous probability course.
This material is available in the STAT 220/230 Notes which are posted on the course website.
4
The University of Wisconsin-Madison statistician George E.P. Box (18 October 1919 –28 March 2013)
said of statistical models that “All models are wrong but some are useful” which is to say that although
models rarely …t very large amounts of data perfectly, they do assist in describing and drawing inferences
from real data.

55
56 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

In probability, there is a large emphasis on factor 1 above, and there are many “families”
of probability distributions that describe certain types of situations. For example, the
Binomial distribution was derived as a model for outcomes in repeated independent trials
with two possible outcomes on each trial while the Poisson distribution was derived as a
model for the random occurrence of events in time or space. The Gaussian or Normal
distribution, on the other hand, is often used to represent the distributions of continuous
measurements such as the heights or weights of individuals. This choice is based largely on
past experience that such models are suitable and on mathematical convenience.
In choosing a model we usually consider families of probability distributions. To be
speci…c, we suppose that for a random variable Y we have a family of probability func-
tions/probability density functions, f (y; ) indexed by the parameter (which may be a
vector of values). In order to apply the model to a speci…c problem we need to choose a
value for . The process of selecting a value for based on the observed data is referred
to as “estimating” the value of or “…tting” the model. The next section describes the
method of maximum likelihood which is the most widely used method for estimating .

Most applications require a sequence of steps in the formulation (the word “speci…ca-
tion” is also used) of a model. In particular, we often start with some family of models in
mind, but …nd after examining the data set and …tting the model that it is unsuitable in cer-
tain respects. (Methods for checking the suitability of a model will be discussed in Section
2.4.) We then try other models, and perhaps look at more data, in order to work towards
a satisfactory model. This is usually an iterative process, which is sometimes represented
by diagrams such as:

Collect and examine data set


#
Propose a (revised?) model
# "
Fit model ! Check model
#
Draw conclusions

Statistics devotes considerable e¤ort to the steps of this process. We will focus on
settings in which the models are not too complicated, so that model formulation problems
are minimized. There are several distributions that you should review before continuing
since they will appear in many examples. See the STAT 220/230/240 Course Notes available
on the course website. You should also consult the Table of Distributions given in Chapter
10 for a condensed table of properties of these distributions including their means, variances
and moment generating functions .
2.1. CHOOSING A STATISTICAL MODEL 57

Property Discrete Continuous

P Rx
F (x) = P (X x) = P (X = t) F (x) = P (X x) = f (t) dt
c.d.f. t x 1
F is a right continuous step F is a continuous
function for all x 2 < function for all x 2 <

d
p.f./p.d.f. f (x) = P (X = x) f (x) = dx F (x) 6= P (X = x) = 0

P
P (X 2 A) = P (X = x) P (a < X b) = F (b) F (a)
Probability
Px2A Rb
of an event = f (x) = f (x) dx
x2A a

P P R1
Total Probability P (X = x) = f (x) = 1 f (x) dx = 1
all x all x 1

P R1
Expectation E [g (X)] = g (x) f (x) E [g (X)] = g (x) f (x) dx
all x 1

Table 2.1 Properties of discrete versus continuous random variables

Binomial Distribution
The discrete random variable (r.v.) Y has a Binomial distribution if its probability
function is of the form
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where is a parameter with 0 < < 1. For convenience we write Y Binomial(n; ).
Recall that E(Y ) = n and V ar(Y ) = n (1 ).

Poisson distribution
The discrete random variable Y has a Poisson distribution if its probability function is
of the form
y
e
f (y; ) = for y = 0; 1; 2; : : :
y!
where is a parameter with 0. We write Y Poisson( ). Recall that E(Y ) = and
V ar(Y ) = .
58 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Exponential distribution
The continuous random variable Y has an Exponential distribution if its probability
density function is of the form
1 y=
f (y; ) = e for y 0

where is parameter with > 0. We write Y Exponential( ). Recall that E(Y ) = and
V ar(Y ) = 2 .

Gaussian (Normal) distribution


The continuous random variable Y has a Gaussian or Normal distribution if its proba-
bility density function is of the form
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <
2 2
where and are parameters, with 2 < and > 0. Recall that E(Y ) = ;
V ar(Y ) = 2 ; and the standard deviation of Y is sd(Y ) = . We write either Y G( ; )
or Y 2
N ( ; ). Note that in the former case, G( ; ), the second parameter is the stan-
dard deviation whereas in the latter, N ( ; 2 ), the second parameter is the variance 2 .
Most software syntax including R requires that you input the standard deviation for the
parameter. As seen in examples in Chapter 1, the Gaussian distribution provides a suitable
model for the distribution of measurements on characteristics like the height or weight of
individuals in certain populations, but is also used in many other settings. It is particularly
useful in …nance where it is the most commonly used model for asset prices, exchange rates,
interest rates, etc.

Multinomial distribution
The Multinomial distribution is a multivariate distribution in which the discrete random
variable’s Y1 ; Y2 ; : : : ; Yk (k 2) have the joint probability function

P (Y1 = y1 ; Y2 = y2 ; : : : ; Yk = yk ; ) = f (y1 ; y2 ; : : : ; yk ; )
n! y1 y2 ::: yk
= (2.1)
y1 !y2 ! : : : yk ! 1 2 k

P
k
where yi = 0; 1; : : : for i = 1; 2; : : : ; k and yi = n. The elements of the parameter vector
i=1
P
k
= ( 1; 2; : : : ; k ) satisfy 0 i 1 for i = 1; 2; : : : ; k and i = 1. This distribution is
i=1
a generalization of the Binomial distribution. It arises when there are repeated independent
trials, where each trial has k possible outcomes (call them outcomes 1; 2; : : : ; k), and the
probability outcome i occurs is i . If Yi , i = 1; 2; : : : ; k is the number of times that outcome i
occurs in a sequence of n independent trials, then (Y1 ; Y2 ; : : : ; Yk ) have the joint probability
function given in (2.1). We write (Y1 ; Y2 ; : : : ; Yk ) Multinomial(n; ):
2.2. MAXIMUM LIKELIHOOD ESTIMATION 59

P
k
Since Yi = n we can rewrite f (y1 ; y2 ; : : : ; yk ; ) using only k 1 variables, say
i=1
y1 ; y2 ; : : : ; yk 1
by replacing yk with n y1 yk 1 . We see that the Multinomial
distribution with k = 2 is just the Binomial distribution, where the two possible outcomes
are S (Success) and F (Failure).
We now turn to the problem of …tting a model. This requires estimating or assigning
numerical values to the parameters in the model, for example, in an Exponential model
or and in the Gaussian model.

2.2 Maximum Likelihood Estimation


Suppose a probability distribution that serves as a model for some random process depends
on an unknown parameter (possibly a vector). In order to use the model we have to
“estimate” or specify a value for . To do this we usually rely on some data set that has
been collected for the random variable in question. It is important that a data set be
collected carefully, and we consider this issue in Chapter 3. For example, suppose that the
random variable Y represents the weight of a randomly chosen female in some population,
and that we consider a Gaussian model, Y G ( ; ). Since E(Y ) = , we might decide to
randomly select, say, 50 females from the population, measure their weights y1 ; y2 ; : : : ; y50 ,
and use the average,
1 P50
^=y= yi (2.2)
50 i=1
to estimate . This seems sensible (why?) and similar ideas can be developed for other
parameters; in particular, note that must also be estimated, and you might think about
how you could use y1 ; y2 ; : : : ; y50 to do this. (Hint: what does or 2 represent in the
Gaussian model?) Note that although we are estimating the parameter we did not write
= y. We introduced a special notation ^ . This serves a dual purpose, both to remind you
that y is not exactly equal to the unknown value of the parameter , but also to indicate
that ^ is a quantity derived from the data yi , i = 1; 2; : : : ; 50 and depends on the sample.
A di¤erent draw of the sample yi , i = 1; 2; : : : ; 50 will result in a di¤erent value for ^ :

De…nition 8 A point estimate of a parameter is the value of a function of the observed


data y1 ; y2 ; : : : ; yn and other known quantities such as the sample size n. We use ^ to denote
an estimate of the parameter .

Note that ^ = ^(y1 ; y2 ; : : : ; yn ) = ^(y) depends on the sample y = (y1 ; y2 ; : : : ; yn ) drawn.


A function of the data which does not involve any unknown quantities such as unknown
parameters is called a statistic. The numerical summaries discussed in Chapter 1 are all
examples of statistics. A point estimate is also a statistic.
Instead of ad hoc approaches to estimation as in (2.2), it is desirable to have a general
method for estimating parameters. The method of maximum likelihood is a very general
method, which we now describe.
60 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Let the discrete (vector) random variable Y represent potential data that will be used
to estimate , and let y represent the actual observed data that are obtained in a speci…c
application. Note that to apply the method of maximum likelihood, we must know (or
make assumptions about) how the data y were collected. It is usually assumed here that
the data set consists of measurements on a random sample of units from a population or
process.

De…nition 9 The likelihood function for is de…ned as

L ( ) = L ( ; y) = P (Y = y; ) for 2

where the parameter space is the set of possible values for .

Note that the likelihood function is a function of the parameter and the given data y.
For convenience we usually write just L ( ). Also, the likelihood function is the probability
that we observe the data y, considered as a function of the parameter . Obviously values
of the parameter that make the observed data y more probable would seem more credible or
likely than those that make the data less probable. Therefore values of for which L( ) is
large are more consistent with the observed data y. This seems like a “sensible”approach,
and it turns out to have very good properties.

De…nition 10 The value of which maximizes L( ) for given data y is called the maxi-
mum likelihood estimate 5 (m.l. estimate) of . It is the value of which maximizes the
probability of observing the data y. This value is denoted ^.

We are surrounded by polls. They guide the policies of political leaders, the products
that are developed by manufacturers, and increasingly the content of the media. The fol-
lowing is an example of a public opinion poll.

Example 2.2.1 Nanos Research poll


Between February 22nd and 24th, 2016 Nanos Research (a Canadian public opinion and
research company) conducted a survey of Canadian adults, 18 years or older, to determine
support for the legalization of marijuana. The 1000 participants were recruited using live
agents and random digit dialing (land- and cell-lines) across Canada with a maximum of …ve
call backs. Respondents were asked “Do you support, somewhat support, somewhat oppose
or oppose legalizing the recreational use of marijuana?”. The data are summarized in a
bar graph in Figure 2.1. Thirty-nine percent of respondents indicated that they supported
the recreational use of marijuana while 29% indicated that they somewhat supported the
recreational use of marijuana. Nanos reported that the margin of error for a random survey
5
We distinguish between the random variable, the maximum likelihood estimator, which is the function
of the potential data, and its numerical value for the given data, referred to as the maximum likelihood
estimate.
2.2. MAXIMUM LIKELIHOOD ESTIMATION 61

40

35

30

25

20

15

10

0
S uppor t S om ew hat S uppo rt U ns ure O ppos e S om ew hat O ppos e

Figure 2.1: Nanos Research poll on use of marijuana

of 1000 Canadians is 3:1 percentage points, 19 times out of 20. How do we interpret this
statement?
Suppose that the random variable Y represents the number of units in a sample of
n units drawn from a very large population who have a certain characteristic of interest.
Suppose we assume that Y is closely modelled by a Binomial distribution with probability
function

n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n and 0 1
y

where represents the proportion of the large population that have the characteristic.
Suppose that y units in the sample of size n have the characteristic. The likelihood function
for based on these data is

L( ) = P (y units have characteristic ; )


n y
= (1 )n y for 0 1 (2.3)
y

If y 6= 0 and y 6= n then it can be shown that (2.3) attains its maximum value at = ^ = y=n
by solving dL(d
)
= 0. The estimate ^ = y=n is called the sample proportion.
For the Nanos poll, suppose we are interested in = proportion of Canadian adults who
support or somewhat support the recreational use of marijuana. From this poll we have
y = 680 people out of n = 1000 people support or somewhat support the recreational use
of marijuana so the likelihood function for is

1000 680
L( ) = (1 )320 for 0 1 (2.4)
680
62 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

The maximum likelihood estimate of for theses data is ^ = y=n = 680=1000 = 0:68 or
68% which can also easily be seen from the graph of the likelihood function (2.4) given in
Figure 2.2. The interval suggested by the pollsters was 68 3:1% or [64:9; 71:1]. Looking at
Figure 2.2 we see that the interval [0:649; 0:711] is a reasonable interval for the parameter
since it seems to contain most of the values of with large values of the likelihood L( ).
We will return to the construction of such interval estimates in Chapter 4.

0.03

0.025

0.02
L(θ)

0.015

0.01

0.005

0
0.64 0.66 0.68 0.7 0.72
θ

Figure 2.2: Likelihood function for for the Nanos poll

The shape of the likelihood function and the value of at which it is maximized are
not a¤ected if L( ) is multiplied by a constant. Indeed it is not the absolute value of the
likelihood function that is important but the relative values at two di¤erent values of the
parameter, e.g. L( 1 )=L( 2 ). You might think of this ratio as how much more or less
consistent the data are with the parameter 1 versus 2 . The ratio L( 1 )=L( 2 ) is also
una¤ected if L( ) is multiplied by a constant. In view of this the likelihood may be de…ned
as P (Y = y; ) or as any constant multiple of it, so, for example, we could drop the term
n y
y in (2.3) and de…ne L( ) = (1 )n y . This function and (2.3) are maximized by the
same value = ^ = y=n and have the same shape. Indeed we might rescale the likelihood
function by dividing through by its maximum value L(^) so that the new function has a
maximum value equal to one.

De…nition 11 The relative likelihood function is de…ned as


L( )
R( ) = for 2
L(^)
Note that 0 R( ) 1 for all 2 .
2.2. MAXIMUM LIKELIHOOD ESTIMATION 63

Sometimes it is easier to work with the log (log = ln) of the likelihood function.

De…nition 12 The log likelihood function is de…ned as

l( ) = ln L ( ) = log L ( ) for 2

1.5
1
L( θ)
0.5
0
-0.5
-1
l(θ)
-1.5
-2
-2.5
-3
0.23 0.24 0.25 0.26 0.27 0.28 0.29
θ

Figure 2.3: The functions L ( ) (upper graph) and l ( ) (lower graph) are both
maximized at the same value = ^

Figure 2.3 displays the graph of a likelihood function L ( ), rescaled to have a maximum
value of one at = ^, and the corresponding log likelihood function l( ) = log L ( ) with a
maximum value of log(1) = 0. We see that l( ), the lower of the two curves, is a monotone
function of L( ) so that the two functions increase together and decrease together. Both
functions have a maximum at the same value = ^.

Because functions are often (but not always!) maximized by setting their derivatives
equal to zero, we can usually obtain ^ by solving the equation
d
l( ) = 0
d
y
For example, from L( ) = (1 )n y we get l( ) = y log( ) + (n y) log(1 ) and

d y n y y n
l( ) = = for 0 < <1
d 1 (1 )
Solving dl=d = 0 gives = y=n. The First Derivative Test can be used to verify that this
corresponds to a maximum value so the maximum likelihood estimate of is ^ = y=n. This
derivation holds if y 6= 0 and y 6= n. See Problem 2 for the derivation of ^ if y = 0 or y = n.
64 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Likelihood function for a random sample


In many applications the data Y = (Y1 ; Y2 ; : : : ; Yn ) are independent and identically
distributed (i.i.d.) random variables each with probability function f (y; ), 2 . We refer
to Y = (Y1 ; Y2 ; : : : ; Yn ) as a random sample from the distribution f (y; ). In this case the
observed data are y = (y1 ; y2 ; : : : ; yn ) and
Q
n
L( ) = L (y; ) = f (yi ; ) for 2
i=1

Recall that if Y1 ; Y2 ; : : : ; Yn are independent random variables then their joint probability
function is the product of their individual probability functions.

Example 2.2.2 Likelihood function for Poisson distribution


Suppose y1 ; y2 ; : : : ; yn is an observed random sample from a Poisson( ) distribution.
The likelihood function is
Q
n Q
n
L( ) = f (yi ; ) = P (Yi = yi ; ) for 2
i=1 i=1
P
n
Q
n yi Q
n 1 yi
e n
= = i=1 e for 0
i=1 yi ! i=1 yi !

or more simply
ny n
L( ) = e for 0
The log likelihood is
l ( ) = n (y log ) for >0
with derivative
d y n
l( ) = n 1 = (y ) for > 0
d
The First Derivative Test can be used to verify that the value = y maximizes l( ) and so
^ = y is the maximum likelihood estimate of .

Combining likelihoods based on independent experiments


If we have two data sets y1 and y2 from two independent studies for estimating , then
since the corresponding random variables Y1 and Y2 are independent we have

P (Y1 = y1 ; Y2 = y2 ; ) = P (Y1 = y1 ; )P (Y2 = y2 ; )

and we obtain the “combined” likelihood function L( ) based on y1 and y2 together as

L( ) = L1 ( ) L2 ( ) for 2

where Lj ( ) = P (Yj = yj ; ); j = 1; 2. This idea, of course, can be extended to more than


two independent studies.
2.2. MAXIMUM LIKELIHOOD ESTIMATION 65

Example 2.2.3
In 2011, Harris/Decima (a research polling company) conducted a poll of the Canadian
adult population in which they asked respondents whether they agreed with the statement:
“University and college teachers earn too much”. In 2011, y2 = 540 people agreed with
the statement. In a previous poll conducted by Harris/Decima in 2010, y1 = 520 people
agreed with the same statement. If we assume that = the proportion of the Canadian
adult population that agree with the statement is the same in both years then may be
estimated using the data from these two independent polls. The combined likelihood would
be

L ( ) = P (Y1 = y1 ; Y2 = y2 ; )
= P (Y1 = y1 ; )P (Y2 = y2 ; )
2000 520 2000 540
= (1 )1480 (1 )1460
520 540
2000 2000 1060
= (1 )2940 for 0 1
520 540
or, ignoring the constants with respect to , we have

L( ) = 1060
(1 )2940 for 0 1

The maximum likelihood estimate of based on the two independent experiments is


^ = 1060=4000 = 0:265.

Sometimes the likelihood function for a given set of data can be constructed in more
than one way as the following example illustrates.

Example 2.2.4
Suppose that the random variable Y represents the number of persons infected with
the human immunode…ciency virus (HIV) in a randomly selected group of n persons. We
assume the data are reasonably modeled by Y Binomial(n; ) with probability function
n y
P (Y = y; ) = f (y; ) = (1 )n y
for y = 0; 1; : : : ; n
y
where represents the proportion of the population that are infected. In this case, if we
select a random sample of n persons and test them for HIV, we have Y = Y , and y = y as
the observed number infected. Thus
n y
L( ) = (1 )n y
for 0 1
y
or more simply
y
L( ) = (1 )n y
for 0 1 (2.5)
and again L( ) is maximized by the value ^ = y=n.
66 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

For this random sample of n persons who are tested for HIV, we could also de…ne the
indicator random variable

Yi = I (person i tests positive for HIV)


for i = 1; 2; : : : ; n. (Note: I(A) is the indicator function; it equals 1 if A is true and 0 if A
is false.) Now Yi Binomial(1; ) with probability function
yi
f (yi ; ) = (1 )1 yi
for yi = 0; 1 and 0 1

The likelihood function for the observed random sample y1 ; y2 ; : : : ; yn is


Q
n
L( ) = f (yi ; )
i=1
Q
n
yi
= (1 )1 yi
i=1
P
n P
n
yi (1 yi )
= i=1 (1 )i=1
y
= (1 )n y
for 0 1

P
n
where y = yi . This is the same likelihood function as (2.5). The reason for this is
i=1
P
n
because the random variable Yi has a Binomial(n; ) distribution.
i=1

In many applications we encounter likelihood functions which cannot be maximized


mathematically and we need to resort to numerical methods. The following example pro-
vides an illustration.

Example 2.2.5 Coliform bacteria in water


The number of coliform bacteria Y in a random sample of water of volume v milliliters
is assumed to have a Poisson distribution:

( v)y v
P (Y = y; ) = f (y; ) = e for y = 0; 1; : : : ; 0 (2.6)
y!
where is the average number of bacteria per milliliter of water.
There is an inexpensive test which can detect the presence (but not the number) of
bacteria in a water sample. In this case we do not observe Y , but rather the “presence”
indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0
2.2. MAXIMUM LIKELIHOOD ESTIMATION 67

From (2.6) we have


v
P (Z = 0; ) = P (Y = 0; ) = e
and so
v
P (Z = 1; ) = 1 P (Z = 0; ) = 1 e
Suppose that n water samples, of volumes v1 ; v2 ; : : : ; vn , are selected. Let z1 ; z2 ; : : : ; zn
be the observed values of the presence indicators. Note that these observed values will be
either 0 or 1.
The likelihood function is
Q
n
L( ) = P (Zi = zi ; )
i=1
Q
n
vi zi v i 1 zi
= (1 e ) (e ) for 0
i=1

and the log likelihood function is


P
n
vi
l( ) = [zi log(1 e ) (1 zi ) vi ] for >0
i=1

-17

-18

-19
l(θ)
-20

-21

-22

-23

-24

-25
θ
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 2.4: The log likelihood function l( ) for Example 2.2.5

We cannot maximize l( ) mathematically by solving dl=d = 0, so we use numerical


methods.
68 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Suppose, for example, that n = 40 samples gave data as follows:

vi (ml) 8 4 2 1
number of samples 10 10 10 10
number with zi = 1 10 8 7 3

This gives
8 4 2
l( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 for 0

Either by maximizing l( ) numerically for 0, or by solving dl=d = 0 numerically, we


…nd the maximum likelihood estimate of to be ^ = 0:478. A simple way to maximize l( )
is to plot it, as shown in Figure 2.4; the maximum likelihood estimate can then be found
by inspection or, for more accuracy, by using a method like Newton’s method.

A few remarks about numerical methods are in order. Aside from a few simple models,
it is not possible to maximize likelihood functions explicitly. However, software exists which
implements powerful numerical methods which can easily maximize (or minimize) functions
of one or more variables. Multi-purpose optimizers can be found in many software packages;
in R the function nlm() is powerful and easy to use. In addition, statistical software packages
contain special functions for …tting and analyzing a large number of statistical models. The
R package MASS (which can be accessed by the command library(MASS)) has a function
fitdistr that will …t many common models.

2.3 Likelihood Functions for Continuous Distributions


Recall that we de…ned likelihoods for discrete random variables as the probability of ob-
serving the data y or

L ( ) = L ( ; y) = P (Y = y; ) for 2

For a continuous continuous random variable, P (Y = y; ) is unsuitable as a de…nition of


the likelihood since this probability always equals zero.
Suppose Y is a continuous random variable with probability density function f (y; ).
For continuous data we usually observe only the value of Y rounded to some degree of
precision, for example, data on waiting times is rounded to the closest second or data on
heights is rounded to the closest centimeter. The actual observation is really a discrete
random variable. For example, suppose we observe Y correct to one decimal place. Then

Z
1:15

P (we observe 1:1 ; ) = f (y; )dy t (0:1)f (1:1; )


1:05
2.3. LIKELIHOOD FUNCTIONS FOR CONTINUOUS DISTRIBUTIONS 69

assuming the function f (y; ) is reasonably smooth over the interval. More generally, sup-
pose y1 ; y2 ; : : : ; yn are the observations from a random sample from the distribution with
probability density function f (y; ) which have been rounded to the nearest which is
assumed to be small. Then
Q
n
n Q
n
P (Y = y; ) t f (yi ; ) = f (yi ; )
i=1 i=1

If we assume that the precision does not depend on the unknown parameter , then the
term n can be ignored. This argument leads us to adopt the following de…nition of the
likelihood function for a random sample from a continuous distribution.

De…nition 13 If y1 ; y2 ; : : : ; yn are the observed values of a random sample from a dis-


tribution with probability density function f (y; ), then the likelihood function is de…ned
as
Q
n
L ( ) = L ( ; y) = f (yi ; ) for 2
i=1

Example 2.3.1 Likelihood function for Exponential distribution


Suppose that the random variable Y represents the lifetime of a randomly selected light
bulb in a large population of bulbs, and that Y Exponential( ) is a reasonable model for
such a lifetime.
If a random sample of light bulbs is tested and the lifetimes y1 ; y2 ; : : : ; yn are observed,
then the likelihood function for is
Q
n 1
yi =
L( ) = e
i=1
1 P
n
= n exp yi =
i=1
n ny=
= e for >0

The log likelihood function is


y
l( ) = n log + for >0

with derivative
d 1 y
l( ) = n 2
d
n
= 2 (y )

Now dd l ( ) = 0 for = y. The First Derivative Test can be used to verify that the value
= y maximizes l( ) and so ^ = y is the maximum likelihood estimate of .
70 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Table 2.1
Summary of Maximum Likelihood Method for Named Distributions

Maximum Maximum Relative


Named Observed
Likelihood Likelihood Likelihood
Distribution Data
Estimate Estimator Function

y n y
1
R( ) = ^ 1 ^
Binomial(n; ) y ^= y ~= Y
n n
0< <1

n^ ^
R( ) = ^ en( )
Poisson( ) y1 ; y2 ; : : : ; yn ^=y ~=Y

>0

n ny
1
R( ) = ^ 1 ^
Geometric( ) y1 ; y2 ; : : : ; yn ^= 1 ~= 1
1+y 1+Y
0< <1

nk ny
1
R( ) = ^
Negative 1 ^
y1 ; y2 ; : : : ; yn ^= k ~= k
Binomial(k; ) k+y k+Y
0< <1

^ n n(1 ^= )
R( ) = e
Exponential( ) y1 ; y2 ; : : : ; yn ^=y ~=Y
>0
2.3. LIKELIHOOD FUNCTIONS FOR CONTINUOUS DISTRIBUTIONS 71

Example 2.3.2 Likelihood function for Gaussian distribution


As an example involving more than one parameter, suppose that y1 ; y2 ; : : : ; yn is an
observed random sample from the G ( ; ) distribution The likelihood function for =
( ; ) is
Q
n Q
n 1 1
L( ) = L( ; ) = f (yi ; ; ) = p exp 2
(yi )2
i=1 i=1 2 2
1 P
n
= (2 ) n=2 n
exp 2
(yi )2 for 2 < and >0
2 i=1

or more simply (ignoring constants with respect to and )


1 P
n
L( ) = L( ; ) = n
exp 2
(yi )2 for 2 < and >0
2 i=1

Since
P
n P
n P
n P
n P
n P
n
(yi y) = yi y= yi ny = yi yi = 0
i=1 i=1 i=1 i=1 i=1 i=1
and
Pn P
n P
n P
n P
n
(yi )2 = (yi y+y )2 = (yi y)2 + 2 (y ) (yi y) + (y )2
i=1 i=1 i=1 i=1 i=1
Pn
= (yi y)2 + n (y )2
i=1

we can write the likelihood function as

n 1 P
n n(y )2
L( ; ) = exp 2
(yi y)2 exp 2
2 i=1 2

The log likelihood function for = ( ; ) is

1 P
n n(y )2
l( ) = l( ; ) = n log 2
(yi y)2 2
for 2 < and >0
2 i=1 2

To maximize l( ; ) with respect to both parameters and we solve 6 the two equations7
@l n @l n 1 P
n
= 2 (y ) = 0 and = + 3
(yi y)2 = 0
@ @ i=1

simultaneously. We …nd that the maximum likelihood estimate of is ^ = (^ ; ^ ), where


1=2
1 Pn 1 Pn
^= yi = y and ^= (yi y)2
n i=1 n i=1
6
To maximize a function of two variables, set the derivative with respect to each variable equal to zero.
Of course …nding values at which the derivatives are zero does not prove this is a maximum. Showing it is
a maximum is another exercise in calculus.
7 @
@
is the derivative with respect to holding the parameter constant. Similarly @@ is the derivative
with respect to holding constant.
72 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

2.4 Likelihood Functions For Multinomial Models


Multinomial models are used in many statistical applications. From Section 2.1, the
Multinomial joint probability function is
n! Q
k
yi P
k
f (y1 ; y2 ; : : : ; yk ; ) = i for yi = 0; 1; : : : where yi = n
y1 !y2 ! yk ! i=1 i=1

The likelihood function for = ( 1; 2; : : : ; k ) based on data y1 ; y2 ; : : : ; yk is given by


n! Q
k
yi
L( ) = L ( 1 ; 2; : : : ; k ) = i
y1 !y2 ! yk ! i=1
or more simply
Q
k
yi
L( ) = i
i=1
The log likelihood function is
P
k
l( ) = yi log i
i=1
If yi represents the number of times outcome i occurred in n “trials”, i = 1; 2; : : : ; k, then
it can be shown that
^i = yi for i = 1; 2; : : : ; k
n
are the maximum likelihood estimates of 1 ; 2 ; : : : ; k .8

Example 2.4.1 A, B, AB, O blood types


Each person is one of four blood types, labelled A, B, AB and O. (Which type a person
is has important consequences, for example in determining to whom they can donate a
blood transfusion.) Let 1 ; 2 ; 3 ; 4 be the fraction of a population that has types A, B,
AB, O, respectively. Now suppose that in a random sample of 400 persons whose blood
was tested, the numbers who were types A, B, AB, O, were y1 = 172; y2 = 38; y3 = 14 and
y4 = 176 respectively. (Note that y1 + y2 + y3 + y4 = 400.) Let the random variables Y1 ; Y2 ;
Y3 ; Y4 represent the number of type A, B, AB, O persons respectively that are in a random
sample of size n = 400. Then Y1 ; Y2 ; Y3 ; Y4 follow a Multinomial(400; 1 ; 2 ; 3 ; 4 ).
The maximum likelihood estimates from the observed data are therefore
^1 = 172 = 0:43; ^2 = 38 = 0:095; ^3 = 14 = 0:035; ^4 = 176 = 0:44
400 400 400 400
P
4
^
(as a check, note that i = 1). These give estimates of the population fractions 1; 2;
i=1
3 ; 4 . (Note: studies involving much larger numbers of people put the values of the i ’s
for Caucasians at close to 1 = 0:448; 2 = 0:083; 3 = 0:034; 4 = 0:436.)
8 P
k P
k
`( ) = yi log i is a little tricky to maximize because the i ’s satisfy a linear constraint, i = 1.
i=1 i=1
The Lagrange multiplier method (Multivariate Calculus) for constrained optimization allows us to …nd the
solution ^i = yi =n , i = 1; 2; : : : ; k.
2.4. LIKELIHOOD FUNCTIONS FOR MULTINOMIAL MODELS 73

In some problems the Multinomial parameters 1 ; 2; : : : ; k may be functions of fewer


than k 1 parameters. The following is an example.

Example 2.4.2 MM, MN, NN blood types


Another way of classifying a person’s blood is through their “M-N” type. Each person
is one of three types, labelled MM, MN and NN and we can let 1 ; 2 ; 3 be the fraction
of the population that is each of the three types. In a sample of size n we let Y1 = number
of MM types observed, Y2 = number of MN types observed and Y3 = number of NN types
observed. The joint probability function of Y1 ; Y2 ; Y3 is

n! y1 y2 y3
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = 1 2 3
y1 !y2 !y3 !

According to a model in genetics, the i ’s can be expressed in terms of a single parameter


for human populations:

2
1 = ; 2 = 2 (1 ); 3 = (1 )2

where is a parameter with 0 1. In this case

n! 2 y1
P (Y1 = y1 ; Y2 = y2 ; Y3 = y3 ) = [ ] [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !

If the observed data are y1 ; y2 ; y3 then the likelihood function for is

n!
L( ) = [ 2 ]y1 [2 (1 )]y2 [(1 )2 ]y3
y1 !y2 !y3 !
n!
= 2y2 2y1 +y2 (1 )y2 +2y3 for 0 1
y1 !y2 !y3 !

or more simply
2y1 +y2
L( ) = (1 )y2 +2y3 for 0 1

The log likelihood function is

l ( ) = (2y1 + y2 ) log + (y2 + 2y3 ) log (1 ) for 0 < <1

with
dl 2y1 + y2 y2 + 2y3
=
d 1
and
dl 2y1 + y2 2y1 + y2
= 0 if = =
d 2y1 + 2y2 + 2y3 2n
so
2y1 + y2
^=
2n
is the maximum likelihood estimate of .
74 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

2.5 Invariance Property of Maximum Likelihood Estimate


Many statistical problems involve the estimation of attributes of a population or process.
These attributes can often be represented as an unknown parameter or parameters in a
statistical model. The method of maximum likelihood gives us a general method for esti-
mating these unknown parameters. Sometimes the attribute of interest is a function of the
unknown parameters. Fortunately the method of maximum likelihood allows us to estimate
functions of unknown parameters with very little extra work. This property is called the
invariance property of maximum likelihood estimates and can be stated as follows:

Theorem 14 If ^ = (^1 ; ^2 ; : : : ; ^k ) is the maximum likelihood estimate of = ( 1; 2; : : : ; k )


then g(^) is the maximum likelihood estimate of g ( ).

Example 2.5.1
Suppose we want to estimate attributes associated with BMI for some population of
individuals (for example, Canadian males age 21-35). If the distribution of BMI values in
the population is well described by a Gaussian model, Y G( ; ), then by estimating
and we can estimate any attribute associated with the BMI distribution. For example:
(i) The mean BMI in the population corresponds to = E(Y ) for the Gaussian distri-
bution.
(ii) The median BMI in the population corresponds to the median of the Gaussian
distribution which equals since the Gaussian distribution is symmetric about its mean.
(iii) For the BMI population, the 0:1 (population) quantile, Q (0:1) = 1:28 . (To
see this, note that P (Y 1:28 ) = P (Z 1:28) = 0:1, where Z = (Y )= has a
G(0; 1) distribution.)
(iv) The fraction of the population with BMI over 35:0 given by
35:0
p=1

where is the cumulative distribution function for a G(0; 1) random variable.


Suppose a random sample of 150 males gave observations y1 ; y2 ; : : : ; y150 and that the
maximum likelihood estimates based on the results derived in Example 2.3.2 were
1=2
P
1 150
^ = y = 27:1 and ^ = (yi y)2 = 3:56
150 i=1
The estimates of the attributes in (i) (iv) would be:
(i) and (ii) ^ = m
^ = 27:1
^ (0:1) = ^ 1:28^ = 27:1
(iii) Q 1:28 (3:56) = 22:54 and
35:0 ^
(iv) p^ = 1 ^ =1 (2:22) = 1 0:98679 = 0:01321.
Note that (iii) and (iv) follow from the invariance property of maximum likelihood
estimates.
2.6. CHECKING THE MODEL 75

2.6 Checking the Model


The models used in this course are probability distributions for random variables that
represent variates in a population or process. A typical model has probability density
function f (y; ) if the variate Y is continuous, or probability function f (y; ) if Y is discrete,
where is (possibly) a vector of parameter values. If a family of models is to be used for some
purpose then it is important to check that the model adequately represents the variability
in Y . This can be done by comparing the model with random samples y1 ; y2 ; : : : ; yn of
y-values from the population or process.

Comparing Observed and Expected Frequencies


One method for checking how well the model …ts the data is to compare observed fre-
quencies with the expected frequencies calculated using the assumed model. This method
is particularly useful for data that have arisen from a discrete probability model. The two
examples below illustrate the method.

Example 2.6.1 Rutherford and Geiger study of alpha-particles and the Poisson
model
In 1910 the physicists Ernest Rutherford and Hans Geiger conducted an experiment
in which they recorded the number of alpha particles emitted from a polonium source (as
detected by a Geiger counter) during 2608 time intervals each of length 1=8 minute. The
number of particles j detected in the time interval and the frequency fj of that number of
particles is given in Table 2.3.
We can see whether a Poisson model …t these data by comparing the observed frequencies
with the expected frequencies calculated assuming a Poisson model. To calculate these
expected frequencies we need to specify the mean of the Poisson model. We estimate
using the sample mean for the data which is

^ = 1 P 14
jfj
2608 j=0
1
= (10097)
2608
= 3:8715

The expected number of intervals in which j particles is observed is


(3:8715)j e 3:8715
ej = (2608) for j = 0; 1; : : :
j!
The expected frequencies are also given in Table 2.3.
Since the observed and expected frequencies are reasonably close, the Poisson model
seems to …t these data well. Of course, we have not speci…ed how close the expected and
observed frequencies need to be in order to conclude that the model is reasonable. We will
look at a formal method for doing this in Chapter 7.
76 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Number of - Observed Expected


particles detected: j Frequency: fj Frequency: ej
0 57 54:3
1 203 210:3
2 383 407:1
3 525 525:3
4 532 508:4
5 408 393:7
6 273 254:0
7 139 140:5
8 45 68:0
9 27 29:2
10 10 11:3
11 4 4:0
12 0 1:3
13 1 0:4
14 1 0:1
Total 2608 2607:9
Table 2.3 Frequency table for Rutherford/Geiger data

This comparison of observed and expected frequencies to check the …t of a model can
also be used for data that have arisen from a continuous model. The following is an example.

Example 2.6.2 Lifetimes of brake pads and the Exponential model


Suppose we want to check whether an Exponential model is reasonable for modeling the
data in Example 1.3.4 on lifetimes of brake pads. To do this we need to estimate the mean
of the Exponential distribution. We use the sample mean y = 49:0275 to estimate .
Since the lifetime Y is a continuous random variable taking on all real values greater
than zero the intervals for the observed and expected frequencies are not obvious as they
were in the discrete case. For the lifetime of brake pads data we choose the same intervals
which were used to produce the relative frequency histogram in Example 1.3.4 except we
have collapsed the last four intervals into one interval [120; +1). The intervals are given
in Table 2.4.
The expected frequency in the interval [aj 1 ; aj ) is calculated using

Zaj
1 y=49:0275
ej = 200 e dy
49:0275
aj 1

aj 1 =49:0275 aj =49:0275
= 200 e e

The expected frequencies are also given in Table 2.2. We notice that the observed and
2.6. CHECKING THE MODEL 77

expected frequencies are not close in this case and therefore the Exponential model does
not seem to be a good model for these data.

Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200
Table 2.4: Frequency table for brake pad data

The drawback of this method for continuous data is that the intervals must be selected
and this adds a degree of arbitrariness to the method. The following graphical methods
provide better techniques for checking the …t of the model for continuous data.

Graphical Checks of Models


We may also use graphical techniques for checking the …t of a model. These methods are
particularly useful for continuous data.

Relative frequency histograms and probability density functions

The …rst graphical method is to superimpose the probability density function of the pro-
posed model on the relative frequency histogram of the data. Figure 2.5 gives the relative
frequency histogram of the female BMI data with a superimposed Gaussian probability den-
sity function. Since the mean is unknown we estimate using the sample mean y = 26:9
and since the standard deviation is unknown we estimate it using the sample standard
deviation s = 4:60.
Figure 2.6 gives the relative frequency histogram of the male BMI data with a super-
imposed Gaussian probability density function. Since the mean is unknown we estimate
using the sample mean y = 27:08 and since the standard deviation is unknown we estimate
it using the sample standard deviation s = 3:56. In both …gures the relative frequency his-
tograms are in reasonable agreement with the superimposed Gaussian probability density
functions.
78 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

0.14

0.12

0.1

relative frequency
0.08

G(27.08,3.56)
0.06

0.04

0.02

0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI

Figure 2.5: Relative frequency histogram of female BMI data with Gaussian
p.d.f.

0.09

0.08

0.07

0.06
relative frequency

G(26.9,4.60)
0.05

0.04

0.03

0.02

0.01

0
16 18 20 22 24 26 28 30 32 34 36 38 40
BMI

Figure 2.6: Relative frequency histogram of male BMI data with Gaussian p.d.f.

If we observe an obvious systematic departure between the relative frequency histogram


and the superimposed probability model, the nature of the systematic departure may sug-
gest a better model for the data. For example Figure 2.7 suggests that a more appropriate
model would be a model with a longer right tail than the Gauss distribution.

The drawback of this technique is that the intervals for the relative frequency histogram
must be chosen.
2.6. CHECKING THE MODEL 79

0.07

0.06

relative frequency 0.05

0.04

0.03
G(15.4,7.4)

0.02

0.01

0
2 6 10 14 18 22 26 30 34 38 42 46
y

Figure 2.7: Example of systematic departure from Gaussian model

Empirical cumulative distribution functions and cumulative distribution func-


tions

A second graphical method which can be used to check the …t of a model is to plot the
empirical cumulative distribution function F^ (y) which was de…ned in Chapter 1 and then
superimpose on this a plot of the cumulative distribution function, P (Y y; ) = F (y; )
for the proposed model. If the graphs of the two functions di¤er a great deal, this would
suggest that the proposed model is a poor …t to the data. Systematic departures may also
suggest a better model for the data.

Example 2.6.3 Checking an Exponential( ) model


Figure 2.8 is a graph of the empirical cumulative distribution function F^ (y) for the data
in Example 1.6.3 with an Exponential cumulative distribution function superimposed. The
unknown mean is estimated using the sample mean y = 3:5. Not surprisingly (since the
data were randomly generated from an Exponential distribution) the agreement between
the two curves is very good.

Figure 2.9 is a graph of the empirical cumulative distribution function F^ (y) for the data
in Figure 2.7 with an Exponential cumulative distribution function superimposed. The
unknown mean is estimated using the sample mean y = 15:4. In this case the agreement
between the two curves is very poor. The disagreement between the curves suggests that
the proposed Exponential model disagrees with the observed distribution in both tails of
the distribution.
80 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

0.9

0.8

0.7

empirical c.d.f.
0.6

0.5

0.4

0.3

0.2

0.1

0
0 2 4 6 8 10 12 14 16 18
y

Figure 2.8: Empirical c.d.f. and Exponential c.d.f.

0.9

0.8

0.7
empirical c.d.f.

0.6

0.5

0.4

0.3

0.2

0.1

0
0 5 10 15 20 25 30 35 40 45
y

Figure 2.9: Empirical c.d.f. and Exponential c.d.f.

Example 2.6.4 Old Faithful data


Consider data for the time in minutes between 300 eruptions of the geyser Old Faithful
in Yellowstone National Park, between the …rst and the …fteenth of August 1985. The
data are available in the …le oldfaithfuldata.txt posted on the course website. The empirical
cumulative distribution function for the data are plotted in Figure 2.10. One might hy-
pothesize that the distribution of times between consecutive eruptions follows a Gaussian
distribution. To see how well a Gaussian model …ts the data we could superimpose a
Gaussian cumulative distribution function on the plot of the empirical cumulative distri-
bution function. To do this we need to estimate the parameters and of the Gaussian
model since they are unknown. We estimate the mean using the sample mean y = 72:3
and the standard deviation using the sample standard deviation s = 13:9. In Figure 2.10
2.6. CHECKING THE MODEL 81

the cumulative distribution function of a G (72:3; 13:9) random variable is superimposed


on the empirical cumulative distribution function for the data. There is poor agreement
between the two curves suggesting a Gaussian model is not suitable for these data.

1
0.9
0.8
0.7
emprical c.d.f.

0.6
0.5
0.4
0.3
0.2
G(72.3,13.9)
0.1
0
30 40 50 60 70 80 90 100 110 120
time between eruptions

Figure 2.10: Empirical c.d.f. of times between eruptions of Old Faithful and
Gaussian c.d.f.

The relative frequency histogram in Figure 2.11 seems to indicate that the distribution
of the times appears to have two modes. The plot of the empirical cumulative distribution
function does not show the shape of the distribution as clearly as the histogram.

0.035

0.03

0.025
relative frequency

0.02

0.015

G(72.3,13.9)
0.01

0.005

0
43 49 55 61 67 73 79 85 91 97 103 109
time between eruptions

Figure 2.11: Relative frequency histogram for times between eruptions of Old
Faithful and Gaussian p.d.f.
82 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

Example 2.6.5 Heights of females


For the data on female heights in Chapter 1 and using the results from Example 2.3.2
we obtain ^ = 1:62; ^ = 0:064 as the maximum likelihood estimates of and . Figure
2.12 shows a plot of the empirical cumulative distribution function with the G(1:62; 0:064)
cumulative distribution function superimposed.

1
0.9
0.8
empirical c.d.f.

0.7
0.6
0.5 G(1.62,0.064)
0.4
0.3
0.2
0.1
0
1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85
height

Figure 2.12: Empirical c.d.f. of female heights and Gaussian c.d.f.

6
relative frequency

4
G(1.62,0.064)
3

0
1.4 1.5 1.6 1.7 1.8
height

Figure 2.13: Relative frequency histogram of female heights and Gaussian p.d.f.

Figure 2.13 shows a relative frequency histogram for these data with the G(1:62; 0:0637)
probability density function superimposed. The two types of plots give complementary but
consistent pictures. An advantage of the distribution function comparison is that the exact
heights in the sample are used, whereas in the histogram plot the data are grouped into
intervals to form the histogram. However, the histogram and probability density function
2.6. CHECKING THE MODEL 83

show the distribution of heights more clearly. Both graphs indicate that a Gaussian model
seems reasonable for these data.

Qqplots for checking Gaussian model

Since the Gaussian model is used frequently for modeling data we look at one more graphical
technique called a (Gaussian) qqplot for checking how well a Gaussian model …ts a set of
data. The idea behind this method is that we expect the empirical cumulative distribution
function and the cumulative distribution for a Gaussian random variable to agree if a
Gaussian model is appropriate for the data as we saw in Figure 2.12. Deciding if two curves
are in agreement is usually more di¢ cult than deciding if a set of points lie along a straight
line. A qqplot is a graph for which the expected plot would reasonably be a straight line
plot if the Gaussian model is a good …t.
Suppose for the moment that we want to check if a G ( ; ) model …ts the set of data
fy1 ; y2 ; : : : ; yn g where and are known. As usual we let fy(1) ; y(2) ; : : : ; y(n) g represent
the order statistic or the data ordered from smallest to largest. Let Q (p) be the pth
(theoretical) quantile for the G ( ; ) distribution, that is, Q (p) satis…es P (Y Q (p)) = p
where Y G ( ; ). Recall also that q (p) is the pth sample quantile de…ned in Chapter
1. If the Gaussian model is appropriate then for a reasonable size data set, we would
expect Q (0:5) = median = to be close in value to the sample quantile q (0:5) = sample
median, Q (0:25) to be close in value to the lower quartile q (0:25), Q (0:75) to be close in
i
value to the upper quartile q (0:75), and so on. More generally we would expect Q n+1
i
to be close in value to the sample quantile q n+1 (see De…nition 1) for i = 1; 2; : : : ; n.
i i
(Note that we use n+1 rather than n since Q (1) = 1.) For a reasonably large data set
i
we also have q n+1 t y(i) , i = 1; 2; : : : ; n. Therefore if the Gaussian model …ts the data
i i
we expect Q n+1 to be close in value to q n+1 , i = 1; 2; : : : ; n. If we plot the points
i i
Q n+1 ; q n+1 , i = 1; 2; : : : ; n then we should see a set of points that lie reasonably
along a straight line.
But what if and are unknown? Let Qz (p) be the pth quantile for the G (0; 1)
distribution. We know that if Y G ( ; ) then Y G (0; 1) and therefore
i i
Q (p) = + Qz (p). Therefore if we plot the points QZ n+1 ; q n+1 , i = 1; 2; : : : ; n
we should still see a set of points that lie reasonably along a straight line if a Gaussian model
is reasonable model for the data. Such a plot is called a (Normal) qqplot. The advantage
of a qqplot is that the unknown parameters and do not need to be estimated.
Qqplots exist for other models but we only use Gaussian qqplots.

Since reading qqplots requires some experience, it is a good idea to generate many plots
where we know the correct answer. This can be done by generating data from a known
distribution and then plotting a qqplot. See Chapter 2, Problems 20 and21. A qqplot of
100 observations randomly generated from a G ( 2; 3) distribution is given in Figure 2.14.
84 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

The theoretical quantiles are plotted on the horizontal axis and the empirical or sample
quantiles are plotted on the vertical axis. The line in the qqplot is the line joining the lower
and upper quartiles of the empirical and Gaussian distributions, that is, the line joining
(QZ (0:25) ; q (0:25)) and (QZ (0:75) ; q (0:75)) where QZ (0:75) = 0:674.

8
6

4
Sample Quantiles

-2
-4

-6
-8

-10
-12
-3 -2 -1 0 1 2 3
G(0,1) Quantiles

Figure 2.14: Qqplot of a random sample of 100 observations from a G( 2; 3)


distribution

We do not expect the points to lie exactly along a straight line since the sample quantiles
are based on the observed data which in general will be di¤erent every time the experiment
i i
is conducted. We only expect Q n+1 to be close in value to the sample quantile q n+1
for a reasonably large data set.

0.4

0.35

0.3

0.25
G(0,1)
p.d.f.
0.2

0.15

0.1

0.05

0
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
z

Figure 2.15: Quantiles of the G (0; 1) distribution for n = 15


2.6. CHECKING THE MODEL 85

As well the points at both ends of the line can be expected to lie further from the line
since the quantiles of the Gaussian distribution change in value more rapidly in the tails
of the distribution. To understand this consider Figure 2.15. The area under a G (0; 1)
probability density function, which is equal to one, has been divided into sixteen areas all
i
of the same size equal to 1=16. The theoretical quantiles Q n+1 , i = 1; 2; : : : ; 15 can be
1
read from the z axis. For example, Q 16 = 1:53, and Q 10 16 = 0:32. Since the area
under the G (0; 1) probability density function is more concentrated near zero, the values
of the quantiles increase more quickly in the tails of the distribution. In Figure 2.15 this
is illustrated by the vertical lines being closer together near z = 0 and further apart for
z < 1 and z > 1. This means we would not expect the sample quantiles in both tails to
be as close to the theoretical quantiles as compared to what we observe in the center of the
distribution.

A qqplot of the female heights is given in Figure 2.16. Overall the points lie reasonably
along a straight line with the points at both ends lying not as close to the line which is
what we expect. As was the case for the relative frequency histogram and the empirical
cumulative distribution function, the qqplot indicates that the Gaussian model is reasonable
for these data. Since the heights in meters are rounded to two decimal places there are
many repeated values in the dataset. The repeated values result in the qqplot looking like
a set of small steps.

1.85

1.8

1.75

1.7
Sample Quantiles

1.65

1.6

1.55

1.5

1.45

1.4
-3 -2 -1 0 1 2 3
G(0,1) Quantiles

Figure 2.16: Qqplot of heights of females


86 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

A qqplot of 100 observations randomly generated from a Exponential (1) distribution


is given in Figure 2.17.We notice that the points form a U-shape. This is typical of data

4
sample quantiles
3

-1

-2
-3 -2 -1 0 1 2 3
G(0,1) quantiles

Figure 2.17: Qqplot of a random sample of 100 observations from a Exponential (1)
distribution

which are best modeled by an Exponential distribution.


To understand why this happens the area under a Exponential (1) probability density
function has been divided into sixteen areas all of the same size equal to 1=16 in Figure
2.18. The theoretical quantiles can be read from the x axis. The values of the quantiles
increase more quickly in the right tail of the distribution.

1
0.9

0.8
0.7

0.6
f(x)
0.5
0.4

0.3
0.2

0.1
0
0 0.5 1 1.5 2 2.5 3
x

Figure 2.18: Quantiles of the Exponential (1) distribution for n = 15

If we plot the theoretical quantiles of an Exponential (1) distribution versus the the-
oretical quantiles of an G (0; 1) distribution for n = 15 we obtain the U-shaped graph in
2.6. CHECKING THE MODEL 87

Figure 2.19. Since we are using the theoretical quantiles for both distributions the points
lie along a curve. For real data the qqplot would look similar to the plot in Figure 2.17. In
general if a dataset has a relative frequency histogram with a long right tail then the qqplot
will exhibit this U-shape behaviour. Such a qqplot suggests that a Gaussian model is not
reasonable for the data and a model with a long right tail like the Exponential distribution
would be more suitable.

2.5

2
Exponential
quantiles
1.5

0.5

0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
G(0,1) quantiles

Figure 2.19: Exponential versus Gaussian quantiles

A qqplot of the lifetimes of brake pads (Example 1.3.4) is given in Figure 2.20. The
points form a U-shaped curve. This pattern is consistent with the long right tail and
positive skewness that we observed before. The Gaussian model is not a reasonable model
for these data.

200
Sample Quantiles

150

100

50

-50
-3 -2 -1 0 1 2 3
G(0,1) Quantiles

Figure 2.20: Qqplot of lifetimes of brake pads


88 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

A qqplot of 100 observations randomly generated from a U nif orm (0; 1) distribution is
given in Figure 2.21.

1.5

Sample Quantiles 1

0.5

-0.5

-1
-3 -2 -1 0 1 2 3
G(0,1) Quantiles

Figure 2.21: Qqplot of 100 observations

We notice that the points form an S-shape. This is typical of data which are best
modeled by a Uniform distribution.
To understand why this happens the area under a U nif orm (0; 1) probability density
function has been divided into sixteen areas all of the same size equal to 1=16 in Figure
2.22. The theoretical quantiles can be read from the x axis. The values of the quantiles
increase uniformly.

1.4

1.2

0.8
f(x)
0.6

0.4

0.2

0
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
x

Figure 2.22: Quantiles of the U nif orm (0; 1) distribution for n = 15

If we plot the theoretical quantiles of an U nif orm (0; 1) distribution versus the theoret-
ical quantiles of an G (0; 1) distribution for n = 15 we obtain the S-shaped graph in Figure
2.23. Since we are using the theoretical quantiles for both distributions the points lie along
2.6. CHECKING THE MODEL 89

a curve. For real data the qqplot would look similar to the plot in Figure 2.21. In general if
a dataset has a relative frequency histogram which is quite symmetric and with short tails
then the qqplot will exhibit this S-shape behaviour. Such a qqplot suggests that a Gaussian
model is not reasonable for these data and a model such as the Uniform distribution would
be more suitable.

1
0.9
0.8
0.7
0.6
Uniform
quantiles
0.5
0.4
0.3
0.2
0.1
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
G(0,1) quantiles

Figure 2.23: Uniform versus Gaussian quantiles

A qqplot of the times between eruptions of Old Faithful is given in Figure 2.24. The
points do not lie along a straight line which indicates as we saw before that the Gaussian
model is not a reasonable model for these data. The two places at which the shape of
the points changes direction correspond to the two modes of these data that we observed
previously.

140

120
Sample Quantiles

100

80

60

40

20
-3 -2 -1 0 1 2 3
G(0,1) Quantiles

Figure 2.24: Qqplot of times between eruptions of Old Faithful


90 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

2.7 Chapter 2 Problems


1. To …nd maximum likelihood estimates we usually …nd such that dd log L ( ) = 0.
For each of the functions G ( ) given below …nd the value of which maximizes G ( )
by …nding the value of which maximizes g ( ) = log G ( ). Use the First Derivative
Test to verify that the value corresponds to a maximum. Note: a and b are positive
real numbers.

(a) G ( ) = a
(1 )b ; 0 1
a b=
(b) G ( ) = e ; >0
a b
(c) G ( ) = e ; 0
2
(d) G ( ) = e a( b) ; 2<

2. If y successes are observed in a Binomial experiment with n trials and = P (success),


the likelihood function for is

L( ) = y
(1 )n y
for 0 1

If y = 1; 2; : : : ; n 1, the maximum likelihood estimate of is ^ = ny which is found


by solving dd L ( ) = 0 or equivalently dd log L ( ) = 0. Show that if y = 0 or y = n
then the maximum likelihood estimate is not found by solving dd log L ( ) = 0 but
the maximum likelihood estimate is still ^ = ny .

3. Consider the following two experiments whose purpose was to estimate , the fraction
of a large population with blood type B.
Experiment 1: Individuals were selected at random until 10 with blood type B were
found. The total number of people examined was 100.
Experiment 2: One hundred individuals were selected at random and it was found
that 10 of them have blood type B.

(a) Find the likelihood function for for each experiment and show that the like-
lihood functions are proportional. Show the maximum likelihood estimate ^ is
the same in each case.
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, use the Nor-
mal approximation to the Binomial distribution (remember to use a continuity
correction) to determine how large should n be to ensure that the probability of
getting 10 or more donors with blood type B is at least 0:90? Use the R function
pbinom to determine the exact value of n.
2.7. CHAPTER 2 PROBLEMS 91

4. Specimens of a high-impact plastic are tested by repeatedly striking them with a ham-
mer until they fracture. Let Y = the number of blows required to fracture a specimen.
If the specimen has a constant probability of surviving a blow, independently of the
number of previous blows received, then the probability function for Y is
y 1
f (y; ) = P (Y = y; ) = (1 ) for y = 1; 2; : : : ; 0 <1

(a) For observed data y1 ; y2 ; : : : ; yn , …nd the likelihood function L( ) and the maxi-
mum likelihood estimate ^.
P
200
(b) Find the relative likelihood function R ( ). Plot R ( ) if n = 200 and yi = 400.
i=1
(c) Estimate the probability that a specimen fractures on the …rst blow using the
data in (b).

5. In modelling the number of transactions of a certain type received by a central com-


puter for a company with many on-line terminals the Poisson distribution can be used.
If the transactions arrive at random at the rate of per minute then the probability
of y transactions in a time interval of length t minutes is

( t)y t
P (Y = y; ) = f (y; ) = e for y = 0; 1; : : : and 0
y!

(a) The numbers of transactions received in 10 separate one minute intervals were
8, 3, 2, 4, 5, 3, 6, 5, 4, 1. Write down the likelihood function for and …nd the
maximum likelihood estimate ^.
(b) Estimate the probability that no transactions arrive during a two-minute interval
using the data in (a).
(c) Use the R function rpois with the value = 4:1 to simulate the number of
transactions received in 100 one minute intervals. Calculate the sample mean
and sample variance. Are they approximately the same?
(Note that E(Y ) = V ar(Y ) = for the Poisson model.)

6. Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the distribution with prob-


ability density function
2y y2 =
f (y; ) = e for y > 0 and >0

(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
(b) Find the relative likelihood function R ( ).
P20
(c) Plot R ( ) for n = 20 and yi2 = 72.
i=1
92 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

7. Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the G ( ; ) distribution.

(a) If is known, …nd the likelihood function L( ) and the maximum likelihood
estimate ^ .
(b) If is known, …nd the likelihood function L( ) and the maximum likelihood
estimate ^ .

8. Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the distribution with prob-


ability density function
f (y) = ( + 1)y for 0 < y < 1 and > 1

(a) Find the likelihood function L( ) and the maximum likelihood estimate ^.
(b) Find the log relative likelihood function r ( ) = log R ( ). If n = 15 and
P
15
log yi = 34:5 then plot r ( ).
i=1

9. Suppose that in a population of twins, males (M ) and females (F ) are equally likely
to occur and that the probability that a pair of twins is identical is . If twins are
not identical, their sexes are independent.

(a) Show that


1+ 1
P (M M ) = P (F F ) = and P (M F ) =
4 2
(b) Suppose that n pairs of twins are randomly selected; it is found that n1 are M M ,
n2 are F F , and n3 are M F , but it is not known whether each set is identical or
fraternal. Use these data to …nd the maximum likelihood estimate of . What
is the value of ^ if n = 50 and n1 = 16, n2 = 16, n3 = 18?

10. When Wayne Gretzky played for the Edmonton Oilers (1979-88) he scored an incred-
ible 1669 points in 696 games. The data are given in the frequency table below:
Number of points Observed number of
in a game: y games with y points: fy
0 69
1 155
2 171
3 143
4 79
5 57
6 14
7 6
8 2
9 0
Total 696
2.7. CHAPTER 2 PROBLEMS 93

The Poisson( ) model has been proposed for the random variable Y = number of
points Wayne scores in a game.

(a) Show that the likelihood function for based on the Poisson model and the data
in the frequency table simpli…es to
1669 696
L( ) = e for 0

What does the parameter represent?


(b) Find the maximum likelihood estimate of .
(c) Determine the expected frequencies based on the Poisson model and = ^.
Comment on how well the Poisson model …ts the data. What does this imply
about the type of hockey player Wayne was during his time with the Edmonton
Oilers? (Recall the assumptions for a Poisson process.)

11. Here are the data for Sidney Crosby playing for the Pittsburgh Penguins in the years
2005-2016.
Number of points Observed number of
in a game: y games with y points: fy
0 219
1 259
2 185
3 90
4 24
5 4
6 2
7 0
Total 783
How well does the Poisson model …t these data?

12. The following model has been proposed for the distribution of Y = the number of
children in a family, for a large population of families:
1 2 y 1
P (Y = 0; ) = ; P (Y = y; ) = for y = 1; 2; : : : and 0 (2.7)
1 2
(a) What does the parameter represent?
(b) Suppose that n families are selected at random and the observed data were

y 0 1 ymax > ymax Total


fy f0 f1 fmax 0 n
where fy = the observed number of families with y children and ymax = maximum
number of children observed in a family. Find the probability of observing these
data and thus determine the maximum likelihood estimate of .
94 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

(c) Consider a di¤erent type of sampling in which a single child is selected at random
and then the number of o¤spring in that child’s family is determined. Let
X = the number of children in the family of a randomly chosen child. Assuming
that the model (2.7) holds then show that

x 1
P (X = x; ) = cx for x = 1; 2; : : : and 0 <
2
where
(1 )2
c=

Hint: How do you determine then mean of a Geometric random variable?


(d) Suppose that the type of sampling in part (c) was used and that the following
data were obtained:

x 1 2 3 4 >4 Total
fx 22 7 3 1 0 33

Find the probability of observing these data and thus determine the maximum
likelihood estimate of . Estimate the probability a couple has no children using
these data.
(e) Suppose the sample in (d) was incorrectly assumed to have arisen from the
sampling plan in (b). What would ^ be found to be? This problem shows that
the way the data have been collected can a¤ect the model.

13. Radioactive particles are emitted randomly over time from a source at an average rate
of per second. In n time periods of varying lengths t1 ; t2 ; : : : ; tn (seconds), the num-
bers of particles emitted (as determined by an automatic counter) were y1 ; y2 ; : : : ; yn
respectively. Let Yi = the number of particles emitted in time interval i of length
ti , i = 1; 2; : : : ; n. Suppose it is reasonable to assume that Yi has a Poisson( ti )
distribution, i = 1; 2; : : : ; n independently.

(a) Show that the likelihood function for based on the Poisson model and the data
(yi ; ti ), i = 1; 2; : : : ; n can be simpli…ed to
ny nt
L( ) = e for 0

1 P
n
where t = n ti . Find the maximum likelihood estimate of .
i=1
(b) Suppose that the intervals are all of equal length (t1 = t2 = = tn = t) and that
instead of knowing the yi ’s, we know only whether or not there were one or more
particles emitted in each time interval of length t. Find the likelihood function
for based on these data, and determine the maximum likelihood estimate of .
2.7. CHAPTER 2 PROBLEMS 95

14. Run the following R code for checking the Gaussian model using numerical and graph-
ical summaries.
# Gaussian Data Example
set.seed(456458)
yn<-rnorm(200,5,2) # 200 observations from G(5,2) distribution
c(mean(yn),sd(yn)) # display sample mean and standard deviation
skewness(yn) # sample skewness
kurtosis(yn) # sample kurtosis
fivenum(yn) # five number summary
IQR(yn) # IQR
#plot relative frequency histogram and superimpose Gaussian pdf
truehist(yn,main="Relative Frequency Histogram of Data")
curve(dnorm(x,mean(yn),sd(yn)),col="red",add=T,lwd=2)
#plot Empirical cdf’s and superimpose Gaussian cdf
plot(ecdf(yn),verticals=T,do.points=F,xlab="y",ylab="ecdf",main="")
title(main="Empirical and Gaussian C.D.F.’s")
curve(pnorm(x,mean(yn),sd(yn)),add=T,col="red",lwd=2)
#plot qqplot of the data
qqnorm(yn,xlab="Standard Normal Quantiles",main="Qqplot of Data")
qqline(yn,col="red",lwd=1.5) # add line for comparison
#
#
# Exponential Data Example
ye<-rexp(200,1/5) # 200 observations from Exponential(5) dist’n
c(mean(ye),sd(ye)) # display sample mean and standard deviation
skewness(ye) # sample skewness
kurtosis(ye) # sample kurtosis
fivenum(ye) # five number summary
IQR(y) # IQR
#plot relative frequency histogram and superimpose Gaussian pdf
truehist(ye,main="Relative Frequency Histogram of Data")
curve(dnorm(x,mean(ye),sd(ye)),col="red",add=T,lwd=2)
#plot Empirical cdf’s and superimpose Gaussian cdf
plot(ecdf(ye),verticals=T,do.points=F,xlab="y",ylab="ecdf",main="")
title(main="Empirical and Gaussian C.D.F.’s")
curve(pnorm(x,mean(ye),sd(ye)),add=T,col="red",lwd=2)
#plot qqplot of the data
qqnorm(ye,xlab="Standard Normal Quantiles",main="Qqplot of Data")
qqline(ye,col="red") # add line for comparison in red
96 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

For both examples assume that you don’t know how the data were generated. Use the
numerical and graphical summaries obtained by running the R code to assess whether
it is reasonable to assume that the data have approximately a Gaussian distribution.
Support your conclusion with clear reasons written in complete sentences.

15. The marks out of 30 for 100 students on a tutorial test in STAT 231 were:

3 5 11:5 13 13 13 13:5 13:5 13:5 13:5


14 14 14:5 14:5 14:5 15 15 15 15:5 15:5
15:5 16 16 16 16 16:5 16:5 17 17 17
17 17 17 17 17 17 17:5 17:5 18 18
18:5 18:5 18:5 18:5 19 19 19 19 19 19:5
19:5 19:5 20 20 20 20 20 20 20 20
20 20 20:5 20:5 20:5 20:5 21 21 21 21:5
21:5 21:5 22 22 22 22 22 22:5 22:5 22:5
23 23 23 23 23 23:5 24:5 25 25 25
25 25 25:5 26 26 26 26:5 27 27 30

The data are available in the …le tutorialtestdata.txt posted on the course website.
For these data
P
100 P 2
100
yi = 1913 and yi = 38556
i=1 i=1

The sample skewness is 0:50 and the sample kurtosis is 4:32.


A boxplot and qqplot of the data are given in Figures 2.25 and 2.26.

(a) Determine the …ve-number summary for these data.


(b) Determine the sample mean y and the sample standard deviation s for these
data.
(c) Determine the proportion of observations in the interval [y s; y + s].
Compare this with P (Y 2 [ ; + ]) where Y G ( ; ).
(d) Find the interquartile range (IQR) for these data. Show that for Normally
distributed data IQR = 1:349 . How well do these data satisfy this relationship?
(e) Using both the numerical and graphical summaries for these data, assess whether
it is reasonable to assume that the data are approximately Normally distributed.
Be sure to support your conclusion with clear reasons.
2.7. CHAPTER 2 PROBLEMS 97

30

25

20

15

10

Marks

Figure 2.25: Boxplot of tutorial test marks

Qqplot of Marks
35

30

25
Sample Quantiles

20

15

10

0
-3 -2 -1 0 1 2 3
N(0,1) Quantiles

Figure 2.26: Qqplot of tutorial test marks


98 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

16. In a study of osteoporosis, the heights in centimeters of a sample of 351 elderly


women randomly selected from a community were recorded. The observed data are
given below. The data are available in the …le osteoporosisdata.txt posted on the
course website.
Heights of Elderly Women
142 145 145 145 146 147 147 147 147 148 148 149 150 150 150
150 150 150 151 151 151 151 151 151 152 152 152 152 152 152
152 152 152 152 152 152 153 153 153 153 153 153 153 153 153
153 153 153 153 153 153 153 153 154 154 154 154 154 154 154
154 154 154 154 155 155 155 155 155 155 155 155 155 155 155
155 155 155 155 155 155 155 155 155 155 156 156 156 156 156
156 156 156 156 156 156 156 156 156 156 156 156 156 156 156
157 157 157 157 157 157 157 157 157 157 157 157 157 157 157
157 157 157 157 157 158 158 158 158 158 158 158 158 158 158
158 158 158 158 158 158 158 158 158 158 158 158 158 158 158
158 158 158 158 158 158 159 159 159 159 159 159 159 159 159
159 159 159 159 159 159 159 159 160 160 160 160 160 160 160
160 160 160 160 160 160 160 160 160 160 160 160 160 160 161
161 161 161 161 161 161 161 161 161 161 161 161 161 161 161
161 161 161 161 162 162 162 162 162 162 162 162 162 162 162
162 162 162 162 162 162 162 163 163 163 163 163 163 163 163
163 163 163 163 163 163 163 163 163 163 163 163 163 163 163
163 163 163 163 163 163 163 164 164 164 164 164 164 164 164
164 164 164 164 164 164 164 164 164 165 165 165 165 165 165
165 165 165 165 165 165 165 165 165 165 165 165 166 166 166
166 166 166 166 166 166 166 166 167 167 167 167 167 167 167
168 168 168 168 168 168 169 169 169 169 169 169 169 169 170
170 170 170 170 170 170 170 170 170 170 171 171 171 173 174
173 174 176 177 178 178
For these data
P
351 P
351
yi = 56081 yi2 = 8973063
i=1 i=1

(a) Determine the sample mean y and the sample standard deviation s for these
data.
(b) Determine the proportion of observations in the interval [y s; y + s] and
[y 2s; y + 2s]. Compare these proportions with P (Y 2 [ ; + ]) and
P (Y 2 [ 2 ; + 2 ]) where Y G ( ; ).
(c) Find the sample skewness and sample kurtosis for these data. Are these values
close to what you would expect for Normally distributed data?
2.7. CHAPTER 2 PROBLEMS 99

(d) Find the …ve-number summary for these data.


(e) Find the IQR for these data. Does the IQR agree with what you expect for
Normally distributed data?
(f) Construct a relative frequency histogram and superimpose a Gaussian probabil-
ity density function with = y and = s.
(g) Construct an empirical distribution function for these data and superimpose a
Gaussian cumulative distribution function with = y and = s.
(h) Draw a boxplot for these data.
(i) Plot a qqplot for these data. Do you observe anything unusual about the qqplot?
Why might cause this?
(j) Based on the above information indicate whether it is reasonable to assume a
Gaussian distribution for these data.

17. Consider the data on heights of adult males and females from Chapter 1. The data
are available in the …le bmidata.txt posted on the course website.

(a) Assume that for each sex the heights in the population from which the samples
were drawn can be modeled by a Gaussian distribution. Obtain the maximum
likelihood estimates of the mean and standard deviation in each case.
(b) Give the maximum likelihood estimates for q (0:1) and q (0:9), the 10th and 90th
percentiles of the height distribution for males and for females.
(c) Give the maximum likelihood estimate for the probability P (Y > 1:83) for males
and females (i.e. the fraction of the population over 1:83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that does not use the Gaussian model is

number of person in sample with y > 1:83


n
where n = 150. Obtain these estimates for males and for females. Can you think
of any advantages for this estimate over the one in part (c)? Can you think of
any disadvantages?
(e) Suggest and try a method of estimating the 10th and 90th percentile of the
height distribution that is similar to that in part (d).

18. The qqplot of the brake pad data in Figure 2.20 indicates that the Normal distribution
is not a reasonable model for these data. Sometimes transforming the data gives a
data set for which the Normal model is more reasonable. A log transformation is often
used. Plot a qqplot of the log lifetimes and indicate whether the Normal distribution
is a reasonable model for these data. The data are posted on the course website.
100 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

19. In a large population of males ages 40 50, the proportion who are regular smokers is
where 0 1 and the proportion who have hypertension (high blood pressure)
is where 0 1. If the events S (a person is a smoker) and H (a person has
hypertension) are independent, then for a man picked at random from the population
the probabilities he falls into the four categories SH; S H; SH; S H are respectively,
; (1 ); (1 ) ; (1 )(1 ). Explain why this is true.

(a) Suppose that 100 men are selected and the numbers in each of the four categories
are as follows:
Category SH S H SH S H
Frequency 20 15 22 43
Assuming that S and H are independent events, determine the likelihood func-
tion for and based on the Multinomial distribution, and …nd the maximum
likelihood estimates of and .
(b) Compute the expected frequencies for each of the four categories using the max-
imum likelihood estimates. Do you think the model used is appropriate? Why
might it be inappropriate?

20. Run the following R code:


par(mfrow=c(2,2))
for (i in 1:4) {
qqnorm(rnorm(30),xlab=’G(0,1) Quantiles’,main=" ")}
for (i in 1:4) {
qqnorm(rnorm(100),xlab=’G(0,1) Quantiles’,main=" ")}
Compare the qqplots that you observe for a sample size of n = 30 with the qqplots
for a sample size of n = 100.

21. Run the following R code:


par(mfrow=c(2,2))
qqnorm(runif(100),xlab=’G(0,1) Quantiles’,main=" ")
qqnorm(rexp(100),xlab=’G(0,1) Quantiles’,main=" ")
qqnorm(rgamma(100,4,1),xlab=’G(0,1) Quantiles’,main=" ")
qqnorm(rt(100,3),xlab=’G(0,1) Quantiles’,main=" ")
On the basis of the qqplot determine whether the underlying distribution is symmetric.
If the distribution is not symmetric indicate if the skewness is positive or negative.
If the distribution is symmetric indicate if the kurtosis is larger or smaller than the
Gaussian kurtosis which is 3.
2.7. CHAPTER 2 PROBLEMS 101

22. A qqplot was generated for 100 values of a variate. See Figure 2.27. Based on this
qqplot, answer the following questions:

(a) What is the approximate value of the sample median of these data?
(b) What is the approximate value of the IQR of these data?
(c) Would the frequency histogram of these data be reasonably symmetric about
the sample mean?
(d) The frequency histogram for these data would most resemble a Normal probabil-
ity density function, an Exponential probability density function or a Uniform
probability density function?

QQ Plot of Sample Data v ersus Standard Normal


3

2.5

1.5
Sample Quantiles

0.5

-0.5

-1

-1.5
-3 -2 -1 0 1 2 3
N(0,1) Quantiles

Figure 2.27: Qqplot for 100 observations

23. Challenge Problem: Uniform data Suppose y1 ; y2 ; : : : ; yn is an observed random


sample from the Uniform(0; ) distribution.

(a) Find the likelihood function, L( ).


(b) Obtain the maximum likelihood estimate of . Hint: The maximum likelihood
estimate is not found by solving l0 ( ) = 0.
102 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION

24. Challenge Problem: Censored lifetime data Consider the Exponential dis-
tribution as a model for the lifetimes of equipment. In experiments, it is often not
feasible to run the study long enough that all the pieces of equipment fail. For ex-
ample, suppose that n pieces of equipment are each tested for a maximum of c hours
(c is called a “censoring time”). The observed data are: k (where 0 k n) pieces
fail, at times y1 ; y2 ; : : : ; yk and n k pieces are still working after time c.

(a) If Y Exponential( ), show that P (Y > c; ) = e c= , for c > 0:


(b) Determine the likelihood function for based on the observed data described
above. Show that the maximum likelihood estimate of is

^ = 1 P yi + (n
k
k)c
k i=1

(c) What does part (b) give when k = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them
to large ‡uctuations in temperature inside specially designed ovens. For one
particular type of component, 50 units were tested and k = 5 failed before
P
5
c = 400 hours, when the test was terminated, with yi = 450 hours. Find the
i=1
maximum likelihood estimate of .

25. Challenge Problem: Estimation from capture-recapture studies In order to


estimate the number of animals, N , in a wild habitat the capture-recapture method
is often used. In this scheme k animals are caught, tagged, and then released. Later
on n animals are caught and the number Y of these that have tags are noted. The
idea is to use this information to estimate N .

(a) Show that under suitable assumptions


k N k
y n y
P (Y = y) = N
n

(b) For observed k, n and y …nd the value N ^ that maximizes the probability in
part (a). Does this ever di¤er much from the intuitive estimate N ~ = kn=y?
(Hint: The likelihood L(N ) depends on the discrete parameter N , and a good
way to …nd where L(N ) is maximized over f1; 2; 3; : : :g is to examine the ratios
L(N + 1)=L(N ):)
(c) When might the model in part (a) be unsatisfactory?
2.7. CHAPTER 2 PROBLEMS 103

26. Challenge Problem: Poisson model with a covariate Let Y represent the
number of claims in a given year for a single general insurance policy holder. Each
policy holder has a numerical “risk score” x assigned by the company, based on
available information. The risk score may be used as an explanatory variate when
modeling the distribution of Y , and it has been found that models of the form

[ (x)]y (x)
P (Y = yjx) = e for y = 0; 1; : : :
y!

where (x) = e + x, are useful.

(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; : : : ; xn
had y1 ; y2 ; : : : ; yn claims, respectively, in a given year. Determine the likelihood
function for and based on these data.
(b) Can ^ and ^ be found explicitly?
104 2. STATISTICAL MODELS AND MAXIMUM LIKELIHOOD ESTIMATION
3. PLANNING AND
CONDUCTING EMPIRICAL
STUDIES

3.1 Empirical Studies


An empirical study is one which is carried out to learn about a population or process by
collecting data. We have given several examples in the preceding two chapters but we have
not yet considered the details of such studies. In this chapter we consider how to conduct
an empirical study in a systematic way. Well-conducted empirical studies are needed to
produce maximal information within existing cost and time constraints. A poorly planned
or executed study can be worthless or even misleading. For example, in the …eld of medicine
thousands of empirical studies are conducted every year at very high costs to society and
with critical consequences. These investigations must be well planned and executed so that
the knowledge they produce is useful, reliable and obtained at reasonable cost.
It is helpful to think about planning and conducting a study using a set of steps such
as the following:

Problem: a clear statement of the study’s objectives, usually involving one or more
questions

Plan: the procedures used to carry out the study including how the data will be
collected

Data: the physical collection of the data, as described in the Plan

Analysis: the analysis of the data collected in light of the Problem and the Plan

Conclusion: The conclusions that are drawn about the Problem and their limitations

We will use this set of steps, which we will refer to as PPDAC, to discuss the important
ideas which must be considered when planning an empirical study. These steps, which are
designed to emphasize the statistical aspects of empirical studies, are described in more
detail in Section 3.2.

105
106 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

PPDAC can be used in two ways - …rst to actively formulate, plan and carry out investi-
gations and second as a framework to critically scrutinize reported empirical investigations.
These reports include articles in the popular press, scienti…c papers, government policy
statements, and various business reports. If you see the phrase “evidence based decision”
or “evidence based management”, look for an empirical study. In this course we will use
PPDAC most often to critically assess empirical studies reported in the media.

The following example will be used in the next section to show how PPDAC can be
used to describe and critically examine how the empirical study was conducted.

Example 3.1 An empirical study on university student drinking


The following news item, published by the University of Sussex, United Kingdom on Feb-
ruary 16, 2015, describes an empirical investigation in the …eld of psychology.
Campaigns to get young people to drink less should focus on the bene…ts of not
drinking and how it can be achieved:
Pointing out the advantages and achievability of staying sober is more e¤ective than traditional
approaches that warn of the risks of heavy drinking, according to the research carried out at the
University of Sussex by researcher Dr Dominic Conroy. The study, published this week in the British
Journal of Health Psychology, found that university students were more likely to reduce their overall
drinking levels if they focused on the bene…ts of abstaining, such as more money and better health.
They were also less likely to binge drink if they had imagined strategies for how non-drinking might
be achieved – for example, being direct but polite when declining a drink, or choosing to spend
time with supportive friends. Typical promotions around healthy drinking focus on the risks of high
alcohol consumption and encourage people to monitor their drinking behaviour (e.g. by keeping a
drinks diary). However, the current study found that completing a drinks diary was less e¤ective in
encouraging safer drinking behaviour than completing an exercise relating to non-drinking.
Dr Conroy says: “We focused on students because, in the UK, they remain a group who drink
heavily relative to their non-student peers of the same age. Similarly, attitudes about the acceptabil-
ity of heavy drinking are relatively lenient among students. “Recent campaigns, such as the NHS
Change4Life initiative, give good online guidance as to how many units you should be drinking
and how many units are in speci…c drinks. “Our research contributes to existing health promotion
advice, which seeks to encourage young people to consider taking ’dry days’ yet does not always
indicate the range of bene…ts nor suggest how non-drinking can be more successfully ‘managed’in
social situations.”
Dr Conroy studied 211 English university students aged 18-25 over the course of a month. Par-
ticipants in the study completed one of four exercises involving either: imagining positive outcomes
of non-drinking during a social occasion; imagining strategies required to successfully not drink
during a social occasion; imagining both positive outcomes and required strategies; or completing a
drinks diary task.
At the start of the study, participants in the outcome group were asked to list positive outcomes
of not drinking and those in the process group listed what strategies they might use to reduce their
3.2. THE STEPS OF PPDAC 107

drinking. Those in the combined group did both. They were reminded of their answers via email
during the one month course of the study and asked to continue practising this mental simulation.
All groups completed an online survey at various points, indicating how much they had drunk
the previous week. Over the course of one month, Dr Conroy found that students who imagined
positive outcomes of non-drinking reduced their weekly alcohol consumption from 20 units to 14
units on average. Similarly, students who imagined required strategies for non-drinking reduced the
frequency of binge drinking episodes –classi…ed as six or more units in one session for women, and
eight or more units for men –from 1.05 episodes a week to 0.73 episodes a week on average.
Interestingly, the research indicates that perceptions of non-drinkers were also more favourable
after taking part in the study. Dr Conroy says this could not be directly linked to the intervention
but was an interesting additional feature of the study. He says: “Studies have suggested that holding
negative views of non-drinkers may be closely linked to personal drinking behaviour and we were
interested to see in the current study that these views may have improved as a result of taking
part in a non-drinking exercise. “I think this shows that health campaigns need to be targeted
and easy to …t into daily life but also help support people to accomplish changes in behaviour that
might sometimes involve ‘going against the grain’, such as periodically not drinking even when in
the company of other people who are drinking.”

To discuss the steps of PPDAC in detail we need to introduce a number of technical


terms. Every subject has its own jargon, that is, words with special meaning, and you need
to learn the terms describing the details of PPDAC to be successful in this course.

3.2 The Steps of PPDAC

Problem
The Problem step describes what the experimenters are trying to learn or what questions
they want to answer. Often this can be done using questions starting with “What”.

What conclusions are the experimenters trying to draw?

What group of things or people do the experimenters want the conclusions to apply?

What variates can be de…ned?

What is(are) the question(s) the experimenters are trying to answer?

Types of problems
Three common types of statistical problems that are encountered are described below.

Descriptive: The problem is to determine a particular attribute of a population or


process. Much of the function of o¢ cial statistical agencies such as Statistics Canada
108 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

involves problems of this type. For example, the government needs to know the
national unemployment rate and whether it has increased or decreased over the past
month.

Causative: The problem is to determine the existence or non-existence of a causal


relationship between two variates. For example:

“Does taking a low dose of aspirin reduce the risk of heart disease among men over the
age of 50?”
“Does changing from assignments to multiple term tests improve student learning in
STAT 231?”
“Does second-hand smoke from parents cause asthma in their children.
“Does compulsory driver training reduce the incidence of accidents among new drivers?”

Predictive: The problem is to predict a future value for a variate of a unit to be selected
from the process or population. This is often the case in …nance or in economics. For
example, …nancial institutions need to predict the price of a stock or interest rates in
a week or a month because this e¤ects the value of their investments.

In a causative problem, the experimenter is interested in whether one variate x tends


to cause an increase or a decrease in another variate Y . Where possible this is conducted
in a controlled experiment in which x is increased or decreased while holding everything
else in the experiment constant and we observe the changes in Y . As indicated in Chapter
1, an experimental study is one in which the experimenter manipulates the values of the
explanatory variates while an observational study is one in which the explanatory variates
are only observed and not controlled. In the study of the relationship between second-hand
smoke and asthma described in Chapter 1, it is unlikely that the experimenter would be
able to manipulate the explanatory variate (child lives in household where adults smoke)
and so the experimenter can only conduct an observational study. In Chapter 8 we will see
how an experimental study can be designed to investigate a causative problem. A single
observational study in which the experimenter is not in control of the explanatory varites
cannot be used to investigate a causative problem.
In the drinking study in Example 3.1, which is an experimental study, the problem
is causative since the researchers wanted to study the e¤ect of di¤erent mental exercises
related to non-drinking on the drinking behaviour of university students.

De…ning the problem


The …rst step in describing the Problem is to de…ne the units and the target population
or target process.

De…nition 15 The target population or target process is the collection of units to which
the experimenters conducting the empirical study wish the conclusions to apply.
3.2. THE STEPS OF PPDAC 109

In the drinking study the units are university students and the target population consists
of English university students aged 18 25 in the United Kingdom at the time of the
study. Note that “all university students aged 18 25 in the world” would not be a
suitable target population since it would not make much sense to include countries in which
the consumption of alcohol is not allowed. A target population of “all English university
students aged 18 25”with no time mentioned is also not a suitable target population for
this study since we might expect the drinking behaviour of university students to change
over time.
In Chapter 1 we considered a survey of Ontario residents aged 14 20 in a speci…c
week to learn about their smoking behaviour. In this study the units are young adults and
the target population is all young adults aged 14 20 living in Ontario at the time of the
survey. Since smoking behaviour varies from province to province and year to year, the
target population of young adults aged 14 20 in Ontario at the time of the study is the
best choice.
In Chapter 1 we considered the comparison of two can …lling machines used by a manu-
facturer with respect to the volume of liquid in the …lled cans. The units are the individual
cans. The target process is all cans, which could be …lled by the manufacturer using the
two machines, now and into the future under current operating conditions. Note that in
de…ning the target process the expression “under current operating conditions” has not
been well de…ned.

De…nition 16 A variate is a characteristic associated with each unit.

The values of the variates change from unit to unit in the population/process. There
are usually many variates associated with each unit.
In the drinking study the most important variates are the weekly alcohol consumption
measured over the course of a month, and which mental exercise the student was assigned
to. Other variates which were collected are the age of the student and the sex of the student.
In the smoking survey, whether or not each young adult (unit) in the target population
smokes is the variate of primary interest. Other variates of interest de…ned for each unit
might be age and sex. In the can-…lling example, the volume of liquid in each can (unit) is
a variate. Whether the old machine or the new machine was used to …ll the can is another
variate.

De…nition 17 An attribute is a function of the variates over a population or process.

The questions of interest in the Problem are speci…ed in terms of attributes of the target
population/process. In the university student drinking study the mean (average) alcohol
consumption for the di¤erent mental exercise groups is the most important attribute. In
the smoking example, one important attribute is the proportion of young adults in the
target population who smoke. In the can-…lling example, the attributes of interest were the
mean (average) volume and the variability (standard deviation) of the volumes for all
110 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

cans …lled by each machine under current conditions. Possible questions of interest (among
others) are:
“Is there a di¤erence in the mean alcohol consumption between the four di¤erent mental
exercise groups?”
“What proportion of young adults aged 14 20 in Ontario smoke?”
“Is the standard deviation of volumes of cans …lled by the new machine less than that
of the old machine?”
We can also ask questions about graphical attributes of the target population such
as the population histogram, population cumulative distribution function, or a
scatterplot of one variate versus another over the target population.
It is very important that the Problem step contain clear questions about one or more
attributes of the target population.

Plan
The Plan step depends on the questions posed in the Problem step. The Plan step includes
a description of the population or process of units from which units will be selected, what
variates will be collected for the units selected, and how the variates will be measured.
In most cases, the attributes of interest for the target population/process cannot be
estimated since only units from a subset of the target population/process can be considered
for study or only units from another population completely can be considered for study.
This may be due to lack of resources and time, as in the smoking survey in which it would be
very costly and nearly impossible to create a list of all young adults aged 14 20 in Ontario
at the time of the study. It may also be a physical impossibility such as in the development
of a new product where the manufacturer may wish to make conclusions about a production
process in the future but only units produced in a pilot process can be examined. It may
also be unethical such as in a clinical trial of a new treatment whose side e¤ects for humans
are unknown and which could be life threatening and therefore only laboratory animals
such as mice can be used.

De…nition 18 The study population or study process is the collection of units available to
be included in the study.

The study population is often but not always a subset of the target population. In many
surveys, the study population is a list of people de…ned by their telephone number. The
sample is selected by calling a subset of the telephone numbers. The study population is
a subset of the target population which excludes those people without telephones or with
unlisted numbers. In the clinical trial example the study population only consists of the
laboratory animals that are available for the study which is not a subset of any target
population of humans. In the development of new products example, the units in the pilot
process are not a subset of the target process which are the units produced in the future.
3.2. THE STEPS OF PPDAC 111

The news item for the drinking study does not indicate how the students in the study
were recruited. To determine this information we need to check the research journal arti-
cle. The more detailed article in the British Journal of Health Psychology indicated that
administrators at 80 academic departments across 45 English universities were asked to for-
ward a pre-prepared recruitment message to their students containing a URL to an online
survey. No reason was given for choosing only these universities. Note that there are over
100 English universities in the United Kingdom as well as other universities in Scotland
and Wales (also part of the United Kingdom) which English students could attend. The
study population is therefore English university students aged 18 25 at the time of the
study at these 45 English universities which is a subset of the target population.
In the smoking survey, it would be di¢ cult to create a list of all young adults aged
14 20 living in Ontario at the time of the survey. Since schools must keep a list of
students attending their school as well as student contact information, the researchers may
decide to choose a study population of all young adults aged 14 20 living in Ontario at
the time of the survey who are attending school. The study population is a subset of the
target population.
In the can-…lling study a possible study process is all cans which are available at the
time of the study and could possibly be …lled by the manufacturer using the two machines
under current operating conditions. In this case the study process is a subset of the target
process.
The study population/process is nearly always di¤erent than the target population/process
since there are always restrictions on the units which are available to be studied.

De…nition 19 If the attributes in the study population/process di¤ er from the attributes
in the target population/process then the di¤ erence is called study error.

Study error cannot be quanti…ed since the values of the target population/process at-
tributes and the study population/process attributes, are unknown. (If these attributes
were known then an empirical study would not be necessary!) Context experts would
need to be consulted, for example, in order to decide whether or not it is reasonable to
assume that conclusions from an investigation using mice are relevant to the human tar-
get population. The statistician’s role it to warn the context experts of the possibility of
such error, especially when the study population/process is very di¤erent from the target
population/process.
In the drinking study, the study population only included English students at the 45
English universities contacted. If the mean alcohol consumption under various mental exer-
cises at these universities was systematically di¤erent than the mean alcohol consumption
under various mental exercises for students in the target population then this di¤erence
would be study error.
Suppose in the smoking survey that young adults aged 14 20 living in Ontario at
the time of the survey who are attending school were less likely to smoke (people with
more education tend to smoke less). In this case the proportion of smokers in the target
112 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

population would be di¤erent than the proportion of smokers in the study population and
this di¤erence would be study error.

De…nition 20 The sampling protocol is the procedure used to select a sample of units from
the study population/process. The number of units sampled is called the sample size.

In Chapter 2, we discussed modeling the data and often claimed that we had a “random
sample”so that our model was simple. In practice, it is exceedingly di¢ cult and expensive
to select a random sample of units from the study population and so other less rigorous
methods are used. Often researchers “take what they can get”.
Sample size is usually driven by cost or availability. In Section 4.4 we will see how to
use the Binomial model to determine sample sizes, and in Section 4.6 we will see how to
use the Gaussian model to determine sample sizes.
In the drinking study, the sampling protocol involved asking administrators at 80 aca-
demic departments across 45 English universities to forward a pre-prepared recruitment
message to their students containing a URL to an online survey. Departments could decide
whether or not to forward the message to their students and students who received the mes-
sage could decide whether or not to take part in the study. The sample size, as reported in
the news item, was 211. Although not indicated in the news item, the journal article indi-
cates that students who agreed to participate were randomly assigned by the researchers to
one of the four mental health exercises (imagining positive outcomes of non-drinking dur-
ing a social occasion; imagining strategies required to successfully not drink during a social
occasion; imagining both positive outcomes and required strategies; or completing a drinks
diary task). The importance of randomization in making a cause and e¤ect conclusion is
discussed in Chapter 8. The students were then asked to report their alcohol consumption
in units in the week before they completed the various online surveys over a period of one
month.

De…nition 21 If the attributes in the sample di¤ er from the attributes in the study popu-
lation/process the di¤ erence is called sample error.

Sample error cannot be quanti…ed since the values of the study population/process
attributes are unknown. Di¤erent random sampling protocols can produce di¤erent sample
errors. We will see in Chapter 4 how models can be used to get an idea of how large this
error might be.
In the university student drinking study, not all academic departments forwarded the
recruitment message (only 23 according to the journal article). Suppose only departments
who thought students at their university had drinking issues forwarded the message and
then only students who were heavy drinkers chose to participate in the study. If the
mean alcohol consumption under various mental exercises for students who received the
recruitment message and decided to participate was systematically higher than the mean
alcohol consumption under various mental exercises for students in the study population
3.2. THE STEPS OF PPDAC 113

then this di¤erence is sample error. Sample error should be suspected in all surveys in
which the participants are volunteers.
The experimenters must decide which variates are going to be measured or determined
for the units in the sample. For any attributes of interest, as de…ned in the Problem
step, the corresponding variates must certainly be measured. Other variates which may
aid the analysis may also need to be measured. In the smoking survey, experimenters
must determine whether each young adult in the sample smokes or not (this requires a
careful de…nition). They may also determine other demographic variates such as age and
sex so that they can compare the smoking rate across age groups, sex, etc. In experimental
studies, the experimenters assign the value of a variate they are controlling to each unit in
the sample. For example, in a clinical trial, sampled units can be assigned to the treatment
group or the placebo group by the experimenters.
When the value of a variate is determined for a given unit, errors are often introduced
by the measurement system which determines the value.

De…nition 22 If the measured value and the true value of a variate are not identical the
di¤ erence is called measurement error.

Measurement errors are unknown since the true value of the variate is unknown. (If we
knew the true value we would not need to measure it!) In practice, experimenters try to
ensure that the processes used to take the measurements, referred to as the measurement
systems, do not contribute substantial error to the conclusions. They may have to study
the measurement systems which are used in separate studies to ensure that this is true.
See, for example, the case study in Section 3.3.
One variate which was determined for each unit (student) in the drinking study was
which mental exercise group the student was assigned to. If the actual group assignment
was recorded incorrectly then this is measurement error. The students were also asked to
report their daily alcohol consumption in the week before they completed the online sur-
veys. The journal article indicated that students measured their daily alcohol consumption
in UK units (an alcohol unit in the United Kingdom is de…ned as 10 milliliters of pure ethyl
alcohol) with the help of a visual aid and that the online surveys occurred at the begin-
ning of the study, at two weeks, and at four weeks. Note that a di¤erent variate would
be associated with each time that the student reported their alcohol consumption. These
variates were self-reported by each student. If a student does not accurately report their
alcohol consumption then this is measurement error. Measurement error should always be
suspected when variates are measured by self-reporting.

Response bias and missing data


Suppose city o¢ cials wish to conduct a study to determine if ethnic residents of a city
are satis…ed with police service in their neighbourhood. A questionnaire is prepared. A
random sample of 300 mailing addresses in a predominantly ethnic neighbourhood is chosen
114 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

and a uniformed police o¢ cer is sent to each address to interview an adult resident. Is there
a possible bias in this study? It is likely that those who are strong supporters of the police
are quite happy to respond but those with misgivings about the police will either choose
not to change some of their responses to favour the police or not respond at all. This
type of bias is called response bias. When those that do respond have a somewhat di¤erent
characteristics than the population at large, the quality of the data is threatened, especially
when the response rate (the proportion who do respond to the survey) is lower. For example
in Canada in 2011, the long form of the Canadian Census (response rate around 98%) was
replaced by the National Household Survey (a voluntary version with similar questions,
response rate around 68%) and there was considerable discussion9 of the resulting response
bias. See for example the CBC story “Census Mourned on World Statistics Day”10 .

The …gure below shows the steps in the Plan and the sources of error:

Target Population/Process
# Study error
Study Population/Process
# Sample error
Sample
# Measurement error
Measured variate values

Steps in the plan and sources of error

A person using PPDAC for an empirical study should, by the end of the Plan step, have
a good understanding of the study population/process, the sampling protocol, the variates
which are to be measured, and the quality of the measurement systems that are intended
for use.
In this course you will most often use PPDAC to critically examine a study done by
someone else. You should examine each step in the Plan (you may have to ask to see the
Plan since many reports omit it) for strengths and weaknesses. You must also pay attention
to the various types of error that may occur and how they might impact the conclusions.

Data
The goal of the Data step is to collect the data according to the Plan. Any deviations from
the Plan should be noted. The data must be stored in a way that facilitates the Analysis.
9
http://www.youtube.com/watch?v=0A7ojjsmSsY
10
http://www.cbc.ca/news/technology/story/2010/10/20/long-form-census-world-statistics-day.html
3.2. THE STEPS OF PPDAC 115

The previous sections noted the need to de…ne variates clearly and to have satisfactory
methods of measuring them. It is di¢ cult to discuss the Data step except in the context
of speci…c examples, but we mention a few relevant points.

Mistakes can occur in recording or entering data into a data base. For complex
investigations, it is useful to put checks in place to avoid these mistakes. For example,
if a …eld is missed, the data base should prompt the data entry person to complete
the record if possible.

In many studies the units must be tracked and measured over a long period of time
(e.g. consider a study examining the ability of aspirin to reduce strokes in which
persons are followed for 3 to 5 years). This requires careful management.

When data are recorded over time or in di¤erent locations, the time and place for
each measurement should be recorded.

There may be departures from the study Plan that arise over time (e.g. persons may
drop out of a long term medical study because of adverse reactions to a treatment; it
may take longer than anticipated to collect the data so the number of units sampled
must be reduced). Departures from the Plan should be recorded since they may have
an important impact on the Analysis and Conclusion.

In some studies the amount of data may be extremely large, so data base design and
management is important.

Analysis
The Analysis step includes both simple and complex calculations to process the data into
information. Numerical and graphical methods such as those discussed in Chapter 1, as
well as others, are used in this step to summarize the data.
A key component of the Analysis step is the selection of an appropriate model that
describes the data and how the data were collected. As indicated in Chapter 1 variates
can be of di¤erent types: continuous, discrete, categorical, ordinal, and complex. It is
important to identify the types of variates collected in a study since this helps in selecting
appropriate models. In the Problem step, the problems of interest were stated in terms of
the attributes of interest. These attributes need to be described in terms of the parameters
and properties of the model. It is also very important to check whether a proposed model
is appropriate. Some methods for checking the …t of a model were discussed in Chapter 2.
Other methods will be discussed in Chapter 7.
It is di¢ cult to describe this step in more detail except in the context of speci…c exam-
ples. You will see many examples of formal analyses in the following chapters.
116 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

In the drinking study the researchers conducted a formal analysis to test for di¤erences
between the mean alcohol consumption for the four groups across the three times points.
This type of analysis is beyond the scope of this course. However in Chapter 6 we will
see how to test for a di¤erence between means when the data consist of two independent
groups and the data are assumed to arise from di¤erent Gaussian distributions.

Conclusions
The purpose of the Conclusion step is to answer the questions posed in the Problem. An
attempt should be made to quantify (or at least discuss) potential errors as described in
the Plan step. Limitations to the conclusions should be discussed. The conclusions for the
drinking study are given below.

Here is a PPDAC for the drinking study based on the published information:

Problem The problem was to study the di¤erences in mean alcohol consumption if
di¤erent mental exercises related to non-drinking were used. The target population
was English university students aged 18 25 in the United Kingdom at the time of
the study. The problem is causative since the researchers wanted to study the e¤ect
of the di¤erent mental exercises on mean alcohol consumption.

Plan The study population was English university students aged 18 25 at the
time of the study at 45 English universities. The sampling protocol involved asking
administrators at 80 academic departments across 45 English universities to forward
a pre-prepared recruitment message to their students containing a URL to an online
survey. Departments could decide whether or not to forward the message to their
students and students who received the message could decide whether or not to take
part in the study. The sample size was 211. Students who agreed to participate
were randomly assigned by the researchers to one of the four mental health exercises
(imagining positive outcomes of non-drinking during a social occasion; imagining
strategies required to successfully not drink during a social occasion; imagining both
positive outcomes and required strategies; or completing a drinks diary task). The
age and sex of each student was also recorded. At the beginning of the study, at two
weeks, and at four weeks the students self-reported, using an online survey, how much
alcohol they had consumed in the previous week in UK units using a visual aid.

Data The data included which mental exercise group the student was assigned to,
their age, their sex, and self-reported information about their alcohol consumption at
three di¤erent time points.

Analysis The researchers conducted a formal analysis to test for di¤erences between
the mean alcohol consumption for the four groups across the three times points.
3.3. CASE STUDY 117

Conclusion In the drinking study the researchers concluded that completing mental
exercises relating to non-drinking was more e¤ective in encouraging safer drinking be-
haviour than completing a drinks diary alone. The researchers should have indicated
that the conclusion only applies to students in the study population not students in
the target population and certainly not students in other countries. This is an ex-
perimental study since the researchers determined group assignment for each student
by randomization. There are several serious drawbacks in this study. Students were
not recruited from all English universities. This could lead to study error. Also not
all contacted departments forwarded the recruitment message and participants were
volunteers. Both of these issues could lead to sample error. Alcohol consumption was
self-reported which could lead to measurement error.

3.3 Case Study


Introduction
This case study is an example of more than one use of PPDAC which demonstrates some
real problems that arise with measurement systems. The documentation given here has
been rewritten from the original report to emphasize the underlying PPDAC framework.

Background
An automatic in-line gauge measures the diameter of a crankshaft journal on 100% of
the 500 parts produced per shift. The measurement system does not involve an operator
directly except for calibration and maintenance. Figure 3.1 shows the diameter in question.
The journal is a “cylindrical”part of the crankshaft. The diameter of the journal must
be de…ned since the cross-section of the journal is not perfectly round and there may be
taper along the axis of the cylinder. The gauge measures the maximum diameter as the
crankshaft is rotated at a …xed distance from the end of the cylinder.
The speci…cation for the diameter is 10 to +10 units with a target of 0. The mea-
surements are re-scaled automatically by the gauge to make it easier to see deviations from
the target. If the measured diameter is less than 10, the crankshaft is scrapped and a
cost is incurred. If the diameter exceeds +10, the crankshaft can be reworked, again at
considerable cost. Otherwise, the crankshaft is judged acceptable.

Overall Project
A project is planned by a crankshaft manufacturer to reduce scrap/rework by reducing
part-to-part variation in the diameter. A …rst step involves an investigation of the mea-
surement system itself. There is some speculation that the measurement system contributes
substantially to the overall process variation and that bias in the measurement system is
resulting in the scrapping and reworking of good parts. To decide if the measurement
118 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

Figure 3.1: Crankshaft with arrow pointing to “journal”

system is making a substantial contribution to the overall process variability, we also need
a measure of this attribute for the current and future population of crankshafts. Since
there are three di¤erent attributes of interest, it is convenient to split the project into three
separate applications of PPDAC.

Study 1
In this application of PPDAC, we estimate the properties of the errors produced by the
measurement system. In terms of the model, we will estimate the bias and variability due
to the measurement system. We hope that these estimates can be used to predict the future
performance of the system.

Problem

The target process is all future measurements made by the gauge on crankshafts to be
produced by the manufacturer. The response variate is the measured diameter associated
with each unit. The attributes of interest are the average measurement error and the
population standard deviation of these errors. We can quantify these concepts using a
model (see below). A detailed …shbone diagram for the measurement system is also shown
in Figure 3.2. In such a diagram, we list explanatory variates organized by the major
“bones” that might be responsible for variation in the response variate, here the measured
journal diameter. We can use the diagram in formulating the Plan.
Note that the measurement system includes the gauge itself, the way the part is loaded
into the gauge, who loads the part, the calibration procedure (every two hours, a master
3.3. CASE STUDY 119

part is put through the gauge and adjustments are made based on the measured diameter
of the master part; that is “the gauge is zeroed”), and so on.

G auge J ournal

Meas urem ent s temperature


maintenance

actual size
position of part

condition dirt

wear on head
out-of-round
Meas ured J ournal D iam eter

training
frequency

master used attention to instructions

E nv ironm ent
C alibration O perator

Figure 3.2: Fishbone diagram for variation in measured journal diameter

Plan

To determine the properties of the measurement errors we must measure crankshafts with
known diameters. “Known” implies that the diameters were measured by an o¤-line mea-
surement system that is very reliable. For any measurement system study in which bias is
an issue, there must be a reference measurement system which is known to have negligible
bias and variability which is much smaller than the system under study.
There are many issues in establishing a study process or a study population. For con-
venience, we want to conduct the study quickly using only a few parts. However, this
restriction may lead to study error if the bias and variability of the measurement system
change as other explanatory variates change over time or parts. We guard against this
latter possibility by using three crankshafts with known diameters as part of the de…nition
of the study process. Since the units are the taking of measurements, we de…ne the study
population as all measurements that can be taken in one day on the three selected crank-
shafts. These crankshafts were selected so that the known diameters were spread out over
the range of diameters Normally seen. This will allow us see if the attributes of the system
depend on the size of the diameter being measured. The known diameters which were used
were: 10, 0, and +10: Remember the diameters have been rescaled so that a diameter of
10 is okay.
No other explanatory variates were measured. To de…ne the sampling protocol, it
was proposed to measure the three crankshafts ten times each in a random order. Each
measurement involved the loading of the crankshaft into the gauge. Note that this was to
be done quickly to avoid delay of production of the crankshafts. The whole procedure took
120 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

only a few minutes.


The preparation for the data collection was very simple. One operator was instructed
to follow the sampling protocol and write down the measured diameters in the order that
they were collected.

Data

The repeated measurements on the three crankshafts are shown below. Note that due to
poor explanation of the sampling protocol, the operator measured each part ten times in
a row and did not use a random ordering. (Unfortunately non-adherence to the sampling
protocol often happens when real data are collected and it is important to consider the
e¤ects of this in the Analysis and Conclusion steps.)

Crankshaft 1 Crankshaft 2 Crankshaft 3


10 8 2 1 9 11
12 12 2 2 8 12
8 10 0 1 10 9
11 10 1 1 12 10
12 10 0 0 10 12

Analysis

A model to describe the repeated measurement of the known diameters is

Yij = i + Rij ; Rij G(0; m) independent (3.1)

where i = 1 to 3 indexes the three crankshafts and j = 1; 2; : : : ; 10 indexes the ten repeated
measurements. The parameter i represents the long term average measurement for crank-
shaft i. The random variables Rij (called the residuals) represent the variability of the
measurement system, while m quanti…es this variability. Note that we have assumed, for
simplicity, that the variability m is the same for all three crankshafts in the study.
We can rewrite the model in terms of the random variables Yij so that Yij G( i ; m ).
Now we can write the likelihood as in Example 2.3.2 and maximize it with respect to the
four parameters 1 , 2 , 3 , and m (the trick is to solve @`=@ i = 0, i = 1; 2; 3 …rst). Not
surprisingly the maximum likelihood estimates for 1 , 2 , 3 are the sample averages for
each crankshaft so that
1 Pn
^ i = yi = yij for i = 1; 2; 3
10 j=1
To examine the assumption that m is the same for all three crankshafts we can calculate
the sample standard deviation for each of the three crankshafts. Let
s
1 P10
si = (yij yi )2 for i = 1; 2; 3
9 j=1

The data can be summarized as:


3.3. CASE STUDY 121

yi si
Crankshaft 1 10:3 1:49
Crankshaft 2 0:6 1:17
Crankshaft 3 10:3 1:42

The estimate of the bias for crankshaft 1 is the di¤erence between the observed average
y1 and the known diameter value which is equal to 10 for crankshaft 1, that is, the
estimated bias is 10:3 ( 10) = 0:3. For crankshafts 2 and 3 the estimated biases are
0:6 0 = 0:6 and 10:3 10 = 0:3 respectively so the estimated biases in this study are all
small.
Note that the sample standard deviations s1 ; s2 ; s3 are all about the same size and
our assumption about a common value seems reasonable. (Note: it is possible to test this
assumption more formally.) An estimate of m is given by
r
s21 + s22 + s23
sm = = 1:37
3
Note that this estimate is not the average of the three sample standard deviations but the
square root of the average of the three sample variances. (Why does this estimate make
sense? Is it the maximum likelihood estimate of m ? What if the number of measurements
for each crankshaft were not equal?)

Conclusion

The observed biases 0:3, 0:6, 0:3 appear to be small, especially when measured against
the estimate of m and there is no apparent dependence of bias on crankshaft diameter.
To interpret the variability, we can use the model (3.1). Recall that if Yij G ( i ; m )
then
P ( i 2 m Yij i + 2 m ) = 0:95

Therefore if we repeatedly measure the same journal diameter, then about 95% of the time
we would expect to see the observations vary by about 2 (1:37) = 2:74.
There are several limitations to these conclusions. Because we have carried out the
study on one day only and used only three crankshafts, the conclusion may not apply to
all future measurements (study error). The fact that the measurements were taken within
a few minutes on one day might be misleading if something special was happening at that
time (sample error). Since the measurements were not taken in random order, another
source of sample error is the possible drift of the gauge over time.
We could recommend that, if the study were to be repeated, more than three known-
value crankshafts could be used, that the time frame for taking the measurements could be
extended and that more measurements be taken on each crankshaft. Of course, we would
also note that these recommendations would add to the cost and complexity of the study.
We would also insist that the operator be better informed about the Plan.
122 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

Study 2
The second study is designed to estimate the overall population standard deviation of the
diameters of current and future crankshafts (the target population). We need to estimate
this attribute to determine what variation is due to the process and what is due to the mea-
surement system. A cause-and-e¤ect or …shbone diagram listing some possible explanatory
variates for the variability in journal diameter is given in Figure 3.3. Note that there are
many explanatory variates other than the measurement system. Variability in the response
variate is induced by changes in the explanatory variates, including those associated with
the measurement system.

M e a su re m e n ts M e th o d M a ch in e

m a i n te n a n c e s p e e d o f ro ta ti o n
s e t-u p o f to o l i n g

o p e ra to r angle of c ut

line s peed
c a l i b ra ti o n
c u tti n g to o l e d g e

p o s i ti o n i n g a u g e J o u rn a l D i a m e te r

s e t-u p m e th o d
h a rd n e s s
d i rt o n p a rt
tra i n i n g
quenc hant
te m p e ra tu re o p e ra to r
c a s ti n g c h e m i s try

e n vi ro n m e n t m a i n te n a n c e
c a s ti n g l o t

E n v i ro n m e n t M a te ri a l O p e ra to r

Figure 3.3: Fishbbone diagram for cause-and-e¤ect

Plan

The study population is de…ned as those crankshafts available over the next week, about
7500 parts (500 per shift times 15 shifts). No other explanatory variates were measured.
Initially it was proposed to select a sample of 150 parts over the week (ten from each
shift). However, when it was learned that the gauge software stores the measurements for
the most recent 2000 crankshafts measured, it was decided to select a point in time near the
end of the week and use the 2000 measured values from the gauge memory to be the sample.
One could easily criticize this choice (sample error), but the data were easily available and
inexpensive.

Data

The individual observed measurements are too numerous to list but a histogram of the data
is shown in Figure 3.4. From this, we can see that the measured diameters vary from 14
3.3. CASE STUDY 123

to +16.

Figure 3.4: Histogram of 2000 measured values from the gauge memory

Analysis

A model for these data is given by

Yi = + Ri ; Ri G(0; ) independently for i = 1; 2; : : : ; 2000

where Yi represents the distribution of the measurement of the ith diameter, represents
the study population mean diameter and the residual Ri represents the variability due to
sampling and the measurement system. We let quantify this variability. We have not
included a bias term in the model because we assume, based on our results from Study 1,
that the measurement system bias is small. As well we assume that the sampling protocol
does not contribute substantial bias.
The histogram of the 2000 measured diameters shows that there is considerable spread in
the measured diameters. About 4:2% of the parts require reworking and 1:8% are scrapped.
The shape of the histogram is approximately symmetrical and centred close to zero. The
sample mean is
P
1 2000
y= yi = 0:82
2000 i=1
which gives us an estimate of (the maximum likelihood estimate) and the sample standard
deviation is s
P
1 2000
s= (yi y)2 = 5:17
1999 i=1
which gives us an estimate of (not quite the maximum likelihood estimate).
124 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

Conclusion

The overall process variation is estimated by s. Since the sample contained 2000 parts
measured consecutively, many of the explanatory variates did not have time to change as
they would in the study populations Thus, there is a danger of sample error producing an
estimate of the variation that is too small.
The variability due to the measurement system, estimated to be 1:37 in Study 1, is much
less than the overall variability which is estimated to be 5:17. One way to compare the two
standard deviations m and is to separate the total variability into the variability due
to the measurement system m and that due to all other sources. In other words, we are
interested in estimating the variability that would be present if there were no variability
in the measurement system ( m = 0). If we assume that the total variability arises from
two independent sources, the measurement system and all other sources, then we have
2 = 2 + 2 or
p
2 2 where
m p p = m p quanti…es the variability due to all other
uncontrollable variates (sampling variability). An estimate of p is given by
p q
s2 s2m = (5:17)2 (1:37)2 = 4:99

Hence, eliminating all of the variability due to the measurement system would produce an
estimated variability of 4:99 which is a small reduction from 5:17. The measurement system
seems to be performing well and not contributing substantially to the overall variation.

Study 3: A Brief Description


A limitation of Study 1 was that it was conducted over a very short time period. To
address this concern, a third study was recommended to study the measurement system
over a longer period during normal production use. In Study 3, a master crankshaft of
known diameter equal to zero was measured every half hour until 30 measurements were
collected. A plot of the measurements versus the times at which the measurements were
taken is given in the run chart in Figure 3.5.
In the …rst study the standard deviation was estimated to be 1:37. In a sample of
observations from a G (0; 1:37) distribution we would expect approximately 95% of the
observations to lie in the interval [0 2 (1:37) ; 0 + 2 (1:37)] = [ 2:74; 2:74] which is ob-
viously not true for the data displayed in the run chart. These data have a much larger
variability. This was a shocking result for the people in charge of the process.

Comments
Study 3 revealed that the measurement system had a serious long term problem. At …rst,
it was suspected that the cause of the variability was the fact that the gauge was not
calibrated over the course of the study. Study 3 was repeated with a calibration before
each measurement. A pattern similar to that for Study 3 was seen. A detailed examination
of the gauge by a repairperson from the manufacturer revealed that one of the electronic
3.3. CASE STUDY 125

Figure 3.5: Scatter plot of diameter versus time

components was not working properly. This was repaired and Study 3 was repeated. This
study showed variation similar to the variation of the short term study (Study 1) so that
the overall project could continue. When Study 2 was repeated, the overall variation and
the number of scrap and reworked crankshafts was substantially reduced. The project was
considered complete and long term monitoring showed that the scrap rate was reduced to
about 0:7% which produced an annual savings of more than $100,000.
As well, three similar gauges that were used in the factory were put through the “long
term” test. All were working well.

Summary
An important part of any Plan is the choice and assessment of the measurement
system.

The measurement system may contribute substantial error that can result in poor
decisions (e.g. scrapping good parts, accepting bad parts).

We represent systematic measurement error by bias in the model. The bias can be
assessed only by measuring units with known values, taken from another reference
measurement system. The bias may be constant or depend on the size of the unit
being measured, the person making the measurements, and so on.

Variability can be assessed by repeatedly measuring the same unit. The variability
may depend on the unit being measured or any other explanatory variates.

Both bias and variability may be a function of time. This can be assessed by examining
these attributes over a su¢ ciently long time span as in Study 3.
126 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

3.4 Chapter 3 Problems


1. Answer the questions below based on the following:
A Waterloo-based public opinion research …rm was hired by the Ontario Ministry of
Education to investigate whether the …nancial worries of Ontario university students
varied by sex. To reduce costs, the research …rm decided to study only university
students living in the Kitchener-Waterloo region in September 2012. An associate
with the research …rm randomly selected 250 university students attending a Laurier-
Waterloo football game. The students were asked whether they agreed/disagreed
with the statement “I have signi…cant trouble paying my bills.” Their sex was also
recorded. The results are given below:

Agreed Disagreed Total


Male 68 77 145
Female 42 63 105
Total 110 140 250

(a) What are the units?


(b) De…ne the target population.
(c) De…ne the study population.
(d) What are two variates in this problem and what is their type?
(e) What is the sampling protocol?
(f) What is a possible source of study error?
(g) What is a possible source of sample error?
(h) Describe an attribute of interest for the target population and provide an esti-
mate based on the given data.

2. Four weeks before a national election, a political party conducts a poll to assess what
proportion of eligible voters plan to vote and, of those, what proportion support the
party. This will determine how they run the rest of the campaign. They are able to
obtain a list of eligible voters and their telephone numbers in the 20 most populated
areas. They select 3000 names from the list and call them. Of these, 1104 eligible
voters agree to participate in the survey with the results summarized in the table
below. Answer the questions below based on this information.

Support Party
Plan to Vote YES NO
YES 351 381
NO 107 265

(a) De…ne the Problem for this study. What type of Problem is this and why?
3.4. CHAPTER 3 PROBLEMS 127

(b) What is the target population?


(c) Identify the variates and their types for this study.
(d) What are the attributes of interest in the target population?
(e) What is the study population?
(f) What is the sample?
(g) What is a possible source of study error?
(h) Describe one possible source of sample error.
(i) Estimate the attributes of interest for the study population based on the given
data.

3. Online brain-training: does it really work? Brain training, or the goal of


improved cognitive function through the regular use of computerized tests, is a multi-
million dollar industry. Lumosity, which is one of the most popular cognitive training
programs, is made up of more than 40 games designed to improve cognitive abilities,
including memory, attention and problem solving. Members pay a monthly member-
ship and are supposed to play the games for 15 minutes, three to …ve times a week. A
2007 press release from the company calls the games “a scienti…cally developed online
brain …tness program demonstrated to improve memory and attention with fun and
e¤ective brain workouts.”
To investigate whether regular brain training leads to any improvement in cognitive
function, researchers in Britain, lead by neuroscientist Adrian Owen, conducted a
study in 2010. Viewers of the BBC popular science programme ‘Bang Goes The The-
ory’were invited to participate in a six-week online study of brain training. Of 52,617
participants aged 18–60 who initially registered, 11,430 completed both benchmark-
ing assessments and at least two full training sessions during the six-week period.
An initial ‘benchmarking’assessment included a broad neuropsychological battery of
four tests that are sensitive to changes in cognitive function in health and disease.
Speci…cally, baseline scores for reasoning, verbal short-term memory, spatial work-
ing memory and paired-associates learning were acquired. Participants were then
randomly assigned to one of two experimental groups or a third control group and
logged on to the BBC Lab UK website to practise six training tasks for a minimum
of 10 minutes a day, three times a week. In experimental group 1, the six training
tasks emphasized reasoning, planning and problem-solving abilities. In experimental
group 2, a broader range of cognitive functions was trained using tests of short-term
memory, attention, visuospatial processing and mathematics similar to those com-
monly found in commercially available brain-training devices. The di¢ culty of the
training tasks increased as the participants improved to continuously challenge their
cognitive performance and maximize any bene…ts of training. The control group did
not formally practise any speci…c cognitive tasks during their ‘training’sessions, but
answered obscure questions from six di¤erent categories using any available online
128 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

resource. At six weeks, the benchmarking assessment was repeated and the pre- and
post-training scores were compared. The di¤erence in benchmarking scores provided
the measure of generalized cognitive improvement resulting from training. Similarly,
for each training task, the …rst and last scores were compared to give a measure of
speci…c improvement on that task.
The relationship between the number of training sessions and changes in benchmark
performance was negligible in all groups for all tests. These results provide no evi-
dence for any generalized improvements in cognitive function following brain training
in a large sample of healthy adults.
Answer the questions below based on the information provided.

(a) What type of study is this? Why?


(b) De…ne the Problem for this study.
(c) What type of Problem is it? Why?
(d) De…ne a suitable target population for this study.
(e) Give two variates of interest in this problem and specify the type of variate for
each.
(f) De…ne a suitable study population for this study.
(g) What is the sampling protocol?
(h) What is a possible source of study error is?
(i) What is a possible source of sample error?
(j) What is a possible source of measurement error?
(k) Why was it important for the researchers to randomly assign the participants to
the three di¤erent groups?
(l) What is the importance of the control group?
(m) Use the article and your answers to the above questions to construct a PPDAC
for this empirical study in as much detail as possible.

4. U.S. to fund study of Ontario math curriculum, Globe & Mail, January 17,
2014, Caroline Alphonso - Education Reporter (article has been condensed)
The U.S. Department of Education has funded a $2.7-million (U.S.) project, led by
a team of Canadian researchers at Toronto’s Hospital for Sick Children. The study
will look at how elementary students at several Ontario public schools fare in math
using the current provincial curriculum as compared to the JUMP math program,
which combines the conventional way of learning the subject with so-called discovery
learning. Math teaching has come under scrutiny since OECD results that measured
the scholastic abilities of 15-year-olds in 65 countries showed an increasing percentage
of Canadian students failing the math test in nearly all provinces. Dr. Tracy Solomon
3.4. CHAPTER 3 PROBLEMS 129

and her team are collecting and analyzing two years of data on students in primary
and junior grades from one school board, which she declined to name. The students
were in Grades 2 and 5 when the study began, and are now in Grades 3 and 6, which
means they will participate in Ontario’s standardized testing program this year. The
research team randomly assigned some schools to teach math according to the Ontario
curriculum, which allows open-ended student investigations and problem-solving. The
other schools are using the JUMP program. Dr. Solomon said the research team is
using classroom testing data, lab tests on how children learn and other measures to
study the impact of the two programs on student learning.
Answer the questions below based on this article.

(a) What type of study is this? Why?


(b) De…ne the Problem for this study.
(c) What type of Problem is it? Why?
(d) De…ne a suitable target population for this study.
(e) Give two variates of interest in this problem and specify the type of variate for
each.
(f) De…ne a suitable study population for this study.
(g) What is the sampling protocol?
(h) What is a possible source of study error?
(i) What is a possible source of sample error?
(j) What is a possible source of measurement error?
(k) Why was it important for the researchers to randomly assign some schools to
teach math according to the Ontario curriculum and some other schools to teach
math using the Jump program?
(l) Use the information in the article and your answers to the above questions to
construct a PPDAC for this empirical study in as much detail as possible.

5. Playing racing games may encourage risky driving, study …nds, Globe &
Mail, January 8, 2015 (article has been condensed)
Playing an intense racing game makes players more likely to take risks such as speed-
ing, passing on the wrong side, running red lights or using a cellphone in a simulated
driving task shortly afterwards, according to a new study. Young adults with more
adventurous personalities were more inclined to take risks, and more intense games
led to greater risk-taking, the authors write in the journal Injury Prevention. Other
research has found a connection between racing games and inclination to risk-taking
while driving, so the new results broaden that evidence base, said lead author of the
new study, Mingming Deng of the School of Management at Xi’an Jiaotong University
in Xi’an, China. “I think racing gamers should be [paying] more attention in their
130 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

real driving,” Deng said.


The researchers recruited 40 student volunteers at Xi’an Jiaotong University, mostly
men, for the study. The students took personality tests at the start and were divided
randomly into two groups. Half of the students played a circuit-racing-type driving
game that included time trials on a race course similar to Formula 1 racing, for about
20 minutes, while the other group played computer solitaire, a neutral game for com-
parison. After a …ve-minute break, all the students took the Vienna Risk-Taking Test,
viewing 24 “risky” videotaped road-tra¢ c situations on a computer screen presented
from the driver’s perspective, including driving up to a railway crossing whose gate
has already started lowering. How long the viewer waits to hit the “stop” key for
the manoeuvre is considered a measure of their willingness to take risks on the road.
Students who had been playing the racing game waited an average of almost 12 sec-
onds to hit the stop button compared with 10 seconds for the solitaire group. The
participants’ experience playing these types of games outside of the study did not
seem to make a di¤erence.
Answer the questions below based on this article.

(a) What type of study is this? Why?


(b) De…ne the Problem for this study.
(c) What type of Problem is this? Why?
(d) De…ne a suitable target population for this study.
(e) What are the two most important variates in this study and what is their type?
(f) What is the attribute of interest in the target population?
(g) De…ne a suitable study population for this study.
(h) Describe the sampling protocol for this study.
(i) Give a possible source of study error for this study in relation to your answer to
(d).
(j) Give a possible source of sample error for this study.
(k) Estimate the attribute of interest for the study population based on the given
data.
(l) Use the information in the article and your answers to the above questions to
construct a PPDAC for this empirical study in as much detail as possible.

6. Higher co¤ee consumption associated with lower risk of early death, Euro-
pean Society of Cardiology, August 27, 2017
Higher co¤ee consumption is associated with a lower risk of death, according to re-
search presented today at ESC Congress. The study in nearly 20; 000 participants
suggests that co¤ee can be part of a healthy diet in healthy people. “Co¤ee is one
of the most widely consumed beverages around the world,” said Dr Adela Navarro,
3.4. CHAPTER 3 PROBLEMS 131

a cardiologist at Hospital de Navarra, Pamplona, Spain. “Previous studies have sug-


gested that drinking co¤ee might be inversely associated with all-cause mortality but
this has not been investigated in a Mediterranean country.”
The purpose of this study was to examine the association between co¤ee consump-
tion and the risk of mortality (death) in a middle-aged Mediterranean cohort. The
study was conducted within the framework of the Seguimiento Universidad de Navarra
(SUN) Project, a long-term prospective cohort study of Spanish university graduates
which began in 1999 and which has recruited new Spanish university graduates to the
study every year since then. This analysis included 19; 896 participants of the SUN
Project, whose average age at enrollment was 37:7 years old. On entering the study,
participants completed a previously validated semi-quantitative food frequency ques-
tionnaire to collect information on co¤ee consumption, lifestyle and sociodemographic
characteristics, and previous health conditions.
Patients were followed-up for an average of ten years. Information on mortality was
obtained from study participants and their families, postal authorities, and the Na-
tional Death Index. During the ten year period, 337 participants died. The researchers
found that participants who consumed at least four cups of co¤ee per day had a 64%
lower risk of all-cause mortality than those who never or almost never consumed cof-
fee. In those who were at least 45 years old, drinking two additional cups of co¤ee
per day was associated with a 30% lower risk of mortality during follow-up. The
association was not signi…cant among younger participants.
Dr Navarro said: “In the SUN project we found an inverse association between drink-
ing co¤ee and the risk of all-cause mortality, particularly in people aged 45 years and
above. This may be due to a stronger protective association among older partici-
pants.” She concluded: “Our …ndings suggest that drinking four cups of co¤ee each
day can be part of a healthy diet in healthy people.”

(a) What type of study is this and why?


(b) De…ne the Problem for this study.
(c) What are the two most important variates in this study and what is their type?
(d) De…ne a suitable target population/process for this study.
(e) De…ne a suitable study population/process for this study.
(f) De…ne study error and give a possible source of study error for this study in
relation to your answers to (d) and (e).
(g) De…ne measurement error and give a possible source of measurement error for
one of the two variates you gave in (c).
(h) Give at least one limitation to this study.
(i) Use the information in the article and your answers to the above questions to
construct a PPDAC for this empirical study in as much detail as possible.
132 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES

(j) Suppose you are not a co¤ee drinker. On the basis of this study do you think
it would be a good idea to start drinking four cups of co¤ee a day. Why or why
not?

7. Answer the following questions based on the study given in Chapter 1, Problem 24.

(a) De…ne the Problem for this study in one or two sentences.
(b) What type of Problem is this? Explain why.
(c) De…ne a suitable target population for this study.
(d) De…ne a suitable study population for this study.
(e) Describe possible sources of study error for this study.
(f) Describe the sampling protocol for this study in as much detail as possible.
(g) What is the sample and sample size for this study?
(h) Describe possible sources of sample error for this study.
(i) Describe possible sources of measurement error for this study.
(j) What is the most serious limitation to the conclusion(s) of this study?
(k) Use the information in the article and your answers to the above questions to
construct a PPDAC for this empirical study in as much detail as possible.

8. Suppose you wish to study the smoking habits of teenagers and young adults, in order
to understand what personal factors are related to whether, and how much, a person
smokes. Brie‡y describe the main components of such a study, using the PPDAC
framework. Be speci…c about the target and study population, the sample, and the
variates you would collect.

9. Suppose you wanted to study the relationship between a person’s “resting”pulse rate
(heart beats per minute) and the amount and type of exercise they get.

(a) List some factors (including exercise) that might a¤ect resting pulse rate. You
may wish to draw a cause and e¤ect (…shbone) diagram to represent potential
causal factors.
(b) Describe brie‡y how you might study the relationship between pulse rate and
exercise using (i) an observational study, and (ii) an experimental study.

10. A large company uses photocopiers leased from two suppliers A and B. The lease
rates are slightly lower for B’s machines but there is a perception among workers
that they break down and cause disruptions in work ‡ow substantially more often.
Describe brie‡y how you might design and carry out a study of this issue, with the
ultimate objective being a decision whether to continue the lease with company B.
What additional factors might a¤ect this decision?
3.4. CHAPTER 3 PROBLEMS 133

11. For a study like the one in Example 1.3.2, where heights x and weights y of individuals
are to be recorded, discuss sources of variability due to the measurement of x and y
on any individual.
134 3. PLANNING AND CONDUCTING EMPIRICAL STUDIES
4. ESTIMATION

4.1 Statistical Models and Estimation


In statistical estimation we use two models:
(1) A model which describes the variability in the variate(s) of interest in the population
or process being studied.
(2) A model which takes in to account how the data were collected and which is con-
structed in conjunction with the model in (1).
We use these two models to estimate the unknown attributes in the population or
process based on the observed data and to determine the uncertainty in these estimates.
The unknown attributes are usually represented by unknown parameters in the models or
by functions of the unknown parameters. We have already seen in Chapter 2, that these
unknown parameters can be estimated using the method of maximum likelihood and the
invariance property of maximum likelihood estimates.
Several issues arise:

(1) Where do we get our probability model? What if it is not a good description of the
population or process?
We discussed the …rst question in Chapters 1 and 2. It is important to check the
adequacy (or “…t”) of the model; some ways of doing this were discussed in Chapter
2 and more formal methods will be considered in Chapter 7. If the model used is not
satisfactory, it is not wise to use the estimates based on it. For the lifetimes of brake
pads data introduced in Example 1.3.4, a Gaussian model did not …t the data well.
Sometimes the data can be transformed in such a way that the Gaussian model does
…t (see Chapter 2, Problem 18).

(2) The estimation of parameters or population attributes depends on data collected from
the population or process, and the likelihood function is based on the probability of
the observed data. This implies that factors associated with the selection of sample
units or the measurement of variates (e.g. measurement error) must be included in
the model. In many examples it is assumed that the variate of interest is measured
without error for a random sample of units from the population. We will typically

135
136 4. ESTIMATION

assume that the data come from a random sample of population units, but in any
given application we would need to design the data collection plan to ensure this
assumption is valid.

(3) Suppose in the model chosen the population mean is represented by the parameter
. The sample mean y is an estimate of , but not usually equal to it. How far away
from is y likely to be? If we take a sample of only n = 50 units, would we expect
the estimate y to be as “good” as y based on 150 units? What does “good” mean?

We focus on the third point in this chapter and assume that we can deal with the …rst
two points with the methods discussed in Chapters 1 and 2.

4.2 Estimators and Sampling Distributions


Suppose that some attribute of interest for a population or process can be represented by
a parameter in a statistical model. We assume that can be estimated using a random
sample drawn from the population or process in question. Recall in Chapter 2 that a point
estimate of , denoted as ^, was de…ned as a function of the observed sample y1 ; y2 ; : : : ; yn ,
^ = g(y1 ; y2 ; : : : ; yn ). For example

^ = y = 1 P yi
n

n i=1
is a point estimate of if y1 ; y2 ; : : : ; yn is an observed random sample from a Poisson
distribution with mean .
The method of maximum likelihood provides a general method for obtaining estimates,
but other methods exist. For example, if = E(Y ) = is the average (mean) value of y
in the population, then the sample mean ^ = y is an intuitively sensible estimate; it is the
maximum likelihood estimate of if Y has a G ( ; ) distribution but because of the Central
Limit Theorem it is a good estimate of more generally. Thus, while we will use maximum
likelihood estimation a great deal, you should remember that the discussion below applies
to estimates of any type.
The problem facing us in this chapter is how to determine or quantify the uncertainty in
an estimate. We do this using sampling distributions, which are based on the following idea.
If we select random samples on repeated occasions, then the estimates ^ obtained from the
di¤erent samples will vary. For example, …ve separate random samples of n = 50 persons
from the same male population described in Example 1.3.2 gave …ve di¤erent estimates
^ = y of E(Y ) as:
1:723 1:743 1:734 1:752 1:736
Estimates vary as we take repeated samples and therefore we associate a random variable
and a distribution with these estimates.
More precisely, we de…ne this idea as follows. Let the random variables Y1 ; Y2 ; : : : ; Yn
represent potential observations in an empirical study. Associate with the estimate
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 137

^ = g(y1 ; y2 ; : : : ; yn ) a random variable ~ = g(Y1 ; Y2 ; : : : ; Yn ). The random variable


~ = g(Y1 ; Y2 ; : : : ; Yn ) is simply a rule that tells us how to process the data to obtain a
numerical value ^ = g(y1 ; y2 ; : : : ; yn ) which is an estimate of the unknown parameter for
a given data set y1 ; y2 ; : : : ; yn . For example

~ = Y = 1 P Yi
n

n i=1

is a random variable and ^ = y is a numerical value. We call ~ the estimator of corre-


sponding to ^. We use ^ to denote an estimate, that is, a numerical value, and ~ to denote
the corresponding estimator, the random variable.

De…nition 23 A (point) estimator ~ is a random variable which is a function


~ = g(Y1 ; Y2 ; : : : ; Yn ) of the random variables Y1 ; Y2 ; : : : ; Yn . The distribution of ~ is called
the sampling distribution of the estimator.

Since ~ is a function of the random variables Y1 ; Y2 ; : : : ; Yn then this means that ~ is


also a random variable. If we know the distribution of Y1 ; Y2 ; : : : ; Yn then we can …nd the
sampling distribution of ~, at least in principle. In other words we can …nd the probability
(density) function or cumulative distribution function of ~ and use it to make probabilities
statements about ~. If we know the sampling distribution of the estimator ~ then we can
use it to quantify the uncertainty in an estimate ^, that is, we can determine the probability
that the estimator ~ is “close” to the true but unknown value of . In Examples 4.2.1 -
4.2.3 we examine ways of …nding the sampling distribution, at least approximately.

Example 4.2.1
Suppose we have a variate of interest (for example, the height in meters of a male in
the population of Example 1.3.2) whose distribution it is reasonable to model as a G( ; )
random variable. Suppose also that we plan to take a random sample Y1 ; Y2 ; : : : ; Yn to
estimate the unknown mean where Yi G( ; ), i = 1; 2; : : : ; n. The maximum likelihood
estimator of is
1 Pn
~=Y = Yi
n i=1
From properties of Gaussian random variables (See Chapter 1, Problem 16) we know that
p p
~=Y G( ; = n) and so the sampling distribution of Y is G( ; = n).
If we knew we could determine how often the estimator ~ = Y is within a speci…ed
amount of the unknown mean . For example, if the variate is height and heights are
measured in meters then we could determine how often the estimator ~ = Y is within 0:01
meters of the true mean as follows:

0:01 Y 0:01
P (j~ j 0:01) = P 0:01 Y + 0:01 = P p p p
= n = n = n
p p
= P 0:01 n= Z 0:01 n= where Z G(0; 1)
138 4. ESTIMATION

Suppose = 0:07 meters and n = 50 then


p p
P (j~ j 0:01) = P 0:01 50=0:07 Z 0:01 50=0:07 = P ( 1:01 Z 1:01)
= 0:688

and if n = 100
p p
P (j~ j 0:01) = P 0:01 100=0:07 Z 0:01 100=0:07 = P ( 1:43 Z 1:43)
= 0:847

This illustrates the rather intuitive fact that, the larger the sample size, the higher the prob-
ability the estimator ~ = Y is within 0:01 meters of the true but unknown mean height
in the population. It also allows us to express the uncertainty in an estimate ^ = y from an
observed sample y1 ; y2 ; : : : ; yn by indicating the probability that any single random sample
will give an estimate within a certain distance of .

Example 4.2.2
In Example 4.2.1 the distribution of the estimator ~ = Y could be determined exactly.
Sometimes the distribution of the estimator can only be determined approximately using the
Central Limit Theorem. For example, for Binomial data with n trials and y successes the
estimator ~ = Y =n has E(~) = and V ar(~) = (1 ) =n. By the Normal approximation
to the Binomial we have
~
q G (0; 1) approximately
(1 )
n

This result could be used, for example, to determine how large n should be to ensure that

P 0:03 ~ 0:03 0:95

for all 2 [0; 1]. See Chapter 4, Problem 10.

In some cases the sampling distribution can be approximated using a simulation study
as illustrated in the next example.

Example 4.2.3
Suppose the population of interest is a …nite population consisting of 500 units. Suppose
associated with each unit is a number between 1 and 10 which is the variate of interest.
If we wanted to estimate the mean of this population we could select a random sample
y1 ; y2 ; : : : ; yn without replacement and estimate using the estimate ^ = y. Let us examine
how good the estimator ~ = Y is in the case of the population which has the distribution
of variate values as indicated in Table 4.1.
4.2. ESTIMATORS AND SAMPLING DISTRIBUTIONS 139

Variate
1 2 3 4 5 6 7 8 9 10 Total
value
No. of
210 127 66 39 23 13 11 7 3 1 500
units

Table 4.1 Distribution of variate values in …nite population

In Figure 4.1 a histogram of the variate values is plotted. We notice that the popula-
tion of variate values is very positively skewed. The population mean and the population
300
200
Frequency
100
50
0

2 4 6 8 10

V a r ia te V a lu e

Figure 4.1: Histogram of the variate values for the …nite population of Table 4.1

standard deviation are given respectively by


1 1181
= [210 (1) + 127 (2) + + 1 (10)] = = 2:362
500 500
and v
u " #
u 1 1181 2
= t 210 (1)2 + 127 (2)2 + + 1 (10) 2
500 = 1:7433
500 500

Note that the population variance is divided by 500 and not 499. To determine how good
an estimator ~ = Y is we need the sampling distribution of Y . This could be determined
exactly but would require a great deal of e¤ort. Another way to approximate the sampling
distribution is to use a computer simulation. The simulation can be done in two steps.
First a random sample y1 ; y2 ; : : : ; yn is drawn at random without replacement from the
140 4. ESTIMATION

population. Secondly the sample mean y for this sample is determined. These two steps
are repeated k times. The k sample means, y1 ; y2 ; : : : ; yk , can then be considered as a
random sample from the distribution of the random variable ~ = Y , and we can study
the distribution by plotting a histogram of the values y1 ; y2 ; : : : ; yk . The R code for such a
simulation is given in Chapter 4, Problem 1.
0.8
0.6
Density
0.4
0.2
0.0

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5


Sample Mean

Figure 4.2: Relative frequency histogram of sample means from 10000 samples
of size 15 drawn from the population de…ned by Table 4.1

The histogram in Figure 4.2 was obtained by drawing k = 10000 samples of size n = 15
from the population de…ned by Table 4.1, calculating the sample means y1 ; y2 ; : : : ; y10000 and
then plotting the relative frequency histogram. The histogram represents an approximation
to the sampling distribution of the estimator Y . The number of simulations k only a¤ects
how good the approximation is. It can be shown11 that the mean and standard deviation
of the true sampling distribution of Y are
1:7433
E Y = = 2:362 and sd Y t p = p = 0:4501
n 15
Does the histogram, which represents the approximate sampling distribution, agree with
these statements? What do you notice about the symmetry of the histogram? Does the
histogram look like a Gaussian distribution?
Based on this simulation we can approximate P Y 2:362 0:5 , the probability
that the sample mean Y is within 0:5 of the population mean = 2:362, by counting the
number of sample means in the simulation which are within 0:5 of the value 2:362. For the
simulation in Figure 4.2 this value was 0:7422.
q
11 N n
For a sample of size n drawn without replacement from a …nite population of size N , sd Y = p
n N 1
.
4.3. INTERVAL ESTIMATION USING THE LIKELIHOOD FUNCTION 141

If samples of size n = 30 were drawn, how would the location, variability and symmetry
of the histogram of simulated means change? How would the estimate of P Y 2:362 0:5
be a¤ected? See Chapter 4, Problem 1.

Regardless of how the sampling distribution of an estimator ~ is determined, the sam-


pling distribution is important because it allows us to compute probabilities of the form
P (j~ j d) for given d so that we can quantify the uncertainty in the estimate ^.

The estimates and estimators we have discussed so far are often referred to as point
estimates and point estimators respectively. This is because they consist of a single value
or “point”. Sampling distributions allow us to address the uncertainty in a point estimate.
The uncertainty in a point estimate is usually conveyed by an interval estimate, which
takes the form [L (y) ; U (y)] where the endpoints, L (y) and U (y), are both functions of
the observed data y. If we let L (Y) and U (Y) represent the random variables associated
with L (y) and U (y), then [L (Y) ; U (Y)] is called a random interval since the endpoints
are random variables. The probability that the parameter falls in the random interval
[L (Y) ; U (Y)] is P ( 2 [L (Y) ; U (Y)]) = P [L (Y) U (Y)]. This probability tells us
how good the rule is by which the interval estimate was obtained. It tells us, for example,
how often we would expect the statement 2 [L (y) ; U (y)] to be true if we were to draw
many random samples from the same population and each time we constructed the interval
[L (y) ; U (y)] based on the observed data y.
For example, suppose P [L (Y) U (Y)] = 0:95. If we drew a large number of
random samples and each time we constructed the interval [L (y) ; U (y)] from the data
y, then we would expect the true value of the parameter to lie in approximately 95%
of these constructed intervals. This means we can be reasonably con…dent that if we
construct one interval based on one observed data set y, then the interval [L (y) ; U (y)]
will contain the true value of the unknown parameter . In general, uncertainty in a
point estimate is explicitly stated by giving the interval estimate along with the probability
P ( 2 [L (Y) ; U (Y)]).
We will discuss this idea of con…dence related to interval estimates in more detail in
Section 4.4. First we show how the likelihood function can be used to construct interval
estimates.

4.3 Interval Estimation Using the Likelihood Function


The likelihood function can be used to obtain interval estimates for parameters in a very
straightforward way. We do this here for the case in which the probability model involves
only a single scalar parameter . Individual models often have constraints on the parame-
ters. For example in the Gaussian distribution, the mean can be any real number 2 <
but the standard deviation must be positive, that is, > 0. Similarly for the Binomial
model the probability of success must lie in the interval [0; 1]. These constraints are usually
identi…ed by requiring that the parameter falls in some set , called the parameter space.
142 4. ESTIMATION

As mentioned in Chapter 2 we often rescale the likelihood function to have a maximum


value of one to obtain the relative likelihood function.

De…nition 24 Suppose is scalar and that some observed data (say a random sample
y1 ; y2 ; : : : ; yn ) have given a likelihood function L( ). The relative likelihood function R( )
is de…ned as
L( )
R( ) = for 2
L(^)
where ^ is the maximum likelihood estimate and is the parameter space.
Note:
0 R( ) 1 for all 2

De…nition 25 A 100p% likelihood interval for is the set f : R( ) pg.

The set f : R( ) pg is not necessarily an interval unless R( ) is unimodal, but this


is the case for all models that we consider here. The justi…cation for a likelihood interval
is that the values of that give large values of L( ) and hence large values of R( ), are the
most plausible in light of the data. We will see that the values of p which are often used
are p = 0:01; 0:1; 0:15 and 0:5. Likelihood intervals intervals cannot usually be found ex-
plicitly. They must be found numerically by solving the equation R( ) = p or equivalently
R( ) p = 0 using a function like uniroot in R or they can be determined approximately
from a graph of R( ) or r( ) = log R( ).

Example 4.3.1 Polls


Let be the proportion of people in a large study population who have a speci…c
characteristic. Suppose n persons are randomly selected for a poll and y people are observed
to have the characteristic of interest. If we let Y be the number who have the characteristic
in the sample of size n, then
Y Binomial(n; ) is a reasonable model. As we have seen previously the likelihood
function is
n y
L( ) = (1 )n y for 0 1
y
and the maximum likelihood estimate of is the sample proportion ^ = y=n. The relative
likelihood function is y
(1 )n y
R( ) = y for 0 1
^ (1 ^)n y
Figure 4.3 shows the relative likelihood functions R( ) for two polls:

Poll 1 : n = 200; y = 80
Poll 2 : n = 1000; y = 400

In each case ^ = 0:40, but the relative likelihood function is more “concentrated”around ^
for the larger poll (Poll 2). The 10% likelihood intervals also re‡ect this. From Figure 4.3
4.3. INTERVAL ESTIMATION USING THE LIKELIHOOD FUNCTION 143

we can determine that the 10% likelihood interval is [0:33; 0:47] for Poll 1 and [0:37; 0:43]
for Poll 2. The interval for Poll 2 is narrower than for Poll 1 which re‡ects the fact that
the larger poll contains more information about .

1
0.9
0.8
0.7
n= 200
0.6
R(θ)

0.5
n= 1000
0.4
0.3
0.2
y = 0.1
0.1
0
0.3 0.35 0.4 0.45 0.5
θ
Figure 4.3: Relative likelihood function for two polls with di¤erent sample sizes

Table 4.2 gives guidelines for interpreting likelihood intervals. These are only guide-
lines for this course.

Table 4.2
Guidelines for Interpreting Likelihood Intervals

Values of inside a 50% likelihood interval are very plausible in light of the observed data.

Values of inside a 10% likelihood interval are plausible in light of the observed data.

Values of outside a 10% likelihood interval are implausible in light of the observed data.

Values of outside a 1% likelihood interval are very implausible in light of the observed data.
144 4. ESTIMATION

The values 1%, 10%, and 50% are typically used because they are nice round numbers
and they provide useful summaries. Other values could also be used. In Section 4.6
we will see that 15% likelihood intervals have a connection with 95% con…dence intervals.
(Values inside a 15% likelihood interval are also plausible in light of the observed data.)
A 10% likelihood interval is useful because it excludes parameter values for which the
1
probability of the observed data is less than 10 of the probability when = ^. In other words
a 10% likelihood interval summarizes the interval of values for the unknown parameter which
are reasonably supported by the observed data in an empirical study. A 50% likelihood
interval contains values of the parameter for which the probability of the observed data is at
least 12 . A narrower 50% likelihood interval might be used if decisions made on the basis of
the plausible values of the unknown parameter in light of the data had serious consequences
in terms of money or lives of people. A 1% likelihood interval, which is wider than a 10%
likelihood interval, would be used if the aim of the empirical study was to summarize all the
parameter values which are supported in some way by the observed data. Which likelihood
interval is used, therefore, depends very much on the goals of the empirical study that is
being conducted.
A drawback of likelihood intervals (as well as con…dence intervals as we will see in the
next section) is that we never know whether the interval obtained contains the true value
of the parameter or not. In Section 4.6 we will see that the construction of a likelihood
interval ensures that we can be reasonably con…dent that it does.

Sometimes it is more convenient to compute the log of the relative likelihood function
instead of R( ).

De…nition 26 The log relative likelihood function is


" #
L( )
r( ) = log R( ) = log = l( ) l(^) for 2
L(^)

where l( ) = log L( ) is the log likelihood function.

The plots of the likelihood function R( ) and the log likelihood function r( ) are both
unimodal. As well, both R( ) and r( ) obtain a maximum value at = ^. (Note: R(^) = 1
while r(^) = 0). The plots of R( ) and r( ), however, di¤er in terms of their shape. The
plot of the relative likelihood function R( ) (see for example, Figure 4.3) often resembles a
Gaussian probability density function in shape while the plot of the log relative likelihood
r( ) resembles a quadratic function of (see, for example, Figure 4.4.)
The log relative likelihood function can also be used to compute a 100p% likelihood
interval since R( ) p if and only if r( ) log p. In other words, a 100p% likelihood interval
can also be de…ned as f : r( ) log pg. For example, f : r( ) log (0:1) = 2:30g is a
10% likelihood interval.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 145

The idea of a likelihood interval for a parameter can also be extended to the case of
a vector of parameters = ( 1 ; 2 ; : : : ; k ). In this case R( ) p gives likelihood “regions”
for .

0
y=log(0.1)
n =200
-5

n =1000
-10
r(θ)

-15

-20

-25
0.3 0.35 0.4 0.45 0.5
θ

Figure 4.4: Log relative likelihood function for two polls with di¤erent sample
sizes

4.4 Con…dence Intervals and Pivotal Quantities


Suppose we assume that the model chosen for the data y is correct and that the interval
estimate for the parameter is given by [L(y); U (y)]. To quantify the uncertainty in the
interval estimate we look at the corresponding interval estimator [L(Y); U (Y)] and the
following probability

P [L(Y) U (Y)] = P f 2 [L(Y); U (Y)]g (4.1)

The parameter in (4.1) is an unknown constant associated with the population. It is not
a random variable and therefore does not have a distribution. The probability in (4.1) can
be interpreted in the following way. Suppose we were about to draw a random sample of
the same size from the same population and the true value of the parameter was . Suppose
also that we knew that we would construct an interval of the form [L(y); U (y)] once we
had collected the data. Then the probability that will be contained in this new interval is
given by (4.1). When we use the observed data y, to construct the interval [L(y); U (y)] we
note that L(y) and U (y) are numerical values not random variables. Since is an unknown
constant we do not know whether the statement 2 [L(y); U (y)] is true or false. How then
146 4. ESTIMATION

do we use the probability in (4.1) to construct an interval estimate? In practice, we choose


random intervals [L(Y); U (Y)] for which the probability in (4.1) is fairly close to 1 (values
0:90, 0:95 and 0:99 are often used) while keeping the constructed intervals [L(y); U (y)] as
narrow as possible. Such interval estimates are called con…dence intervals.

De…nition 27 A 100p% con…dence interval for a parameter is an interval estimate


[L(y); U (y)] for which
P [L(Y) U (Y)] = p (4.2)
where p is called the con…dence coe¢ cient.

Suppose p = 0:95 and we drew a very large number of random samples from the model.
Suppose also that each time we observed a random sample, we constructed a 95% con…dence
interval [L(y); U (y)] based on the observed data y. Then (4.2) indicates that 95% of these
constructed intervals would contain the true value of the parameter (and of course 5%
do not). This gives us some con…dence that for a particular sample, the true value of the
parameter is contained in the con…dence interval constructed from the sample.

The following example illustrates that the con…dence coe¢ cient sometimes does not
depend on the unknown parameter.

Example 4.4.1 Gaussian distribution with unknown mean and known standard
deviation
Suppose Y1 ; Y2 ; : : : ; Yn is a random sample from a G( ; 1) distribution, that is, = E(Yi )
is unknown but sd (Yi ) = 1 is known.
Consider the interval h i
Y 1:96n 1=2 ; Y + 1:96n 1=2

1 P
n
where Y = n Yi is the sample mean. Since the sampling distribution of Y is
pi=1
Y G( ; 1= n),
p p
P Y 1:96= n Y + 1:96= n
p
=P 1:96 n Y 1:96
= P ( 1:96 Z 1:96) where Z G(0; 1)
= 0:95
p p
and the interval [y 1:96= n; y + 1:96= n] is a 95% con…dence interval for the unknown
mean . Note that the con…dence coe¢ cient did not depend on the value of the unknown
parameter .
Suppose for a particular sample of size n = 16 the observed mean was y = 10:4, then
the 95% con…dence interval would be [y 1:96=4; y + 1:96=4], or [9:91; 10:89]. We cannot
say that P ( 2 [9:91; 10:89]) = 0:95. We can only say that we are 95% con…dent that the
interval [9:91; 10:89] contains .
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 147

We repeat the very important interpretation of a 95% con…dence interval. Suppose the
experiment which was used to estimate was conducted a large number of times and each
time a 95% con…dence interval for was constructed using the observed data and the in-
p p
terval [y 1:96= n; y + 1:96= n]. Then, approximately 95% of these constructed intervals
would contain the true but unknown value of the parameter . Of course approximately
5% of these constructed would not contain the true but unknown value of the parameter
p p
. Since we only have one interval [y 1:96= n; y + 1:96= n]] we do not know whether it
contains the true value of or not. We can only say that we are 95% con…dent that the
p p
interval [y 1:96= n; y + 1:96= n] contains the true value of . In other words, we hope
we were one of the “lucky”95% who constructed an interval containing the true value of .
p p
Warning: P ( 2 [y 1:96= n; y + 1:96= n]) = 0:95 is an incorrect statement!!!

Recall that the coverage probability for the interval in Example 4.4.1 did not depend on
the unknown parameter . This is a highly desirable property to have because we would like
to know the coverage probability without knowing the value of the unknown parameter. We
next consider a general method for …nding con…dence intervals which have this property.

Pivotal Quantities
De…nition 28 A pivotal quantity Q = Q(Y; ) is a function of the data Y and the un-
known parameter such that the distribution of the random variable Q is fully known. That
is, probability statements such as P (Q b) and P (Q a) depend on a and b but not on
or any other unknown information.

Example 4.4.1 Revisited Gaussian distribution with unknown mean and known
standard deviation
In Example 4.4.1 the parameter = E(Yi ) was unknown but the standard deviation
sd (Yi ) = 1 was known. Since Y1 ; Y2 ; : : : ; Yn is a random sample from a G( ; 1) distribution,
p
E(Y ) = , and sd(Y ) = 1= n, it follows that
Y p
p = n Y G (0; 1)
1= n
p
In other words the distribution of the random variable n Y is completely known
p
and therefore n Y is a pivotal quantity. In particular we know that
p
P n Y 1:96 = 0:025

and
p
P n Y 1:96 = 0:025

We now describe how a pivotal quantity can be used to construct a con…dence interval.
We begin with the statement P [a Q(Y; ) b] = p where Q(Y; ) is a pivotal quantity
148 4. ESTIMATION

whose distribution is completely known. Suppose that we can re-express the inequality
a g(Y; ) b in the form L(Y) U (Y) for some functions L and U . Then since

p = P [a Q(Y; ) b]
= P [L(Y) U (Y)]
= P ( 2 [L(Y); U (Y)])

the interval [L (y) ; U (y)] is a 100p% con…dence interval for . The con…dence coe¢ cient
for the interval [L (y) ; U (y)] is equal to p which does not depend on . The con…dence
coe¢ cient does depend on a and b, but these are determined by the known distribution of
Q(Y; ).

Example 4.4.2 Con…dence interval for the mean of a Gaussian distribution


with known standard deviation
Suppose Y = (Y1 ; Y2 ; : : : ; Yn ) is a random sample from the G( ; ) distribution where
E (Yi ) = is unknown but sd (Yi ) = is known. Since

Y
Q = Q (Y; ) = p G(0; 1) (4.3)
= n
and G(0; 1) is a completely known distribution, Q is a pivotal quantity. To obtain a 95%
con…dence interval for we …rst note that 0:95 = P ( 1:96 Z 1:96) where Z G (0; 1).
From (4.3) it follows that

Y
0:95 = P 1:96 p 1:96
= n
p p
=P Y 1:96 = n Y + 1:96 = n

so that
p p
y 1:96 = n; y + 1:96 = n
is a 95% con…dence interval for based on the observed data y = (y1 ; y2 ; : : : ; yn ).
Note that if a and b are values such that 0:95 = P (a Z b) where Z G (0; 1) then
p p
the interval [y b = n; y a = n] is also a 95% con…dence interval for . The interval
p p
[y 1:96 = n; y + 1:96 = n] can be shown to be the narrowest possible 95% con…dence
interval for .

p p p
The interval [y 1:96 = n; y + 1:96 = n] or y 1:96 = n is often referred to as a
two-sided con…dence interval. Note that this interval takes the form

point estimate a standard deviation of the estimator

where a is a value from the G (0; 1) table. Many two-sided con…dence intervals we will
encounter will take a similar form.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 149

Exercise Show that


p
(a) y 1:645 = n is a 90% con…dence interval for
p
(b) y 2:576 = n is a 99% con…dence interval for .

p p
Since P 2 [Y 1:645 = n; 1) = 0:95, the interval [y 1:645 = n; 1) is also a
p
95% con…dence interval for . The interval [y 1:645 = n; 1) is usually referred to as
a one-sided con…dence interval. This type of interval is useful when we are interested in
determining a lower bound on the value of .

Remark
It is important to understand that con…dence intervals vary when we take repeated
samples. In Example 4.4.2, suppose = 2 is known, and the sample size n is 16. A 95%
con…dence interval based on one sample with observed sample mean y is
h p p i
y 1:96 (2) = 16; y + 1:96 (2) = 16
= [y 0:98; y + 0:98]

Suppose for = 8, we generated 25 random samples of size n = 16 and constructed the


95% con…dence for the mean for each of these 25 data sets. One such simulation gave
the following 95% con…dence intervals for :
[7:021; 8:981] [7:375; 9:335] [7:281; 9:241] [7:059; 9:019] [6:615; 8:575]
[7:613; 9:573] [6:767; 8:727] [6:645; 8:605] [6:974; 8:934] [7:026; 8:986]
[6:697; 8:657] [7:716; 9:676] [7:696; 9:656] [7:115; 9:075] [7:295; 9:255]
[6:772; 8:732] [7:662; 9:622] [7:879; 9:839] [6:911; 8:871] [7:061; 9:021]
[6:291; 8:251] [5:962; 7:922] [7:831; 9:791] [6:868; 8:828] [7:271; 9:231]
In 25 generated samples we would expect approximately (0:95) (25) = 23:75% of the
intervals to contain the true value of . We note for this simulation that 24 of the 25 or
96% of the generated intervals contain the value = 8.

Behaviour of con…dence interval as n ! 1


Con…dence intervals become narrower as the size of the sample on which they are based
increases. For example, note the e¤ect of the sample size n in Example 4.4.2. The width
p p
of the 95% con…dence interval was 2(1:96) = n which decreases at the rate n, as n
increases. We noted this behaviour before for likelihood intervals. We will see in Section
4.6 that likelihood intervals are a type of con…dence interval.
It turns out that for most models it is not possible to …nd exact pivotal quantities or
con…dence intervals for whose coverage probabilities do not depend on the true value of .
However, in general we can …nd quantities Qn = Qn (Y1 ; Y2 ; : : : ; Yn ; ) such that as n ! 1,
the distribution of Qn ceases to depend on or other unknown information. We then say
that Qn is asymptotically pivotal, and in practice we treat Qn as a pivotal quantity for
su¢ ciently large values of n. We call Qn an asymptotic pivotal quantity.
150 4. ESTIMATION

Asymptotic Gaussian Pivotal Quantities


Suppose ~ is a point estimator of the unknown parameter . Suppose also that the Central
Limit Theorem can be used to obtain the result that
~
p
g( )= n
p
has approximately a G (0; 1) distribution for large n where E(~) = and sd(~) = g ( ) = n
for some real valued function g ( ). If we replace by ~ in the denominator then it can be
shown that
~
Qn (~; ) = p
g(~)= n
also has approximately a G (0; 1) distribution for large n. (This result is proved in STAT
330.) Therefore Qn ~; is an asymptotic Gaussian pivotal quantity which can be used
to construct approximate con…dence intervals for .

Example 4.4.3 Approximate con…dence interval for Binomial model


Suppose Y Binomial(n; ). The maximum likelihood estimator of is ~ = Y =n with
Y
E(~) = E =
n
and r
Y (1 )
sd(~) = sd =
n n
By the Central Limit Theorem the random variable
~
q
(1 )
n

has approximately a G(0; 1) distribution for large n.


If we replace in the denominator by the estimator ~ = Y =n then, based on the previous
discussion, we have that the random variable
~
Qn = Qn (Y ; ) = q
~(1 ~)
n

has approximately a G(0; 1) distribution for large n. Therefore Qn is an asymptotic


Gaussian pivotal quantity which can be used to construct con…dence intervals for .
For example, since

0:95 t P ( 1:96 Qn 1:96)


0 s s 1
~(1 ~) ~(1 ~)
= P @~ 1:96 ~ + 1:96 A
n n
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 151

therefore
2 s s 3
^(1 ^) ^(1 ^)
4^ 1:96 ; ^ + 1:96 5 (4.4)
n n
s
^(1 ^)
= ^ 1:96 (4.5)
n

is an approximate 95% con…dence interval for where ^ = y=n and y is the observed data.

Note Asymptotic Gaussian pivotal quantities exist for other models. See Problem 15
(Poisson), Problem 24 (Exponential), and Problem 17. See Table 4.3 in Section 4.8 for a
summary of the approximate con…dence intervals for these models.

Choosing the sample size for a Binomial experiment


We have seen that con…dence intervals for a parameter tend to get narrower as the
sample size n increases. When designing a study we often decide how large a sample to
collect on the basis of (i) how narrow we would like con…dence intervals to be, and (ii)
how much we can a¤ord to spend (it costs time and money to collect data). The following
example illustrates the procedure.

Example 4.4.4
Suppose we want to estimate the probability from a Binomial experiment in which
Y Binomial(n; ) distribution. We use the asymptotic pivotal quantity

~
Qn = q
~(1 ~)
n

which was introduced in Example 4.4.3 and which has approximately a G(0; 1) distribution
for large n, to obtain con…dence intervals for .
Here is a criterion that is widely used for choosing the size of n: Choose n large enough
so that the width of a 95% con…dence interval for is no wider than 2 (0:03). Let us see
where this leads and why this rule is used.
From Example 4.4.3, we know that
2 s s 3
^(1 ^) ^(1 ^)
4^ 1:96 ; ^ + 1:96 5 (4.6)
n n

is an approximate 0:95 con…dence interval for and that the width of this interval is
s
^(1 ^)
2 (1:96)
n
152 4. ESTIMATION

To make this con…dence interval narrower that 2 (0:03), we need n large enough so that
s
^(1 ^)
1:96 0:03
n
or
2
1:96 ^(1 ^)
n (4.7)
0:03
Of course we don’t know what ^ is because we have not taken a sample, but we note that
the interval (4.6) is the widest when ^ = 0:5. To ensure that the inequality (4.7) holds for
all values of ^, we …nd n such that
2
1:96
n (0:5) (0:5) t 1067:1
0:03

Thus, choosing n = 1068 (or larger) will result in an approximate 95% con…dence interval
of the form ^ c, where c 0:03.
If you look or listen carefully when polling results are announced, you will often hear
words like “this poll is accurate to within 3 percentage points 19 times out of 20.” What
this really means is that the estimator ~ (which is usually given in percentile form) approx-
imately satis…es P ( ~ 0:03) = 0:95, or equivalently, that the actual estimate ^ is the
centre of an approximate 95% con…dence interval ^ c, for which c = 0:03. In practice,
many polls are based on 1050 1100 people, giving “accuracy to within 3 percent” with
probability 0:95. Of course, one needs to be able to a¤ord to collect a sample of this size. If
we were satis…ed with an accuracy of 5 percent, then we’d only need n = 385 (can you show
this?). In many situations however this might not be su¢ ciently accurate for the purpose
of the study.

Exercise Show that to ensure that the width of the approximate 95% con…dence interval
is 2 (0:02) = 0:04 or smaller, you need n = 2401. What should n be to make ensure the
width of a 99% con…dence interval is less than 2 (0:02) = 0:04?

Remark Very large Binomial polls (n 2000) are not done very often. Although we can
in theory estimate very precisely with an extremely large poll, there are two problems:

1. It is di¢ cult to pick a sample that is truly random, so Y Binomial(n; ) is only an


approximation.

2. In many settings the value of ‡uctuates over time. A poll is at best a snapshot at
one point in time.

As a result, the “real” accuracy of a poll cannot generally be made arbitrarily high.
4.4. CONFIDENCE INTERVALS AND PIVOTAL QUANTITIES 153

Census versus a random sample


Conducting a complete census is usually costly and time-consuming. This example
illustrates how a random sample, which is less expensive, can be used to obtain “good”
information about the attributes of interest for a population.
Suppose interviewers are hired at $20 per hour to conduct door to door interviews of
adults in a municipality of 50; 000 households. There are two choices:

(1) conduct a census using all 50; 000 households or

(2) take a random sample of households in the municipality and then interview a member
of each household.

If a random sample is used it is estimated that each interview will take approximately
20 minutes (travel time plus interview time). If a census is used it is estimated that each
interview will take approximately 10 minutes each since there is less travel time. We can
summarize the costs and precision one would obtain for one question on the form which
asks whether a person agrees/disagrees with a statement about the funding levels for higher
education. Let be the proportion in the population who agree. Suppose we decide that a
“good”estimate of is one that is accurate to within 2% of the true value 95% of the time.
For a census, six interviews can be completed in one hour. At $20 per hour the inter-
viewer cost for the census is approximately
50000
$20 = $166; 667
6
since there are 50; 000 households.
For a random sample, three interviews can be completed in one hour. An approximate
95% con…dence interval for of the form ^ 0:02 requires n = 2401. The cost of the random
sample of size n = 2401 is
2401
$20 t $16; 000
3
as compared to $166; 667 for the census - more than ten times the cost of the random
sample!
Of course, we have also not compared the costs of processing 50; 000 versus 2401 surveys
but it is obvious again that the random sample will be less costly and time consuming.
154 4. ESTIMATION

4.5 The Chi-squared and t Distributions


In this section we introduce two new distributions, the Chi-squared distribution and the
Student t distribution. These two distributions play an important role in constructing
con…dence intervals and the tests of hypotheses to be discussed in Chapter 5.

2
The (Chi-squared) Distribution
To de…ne the Chi-squared distribution we …rst recall the Gamma function and its properties:

Z1
1 y
( )= y e dy for >0
0

Properties of the Gamma function:


(1) ( ) = ( 1) ( 1)
(2) ( ) = ( 1)! for = 1; 2; : : :
p
(3) (1=2) =

The 2 (k) distribution is a continuous family of distributions on (0; 1) with probability


density function of the form

1
f (x; k) = x(k=2) 1
e x=2
for x > 0 (4.8)
2k=2 (k=2)

where k 2 f1; 2; : : :g is a parameter of the distribution. We write X 2 (k). The parameter

k is referred to as the “degrees of freedom” (d.f.) parameter. In Figure 4.5 you see the
characteristic shapes of the Chi-squared probability density functions. For k = 2; the
probability density function is the Exponential (2) probability density function. For k > 2,
the probability density function is unimodal with maximum value at x = k 2. For values
of k 30, the probability density function resembles that of a N (k; 2k) probability density
function.
The cumulative distribution function, F (x; k), can be given in closed algebraic form
for even values of k. Probabilities for the 2 (k) distribution are provided in the Chi-
squared table at the end of these Course Notes. In R the function dchisq(x,k) gives the
probability density function f (x; k), pchisq(x,k) gives the cumulative distribution function
F (x; k) = P (X x; k), and qchisq(p,k) gives the value a such that P (X a; k) = p.
If X 2 (k) then

E (X) = k and V ar(X) = 2k

This result follows by …rst showing that


k
+j
E X j = 2j 2
k
for j = 1; 2; : : : :
2
4.5. THE CHI-SQUARED AND T DISTRIBUTIONS 155

4 0.8

3 0.6
f(x;1)

f(x;2)
2 0.4
d.f.=1 d.f.=2
1 0.2

0 0
0 1 2 3 0 2 4 6 8
x x

0.08 0.06

0.06
0.04 d.f.=30
d.f.=15
f(x;15)

f(x;15)
0.04
0.02
0.02

0 0
0 10 20 30 0 20 40 60
x x

Figure 4.5: Chi-squared probability density functions for k = 1; 2; 15; 30

This is true since


Z1
j 1
E X = x(k=2)+j 1
e x=2
dx let y = x=2 or x = 2y
2k=2 (k=2)
0
Z1
1
= (2y)(k=2)+j 1
e y
2dy
2k=2 (k=2)
0
Z1
2j
= y (k=2)+j 1
e y
dy
(k=2)
0
k
+j
= 2j 2
k
2
Letting j = 1 we obtain
k
2 +1 k
E (X) = 2 k
=2 =k
2
2
Letting j = 2 we obtain
k
+2 k k
E X 2 = 22 2
k
=4 +1 = k (k + 2)
2
2 2
156 4. ESTIMATION

and therefore
V ar (X) = E X 2 [E (X)]2 = k (k + 2) k 2 = 2k
The following results will also be very useful.

Theorem 29 Let W1 ; W2 ; : : : ; Wn be independent random variables with Wi 2 (k


i ).
P
n
2(
P
n
Then S = Wi ki ).
i=1 i=1

Proof. See Problem 20.

Theorem 30 If Z G(0; 1) then the distribution of W = Z 2 is 2 (1).

Proof. Suppose W = Z 2 where Z G(0; 1). Let represent the cumulative distribution
function of a G(0; 1) random variable and let represent the probability density function of
a G(0; 1) random variable.

For w > 0
p p p p
P (W w) = P ( w Z w) = ( w) ( w)
and therefore the probability density function of W is
d d p p
P (W w) = ( w) ( w)
dw dw
p p 1 1=2
= ( w) + ( w) w
2
1 1=2 w=2
=p w e
2

which is the probability density function of a 2 (1) random variable as required.

Corollary 31 If Z1 ; Z2 ; : : : ; Zn are mutually independent G(0; 1) random variables and


Pn
S= Zi2 ; then S 2 (n).
i=1

Proof. Since Zi G(0; 1) then by Theorem 30, Zi2 2 (1) and the result follows by
Theorem 29.

The following results will be useful in Chapter 5.

Useful Results:

2 (1) p
1. If W then P (W w) = 2 [1 P (Z w)] where Z G (0; 1).

2. If W 2 (2) then W Exponential (2) and P (W w) = e w=2 .


4.5. THE CHI-SQUARED AND T DISTRIBUTIONS 157

Student’s t Distribution
Student’s t distribution (or more simply the t distribution) has probability density function
(k+1)=2
t2
f (t; k) = ck 1 + for t 2 < and k = 1; 2; : : :
k

where the constant ck is given by


k+1
2
ck = p
k ( k2 )

The parameter k is called the degrees of freedom. We write T t (k) to indicate that
the random variable T has a t distribution with k degrees of freedom. In Figure 4.6 the
probability density function f (t; k) for k = 2 and k = 25 is plotted together with the G (0; 1)
probability density function.

0.4 0.4
G(0,1) G(0,1)
0.35 0.35

0.3 0.3

0.25 0.25
f(x;25)
f(x;2)

0.2 0.2

0.15 0.15

0.1 0.1

0.05
t(2) 0.05 t(25)
0 0
-5 0 5 -4 -2 0 2 4
x x

Figure 4.6: Probability density functions for t (k) (solid blue) and G (0; 1) (dashed
red)

The t probability density function is similar to that of the G (0; 1) distribution in several
respects: it is symmetric about the origin, it is unimodal, and indeed for large values of k,
the graph of the probability density function f (t; k) is indistinguishable from that of the
G (0; 1) probability density function. The primary di¤erence, for small k such as the one
plotted, is in the tails of the distribution. The t probability density function has fatter
“tails” or more area in the extreme left and right tails. Problem 21 at the end of this
chapter considers some properties of f (x; k).
158 4. ESTIMATION

Probabilities for the t distribution are provided in the t table at the end of these Course
Notes. In R the function dt(t,k) gives the probability density function f (t; k), pt(t,k)
gives the cumulative distribution function F (t; k) = P (T t; k), and qt(p,k) gives the
value a such that P (T a; k) = p.

The t distribution arises as a result of the following theorem. The proof of this theorem
is beyond the scope of this course.

Theorem 32 Suppose Z G(0; 1) and U 2 (k) independently. Let

Z
T =p
U=k

Then T has a Student’s t distribution with k degrees of freedom.

4.6 Likelihood-Based Con…dence Intervals


We will now show that likelihood intervals are also con…dence intervals. Recall the relative
likelihood
L( )
R( ) = for 2
L(^)

is a function of the maximum likelihood estimate ^. Replace the estimate ^ by the random
variable (the estimator) ~ and de…ne the random variable ( )

L( )
( )= 2 log
L(~)

where ~ is the maximum likelihood estimator. The random variable ( ) is called the
likelihood ratio statistic. The following theorem implies that ( ) is an asymptotic pivotal
quantity.

Theorem 33 If L( ) is based on Y = (Y1 ; Y2 ; : : : ; Yn ), a random sample of size n, and if


is the true value of the scalar parameter, then (under mild mathematical conditions) the
distribution of ( ) converges to a 2 (1) distribution as n ! 1.

This theorem means that ( ) can be used as a pivotal quantity for su¢ ciently large n
in order to obtain approximate con…dence intervals for . More importantly we can use this
result to show that the likelihood intervals discussed in Section 4.3 are also approximate
con…dence intervals.

Theorem 34 A 100p% likelihood interval is an approximate 100q% con…dence interval


p
where q = 2P Z 2 log p 1 and Z N (0; 1).
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 159

Proof. A 100p% likelihood interval is de…ned by f ; R( ) pg which can be rewritten as


( " # )
L( )
f ; R( ) pg = : 2 log 2 log p
L(^)

By Theorem 33 the con…dence coe¢ cient for this interval can be approximated by
L( )
P[ ( ) 2 log p] = P 2 log 2 log p
L(~)
2
t P (W 2 log p) where W (1)
p
= P jZj 2 log p where Z N (0; 1)
p
= 2P Z 2 log p 1

as required.

Example If p = 0:1 then


p
q = 2P Z 2 log (0:1) 1 where Z G (0; 1)
= 2P (Z 2:15) 1 = 0:96844

and therefore a 10% likelihood interval is an approximate 97% con…dence interval.

Exercise
(a) Show that a 1% likelihood interval is an approximate 99:8% con…dence interval.
(b) Show that a 50% likelihood interval is an approximate 76% con…dence interval.

Theorem 33 can also be used to …nd a likelihood interval which is also an approximate
100p% con…dence interval.

Theorem 35 If anis a value such thato p = 2P (Z a) 1 where Z N (0; 1), then the
a2 =2
likelihood interval : R( ) e is an approximate 100p% con…dence interval.
n o
Proof. The con…dence coe¢ cient corresponding to the interval : R( ) e a2 =2 is

L( ) a2 =2 L( )
P e = P 2 log a2
L(~) ~
L( )
2 2
t P W a where W (1) by Theorem 33
= 2P (Z a) 1 where Z N (0; 1)
= p

as required.
160 4. ESTIMATION

Example
Since
0:95 = 2P (Z 1:96) 1 where Z N (0; 1)

and
(1:96)2 =2 1:9208
e =e t 0:1465 t 0:15

therefore a 15% likelihood interval for is also an approximate 95% con…dence interval for
.

Exercise
(a) Show that a 26% likelihood interval is an approximate 90% con…dence interval.
(b) Show that a 4% likelihood interval is an approximate 99% con…dence interval.

Example 4.6.1 Approximate con…dence intervals for Binomial model


For Binomial data with n trials and y successes the relative likelihood function is (see
Example 4.3.1)
y
(1 )n y
R( ) = y for 0 1
^ (1 ^)n y

Suppose n = 100 and y = 40 so that ^ = 40=100 = 0:4. From the graph of the relative
likelihood function given in Figure 4.7 we can read o¤ the 15% likelihood interval which is
[0:31; 0:495] which is also an approximate 95% con…dence interval.

0.9

0.8

0.7

0.6
R(θ)

0.5

0.4

0.3

0.2

0.1

0
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
θ

Figure 4.7: Relative likelihood function for Binomial with n = 100 and y = 40
4.6. LIKELIHOOD-BASED CONFIDENCE INTERVALS 161

The approximate 95% con…dence interval

s
^(1 ^)
^ 1:96 (4.9)
n

is [0:304; 0:496]. The two intervals di¤er slightly but are very close.
If n = 30 and ^ = 0:1 then from Figure 4.8 the 15% likelihood interval is [0:03; 0:24]
which is also an approximate 95% con…dence interval. The approximate 95% con…dence

0.9

0.8

0.7

0.6
R(θ)

0.5

0.4

0.3

0.2

0.1

0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
θ

Figure 4.8: Relative likelihood function for Binomial with n = 30 and y = 3

interval based on 4.9 is [ 0:0074; 0:2074] which is quite di¤erent than the likelihood based
approximate con…dence interval and which also contains negative values for . Of course
can only take on values between 0 and 1. This happens because the con…dence interval in
(4.9) is always symmetric about ^ and if ^ is close to 0 or 1 and n is not very large then
the interval can contain values less than 0 or bigger than 1. The graph of the likelihood
interval in Figure 4.8 is not symmetric about ^. In this case the 15% likelihood interval is
a better summary of the values which are supported by the data.

More generally, if ^ is close to 0:5 or n is large then the likelihood interval will be fairly
symmetric about ^ and there will be little di¤erence in the two approximate con…dence
intervals. If ^ is close to 0 or 1 and n is not large then the likelihood interval will not be
symmetric about ^ and the two approximate con…dence intervals will not be similar. In
this case the 15% likelihood interval will be a better summary of the values which are
supported by the data.
162 4. ESTIMATION

4.7 Con…dence Intervals for Parameters in the G( ; ) Model


Suppose we have a variate of interest (for example, the weight in kilograms of a female in
the population of Example 1.3.2) whose distribution it is reasonable to model as a G( ; )
random variable. Suppose also that we plan to take a random sample Y1 ; Y2 ; : : : ; Yn to
estimate the unknown mean where Yi G( ; ), i = 1; 2; : : : ; n. and that the standard
deviation is also unknown. Recall that the maximum likelihood estimator of is
1 Pn
~=Y = Yi
n i=1

and the maximum likelihood estimator of 2 is


1 Pn
~2 = (Yi Y )2
n i=1

The sample variance


1 P
n
S2 = (Yi Y )2
n 1 i=1
is also an estimator of 2 . The estimators of 2 di¤er only in the denominator. Indeed if
n is large there is very little di¤erence between ~ 2 and S 2 . Note that the sample variance
has the advantage that it is an unbiased estimator, that is, E(S 2 ) = 2 (see Chapter 1,
Problem 18).

Con…dence intervals for


If were known then we have seen in Section 4.4 that
Y
Z= p G(0; 1) (4.10)
= n
is a pivotal quantity that can be used to obtain con…dence intervals for . However, is
generally unknown. Fortunately it turns out that if we simply replace with either the
maximum likelihood estimator ~ or the sample standard deviation S in Z, then we still
have a pivotal quantity. We will write the pivotal quantity in terms of S. The pivotal
quantity is
Y
T = p (4.11)
S= n
Since S, unlike , is a random variable in (4.11) the distribution of T is no longer G(0; 1).
The random variable T actually has a t distribution which was introduced in Section 4.5.

Theorem 36 Suppose Y1 ; Y2 ; : : : ; Yn is a random sample from the G( ; ) distribution with


sample mean Y and sample variance S 2 . Then

Y
T = p t (n 1) (4.12)
S= n
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 163

Note: The random variable T is a pivotal quantity since it is a function of the data
Y1 ; Y2 ; : : : ; Yn and the unknown parameter whose distribution t (n 1) is completely
known.

To see how
Y
T = p t (n 1)
S= n
follows from Theorem 32 let
Y
Z= p G(0; 1)
= n
and
(n 1)S 2
U= 2

We choose this function of S 2 since it can be shown that U 2 (n 1). It can also be
shown that Z and U are independent random variables. The proofs of these very important
results are beyond the scope of this course and are covered in a third year mathematical
statistics course.
By Theorem 32 with k = n 1, we have
Yp
Z = n Y
p =p = p t (n 1)
U=k S2= 2 S= n

In other words if we replace in the pivotal quantity (4.10) by its estimator S, the distri-
bution of the resulting pivotal quantity has a t(n 1) distribution rather than a G(0; 1)
distribution. The degrees of freedom of the t distribution are determined by the degrees of
freedom of the Chi-squared random variable U .

We now show how to use the pivotal quantity (4.12) to obtain a con…dence interval for
when is unknown. Since the t distribution is symmetric we determine the constant
a such that P ( a T a) = p using the t table provided in these Course Notes or R.
Note that, due to symmetry, P ( a T a) = p is equivalent to P (T a) = (1 + p) =2
(you should verify this) and since the t table tabulates the cumulative distribution function
P (T t), it is easier to …nd a such that P (T a) = (1 + p) =2. Then since

p = P( a T a)
Y
= P a p a
S= n
p p
= P Y aS= n Y + aS= n

a 100p% con…dence interval for is given by


p p p
y as= n = y as= n; y + as= n (4.13)
164 4. ESTIMATION

(Note that if we attempted to use (4.10) to build a con…dence interval we would have two
unknowns in the inequality since both and are unknown.) As usual the method used
to construct this interval implies that 100p% of the con…dence intervals constructed from
samples drawn from this population contain the true value of .

p
We note that this interval is of the form y as= n or

estimate a estimated standard deviation of estimator.

Recall that a con…dence interval for in the case of a G( ; ) population when is known
has a similar form

estimate a standard deviation of estimator

except that the standard deviation of the estimator is known in this case and the value of
a is taken from a G(0; 1) distribution rather than the t distribution.

Behaviour of con…dence interval as n ! 1


As the sample size n increases, con…dence intervals behave in a largely predictable fashion.
Since E (S) t for large n, the sample standard deviation s gets closer to the true stan-
dard deviation . Secondly as the degrees of freedom = n 1 increase, the quantiles of
the t distribution approach the quantiles of the G(0; 1) distribution. For example, for the
column labeled p = 0:975 in the t table we notice that as the degrees of freedom increase,
the quantiles are approaching the value 1:96 since P (Z 1:96) = 0:975. In general for
large n, the width of the con…dence interval gets narrower as n increases (but at the rate
p
1= n) so that, in the limit, the con…dence interval shrinks to include only the point y.

Example 4.7.1 Study on physical activity and academic performance


Researchers at a university in a large city who were interested in studying the rela-
tionship between physical activity and academic performance were given permission to
randomly select 51 Grade 7 girls attending a very large senior public school to participate
in the study. Parental consent for each student was also obtained. Data on age, height,
weight, IQ score, and scores on a …tness test were collected for each participant. To analyse
the data on heights (in centimeters) the model Yi G( ; ), i = 1; 2; : : : ; 51 was assumed.
The study population is all Grade 7 girls attending the large senior public school. The
sample size is n = 51. The parameter represents the mean height in centimeters of the
girls in the study population and the parameter represents the standard deviation of the
heights in centimeters of the girls in the study population. (We assume the heights are
measured without error.)
For these data
y = 150:1412 s = 5:3302
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 165

165

160

155
Sample Quantiles

150

145

140

135

130
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
G(0,1) Quantiles

Figure 4.9: QQplot of heights for Example 4.7.1

and a qqplot of the data are given in Figure 4.9. Since the points in the qqplot lie reasonably
along a straight line with more variability at both ends which is expected, we would conclude
that a Gaussian model is reasonable for these data.
Since
1 + 0:95
P (T 2:0086) = = 0:975 for T t (50)
2
a 95% con…dence interval for based on (4.13) is
p
y 2:0086s= 51
p
= 150:1412 (2:0086) (5:3302) = 51
= 150:1412 1:4992
= [148:6420; 151:6403]

Sample size required for a given width of con…dence interval for


If we know the value of approximately (possibly from previous studies), we can determine
the value of n needed to make a 95% con…dence interval a given length. This is used in
deciding how large a sample to take in a future study. A 95% con…dence interval using the
p
G (0; 1) quantiles takes the form y 1:96 = n. If we wish a 95% con…dence interval of the
form y d (the width of the con…dence interval is then 2d), we should choose
p
1:96 = n t d
or n t (1:96 =d)2
166 4. ESTIMATION

We would usually choose n a little larger than this formula gives to accommodate the fact
that we used G (0; 1) quantiles rather than the quantiles of the t distribution which are
larger in value and we only know approximately.

2
Con…dence intervals for and

Suppose that Y1 ; Y2 ; : : : ; Yn is a random sample from the G( ; ) distribution. We have


seen that there are two closely related estimators for the population variance, ~ 2 and the
sample variance S 2 . We use S 2 to build a con…dence interval for the parameter 2 . Such a
construction depends on the following result.

Theorem 37 Suppose Y1 ; Y2 ; : : : ; Yn is a random sample from the G( ; ) distribution with


sample variance S 2 .

2
(n 1)S 2 1 P
n
2 P
n Yi Y 2
U= 2
= 2
Yi Y = (n 1) (4.14)
i=1 i=1

Note: The random variable U is a pivotal quantity since it is a function of the data
Y1 ; Y2 ; : : : ; Yn and the unknown parameter 2 whose distribution 2 (n 1) is completely
known.

While the proof of this result is beyond the scope of this course, we will try to explain
the puzzling number of degrees of freedom n 1, which, at …rst glance, seems wrong
P
n
since (Yi Y )2 is the sum of n squared Normal random variables. Does this contradict
i=1
Corollary 31? It is true that each Wi = (Yi Y ) is a Normally distributed random variable.
However Wi does not have a N (0; 1) distribution and more importantly the Wi ’s are not
independent! (See Problem 23.) One way to see that W1 ; W2 ; : : : ; Wn are not independent
Pn nP1
random variables is to note that since Wi = 0 this implies Wn = Wi so the last
i=1 i=1
term can be determined using the sum of the …rst n 1 terms. Therefore in the sum,
Pn Pn
(Yi Y )2 = Wi2 there are really only n 1 terms that are linearly independent or
i=1 i=1
“free”. This is an intuitive explanation for the n 1 degrees of freedom for the pivotal
quantities (4.14) and (4.12). In both cases, the degrees of freedom are determined by S 2
and are related to the dimension of the subspace inhabited by the terms in the sum for S 2 ,
that is, the terms Wi = Yi Y ; i = 1; 2; : : : ; n.

We will now show how to use the pivotal quantity (4.14) to construct a 100p% con…dence
interval for the parameter 2 or . Using the Chi-squared table or R we can …nd constants
a and b such that
P (a U b) = p
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 167

where U 2 (n 1). Since

p = P (a U b)
(n 1)S 2
= P a 2
b

(n 1)S 2 2 (n 1)S 2
= P
b a
r r !
(n 1) S 2 (n 1) S 2
= P
b a

a 100p% con…dence interval for 2 is

(n 1)s2 (n 1)s2
; (4.15)
b a

and a 100p% con…dence interval for is


" r r #
n 1 n 1
s ; s (4.16)
b a

The choice for a, b is not unique. For convenience, a and b are usually chosen such that
1 p
P (U a) = P (U > b) = (4.17)
2
where U 2 (n 1). Note that since the Chi-squared table provided in these Course
Notes tabulate the cumulative distribution function, P (U u), this means using the table
to …nd a and b such that
1 p 1 p 1+p
P (U a) = and P (U b) = p + =
2 2 2
The intervals (4.15) and (4.16) are called equal-tailed con…dence intervals. The choice
(4.17) for a, b does not give the narrowest con…dence interval. The narrowest interval must
be found numerically. For large n the equal-tailed interval and the narrowest interval are
nearly the same.
Note that, unlike con…dence intervals for , the con…dence interval for 2 is not sym-
metric about s2 , the estimate of 2 . This happens of course because the 2 (n 1) distri-
bution is not a symmetric distribution.
In some applications we are interested in an upper bound on (because small is
“good” in some sense). In this case we take b = 1 and …nd a such that P (a U ) = p or
P (U a) = 1 p so that a one-sided 100p% con…dence interval for is
" r #
n 1
0; s
a
168 4. ESTIMATION

Example 4.7.2 Optical glass


At the Clear Eye Optical Lab Company a manufacturing process produces wafer-shaped
pieces of optical glass for lenses. Pieces must be very close to 25 millimeters thick, and only
a small amount of variability around this can be tolerated. From past experience it is known
that if Y represents the thickness of a randomly selected piece of glass then it is reasonable
to assume the model Y G( ; ). (Thicknesses are assumed to be measured without
error.) The parameter represents the mean and represents the standard deviation of
the thicknesses in millimeters of the pieces of optical glass produced by the manufacturing
process at the Clear Eye Optical Lab (the study process). For quality control purposes a
random sample of size n = 15 is drawn every eight hours to check if the process is working
properly. The values of and are estimated based on the sample to see if they are
consistent with = 25 and with being under 0:02 millimeters. On one such occasion the
observed data were
y = 25:009 and s = 0:013

To obtain a 95% con…dence interval for we determine a and b such that


1 0:95 1 + 0:95
P (U a) = = 0:025 and P (U b) = = 0:975
2 2

where U 2 (14). From the Chi-squared table or R we obtain

P (U 5:629) = 0:025 and P (U 26:119) = 0:975

so a = 5:63 and b = 26:12. Substituting these values along with (14) s2 = 0:002347 into
(4.16) we obtain
" r r #
14 14
0:013 ; 0:013 = [0:00952; 0:0205]
26:119 5:629

as the 95% con…dence interval for .


It seems plausible that 0:02, though the right endpoint of the 95% con…dence
interval is very slightly over 0:02. Using P (U 6:571) = 0:05 we can obtain a one-sided
95% con…dence interval for which is given by
r " r #
n 1 14
0; s = 0; 0:013 = [0; 0:0190]
a1 6:571

and the value 0:02 is not in the interval.


Why are the intervals di¤erent? As we have already noted, both intervals are 95%
con…dence intervals. The two-sided interval is
" r r #
n 1 n 1
s ; s where P (U a) = 0:025 = P (U b)
b a
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 169

whereas the one-sided interval is


r
n 1
0; s where P (U a1 ) = 0:05
a1
Now a1 > a since 0:05 > 0:025 so the upper endpoint of the one-sided interval is smaller
than the upper endpoint of the two-sided interval. Correspondingly the lower endpoint of
the two-sided interval is larger than the lower endpoint of the one-sided interval. The one-
sided interval is summarizing all the small values of which are supported by the observed
data. If we are concerned that is too large, then it makes sense to look at all the small
values of that are supported by the observed data. The two-sided interval is excluding
both large and small values of that are not supported by the observed data.

Prediction Interval for a Future Observation


In Chapter 3 we mentioned that a common type of statistical problem was a predictive
problem in which the experimenter wishes to predict the response of a variate for a given
unit. This is often the case in …nance or in economics. For example, …nancial institutions
need to predict the price of a stock or interest rates in a week or a month because this
a¤ects the value of their investments. We will now show how to do this in the case where
the Gaussian model for the data is valid.
Suppose that y1 ; y2 ; : : : ; yn is an observed random sample from a G( ; ) population
and that Y is a new observation which is to be drawn at random from the same G( ; )
population. We want to estimate Y and obtain an interval of values for Y . As usual we esti-
mate the unknown parameters and using y and s respectively. The best point estimate
p
of Y based on the data is ^ = y with corresponding estimator ~ = Y G ( ; = n).
To obtain an interval of values for Y we note that Y G ( ; ) independently of
p
~ = Y G ( ; = n). Since Y Y is a linear combination of independent Gaussian
random variables then Y Y also has a Gaussian distribution with mean

E Y Y = =0

and variance
2
2
V ar Y Y = V ar (Y ) + V ar Y = +
n
Since
Y Y
q G (0; 1)
1 + n1
independently of
(n 1)S 2 2
2
(n 1)
then by Theorem 32
Y Y
q
1
1+ n Y Y
p = q t (n 1)
S2= 2
S 1 + n1
170 4. ESTIMATION

is a pivotal quantity which can be used to obtain an interval of values for Y .


Let a be the value such that
(1 + p)
P( a T a) = p or P (T a) = where T t (n 1)
2
which is obtained from the t table or by using R. Since

p = P( a T a)
0 1
Y Y
= P@ a q aA
S 1 + n1
r r !
1 1
= P Y aS 1 + Y Y + aS 1+
n n

therefore " r r #
1 1
y as 1 + ; y + as 1+ (4.18)
n n
is an interval of values for the future observation Y with con…dence coe¢ cient p. The
interval (4.18) is called a 100p% prediction interval instead of a con…dence interval since
Y is not a parameter but a random variable. Note that the interval (4.18) is wider than a
100p% con…dence interval for mean . This makes sense since is an unknown constant
with no variability while Y is a random variable with its own variability V ar (Y ) = 2 .
A 100p% prediction interval summarizes a set of values for an unknown future observa-
tion (a random variable) based on the observed data. Con…dence intervals are for unknown
but …xed parameters (not random variables). The procedure for constructing the prediction
interval is based on the probability statement

P [L(Y) Y U (Y)] = p (4.19)

where Y (a random variable) is the future observation and Y (a random variable possibly
a vector) are the data from an experiment. Suppose you conduct the experiment once and
observe the data y. The constructed interval based on the probability statement (4.19) and
the observed data y is [L(y); U (y)].
To interpret a prediction interval, suppose you conducted the same experiment inde-
pendently a large number of times and each time you constructed the interval [L(y); U (y)]
based on your observed data y. (Of course y won’t be the same every time you conduct
the experiment.) Approximately 100p% of these constructed intervals would contain the
future unknown observation. Of course you usually only conduct the experiment once and
you only have one interval [L(y); U (y)]. You would that say you are 100p% con…dent that
your constructed interval contains the value of the future observation.
4.7. CONFIDENCE INTERVALS FOR PARAMETERS IN THE G( ; ) MODEL 171

Example 4.7.2 Revisited Optical glass


Suppose in Example 4.7.2 a 95% prediction interval is required for a piece of optical
glass drawn at random from the study process.
Since y = 25:009, s = 0:013 and
1 + 0:95
P (T 2:1448) = = 0:975 for T t (14)
2
a 95% prediction interval for this new piece of optical glass is given by
r
1
25:009 2:1448 (0:013) 1 +
15
= 25:009 0:0288
= [24:9802; 25:0378]

Note that this interval is much wider than a 95% con…dence interval for = the mean of
the population of lens thicknesses produced by this manufacturing process which is given
by
p
25:009 2:1448 (0:013) = 15
= 25:009 0:0072
= [25:0018; 25:0162]
172 4. ESTIMATION

4.8 Chapter 4 Summary


Approximate Con…dence Intervals based on Likelihood Intervals

A 100p% likelihood interval is de…ned as f : R ( ) pg where R ( ) = R ( ; y) is the


relative likelihood function for based on observed data y (possible a vector). Likelihood
intervals must usually be found using a numerical method such as the uniroot function in
R.
A100p% likelihood interval is an approximate 100q% con…dence interval where
q = P (W 2 log p) and W 2 (1). (Note: q = pchisq( 2 log p; 1) in R.)

An approximate 100p% con…dence interval is given by a 100 e b=2 % likelihood interval


where b is the value such that p = P (W b) and W 2 (1). (Note: b = qchisq(p; 1) in

R.)
These results are derived from the fact that 2 log R ( ; Y) is an asymptotic pivotal
quantity with approximately a 2 (1) distribution.

Table 4.3
Approximate Con…dence Intervals for Named Distributions
based on Asymptotic Gaussian Pivotal Quantities

Asymptotic Approximate
Point Point
Named Observed Gaussian 100p%
Estimate Estimator
Distribution Data ^ ~ Pivotal Con…dence
Quantity Interval

~ q
y Y q ^(1 ^)
Binomial(n; ) y n n ~(1 ~) ^ a
n n

~ q
q ^ ^
Poisson( ) y1 ; y2 ; : : : ; yn y Y ~ a n
n

~ ^
Exponential( ) y1 ; y2 ; : : : ; yn y Y ~ ^ a pn
p
n

1+p 1+p
Note: The value a is given by P (Z a) = 2 where Z G (0; 1). In R, a = qnorm 2
4.8. CHAPTER 4 SUMMARY 173

Table 4.4
Con…dence/Prediction Intervals for Gaussian
and Exponential Models

Unknown Pivotal 100p%


Model
Quantity Quantity Con…dence/Prediction
Interval

G( ; ) Yp p
= n
G (0; 1) y a = n
known

G( ; ) Yp p
S= n
t (n 1) y as= n
unknown

100p% Prediction
G( ; ) Y Y
q t (n 1) Interval
Y 1
S 1+ n q
unknown 1
y as 1+ n
unknown

G( ; ) h i
2 (n 1)S 2 2 (n (n 1)s2 (n 1)s2
2 1) b ; a
unknown

q q
G( ; ) (n 1)S 2 2 (n (n 1)s2 (n 1)s2
2 1) d ; c
unknown

h i
2nY 2 (2n) 2ny 2ny
Exponential( ) d1 ; c1

1+p
Notes: (1) The value a is given by P (Z a) = 2 where Z G (0; 1).
1+p
In R, a = qnorm 2
1+p 1+p
(2) The value b is given by P (T b) = 2 where T t (n 1). In R, b = qt 2 ;n 1
1 p 2 (n
(3) The values c and d are given by P (W c) = 2 = P (W > d) where W 1).
1 p 1+p
In R, c = qchisq 2 ;k and d = qchisq 2 ;k
1 p 2 (2n).
(4) The values c1 and d1 are given by P (W c1 ) = 2 = P (W > d1 ) where W
1 p 1+p
In R, c1 = qchisq 2 ;k and d1 = qchisq 2 ;k
174 4. ESTIMATION

4.9 Chapter 4 Problems


1. The following R code produces a histogram similar to Figure 4.2.
# pop = vector of variate values for the population given in Table 4.1
pop<-c(rep(1,times=210),rep(2,times=127),rep(3,times=66),rep(4,times=39),
rep(5,times=23),rep(6,times=13),rep(7,times=11),rep(8,times=7),
rep(9,times=3),rep(10,times=1))
hist(pop,breaks=seq(1,10,1),col="cyan",main="",xlab="Variate Value")
mu<-mean(pop) # population mean
mu
(499*var(pop)/500)^0.5 # population standard deviation
k<-10000 # number of simulations
n<-15 # sample size
sim<-rep(0,k) # vector to store sample means
# Calculate k sample means for samples of size n drawn from population pop
for (i in 1:k)
sim[i]=mean(sample(pop,n,replace=F))
hist(sim,freq=F,col="cyan",xlab="Sample Mean",main="")
# percentage of times sample mean is within 0.5 of true mean mu
mean(abs(sim-mu)<0.5)

(a) Run the R code and compare the output with the answers in Example 4:2:3.
(b) Run the R code replacing n<-15 with n<-30 and compare the results with
those for n = 15.
(c) Explain how the mean, standard deviation and symmetry of the original popu-
lation a¤ect the histogram of simulated means.
(d) Explain how the sample size n a¤ects the histogram of simulated means.

2. R code for plotting a Binomial relative likelihood


Suppose for a Binomial experiment we observe y = 15 successes in n = 40 trials. The
following R code will plot the relative likelihood function of and the line R ( ) = 0:15
which can be used to determine a 15% likelihood interval.
y<-15
n<-40
thetahat<-y/n
theta<-seq(0.15,0.65,0.001)
# points between 0.15 and 0.65 spaced 0.001 apart
Rtheta<-exp(y*log(theta/thetahat)+(n-y)*log((1-theta)/(1-thetahat)))
plot(theta,Rtheta,type="l") # plot relative likelihood function
# draw a horizontal line at 0.15
4.9. CHAPTER 4 PROBLEMS 175

abline(a=0.15,b=0,col="red",lwd=2)
title(main="Binomial Likelihood for y=15 and n=40")

Modify this code for y = 75 successes in n = 200 trials and y = 150 successes in
n = 400 trials and observe what happens to the width of the 15% likelihood interval.

3. R code for plotting a Poisson relative likelihood


Suppose we have a sample y1 ; y2 ; :::; yn from a Poisson distribution with n = 25 and
y = 5. The following R code will plot the relative likelihood function of and the line
R ( ) = 0:15 which can be used to determine a 15% likelihood interval.

thetahat<-5
n<-25
theta<-seq(3.7,6.5,0.001)
Rtheta<-exp(n*thetahat*log(theta/thetahat)+n*(thetahat-theta))
plot(theta,Rtheta,type="l")
# draw a horizontal line at 0.15
abline(a=0.15,b=0,col="red",lwd=2)
title(main="Poisson Likelihood for ybar=5 and n=25")

Modify this code for larger sample sizes n = 100 and n = 400, and observe what
happens to the width of the 15% likelihood interval.

4. For Chapter 2, Problem 4(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.

5. For Chapter 2, Problem 6(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.

6. For Chapter 2, Problem 8(b) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of r( ) or by using the function uniroot in R.

7. (a) For Chapter 2, Problem 9 plot the relative likelihood function R( ) and de-
termine a 10% likelihood interval. The likelihood interval can be found from
the graph of R( ) or by using the function uniroot in R. How well can be
determined based on these data?
(b) Suppose that we can …nd out whether each pair of twins is identical or not, and
that it is determined that of 50 pairs, 17 were identical. Obtain the likelihood
function, the maximum likelihood estimate and a 10% likelihood interval for
in this case. Plot the relative likelihood function on the same graph as the one
in (a), and compare how well can be determined based on the two data sets.

8. For Chapter 2, Problem 12(c) determine a 15% likelihood interval for . The likelihood
interval can be found from the graph of R( ) or by using the function uniroot in R.
176 4. ESTIMATION

9. Suppose that a fraction of a large population of persons are infected with a certain
virus. Let n and k be integers. Suppose that blood samples for n k people are to be
tested to obtain information about . In order to save time and money, pooled testing
is used, that is, samples are mixed together k at a time to give a total of n pooled
samples. A pooled sample will test negative if all k individuals in that sample are not
infected.

(a) Find the probability that y out of n samples will be negative, if the nk people
are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate of in terms
of n, k and y.
(c) Suppose n = 100, k = 10 and y = 89. Find the maximum likelihood estimate of
, and a 10% likelihood interval for .

10. Suppose that a fraction of a large population of persons over 18 years of age never
drink alcohol. In order to estimate , a random sample of n persons is to be selected
and the number y who do not drink determined; the maximum likelihood estimate of
is then ^ = y=n. We want our estimate ^ to have a high probability of being close
to , and want to know how large n should be to achieve this. Consider the random
variable Y and the estimator ~ = Y =n.

(a) Determine P 0:03 ~ 0:03 , if n = 1000 and = 0:5 using the Normal
approximation to the Binomial. You do not need to use a continuity correction.
(b) If = 0:50 determine how large n should be to ensure that

P 0:03 ~ 0:03 = P ~ 0:03 0:95

(c) If is unknown determine how large n should be to ensure that

P 0:03 ~ 0:03 = P ~ 0:03 0:95

for all 2 [0; 1].

11. The following R code generates k approximate Binomial(n; ) con…dence intervals


based on the Gaussian asymptotic pivotal quantity and determines the proportion
which contain the true value of :
n<-30 # n = number of trials
theta<-0.25 # value of theta
k<-1000 # number of confidence intervals generated
a<-1.96 # a = 1.96 for approximate 95% confidence interval
that<-rbinom(k,n,theta)/n # vector of thetahat’s for k simulations
4.9. CHAPTER 4 PROBLEMS 177

pm<-a*(that*(1-that)/n)^0.5 # used to get confidence interval


# each confidence interval is stored in a row of matrix int
int<-matrix(c(that-pm,that+pm),nrow=k,byrow=F)
# Look at first 25 intervals to see how variable intervals are
int[1:25,1:2]
# proportion of intervals which contain the true value theta
mean(abs(theta-that)<pm)

(a) Run this code to determine the proportion of approximate 95% con…dence in-
tervals which contain the true value.
(b) Run this code for n<-100 and n<-1000 and observe what happens to the pro-
portion.
(c) Run this code for theta<-0.1 and observe what happens to the proportion.

12. The following excerpt is from a March 2, 2012 cbc.ca news article:
“Canadians lead in time spent online: Canadians are spending more time online
than users in 10 other countries, a new report has found. The report, 2012 Canada
Digital Future in Focus, by the internet marketing research company comScore, found
Canadians spent an average of 45:3 hours on the internet in the fourth quarter of 2011.
The report also states that smartphones now account for 45% of all mobile phone use
by Canadians.”
Assume that these results are based on a random sample of 1000 Canadians.

(a) Suppose a 95% con…dence interval for , the mean time Canadians spent on the
internet in this quarter, is reported to be [42:8; 47:8]. How should this interval
be interpreted?
(b) Construct an approximate 95% con…dence interval for the proportion of Cana-
dians whose mobile phone is a smartphone.
(c) Since this study was conducted in March 2012 the research company has been
asked to conduct a new survey to determine if the proportion of Canadians whose
mobile phone is a smartphone has changed. What size sample should be used
to ensure that the width of an approximate 95% con…dence interval is less than
2 (0:02)?

13. Two hundred adults are chosen at random from a population and each adult is asked
whether information about abortions should be included in high school public health
sessions. Suppose that 70% say they should.

(a) Obtain an approximate 95% con…dence interval for the proportion of the pop-
ulation who support abortion information included in high school public health
sessions.
178 4. ESTIMATION

(b) Suppose you found out that the 200 persons interviewed consisted of 50 married
couples and 100 other persons. The 50 couples were randomly selected, as were
the other 100 persons. Discuss the validity (or non-validity) of the analysis in
(a).

14. In the United States. the prevalence of HIV (Human Immunode…ciency Virus) infec-
tions in the population of child-bearing women has been estimated by doing blood
tests (anonymously) on all women giving birth in a hospital. One study tested 29; 000
women and found that 64 were HIV positive (had the virus). Give an approximate
99% con…dence interval for , the fraction of the population that is HIV positive.
State any concerns you have about the accuracy of this estimate.

15. If Y1 ; Y2 ; : : : ; Yn is a random sample from the Poisson( ) distribution then by the


Central Limit Theorem and other limit theorems the random variable

Y
p
Y =n

has approximately a G (0; 1) distribution.

(a) Show how this asymptotic pivotal quantity leads to an approximate 95% con…-
dence interval for given by r
y
y 1:96
n
(b) Use the result from (a) to construct an approximate 95% con…dence interval for
in Chapter 2, Problem 10.
(c) Compare the approximate 95% con…dence interval for with a 15% likelihood
interval. What do you notice?

16. Company A leased photocopiers to the federal government, but at the end of their
recent contract the government declined to renew the arrangement and decided to
lease from a new vendor, Company B. One of the main reasons for this decision was
a perception that the reliability of Company A’s machines was poor.

(a) Over the preceding year the monthly numbers of failures requiring a service call
from Company A were

12 14 15 16 18 19 19 22 23 25 28 29

Assuming that the number of service calls needed in a one month period has
a Poisson distribution with mean , obtain and graph the relative likelihood
function R( ) based on the data above.
4.9. CHAPTER 4 PROBLEMS 179

(b) In the …rst year using Company B’s photocopiers, the monthly numbers of service
calls were

7 8 9 10 10 12 12 13 13 14 15 17

Under the same assumption as in part (a), obtain R( ) for these data and graph
it on the same graph as used in (a).
(c) Determine the 15% likelihood interval for which is also an approximate 95%
con…dence interval for for each company. The intervals can be obtained from
the graphs of the relative likelihood functions or by using the function uniroot
in R. Do you think the government’s decision was a good one, as far as the
reliability of the machines is concerned?
(d) What conditions would need to be satis…ed to make the assumptions and analysis
in (a) to (c) valid?
(e) Use the result from Problem 15 to determine approximate 95% con…dence inter-
vals for for each company. Compare these intervals with the intervals obtained
in (c).

17. A manufacturing process produces …bers of varying lengths. The length of a …ber Y
is a continuous random variable with probability density function
y y=
f (y; ) = 2e for y 0 and >0

where is an unknown parameter.

(a) If Y has probability density function f (y; ) show that E (Y ) = 2 and


V ar (Y ) = 2 2 . Hint: Use the Gamma function.
(b) Let y1 ; y2 ; : : : ; yn be the lengths of n …bers selected at random. Find the maxi-
mum likelihood estimate of based on these data.
(c) Suppose Y1 ; Y2 ; : : : ; Yn are independent and identically distributed random vari-
ables with probability density function f (y; ) given above. Find E Y and
V ar Y using the result in (a).
(d) Justify the statement
!
Y 2
P 1:96 p 1:96 t 0:95
2=n

(e) Show how you would use the statement in (d) to construct an approximate 95%
con…dence interval for .
180 4. ESTIMATION

(f) Suppose n = 18 …bers were selected at random and the lengths were:

6:19 7:92 1:23 8:13 4:29 1:04 3:67 9:87 10:34


1:41 10:76 3:69 1:34 6:80 4:21 3:44 2:51 2:08

P
18
For these data yi = 88:92. Give the maximum likelihood estimate of and
i=1
an approximate 95% con…dence interval for using your result from (e).

18. The lifetime T (in days) of a particular type of light bulb is assumed to have a
distribution with probability density function

1 3 2 t
f (t; ) = t e for t > 0 and >0
2
(a) Suppose t1 ; t2 ; : : : ; tn is a random sample from this distribution. Find the maxi-
mum likelihood estimate ^ and the relative likelihood function R( ).
P20
(b) If n = 20 and ti = 996, graph R( ) and determine the 15% likelihood interval
i=1
for which is also an approximate 95% con…dence interval for . The interval
can be obtained from the graph of R( ) or by using the function uniroot in R.
(c) Suppose we wish to estimate the mean lifetime of a light bulb. Show E(T ) = 3= .
Hint: Use the Gamma function. Find an approximate 95% con…dence interval
for the mean.
(d) Show that the probability p that a light bulb lasts less than 50 days is

p = p( )
= P (T 50; )
50 2
= 1 e [1250 + 50 + 1]

Determine the maximum likelihood estimate of p. Find an approximate 95%


con…dence interval for p from the approximate 95% con…dence interval for .
For the data referred to in (b), the number of light bulbs which lasted less than
50 days was 11 (out of 20). Using a Binomial model, obtain an approximate 95%
con…dence interval for p. What are the pros and cons of the second interval over
the …rst one?
4.9. CHAPTER 4 PROBLEMS 181

19. The Chi-squared distribution

(a) Use the Chi-squared table provided at the end of these Course Notes to answer
the following:
(i) If X 2 (10) …nd P (X 2:6) and P (X > 16).
(ii) If X 2 (4) …nd P (X > 15).
(iii) If X 2 (40) …nd P (X 24:4) and P (X 55:8). Compare these values
with P (Y 24:4) and P (Y 55:8) if Y N (40; 80).
(iv) If X 2 (25) …nd a and b such that P (X a) = 0:025 and
P (X > b) = 0:025.
(v) If X 2 (12) …nd a and b such that P (X a) = 0:05 and P (X > b) = 0:05.
(b) Use the R functions pchisq(x,k) and qchisq(p,k) to check the values in (a).
(c) Determine the following WITHOUT using the Chi-squared table:
(i) If X 2 (1) …nd P (X 2) and P (X > 1:4).
(ii) If X 2 (2) …nd P (X 2) and P (X > 3).
(d) If X G (3; 2) and YiExponential (2), i = 1; 2; : : : ; 5 all independently then
P
5 2
what is the distribution of W = Yi + X2 3 ?
i=1
(e) If Xi 2 (i) ; i = 1; 2; : : : ; 10 independently then what is the distribution of
P10
Xi ?
i=1

20. Properties of the Chi-squared distribution Suppose X 2 (k) with probabil-


ity density function given by
1
f (x; k) = x(k=2) 1
e x=2
for x > 0
2k=2 (k=2)

(a) Show that this probability density function integrates to one for k = 1; 2; : : :
using the properties of the Gamma function.
(b) Plot the probability density function for k = 5, k = 10 and k = 25 on the same
graph. What do you notice?
(c) Show that the moment generating function of Y is given by

M (t) = E etX
k=2 1
= (1 2t) for t <
2
and use this to show that E(X) = k and V ar(X) = 2k.
(d) Prove Theorem 29 using moment generating functions.
182 4. ESTIMATION

21. Student’s t distribution Suppose T t (k).

(a) Plot the probability density function for k = 1; 5; 25. Plot the N (0; 1) probability
density function on the same graph. What do you notice?
(b) Show that f (t; k) is unimodal.
(c) Use Theorem 32 to show that E (T ) = 0. Hint: If X and Y are independent
random variables then E [g (X) h (Y )] = E [g (X)] E [h (Y )].
(d) Use the t table provided at the end of these Course Notes to answer the following:
(i) If T t(10) …nd P (T 0:88), P (T 0:88) and P (jT j 0:88).
(ii) If T t(17) …nd P (jT j > 2:90).
(iii) If T t(30) …nd P (T 2:04) and P (T 0:26). Compare these values
with P (Z 2:04) and P (Z 0:26) if Z N (0; 1).
(iv) If T t(18) …nd a and b such that P (T a) = 0:025 and P (T > b) = 0:025.
(v) If T t(13) …nd a and b such that P (T a) = 0:05 and P (T > b) = 0:05.
(e) Use the R functions pt(x,k) and qt(p,k) to check the values in (d).

22. Limiting t distribution Suppose T t (k) with probability density function


k+1
t2 2
f (t; k) = ck 1+ for t 2 < and k = 1; 2; : : :
k
where
k+1
2
ck = p k
k 2
Show that
1 1 2
lim f (t; k) = p exp t for t 2 <
k!1 2 2
which is the probability density function of the G(0; 1) distribution. Hint: You may
p
use the fact that lim ck = 1= 2 which is a property of the Gamma function.
k!1

23. Suppose Yi G ( ; ) ; i = 1; 2; : : : ; n independently and let Wi = Yi Y;


i = 1; 2; : : : ; n.

(a) Show that Wi ; i = 1; 2; : : : ; n can be written as a linear combination of indepen-


dent Normal random variables.
(b) Show that E (Wi ) = 0 and V ar (Wi ) = 2 1 n1 ; i = 1; 2; : : : ; n. Hint: Show
2
Cov Yi ; Y = n , i = 1; 2; : : : ; n. Note that this result along with the result in
(a) implies that
r !
1
Wi = Yi Y G 0; 1 ; i = 1; 2; : : : ; n
n
4.9. CHAPTER 4 PROBLEMS 183

2
(c) Show that Cov (Wi ; Wj ) = n , for all i 6= j which implies that the Wi0 s are
correlated random variable and therefore not independent random variables.

24. If Y1 ; Y2 ; : : : ; Yn is a random sample from the Exponential( ) distribution then


E Y = and V ar Y = 2 =n. By the Central Limit Theorem the random variable

Y
p
= n
has approximately a G (0; 1) distribution. It also follows that

Y
Q= p
Y= n
has approximately a G (0; 1) distribution. Show how the asymptotic pivotal quantity
Q leads to an approximate 100p% con…dence interval for given by
y
y ap
n
where P (Z a) = (1 + p) =2 and Z G (0; 1).

25. In an early study concerning survival time for patients diagnosed with Acquired Im-
mune De…ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis
P
30
of AIDS and death) of 30 male patients were such that yi = 11; 400 days. Assume
i=1
that the survival times are Exponentially distributed with mean days.

(a) Use the result in Problem 24 to obtain an approximate 90% con…dence interval
for .
(b) Graph the relative likelihood function for these data and obtain an approximate
90% likelihood based con…dence interval for . Compare this with the interval
obtained in (a).
(c) Show that m = ln 2 is the median survival time. Give an approximate 90%
con…dence interval for m based on your interval from (b).

26. Exact con…dence intervals for for Exponential data

(a) If Y Exponential( ) then show that W = 2Y = has a 2 (2) distribution.


(Hint: compare the probability density function of W with (4.8)).
(b) Suppose Y1 ; Y2 ; : : : ; Yn is a random sample from the Exponential( ) distribution.
Use the results of Section 4.5 to prove that
P
n Y
i 2
U =2 (2n)
i=1

This result implies that U is a pivotal quantity.


184 4. ESTIMATION

(c) Show how the pivotal quantity U can be used to construct an exact con…dence
interval for .
(d) Refer to the data in Problem 25. Obtain an exact 90% con…dence interval for
based on the pivotal quantity U . Compare this with the approximate con…dence
intervals for obtained in Problem 25.

27. Suppose the model Yi G ( ; ) ; i = 1; 2; : : : ; n independently is assumed where is


a known value. Show that
2
P
n Yi
U=
i=1

is a pivotal quantity. Show how this pivotal quantity can be used to construct a
100p% con…dence interval for 2 and .

28. A study on the common octopus (Octopus Vulgaris) was conducted by researchers
at the University of Vigo in Vigo, Spain. Nineteen octopi were caught in July 2008
in the Ria de Vigo (a large estuary on the northwestern coast of Spain). Several
measurements were made on each octopus including their weight in grams. These
weights are given in the table below.

680 1030 1340 1330 1260 770 830 1470 1380 1220
920 880 1020 1050 1140 960 1060 1140 860

Let yi = weight of the i0 th octopus, i = 1; 2; : : : ; 19. For these data

P
19 P
19
yi = 20340 and (yi y)2 = 884095
i=1 i=1

To analyze these data the model Yi G ( ; ) ; i = 1; 2; : : : ; 19 independently is


assumed where and are unknown parameters.

(a) Use a qqplot to determine how reasonable the Gaussian model is for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(c) The researchers at the University of Vigo were interested in determining whether
the octopi in the Ria de Vigo are healthy. For common octopi, a population mean
weight of 1100 grams is considered to be a healthy population. Determine a 95%
con…dence interval for . What should the researchers conclude about the health
of the octopi, in terms of weight, in the Ria de Vigo?
(d) Determine a 90% con…dence interval for based on these data.
4.9. CHAPTER 4 PROBLEMS 185

29. Consider the data on weights of adult males and females from Chapter 1. The data
are available in the …le bmidata.txt posted on the course website.

(a) Determine whether it is reasonable to assume a Gaussian model for the female
heights and a di¤erent Gaussian model for the male heights.
(b) Obtain a 95% con…dence interval for the mean for the females and males sepa-
rately. Does there appear to be a di¤erence in the means for females and males?
(We will see how to test this formally in Chapter 6.)
(c) Obtain a 95% con…dence interval for the standard deviation for the females and
males separately. Does there appear to be a di¤erence in the standard deviations?

30. Sixteen packages are randomly selected from the production of a detergent packaging
machine. Let yi = weight in grams of the i’th package, i = 1; 2; : : : ; 16.
287 293 295 295 297 298 299 300
300 302 302 303 306 307 308 311
For these data
P
16 P
16
yi = 4803 and yi 2 = 1442369
i=1 i=1
To analyze these data the model Yi G ( ; ) ; i = 1; 2; : : : ; 16 independently is
assumed where and are unknown parameters.

(a) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(b) Obtain 95% con…dence intervals for and .
(c) Let Y represent the weight of a future, independent, randomly selected package.
Obtain a 95% prediction interval for Y .

31. Radon is a colourless, odourless gas that is naturally released by rocks and soils and
may concentrate in highly insulated houses. Because radon is slightly radioactive,
there is some concern that it may be a health hazard. Radon detectors are sold to
homeowners worried about this risk, but the detectors may be inaccurate. University
researchers in Waterloo purchased 12 detectors of the same type at Home Depot. The
detectors were placed in a chamber where they were exposed to 105 picocuries per
liter of radon over 3 days. The readings given by the detectors were:

91:9 97:8 111:4 122:3 105:4 95:0 103:8 99:6 96:6 119:3 104:8 101:7
Let yi = reading for the i’th detector, i = 1; 2; : : : ; 12. For these data
P
12 P
12
yi = 1249:6 and (yi y)2 = 971:4267
i=1 i=1

To analyze these data assume the model Yi G ( ; ) ; i = 1; 2; : : : ; 12 independently


where and are unknown parameters.
186 4. ESTIMATION

(a) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
(b) Obtain a 95% con…dence interval for . Does it contain the value = 105?
(c) Obtain a 95% con…dence interval for .
(d) As a statistician what would you say to the university researchers about the
accuracy and precision of the detectors?
(e) University researchers purchased one more radon detector. It is to be exposed to
105 picocuries per liter of radon over 3 days. Calculate a 95% prediction interval
for the reading for this new radon detector.
(f) Suppose the researchers wanted to determine the mean level of radon detected by
the radon detectors to “within 3 picocurie per liter”. As a statistician we would
interpret this as requiring that the 95% con…dence interval for a should have
width at most 6. How many detectors in total would you advise the researchers
to test?

32. A chemist has two ways of measuring a particular quantity; one has more random
error than the other. For method I, measurements X1 ; X2 ; : : : ; Xm follow a Normal
distribution with mean and variance 21 , whereas for method II, measurements
Y1 ; Y2 ; : : : ; Yn have a Normal distribution with mean and variance 22 .

(a) Assuming that 21 and 22 are known, …nd the combined likelihood function for
based on observed data x1 ; x2 ; : : : ; xm and y1 ; y2 ; : : : ; yn and show that the
maximum likelihood estimate of is
w1 x + w2 y
^=
w1 + w2

where w1 = m= 2 and w2 = n= 2. Why does this estimate make sense?


1 2

(b) Suppose that 1 = 1, 2 = 0:5 and n = m = 10. How would you rationalize
to a non-statistician why you were using the estimate (x + 4y) =5 instead of
(x + y) =2?
(c) Suppose that 1 = 1, 2 = 0:5 and n = m = 10, determine the standard
deviation of the maximum likelihood estimator
w1 X + w2 Y
~=
w1 + w2

and the estimator (X + Y )=2. Why is ~ a better estimator?


4.9. CHAPTER 4 PROBLEMS 187

33. Challenge Problem For “two-sided”intervals based on the t distribution, we usu-


ally pick the interval which is symmetrical about y. Show that this choice provides
the shortest 100p% con…dence interval.

34. Challenge Problem A sequence of random variables fXn g is said to converge in


probability to the constant c if for all > 0,

lim P (jXn cj )=0


n!1

p
We denote this by writing Xn ! c.

p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and
p p p
Yn ! c2 , show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
(b) Let X1 ; X2 ; : : : be independent and identically distributed random variables with
probability density function f (x; ). A point estimator ~n based on a random
p
sample X1 ; X2 ; : : : ; Xn is said to be consistent for if ~n ! as n ! 1.
(i) Let X1 ; X2 ; : : : ; Xn be independent and identically distributed Uniform(0; )
random variables. Show that ~n = max (X1 ; X2 ; : : : ; Xn ) is consistent for .
(ii) Let X Binomial(n; ). Show that ~n = X=n is consistent for .

35. Challenge Problem Refer to the de…nition of consistency in Problem 32(b). Di¢ -
culties can arise when the number of parameters increases with the amount of data.
Suppose that two independent measurements of blood sugar are taken on each of n
individuals and consider the model

2
Xi1 ; Xi2 N ( i; ) for i = 1; 2; : : : ; n

where Xi1 and Xi2 are the independent measurements. The variance 2 is to be
estimated, but the i ’s are also unknown.

(a) Find the maximum likelihood estimator ~ 2 and show that it is not consistent.
(b) Suggest an alternative way to estimate 2 by considering the di¤erences
Wi = Xi1 Xi2 .
(c) What does represent physically if the measurements are taken very close to-
gether in time?

36. Challenge Problem Proof of Central Limit Theorem (Special Case) Suppose
Y1 ; Y2 ; : : : are independent random variables with E(Yi ) = ; V ar(Yi ) = 2 and that
they have the same distribution, whose moment generating function exists.
188 4. ESTIMATION

2
(a) Show that (Yi )= has moment generating function of the form (1 + t2 +
3 4 p
terms in t ; th ; : : :) and thusi that (Yi )= n has moment generating function
t2
of the form 1 + 2n + o(n) , where o(n) signi…es a remainder term Rn with the
property that Rn =n ! 0 as n ! 1.
(b) Let p
P
n (Y
i ) n(Y )
Zn = p =
i=1 n
h in
t2
and note that its moment generating function is of the form 1 + 2n + o(n) .
2
Show that as n ! 1 this approaches the limit et =2 , which is the moment
generating function for a G(0; 1) random variable. (Hint: For any real number
a, (1 + a=n)n ! ea as n ! 1.)
5. HYPOTHESIS TESTING

5.1 Introduction
What does it mean to test a hypothesis in the light of observed data or information?
Suppose a statement has been formulated such as “I have extrasensory perception.” or
“This drug that I developed reduces pain better than those currently available.” and an
experiment is conducted to determine how credible the statement is in light of observed
data. How do we measure credibility? If there are two alternatives: “I have ESP.” and
“I do not have ESP.” should they both be considered a priori as equally plausible? If I
correctly guess the outcome on 53 of 100 tosses of a fair coin, would you conclude that
my gift is real since I was correct more than 50% of the time? If I develop a treatment
for pain in my basement laboratory using a mixture of seaweed and tofu, would you treat
the claims “this product is superior to aspirin”and “this product is no better than aspirin”
symmetrically?
To understand a test of hypothesis it is helpful to draw an analogy with the criminal
court system used in many places in the world, where the two hypotheses “the defendant is
innocent”and “the defendant is guilty”are not treated symmetrically. In these courts, the
court assumes a priori that the …rst hypothesis, “the defendant is innocent” is true, and
then the prosecution attempts to …nd su¢ cient evidence to show that this hypothesis of
innocence is not plausible. There is no requirement that the defendant be proved innocent.
At the end of the trial the judge or jury may conclude that there was insu¢ cient evidence
for a …nding of guilty and the defendant is then exonerated. Of course there are two types
of errors that this system can (and inevitably does) make; convict an innocent defendant or
fail to convict a guilty defendant. The two hypotheses are usually not given equal weight a
priori because these two errors have very di¤erent consequences.
A test of hypothesis is analogous to this legal example. We often begin by specifying
a single “default” hypothesis (“the defendant is innocent” in the legal context) and then
check whether the data collected is unlikely under this hypothesis. This default hypothesis
is often referred to as the “null”hypothesis and is denoted by H0 (“null”is used because it
often means a new treatment has no e¤ect). Of course, there is an alternative hypothesis,
which may not always be speci…ed. In many cases the alternative hypothesis is simply that
H0 is not true.
We will outline the logic of a test of hypothesis in the …rst example, the claim that I

189
190 5. HYPOTHESIS TESTING

have ESP. In an e¤ort to prove or disprove this claim, an unbiased observer tosses a fair coin
100 times and before each toss I guess the outcome of the toss. We count Y , the number
of correct guesses which we can assume has a Binomial distribution with n = 100. The
probability that I guess the outcome correctly on a given toss is an unknown parameter .
If I have no unusual ESP capacity at all, then we would assume = 0:5, whereas if I have
some form of ESP, either a positive attraction or an aversion to the correct answer, then
we expect 6= 0:5. We begin by asking the following questions in this context:

(1) Which of the two possibilities, = 0:5 or 6= 0:5, should be assigned to H0 , the null
hypothesis?

(2) What observed values of Y are highly inconsistent with H0 and what observed values
of Y are compatible with H0 ?

(3) What observed values of Y would lead to us to conclude that the data provide no
evidence against H0 and what observed values of Y would lead us to conclude that
the data provide strong evidence against H0 ?

In answer to question (1), hopefully you observed that these two hypotheses ESP and
NO ESP are not equally credible and decided that the null hypothesis should be H0 : = 0:5
or H0 : I do not have ESP.
To answer question (2), we note that observed values of Y that are very small (e.g.
0 10) or very large (e.g. 90 100) would clearly lead us to to believe that H0 is false,
whereas values near 50 are perfectly consistent with H0 . This leads naturally to the concept
of a test statistic or discrepancy measure.

De…nition 38 A test statistic or discrepancy measure D is a function of the data Y that is


constructed to measure the degree of “agreement”between the data Y and the null hypothesis
H0 .

Usually we de…ne D so that D = 0 represents the best possible agreement between the
data and H0 , and values of D not close to 0 indicate poor agreement. A general method for
constructing test statistics will be described in Sections 5:3, but in this example, it seems
natural to use D(Y ) = jY 50j.
Question (3) could be resolved easily if we could specify a threshold value for D, or
equivalently some function of D. In the given example, the observed value of Y was y = 52
and so the observed value of D is d = j52 50j = 2. One might ask what is the probability,
when H0 is true, that the discrepancy measure results in a value less than d. Equivalently,
what is the probability, assuming H0 is true, that the discrepancy measure is greater than
or equal to d? In other words we want to determine P (D d; H0 ) where the notation
“; H0 ” means “assuming that H0 is true”. We can compute this probability easily for this
5.1. INTRODUCTION 191

example. If H0 is true then Y Binomial(100; 0:5) and

P (D d; H0 ) = P (jY 50j j52 50j ; H0 )


= P (jY 50j 2) where Y Binomial(100; 0:5)
=1 P (49 Y 51)
100 100 100
=1 (0:5)100 (0:5)100 (0:5)100
49 50 51
t 0:76

How can we interpret this value in terms of the test of H0 ? Roughly 76% of claimants
similarly tested for ESP, who have no abilities at all but simply randomly guess, will
perform as well or better (that is, result in at least as large a value of D as the observed
value of 2) than I did. This does not prove I do not have ESP but it does indicate we
have failed to …nd any evidence in these data to support rejecting H0 . There is no evidence
against H0 in the observed value d = 2, and this was indicated by the high probability that,
when H0 is true, we obtain at least this much measured disagreement with H0 .
We now proceed to a more formal treatment of a test of hypothesis. We will concentrate
on two types of hypotheses:

(1) the hypothesis H0 : = 0 where it is assumed that the data Y have arisen from a
family of distributions with probability (density) function f (y; ) with parameter

(2) the hypothesis H0 : Y f0 (y) where it is assumed that the data Y have a speci…ed
probability (density) function f0 (y).

The ESP example is an example of a type (1) hypothesis. If we wish to determine


if is reasonable to assume a given data set is a random sample from an Exponential(1)
distribution then this is an example of a type (2) hypothesis. We will see more examples
of type (2) hypotheses in Chapter 7.
A statistical test of hypothesis proceeds as follows. First, assume that the hypothesis
H0 will be tested using some random data Y. We then adopt a test statistic or discrepancy
measure D(Y) for which, normally, large values of D are less consistent with H0 . Let
d = D (y) be the corresponding observed value of D. We then calculate the p-value or
observed signi…cance level of the test.

De…nition 39 Suppose we use the test statistic D = D (Y) to test the hypothesis H0 .
Suppose also that d = D (y) is the observed value of D. The p-value or observed signi…cance
level of the test of hypothesis H0 using test statistic D is

p value = P (D d; H0 )

In other words, the p value is the probability (calculated assuming H0 is true) of


observing a value of the test statistic greater than or equal to the observed value of the test
192 5. HYPOTHESIS TESTING

statistic. If d (the observed value of D) is large and consequently the p value is small
then one of the following two statements is correct:
(1) H0 is true but by chance we have observed an outcome that does not happen very often
when H0 is true
or
(2) H0 is false.
If the p value is close to 0:05, then the event of observing a D value as unusual or more
unusual as we have observed happens only 5 times out of 100, that is, not very often.
Therefore we interpret a p value close to 0:05 as indicating that the observed data are
providing evidence against H0 . If the p value is very small, for example less than 0:001,
then the event of observing a D value as unusual or more unusual as we have observed
happens only 1 time out of 1000, that is, very rarely. Therefore we interpret a p value
close to 0:001 as indicating that the observed data are providing strong evidence against
H0 . If the p value is greater than 0:1, then the event of observing a D value as unusual
or more unusual as we have observed happens more than 1 time out of 10, that is, fairly
often and therefore the observed data are consistent with H0 .

Remarks
(1) Note that the p value is de…ned as P (D d; H0 ) and not P (D = d; H0 ) even though
the event that has been observed is D = d. If D is a continuous random variable then
P (D = d; H0 ) is always equal to zero which is not very useful. If D is a discrete random
variable with many possible values then P (D = d; H0 ) will be small which is also not very
useful. Therefore to determine how unusual the observed result is we compare it to all the
other results which are as unusual or more unusual than what has been observed.
(2) The p value is NOT the probability that H0 is true. This is a common misinterpre-
tation.

The following table gives a rough guideline for interpreting p values. These are only
guidelines for this course. The interpretation of p values must always be made in the
context of a given study.

p value Interpretation
p value > 0:10 No evidence against H0 based on the observed data.
0:05 < p value 0:10 Weak evidence against H0 based on the observed data.
0:01 < p value 0:05 Evidence against H0 based on the observed data.
0:001 < p value 0:01 Strong evidence against H0 based on the observed data.
p value 0:001 Very strong evidence against H0 based on the observed data.
Table 5.1: Guidelines for interpreting p values
5.1. INTRODUCTION 193

Example 5.1.1 Test of hypothesis for Binomial for large n


Suppose that in the ESP experiment the coin was tossed n = 200 times and I correctly
guessed 110 of the outcomes. In this case we use the test statistic D = jY 100j with
observed value d = j110 100j = 10. The p value is

p value = P (jY 100j 10) where Y Binomial(200; 0:5)

which can be calculated using R or using the Normal approximation to the Binomial since
n = 200 is large. Using the Normal approximation (without a continuity correction since
it is not essential to have an exact p value) we obtain

p value = P (jY 100j 10) where Y Binomial(200; 0:5)


!
jY 100j 10
= P p p
200 (0:5) (0:5) 200 (0:5) (0:5)
t P (jZj 1:41) where Z N (0; 1)
= 2 [1 P (Z 1:41)]
= 2 (1 0:92073)
= 0:15854

so there is no evidence against the hypothesis that I was guessing.

Example 5.1.2 One-sided test of hypothesis for Binomial model


Suppose that it is suspected that a 6-sided die has been “doctored”so that the number
one turns up more often than if the die were fair. Let = P (die turns up one) on a single
toss and consider the hypothesis H0 : = 1=6. To test H0 , we toss the die n times and
observe the number of times Y that a one occurs. Assuming H0 : = 1=6 is true,
Y Binomial(n; 1=6) distribution. If we only wanted to focus on the alternative hypothesis
> 1=6 then a reasonable test statistic would be D = max [(Y n=6); 0].
Suppose that n = 180 tosses gave y = 44, then the observed value of D is
d = max [(44 180=6); 0] = 14 and the p value (calculated using R) is

p value = P (D 14; H0 )
1
= P (Y 44) where Y Binomial 180;
6
180
X y 180 y
180 1 5
=
y 6 6
y=44

= 0:005

which provides strong evidence against H0 , and suggests that is bigger than 1=6. This is
an example of a one-sided test.
194 5. HYPOTHESIS TESTING

Example 5.1.2 Revisited


Suppose that in the experiment in Example 5.1.2 we observed y = 35 ones in n = 180
tosses. The p value (calculated using R) is now

p value = P (Y 35; = 1=6)


180
X y 180 y
180 1 5
=
y 6 6
y=35

= 0:18

and this probability is not especially small. Indeed almost one die in …ve, though fair, would
show this level of discrepancy with H0 . We conclude that there is no evidence against H0
in light of the observed data.

Note that we do not claim that H0 is true, only that there is no evidence in light of the
data that it is not true. Similarly in the legal example, if we do not …nd evidence against
H0 : “defendant is innocent”, this does not mean we have proven he or she is innocent, only
that, for the given data, the amount of evidence against H0 was insu¢ cient to conclude
otherwise.

The approach to testing a hypothesis described above is very general and straightfor-
ward, but a few points should be stressed:

(1) If the p value is very small then we would conclude that there is strong evidence
against H0 in light of the observed data; this is often termed “statistically
signi…cant”evidence against H0 . We believe that statistical evidence is best measured
when we interpret p values as in Table 5.1. However, it is still common in some areas
of research to adopt a threshold p value such as 0:05 and “reject H0 ” whenever
the p-value is below this threshold. This may be necessary when there are only
two possible decisions from which to choose. For example in a trial, a person is either
convicted or acquitted of a crime. In the examples in these Course Notes we report
the p value and use the guidelines in Table 5.1 to make a conclusion about whether
there is evidence against H0 or not. We emphasize the point that any decisions which
are made after determining the p value for a given hypothesis H0 must be made in
the context of the empirical study.

(2) If the p value is not small, we do not conclude that H0 is true. We simply say
there is no evidence against H0 in light of the observed data. The reason for
this “hedging”is that in most settings a hypothesis may never be strictly “true”. For
example, one might argue when testing H0 : = 1=6 in Example 5.1.2 that no real
die ever has a probability of exactly 1=6 for side 1. Hypotheses can be “disproved”
(with a small degree of possible error) but not proved.
5.1. INTRODUCTION 195

(3) Just because there is strong evidence against a hypothesis H0 , there is no implication
about how “wrong” H0 is. A test of hypothesis should always be supplemented with
an interval estimate that indicates the magnitude of the departure from H0 .

(4) It is important to keep in mind that although we might be able to …nd statisti-
cally signi…cant evidence against a given hypothesis, this does not mean that the
di¤erences found are of practical signi…cance. For example, suppose an insurance
company randomly selects a large number of policies in two di¤erent years and …nds a
statistically signi…cant di¤erence in the mean value of policies sold in those two years
of $5:21. This di¤erence would probably not be of practical signi…cance if the average
value of policies sold in a year was greater than $1000. Similarly, if we collect large
amounts of …nancial data, it is quite easy to …nd evidence against the hypothesis that
stock or stock index returns are Normally distributed. Nevertheless for small amounts
of data and for the pricing of options, such an assumption is usually made and con-
sidered useful. Finally suppose we compared two cryptographic algorithms using the
number of cycles per byte as the unit of measurement. A mean di¤erence of two
cycles per byte might be found to be statistically signi…cant but the decision about
whether this di¤erence is of practical importance or not is best left to a computer
scientist who studies algorithms.

(5) When the observed data provide strong evidence against the null hypothesis, re-
searchers often have an “alternative” hypothesis in mind. For example, suppose a
standard pain reliever provides relief in about 50% of cases and researchers at a phar-
maceutical company have developed a new pain reliever that they wish to test. The
null hypothesis is H0 : P (relief) = 0:5. Suppose there is strong evidence against
H0 based on the data. The researchers will want to know in which direction that
evidence lies. If the probability of relief is greater than 0:5 the researchers might
consider adopting the drug or doing further testing, but if the probability of relief is
less than 0:5, then the pain reliever would probably be abandoned. The choice of the
discrepancy measure D is often made with a particular alternative in mind.

A drawback with the approach to testing described so far is that we do not have a
general method for choosing the test statistic or discrepancy measure D. Often there are
“intuitively obvious” test statistics that can be used; this was the case in the examples in
this section. In Section 5:3 we will see how to use the likelihood function to construct a
test statistic in more complicated situations where it is not always easy to come up with
an intuitive test statistic.
For the Gaussian model with unknown mean and standard deviation we use test statis-
tics based on the pivotal quantities that were used in Chapter 4 for constructing con…dence
intervals.
196 5. HYPOTHESIS TESTING

5.2 Hypothesis Testing for Parameters in the G( ; ) Model


Suppose that Y G( ; ) models a variate y in some population or process. A random
sample Y1 ; Y2 ; : : : ; Yn is selected, and we want to test hypotheses concerning one of the two
parameters ( ; ). The maximum likelihood estimators of and 2 are
1 Pn 1 Pn
~=Y = Yi and ~ 2 = (Yi Y )2
n i=1 n i=1
As usual we prefer to use the sample variance estimator
1 P
n
S2 = (Yi Y )2
n 1 i=1

to estimate 2 .
Recall from Chapter 4 that
Y
T = p t (n 1)
S= n
We use this pivotal quantity to construct a test of hypothesis for the parameter when the
standard deviation is unknown.

Test of Hypothesis for


Suppose we wish to test the hypothesis H0 : = 0 , where 0 is some speci…ed value,
against the alternative hypothesis that 6= 0 . Values of Y which are either larger than
0 or smaller than 0 provide evidence against the null hypothesis H0 : = 0 . The test
statistic
jY j
D= p 0 (5.1)
S= n
makes intuitive sense. We obtain the p value using the fact that
Y
p 0 t (n 1)
S= n
if H0 : = 0 is true. Let
jy j
d= p0 (5.2)
s= n
be the observed value of D in a sample with mean y and standard deviation s, then

p value = P (D d; H0 is true)
= P (jT j d) where T t (n 1)
= 2 [1 P (T d)]

Since values of Y which are larger or smaller than 0 provide evidence against the null
hypothesis this test is called a two-sided test of hypothesis.
5.2. HYPOTHESIS TESTING FOR PARAMETERS IN THE G( ; ) MODEL 197

Example 5.2.1 Testing for bias in a measurement system


Two cheap scales A and B for measuring weight are tested by taking 10 weighings of a
one kg weight on each of the scales. The measurements on A and B are

A: 1:026 0:998 1:017 1:045 0:978 1:004 1:018 0:965 1:010 1:000
B: 1:011 0:966 0:965 0:999 0:988 0:987 0:956 0:969 0:980 0:988

Let Y represent a single measurement on one of the scales, and let represent the
average measurement E(Y ) in repeated weighings of a single 1 kg weight. If an experiment
involving n weighings is conducted then a test of H0 : = 1 can be based on the test
statistic (5.1) with observed value (5.2) and 0 = 1.
The samples from scales A and B above give us

A : y = 1:0061; s = 0:0230; d = 0:839


B : y = 0:9810; s = 0:0170; d = 3:534

The p value for A is

p value = P (D 0:839; = 1)
= P (jT j 0:839) where T t (9)
= 2 [1 P (T 0:839)]
= 2 (1 0:7884)
t 0:42

where the probability is obtained using R. Alternatively if we use the t table provided in
these notes we obtain P (T 0:5435) = 0:7 and P (T 0:88834) = 0:8 so

0:4 = 2 (1 0:8) < p value < 2 (1 0:7) = 0:6:

In either case we have that the p value > 0:1 and thus there is no evidence of bias, that
is, there is no evidence against H0 : = 1 for scale A based on the observed data.
For scale B, however, we obtain

p value = P (D 3:534; = 1)
= P (jT j 3:534) where T t (9)
= 2 [1 P (T 3:534)]
= 0:0064

where the probability is obtained using R. Alternatively if we use the t table we obtain
P (T 3:2498) = 0:995 and P (T 4:2968) = 0:999 so

0:002 = 2 (1 0:999) < p value < 2 (1 0:995) = 0:01

In either case we have that the p value < 0:01 and thus there is strong evidence against
H0 : = 1. The observed data suggest strongly that scale B is biased.
198 5. HYPOTHESIS TESTING

Finally, note that just although there is strong evidence against H0 for scale B, the
degree of bias in its measurements is not necessarily large enough to be of practical concern.
In fact, we can obtain a 95% con…dence interval for for scale B by using the pivotal
quantity
Y
T = p t (9)
S= 10
For the t table we have P (T 2:2622) = 0:975 and a 95% con…dence interval for is given
by p
y 2:2622s= 10 = 0:981 0:012
or
[0:969; 0:993]
Evidently scale B consistently understates the weight but the bias in measuring the one kg
weight is likely fairly small (about 1% 3%).

Remark The function t.test in R will give con…dence intervals and test hypotheses about
. See Problem 3.

One-sided test of hypothesis for


Suppose data on the e¤ects of a new treatment follow a G( ; ) distribution and that
the new treatment can either have no e¤ect represented by = 0 or a bene…cial e¤ect
represented by > 0 . In this case the null hypothesis is H0 : = 0 and the alternative
hypothesis is > 0 .
To test H0 : = 0 we would use the test statistic

Y
D= p 0
S= n

so that large values of D provide evidence against H0 in the direction of the alternative
> 0.
Let the observed value of D be
y
d= p0
s= n
Then

p value = P (D d; H0 is true)
= P (T d)
=1 P (T d) where T t (n 1)

This is another example of a one-sided test of hypothesis.


5.2. HYPOTHESIS TESTING FOR PARAMETERS IN THE G( ; ) MODEL 199

Relationship between Hypothesis Testing and Interval Estimation


Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the G( ; ) distribution. Suppose
we test H0 : = 0 .
Now
p value 0:05

jY j jy j
if and only if P p 0 p 0 ; H0 : = 0 is true 0:05
S= n s= n

jy j
if and only if P jT j p0 0:05 where T t (n 1)
s= n

jy j
if and only if P jT j p0 0:95
s= n

jy j
if and only if p0 a where P (jT j a) = 0:95
s= n
p p
if and only if 0 2 y as= n; y + as= n
which is a 95% con…dence interval for . In other words, the p value for testing H0 : = 0
is greater than or equal to 0:05 if and only if the value = 0 is an element of a 95% con…-
dence interval for (assuming we use the same pivotal quantity). Note that both endpoints
of the 95% con…dence interval correspond to a p value equal to 0:05 while the values inside
the 95% con…dence interval will have p values greater than 0:05.

More generally, suppose we have data y and a model f (y; ). Suppose also that we
use the same pivotal quantity to construct the (approximate) con…dence interval for and
to test the hypothesis H0 : = 0 . Then the parameter value = 0 is an element of
the 100q% (approximate) con…dence interval for if and only if the p value for testing
H0 : = 0 is greater than or equal to 1 q.

Example 5.2.1 Revisited


For the weigh scale example the p value for testing H0 : = 1 for scale A was greater
than 0:4 and thus greater than 0:05. Therefore we know that the value = 1 is in a 95%
con…dence interval for the mean . In fact the 95% con…dence interval for the mean is
p
y 2:2622s= 10 = 1:0061 0:01645 = [0:9897; 1:0226]

which does indeed contain the value = 1.


For scale B, a 95% con…dence interval for the mean was [0:969; 0:993]. Since = 1 is
not in this interval we know that the p value for testing H0 : = 1 would be less than
0:05. In fact we showed the p value equals 0:0064 which is indeed less than 0:05.
200 5. HYPOTHESIS TESTING

Test of Hypothesis for


Suppose that we have a sample Y1 ; Y2 ; : : : ; Yn of independent random variables each from
the same G( ; ) distribution. Recall that we used the pivotal quantity

(n 1)S 2 1 P
n
2
= 2
(Yi Y )2 2
(n 1)
i=1

to construct con…dence intervals for the parameter . We may also wish to test a hypothesis
such as H0 : = 0 or equivalently H0 : 2 = 20 . One approach is to use a likelihood
ratio test statistic which is described in the next section. Alternatively we could use the
test statistic
(n 1)S 2
U= 2
0
for testing H0 : = 0 . Large values of U and small values of U provide evidence against
H0 . (Why is this?) Now U has a Chi-squared distribution when H0 is true and the
Chi-squared distribution is not symmetric which makes the determination of “large” and
“small” values somewhat problematic. The following simpler calculation approximates the
p value:

1. Let u = (n 1)s2 = 2
0 denote the observed value of U from the data.

2. If u is large (that is, if P (U u) > 12 ) compute the p value as

p value = 2P (U u)

where U 2 (n 1).

3. If u is small (that is, if P (U u) < 21 ) compute the p value as

p value = 2P (U u)

where U 2 (n 1).

1
Figure 5.1 shows a picture for a large observed value of u. In this case P (U u) > 2
and the p value = 2P (U u).

Note:
Only one of the two values 2P (U u) and 2P (U u) will be less than one and this
value is the desired p value.
5.2. HYPOTHESIS TESTING FOR PARAMETERS IN THE G( ; ) MODEL 201

0.09

0.08

0.07

0.06
p.d.f.
0.05

0.04 P(U<u)

0.03

0.02

0.01 P(U>u)

0
0 5 10 15 u 20 25 30

Figure 5.1: Picture of large observed u

Example 5.2.2
Suppose for the manufacturing process in Example 4.7.2, we wish to test the hypothesis
H0 : = 0:008 (0:008 is the desired or target value of the manufacturer would like to
achieve). Since the 95% con…dence interval for was found to be [0:0095; 0:0204] which
does not contain the value = 0:008 we already know that the p value for a test of H0
based on the test statistic U = (n 1)S 2 = 20 will be less than 0:05.
To …nd the p value, we use the procedure given above:

1. u = (n 1)s2 = 2
0 = (14) s2 = (0:008)2 = 0:002347= (0:008)2 = 36:67

2. The p value is
2
p value = 2P (U u) = 2P (U 36:67) = 0:0017 where U (14)

where the probability was obtained using R.


Alternatively if we use the Chi-squared table provided in these Course Notes we obtain
P (U 31:319) = 0:995 so

p value < 2 (1 0:995) = 0:01

In either case we have that the p value < 0:01 and thus there is strong evidence based
on the observed data against H0 : = 0:008. Both the observed value of
p
s = 0:002347=14 = 0:0129 and the 95% con…dence interval for suggest that is bigger
than 0:008.
202 5. HYPOTHESIS TESTING

5.3 Likelihood Ratio Test of Hypothesis - One Parameter


When a pivotal quantity exists then it is usually straightforward to construct a test of
hypothesis as we have seen Section 5.2 for the Gaussian distribution parameters. When
a pivotal quantity does not exist then a general method for …nding a test statistic with
good properties can be based on the likelihood function. In Chapter 2 we used likelihood
functions to gauge the plausibility of parameter values in the light of the observed data. It
should seem natural, then, to base a test of hypothesis on a likelihood value or, in comparing
the plausibility of two values, a ratio of the likelihood values. Let us suppose, for example,
that we are engaged in an argument over the value of a parameter in a given model (we
agree on the model but disagree on the parameter value). I claim that the parameter value
is 0 whereas you claim it is 1 . Having some data y at hand, it would seem reasonable to
attempt to settle this argument using the ratio of the likelihood values at these two values,
that is,
L( 0 )
(5.3)
L( 1 )
As usual we de…ne the likelihood function L( ) = L ( ; y) = f (y; ) where f (y; ) is the
probability (density) function of the random variable Y representing the data. If the value
of the ratio L( 0 )=L( 1 ) is much greater than one then the data support the value 0 more
than 1 .
Let us now consider testing the plausibility of my hypothesized value 0 against an
unspeci…ed alternative. In this case it is natural to replace 1 in (5.3) by the value which
appears most plausible given the data, that is, the maximum likelihood estimate ^. The
resulting ratio is just the value of the relative likelihood function at 0 :

L( 0 )
R( 0 ) =
L(^)

If R( 0 ) is close to one, then 0 is plausible in light of the observed data, but if R( 0 ) is


very small and close to zero, then 0 is not plausible in light of the observed data and this
suggests evidence against H0 . Therefore the corresponding random variable, L( 0 )=L(~)12 ,
appears to be a natural statistic for testing H0 : = 0 . To determine p values we need
the sampling distribution of L( 0 )=L(~) under H0 . It is actually easier to use the likelihood
ratio statistic which was introduced in Chapter 4

L( 0 )
( 0) = 2 log (5.4)
L(~)

(remember log = ln) which is a one-to-one function of L( 0 )=L(~). We choose this particular
function because, if H0 : = 0 is true, then ( 0 ) has approximately 2 (1) distribution.
12
Recall that L ( ) = L ( ; y) is a function of the observed data y. Replacing y by the correspond-
ing random variable Y means that L ( ; Y) is a random variable. The random variable L( 0 )=L(~) =
L( 0 ; Y)=L(~; Y) is a function of Y in several places including ~ = g (Y).
5.3. LIKELIHOOD RATIO TEST OF HYPOTHESIS - ONE PARAMETER 203

Note that small values of R( 0 ) correspond to large observed values of ( 0 ) and therefore
large observed value of ( 0 ) indicate evidence against the hypothesis H0 : = 0 . We
illustrate this in Figure 5.2. Notice that the more plausible values of the parameter
correspond to larger values of R( ) or equivalently, in the bottom panel, to small values of
( ) = 2 log [R( )]. The particular value displayed 0 is around 0:3 and it appears that
( 0 ) = 2 log [R( 0 )] is quite large, in this case around 9. To know whether this is too
large to be consistent with H0 , we need to compute the p value.

0 .8

0 .6
m o r e p l a u s i b l e va lu e s
R( θ )

0 .4

0 .2

le s s p la u s ib le
0
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6

12

10
θ ))

6
-2log(R(

2
le s s p la u s ib le m o r e p l a u s i b l e va lu e s
0
le s s p la u s ib le
0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5 0 .6

θ
θ = 0 .3
0

Figure 5.2: Top panel: Graph of the relative likelihood function


Bottom Panel: ( ) = 2 log R( )
Note that ( 0 ) is relatively large when R( 0 ) is small.

To determine the p value we …rst calculate the observed value of ( 0 ), denoted by


( 0 ) and given by
" #
L( 0 )
( 0) = 2 log = 2 log R ( 0 )
L(^)

where R ( 0 ) is the relative likelihood function evaluated at = 0. The approximate


204 5. HYPOTHESIS TESTING

p value is then
2
p value t P [W ( 0 )] where W (1) (5.5)
p
= P jZj ( 0 ) where Z G (0; 1)
h p i
= 2 1 P Z ( 0)

Let us summarize the construction of a test from the likelihood function. Let the random
variable (or vector of random variables) Y represent data generated from a distribution
with probability function or probability density function f (y; ) which depends on the
scalar parameter . Let be the parameter space (set of possible values) for . Consider
a hypothesis of the form
H0 : = 0
where 0 is a single point (hence of dimension 0). We can test H0 using as our test statis-
tic the likelihood ratio test statistic , de…ned by (5.4). Then large observed values of
( 0 ) correspond to a disagreement between the hypothesis H0 : = 0 and the data and
so provide evidence against H0 . Moreover if H0 : = 0 is true, ( 0 ) has approximately
a 2 (1) distribution so that an approximate p value is obtained from (5.5). The theory
behind the approximation is based on a result which shows that under H0 , the distribution
of approaches 2 (1) as the size of the data set becomes large.

Example 5.3.1 Likelihood ratio test statistic for Binomial model


Since the relative likelihood function for the Binomial model is
L( )
R( ) =
L(^)
y
(1 )n y
= y
^ (1 ^)n y
y n y
1
= for 0 1
^ 1 ^

the likelihood ratio test statistic for testing the hypothesis H0 : = 0 is


L( 0 )
( 0) = 2 log
L(~)
" #
y n y
0 1 0
= 2 log
~ 1 ~

where ~ = Y =n is the maximum likelihood estimator of . The observed value of ( 0 ) is

( 0) = 2 log R( 0 )
" #
y n y
0 1 0
= 2 log
^ 1 ^
5.3. LIKELIHOOD RATIO TEST OF HYPOTHESIS - ONE PARAMETER 205

where ^ = y=n. If ^ is close in value to 0 then R( 0 ) will be close in value to 1 and ( 0)


will be close in value to 0.
Suppose we use the likelihood ratio test statistic to test H0 : = 0:5 for the ESP
example and the data in Example 5.1.1. Since n = 200, y = 110 and ^ = 0:55, the observed
value of the likelihood ratio statistic for testing H0 : = 0:5 is
" #
110 90
0:5 1 0:5
(0:5) = 2 log R(0:5) = 2 log
0:55 1 0:55
= 2 log (0:367)
= 2:003

(Note that since R(0:5) = 0:367 > 0:1 then we already know that = 0:5 is a plausible
value of .) The approximate p value for testing H0 : = 0:5 is

2
p value t P (W 2:003) where W (1)
h p i
=2 1 P Z 2:003 where Z G (0; 1)
= 2 [1 P (Z 1:42)] = 2 (1 0:9222)
= 0:1556

and there is no evidence against H0 : = 0:5 based on the data. Note that the test statistic
D = jY 100j used in Example 5.1.1 and the likelihood ratio test statistic (0:5) give
nearly identical results. This is because n = 200 is large.

Example 5.3.2 Likelihood ratio test statistic for Exponential model


Suppose y1 ; y2 ; : : : ; yn is an observed random sample from the Exponential( ) distribu-
tion. The likelihood function (see Example 2.3.1) is

1 1P
n
n ny=
L( ) = n exp yi = e for >0
i=1

Since the maximum likelihood estimate is ^ = y, the relative likelihood function can be
written as

L( )
R( ) =
L(^)
n ny=
e
= n
^ e ny=^
!n
^ ^=
= en(1 ) for >0
206 5. HYPOTHESIS TESTING

The likelihood ratio test statistic for testing H0 : = 0 is


L( 0 )
( 0) = 2 log
L(~)
" !n #
~
n(1 ~= 0)
= 2 log e
0

where ~ = Y and the observed value of ( 0 ) is

( 0) = 2 log R ( 0 )
" !n #
^ ^=
= 2 log en(1 0 )
0

If ^ is close in value to 0 then R( 0 ) will be close in value to 1 and ( 0 ) will be close in


value to 0.
The variability in lifetimes of light bulbs (in hours, say, of operation before failure) is
often well described by an Exponential( ) distribution where = E(Y ) > 0 is the average
(mean) lifetime. A manufacturer claims that the mean lifetime of a particular brand of
bulbs is 2000 hours. We can examine this claim by testing the hypothesis H0 : = 2000.
Suppose a random sample of n = 50 light bulbs was tested over a long period and that the
observed lifetimes were:

572 2732 1363 716 231 83 1206 3952 3804 2713


347 2739 411 2825 147 2100 3253 2764 969 1496
2090 371 1071 1197 173 2505 556 565 1933 1132
5158 5839 1267 499 137 4082 1128 1513 8862 2175
3638 461 2335 1275 3596 1015 2671 849 744 580

P
50
with yi = 93840. For these data the maximum likelihood estimate of is
i=1
^ = y = 93840=50 = 1876:8. To check whether the Exponential model is reasonable
for these data we plot the empirical cumulative distribution function for these data and
then superimpose the cumulative distribution function for a Exponential(1876:8) random
variable. See Figure 5.3. Since the agreement between the empirical cumulative distribution
function and the Exponential(1876:8) cumulative distribution function is quite good we
assume the Exponential model to test the hypothesis that the mean lifetime the light
bulbs is 2000 hours. The observed value of the likelihood ratio test statistic for testing
H0 : = 2000 is

(2000) = 2 log R (2000)


" #
50
1876:8
= 2 log e50(1 1876:8=2000)
= 2 log (0:9058)
2000
= 0:1979
5.3. LIKELIHOOD RATIO TEST OF HYPOTHESIS - ONE PARAMETER 207

0.9
Exponential(1876.8)

0.8

0.7
e.c.d.f.
0.6

0.5

0.4

0.3

0.2

0.1

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Lifetimes of Light Bulbs

Figure 5.3: Empirical c.d.f. and Exponential(1876:8) c.d.f.

(Note that since R(2000) = 0:9058 > 0:1 then we already know that = 2000 is a plausible
value of .) The approximate p value for testing H0 : = 2000 is
2
p value t P (W 0:1979) where W (1)
h p i
=2 1 P Z 0:1979 where Z G (0; 1)
= 2 [1 P (Z 0:44)] = 2 (1 0:67003)
= 0:65994

and there is no evidence against H0 : = 2000 based on the data. Therefore there is no
evidence against the manufacturer’s claim that is 2000 hours based on the data. Although
the maximum likelihood estimate ^ was under 2000 hours (1876:8) it was not su¢ ciently
under to give evidence against H0 : = 2000.

Example 5.3.3 Likelihood ratio test of hypothesis for for G( ; ), known


Suppose Y G( ; ) with probability density function
1 1
f (y; ; ) = p exp 2
(y )2 for y 2 <
2 2
Suppose the standard deviation has a known value and the only unknown parameter is
. From the results in Example 2.3.2, we have that the likelihood function based on the
observed sample y1 ; y2 ; : : : ; yn is

n 1 P
n n(y )2
L( ) = exp 2
(yi y)2 exp 2
2 i=1 2
208 5. HYPOTHESIS TESTING

or more simply (ignoring constants with respect to )

n(y )2
L( ) = exp 2
for 2<
2

The corresponding log likelihood function is

n(y )2
l( ) = 2
for 2<
2

To …nd the maximum likelihood estimate of we solve the equation

n(y )
l0 ( ) = 2
=0

which gives ^ = y. The corresponding maximum likelihood estimator of is

1 Pn
~=Y = Yi
n i=1

The relative likelihood function can be written as

L( )
R( ) =
L(^ )
n(y )2
= exp 2
for 2<
2

since ^ = y gives L(^ ) = 1.


To test the hypothesis H0 : = 0 we use the likelihood ratio statistic

L ( 0)
( 0) = 2 log
L (~ )
n(Y 2
0)
= 2 log exp 2
since ~ = Y
2
n(Y 2
0)
= 2
2
Y
= p 0 (5.6)
= n

The purpose of writing the likelihood ratio statistic in the form (5.6) is to draw attention
to the fact that, in this special case, ( 0 ) has exactly a 2 (1) distribution for all values
of n since Y =pn0 G (0; 1).

More generally it is not obvious that the likelihood ratio test statistic has an approximate
2 (1) distribution.
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 209

5.4 Likelihood Ratio Test of Hypothesis - Multiparameter


Let the random vector Y represent data generated from a distribution with probability or
probability density function f (y; ) which depends on the k-dimensional parameter . Let
be the parameter space (set of possible values) for .
Consider a hypothesis of the form

H0 : 2 0

where 0 and 0 is of dimension p < k. For example H0 might specify particular


values for k p of the components of but leave the remaining parameters unspeci…ed.
The dimensions of and 0 refer to the minimum number of parameters (or “coordinates”)
needed to specify points in them. Again we test H0 using as our test statistic the like-
lihood ratio test statistic , de…ned as follows. Let ^ denote the maximum likelihood
estimate of over so that, as before,

L(^) = max L( )
2

Similarly we let ^ 0 denote the maximum likelihood estimate of over 0 (i.e. we maximize
the likelihood with the parameter constrained to lie in the set 0 ) so that

L(^0 ) = max L( )
2 0

Now consider the corresponding statistic (random variable)


" #
L(~0 ) h i
= 2 log = 2 l(~) l(~0 )
L(~)

and let " #


L(^0 ) h i
= 2 log = 2 l(^) l(^0 ) (5.7)
L(^)
denote the observed value of . If is very large, then there is evidence against H0
(con…rm that this means L(^) is much larger than L(^0 )). It can be shown that under H0 ,
the distribution of is approximately 2 (k p) as the size of the data set becomes large.
Large values of indicate evidence against H0 so the p value is given approximately by

p value = P ( ; H0 ) t P (W ) (5.8)

where W 2 (k p).
The likelihood ratio test covers a great many di¤erent types of examples, but we only
provide a few examples here.
210 5. HYPOTHESIS TESTING

Example 5.4.3 Comparison of two Poisson means


In Chapter 4, Problem 16 data were given on the numbers of failures per month for each
of two companies’photocopiers. We assume that in a given month the number of failures
Y follows a Poisson distribution with probability function
ye
f (y; ) = P (Y = y) = for y = 0; 1; : : :
y!

where = E(Y ) is the mean number of failures per month. (This ignores that the number
of days that the copiers are used varies a little across months. Adjustments could be made
to the analysis to deal with this.) Denote the value of for Company A’s copiers as A and
the value for Company B’s as B . Let us test the hypothesis that the two photocopiers
have the same mean number of failures

H0 : A = B

Essentially we have data from two Poisson distributions with possibly di¤erent parameters.
For convenience let x1 ; : : : ; xn denote the observations for Company A’s photocopier which
are assumed to be a random sample from the model
xe A
A
P (X = x; A) = for x = 0; 1; : : : and A 0
x!
Similarly let y1 ; y2 ; : : : ; ym denote the observations for Company B’s photocopier which are
assumed to be a random sample from the model
y
Be
B
P (Y = y; B) = for y = 0; 1; : : : and B 0
y!
independently of the observations for Company A’s photocopier. In this case the parameter
vector is the two dimensional vector = ( A ; B ) and = f( A ; B ) : A 0; B 0g.
The note that the dimension of is k = 2. Since the null hypothesis speci…es that the
two parameters A and B are equal but does not otherwise specify their values, we have
0 = f( ; ) : 0g which is a space of dimension p = 1.
To construct the likelihood ratio test of H0 : A = B we need the likelihood function
for the parameter vector = ( A ; B ). We …rst note that the likelihood function for A
only based on the data x1 ; x2 ; : : : ; xn is
xi
Q
n Q
n
Ae
A
L1 ( A) = f (xi ; A) = for A 0
i=1 i=1 xi !

or more simply
nx n
L1 ( A) = A e
A for A 0
Similarly the likelihood function for B only based on y1 ; y2 ; : : : ; ym is given by
my m
L2 ( B) = B e
B for B 0
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 211

Since the datasets are independent, the likelihood function for =( A; B) is obtained as
a product of the individual likelihoods

L( ) = L( A ; B ) = L1 ( A ) L2 ( B)
nx n A my m B
= A e B e for ( A; B) 2

with corresponding log likelihood function

l( ) = n A m B + (nx) log A + my log B for ( A; B) 2 (5.9)

The number of photocopy failures in twelve consecutive months for company A and
company B are given below:

Month 1 2 3 4 5 6 7 8 9 10 11 12 Total
P
12
Company A 16 14 25 19 23 12 22 28 19 15 18 29 xi = 240
i=1
P12
Company B 13 7 12 9 15 17 10 13 8 10 12 14 yj = 140
j=1

The log likelihood function is

l( ) = l( A; B) = 12 A + 240 log A 12 B + 140 log B for ( A; B) 2

The values of A and B which maximize l( A; B) are obtained by solving the two equa-
tions
@l @l
=0 =0
@ A @ B
which gives two equations in two unknowns:
240
12 + =0
A
140
12 + =0
B

240
The maximum likelihood estimates of A and B (unconstrained) are ^ A = x = 12 = 20:0
and ^ B = y = 140 ^ = (x; y) = (20:0; 11:667).
12 = 11:667 and
To determine
L(^0 ) = max L( )
2 0

we need to …nd the (constrained) maximum likelihood estimate ^0 , which is the value of
= ( A ; B ) which maximizes l( A ; B ) under the constraint A = B . To do this we
merely let = A = B in (5.9) to obtain

l( ; ) = 12 + 240 log 12 + 140 log


= 24 + 380 log for 0
212 5. HYPOTHESIS TESTING

Solving @l( ; )=@ = 0, we …nd ^ = nx+my 380


n+m = 24 = 15:833 (= ^ A = ^ B ) so
^0 = (15:833; 15:833).
The observed value of the likelihood ratio statistic using (5.7) is
" #
L(^0 ) h i
= 2 log = 2 l(^) l(^0 )
L(^)
= 2 [l(20:0; 11:667) l(15:833; 15:833)]
= 2 (682:92 669:60)
= 26:64

and the approximate p value (5.8) is

p value = P ( 26:64; H0 )
2
t P (W 26:64) where W (1)
h p i
=2 1 P Z 26:64 where Z G (0; 1)
t0

Based on the data there is very strong evidence against the hypothesis H0 : A = B . The
data suggest that Company B’s photocopiers have a lower rate of failure than Company
A’s photocopiers.
Note that we could also follow up this conclusion by giving a con…dence interval for the
mean di¤erence A B since this would indicate the magnitude of the di¤erence in the
two failure rates. The maximum likelihood estimates ^ A = 20:0 average failures per month
and ^B = 11:67 failures per month di¤er by quite a bit, but we could also give a con…dence
interval in order to express the uncertainty in such estimates.

Example 5.4.4 Likelihood ratio test of hypothesis for for G( ; ), unknown


Consider a test of H0 : = 0 based on a random sample y1 ; y2 ; : : : ; yn . In this case
the unconstrained parameter space is = f( ; ) : 1 < < 1; > 0g, obviously a
2-dimensional space, but under the constraint imposed by H0 , the parameter must lie in
the space 0 = f( ; 0 ); 1 < < 1g a space of dimension 1. Thus k = 2, and p = 1.
The likelihood function is
Q
n Q
n 1 1
L( ) = L( ; ) = f (Yi ; ; ) = p exp 2
(yi )2
i=1 i=1 2 2

and the log likelihood function is


1 P
n
l( ; ) = n log( ) 2
(yi )2 + c
2 i=1

where h i
n=2
c = log (2 )
5.4. LIKELIHOOD RATIO TEST OF HYPOTHESIS - MULTIPARAMETER 213

does not depend on or . The maximum likelihood estimators of ( ; ) in the uncon-


strained case are

~=Y
1 Pn
~2 = (Yi Y )2
n i=1

Under the constraint imposed by H0 : = 0 the maximum likelihood estimator of the


parameter is also Y so the likelihood ratio statistic is

( 0) = 2l Y ; ~ 2l Y ; 0
1 P n 1 P
n
= 2n log(~ ) 2 (Yi Y )2 + 2n log( 0) + 2 (Yi Y )2
~ i=1 0 i=1
0 1 1 2
= 2n log + 2 2 n~
~ 0 ~
~2 ~2
=n 2 1 log 2
0 0

This is not as obviously a Chi-squared random variable. It is, as one might expect, a
function of ~ 2 = 20 which is the ratio of the maximum likelihood estimator of the variance
divided by the value of 2 under H0 . In fact the value of ( 0 ) increases as the quantity
~ 2 = 20 gets further away from the value 1 in either direction.
The test proceeds by determining the observed value of ( 0 )

^2 ^2
( 0) = n 2 1 log 2
0 0

and then obtaining and interpreting the p value


2
p value t P (W > ( 0 )) where W (1)
h p i
= 2 1 P Z ( 0) where Z G (0; 1)

Remark It can be shown that the likelihood ratio statistic ( 0 ) is a function of


U = (n 1)S 2 = 20 , in fact ( 0 ) = U n log (U=n) n. See Problem 17(b). This is not
a one-to-one function of U but ( 0 ) is zero when U = n and ( 0 ) is large when U=n is
much bigger than or much less than one (that is, when S 2 = 20 is much bigger than one or
much less than one). Since U has a Chi-squared distribution with n 1 degrees of freedom
when H0 is true, we can use U as the test statistic for testing H0 : = 0 and compute
exact p values instead of using the Chi-squared approximation for the distribution of
( 0 ).
214 5. HYPOTHESIS TESTING

Example 5.4.5 Tests of hypotheses for Multinomial model


Consider a random vector Y = (Y1 ; Y2 ; : : : ; Yk ) with Multinomial probability function

n! y1 y2 yk P
k
f (y1 ; y2 ; : : : ; yk ; 1; : : : ; k ) = 1 2 k for yj = 0; 1; : : : ; n and yj = n
y1 !y2 ! yk ! j=1

Suppose we wish to test a hypothesis of the form: H0 : j = j ( ) where the probabili-


ties j ( ) are all functions of an unknown parameter (possibly vector) with dimension
dim( ) = p < k 1. The parameter in the original model is = ( 1 ; 2 ; :::; k ) and the
Pk
parameter space = f( 1 ; 2 ; : : : ; k ) : 0 j 1; where j = 1g has dimension k 1.
j=1
The parameter in the model assuming H0 is 0 = ( 1 ( ); 2 ( ); : : : ; k ( )) and the para-
meter space 0 = f( 1 ( ); 2 ( ); : : : ; k ( )) : for all g has dimension p. The likelihood
function is
n! y1 y2 yk
L( ) =
y1 !y2 ! yk ! 1 2 k

or more simply
Q
k
yj
L( ) = j
j=1

L( ) is maximized over (of dimension k 1) by the vector ^ with ^j = yj =n, j = 1; 2; : : : ; k.


The likelihood ratio test statistic for testing H0 : j = j ( ) is
" #
L(~0 )
= 2 log
L(~)

where L( 0 ) is maximized over 0 by the vector ~0 with ^j = j (^ ). If H0 is true and


n is large the distribution of is approximately 2 (k 1 p) and the p value can be
calculated approximately as
2
p value = P ( ; H0 ) t P (W ) where W (k 1 p)

where
= 2l(^) 2l(^0 )
is the observed value of .
We will give speci…c examples of the Multinomial model in Chapter 7.
5.5. CHAPTER 5 SUMMARY 215

5.5 Chapter 5 Summary


Test of Hypothesis based on
Likelihood Ratio Statistic
Suppose R ( ) = R ( ; y) is the relative likelihood function for based on observed data
y (possible a vector). To test the hypothesis H0 : = 0 we can use the likelihood ratio
statistic 2logR ( 0 ; Y) as the test statistic. Let = 2logR ( 0 ; y) be the observed value
of the likelihood ratio statistic for the data y. The corresponding p value is approximately
equal to P (W ) where W 2 (1). In R this can be calculated as 1-pchisq(lambda,1).

This result is based on the fact that 2 log R ( 0 ; Y) has approximately a 2 (1) distri-
bution assuming H0 : = 0 is true.

Table 5.2
Hypothesis Tests for Named Distributions
based on Asymptotic Gaussian Pivotal Quantities

Point Point Test Approximate p value


Named
Estimate Estimator Statistic for based on Gaussian
Distribution ^ ~ H0 : = 0 approximation

!
j^ 0 j
2P Z q
y Y q
j~ 0 j 0 (1 0)
Binomial(n; ) n n 0 (1 0)
n

Z G (0; 1)

!
j^q 0 j
2P Z
j~q 0 j 0
Poisson( ) y Y 0
n

Z G (0; 1)

j^ 0 j
2P Z
j~ 0 j p0
n
Exponential( ) y Y p0
n

Z G (0; 1)

Note: To …nd 2P (Z d) where Z G (0; 1) in R, use 2 (1 pnorm(d))


216 5. HYPOTHESIS TESTING

Table 5.3
Hypothesis Tests for Gaussian
and Exponential Models

Test Exact
Model Hypothesis
Statistic p value

jy j
2P Z p0
= n
G( ; ) jY j
H0 : = 0 p0
known = n
Z G (0; 1)

jy j
2P T p0
s= n
G( ; ) jY j
H0 : = 0 p0
unknown S= n
T t (n 1)

(n 1)s2
min(2P W 2 ;
0
G( ; ) (n 1)s2
H0 : =
(n 1)S 2 2P W 2 )
0 2 0
unknown 0

W 2 (n 1)

2ny
min(2P W 0
;
2ny
Exponential( ) H0 : = 2nY 2P W 0
)
0 0

W 2 (2n)

Notes:
(1) To …nd P (Z d) where Z G (0; 1) in R, use 1 pnorm(d)
(2) To …nd P (T d) where T t (n 1) in R, use 1 pt(d; n 1)
(3) To …nd P (W d) where W 2 (n 1) in R, use pchisq(d; n 1)
5.6. CHAPTER 5 PROBLEMS 217

5.6 Chapter 5 Problems


1. A woman who claims to have special guessing abilities is given a test, as follows: a
deck which contains …ve cards with the numbers 1 to 5 is shu- ed and a card drawn
out of sight of the woman. The woman then guesses the card, the deck is reshu- ed
with the card replaced, and the procedure is repeated several times.

(a) Let be the probability the woman guesses the card correctly and let Y be
the number of correct guesses in n repetitions of the procedure. Discuss why
Y Binomial(n; ) would be an appropriate model. If you wanted to test the
hypothesis that the woman is guessing at random what is the appropriate null
hypothesis H0 in terms of the parameter ?
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Using the test
statistic D = jY E (Y )j, calculate the p value for your hypothesis H0 in
(a) and give a conclusion about whether you think the woman has any special
guessing ability.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed cor-
rectly 32 times. Using the test statistic D = jY E (Y )j, calculate the p value
for these data. What would you conclude now?

2. The accident rate over a certain stretch of highway was about = 10 per year for a
period of several years. In the most recent year, however, the number of accidents was
25. We want to know whether this many accidents is very probable if = 10; if not,
we might conclude that the accident rate has increased for some reason. Investigate
this question by assuming that the number of accidents in the current year follows a
Poisson distribution with mean and then testing H0 : = 10. Use the test statistic
D = max(0; Y 10) where Y represents the number of accidents in the most recent
year.

3. A hospital lab has just purchased a new instrument for measuring levels of dioxin
(in parts per billion). To calibrate the new instrument, 20 samples of a “standard”
water solution known to contain 45 parts per billion dioxin are measured by the new
instrument. The observed data are given below:
44:1 46:0 46:6 41:3 44:8 47:8 44:5 45:1 42:9 44:5
42:5 41:5 39:6 42:0 45:8 48:9 46:6 42:9 47:0 43:7
For these data
P
20 P
20
yi = 888:1 and yi 2 = 39545:03
i=1 i=1

(a) Use a qqplot to check whether a G ( ; ) model is reasonable for these data.
(b) Describe a suitable study population for this study. The parameters and
correspond to what attributes of interest in the study population?
218 5. HYPOTHESIS TESTING

(c) Assuming a G ( ; ) model for these data test the hypothesis H0 : = 45.
Determine a 95% con…dence interval for . What would you conclude about
how well the new instrument is working?
(d) The manufacturer of these instruments claims that the variability in measure-
ments is less than two parts per billion. Test the hypothesis that H0 : = 2 and
determine a 95% con…dence interval for . What would you conclude about the
manufacturer’s claim?
(e) Suppose the hospital lab rechecks the new instrument one week later by taking
25 new measurements on a standard solution of 45 parts per billion dioxin. If
the new data give
y = 44:1 and s = 2:1
what would you conclude about how well the instrument is working now? Explain
the di¤erence between a result which is statistically signi…cant and a result which
is of practical signi…cance in the context of this study.
(f) Run the following R code which does the calculations for (c) and (d)
y<-c(44.1,46,46.6,41.3,44.8,47.8,44.5,45.1,42.9,44.5,
42.5,41.5,39.6,42,45.8,48.9,46.6,42.9,47,43.7)
t.test(y,mu=45,conf.level=0.95) # test hypothesis mu=45
# and gives a 95% confidence interval
df<-length(y)-1 # degrees of freedom
s2<-var(y) # sample variance
p<-0.95 # p=0.95 for 95% confidence interval
a<-qchisq((1-p)/2,df) # lower value from Chi-squared dist’n
b<-qchisq((1+p)/2,df) # upper value from Chi-squared dist’n
c(s2*df/b,s2*df/a) # confidence interval for sigma squared
c(sqrt(s2*df/b),sqrt(s2*df/a)) # confidence interval for sigma
sigma0sq<-2^2 # test hypothesis sigma=2 or sigmasq=4
chitest<-s2*df/sigma0sq
q<-pchisq(chitest,df)
min(2*q,2*(1-q)) # p-value for testing sigma=2

4. In Problem 3 suppose we accept the manufacturer’s claim and assume we know = 2.


Test the hypothesis H0 : = 45 and determine a 95% con…dence interval for for
the original data with y = 44:405.
Hint: Use the pivotal quantity
Y
Z= p G (0; 1)
= n
with = 2.

5. For Chapter 4, Problem 31 test the hypothesis H0 : = 105.


5.6. CHAPTER 5 PROBLEMS 219

6. Suppose in Problem 5 we assume that = 105. Test the hypothesis H0 : 2 = 100


and determine a 95% con…dence interval for .
Hint: Use the pivotal quantity
1 P
n
2
(Yi )2 2
(n)
i=1

with = 105.

7. Between 10 a.m. on November 4, 2014 and 10 p.m. on November 6, 2014 a referendum


on the question “Should classes start on the …rst Thursday after Labour Day to allow
for two additional days o¤ in the Fall term?” was conducted by the Federation of
Students at the University of Waterloo. All undergraduates were able to cast their
ballot online. Six thousand of the 30; 990 eligible voters voted. Of the 6000 who
voted, 4440 answered yes to this question.

(a) The Federation of Students used an empirical study to determine whether or


not students support a fall term break. The Plan step of the empirical study
involved using an online referendum. Give at least one advantage and at least
one disadvantage of using the online referendum in this context.
(b) Describe a suitable target population and study population for this study.
(c) Assume the model Y Binomial(6000; ) where Y = number of people who
responded yes to the question “Should classes start on the …rst Thursday after
Labour Day to allow for two additional days o¤ in the Fall term?”The parameter
corresponds to what attribute of interest in the study population? How valid
do you think the Binomial model is and why?
(d) Give the maximum likelihood estimate of . How valid do you think this estimate
is?
(e) Determine an approximate 95% con…dence interval for .
(f) By reference to the approximate con…dence interval, indicate what you know
about the approximate p value for a test of the hypothesis H0 : = 0:7.

8. Data on the number of accidents at a busy intersection in Waterloo over the last 5
years indicated that the average number of accidents at the intersection was 3 acci-
dents per week. After the installation of new tra¢ c signals the number of accidents
per week for a 25 week period were recorded as follows:

4 5 0 4 2 0 1 4 1 3 1 1 2
2 2 1 1 3 2 3 2 0 2 2 3

Let yi = the number of accidents in week i; i = 1; 2; : : : ; 25. To analyse these data we


assume Yi has a Poisson distribution with mean ; i = 1; 2; : : : ; 25 independently.
220 5. HYPOTHESIS TESTING

(a) To decide whether the mean number of accidents at this intersection has changed
after the installation of the new tra¢ c signals we wish to test the hypothesis H0 :
P25
= 3: Why is the discrepancy measure D = Yi 75 reasonable? Calculate
i=1
the exact p value for testing H0 : = 3. What would you conclude?
(b) Justify the following statement:
!
Y
P p c t P (Z c) where Z N (0; 1)
=n

(c) Why is the discrepancy measure D = Y 3 reasonable for testing H0 : = 3?


Calculate the approximate p value using the approximation in (b). Compare
this to the value in (a).

9. Use the likelihood ratio test statistic to test H0 : = 3 for the data in Problem 8.
Compare this answer to the answers in 8 (a) and 8 (c).

10. For Chapter 2, Problem 6 (b) test the hypothesis H0 : = 5 using the likelihood ratio
test statistic. Is this result consistent with the approximate 95% con…dence interval
for that you found in Chapter 4, Problem 5?

11. For Chapter 2, Problem 8 (b) test the hypothesis H0 : = 0:1 using the likelihood
ratio test statistic. Is this result consistent with the approximate 95% con…dence
interval for that you found in Chapter 4, Problem 6?

12. Data from the 2011 Canadian census indicate that 18% of all families in Canada have
one child. Suppose the data in Chapter 2, Problem 12 (d) represented 33 children
chosen at random from the Waterloo Region. Based on these data, test the hypothesis
that the percentage of families with one child in Waterloo Region is the same as the
national percentage using the likelihood ratio test statistic. Is this result consistent
with the approximate 95% con…dence interval for that you found in Chapter 4,
Problem 8?

13. A company that produces power systems for personal computers has to demonstrate
a high degree of reliability for its systems. Because the systems are very reliable
under normal use conditions, it is customary to ‘stress’the systems by running them
at a considerably higher temperature than they would normally encounter, and to
measure the time until the system fails. According to a contract with one personal
computer manufacturer, the average time to failure for systems run at 70 C should
be no less than 1; 000 hours. From one production lot, 20 power systems were put on
test and observed until failure at 70 C. The 20 failure times y1 ; y2 ; : : : ; y20 were (in
hours):

374:2 544:0 509:4 1113:9 1244:3 551:9 853:2 3391:2 297:0 1501:4
250:2 678:1 379:6 1818:9 1191:1 162:8 332:2 1060:1 63:1 2382:0
5.6. CHAPTER 5 PROBLEMS 221

P
20
Note: yi = 18; 698:6. Failure times are assumed to have an Exponential( ) distri-
i=1
bution.

(a) Check whether the Exponential model is reasonable for these data. (See Example
5:3:2.)
(b) Use a likelihood ratio test to test H0 : = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?

14. The R function runif() generates pseudo random Uniform(0; 1) random variables.
The command y<-runif(n) will produce a vector of n values y1 ; y2 ; : : : ; yn .

(a) Suggest a test statistic which could be used to test that the yi ’s, i = 1; 2; : : : ; n
are consistent with a random sample from Uniform(0; 1).
(See: www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA393366)
(b) Generate 1000 yi ’s and carry out the test in (a).

15. The Poisson model is often used to compare rates of occurrence for certain types of
events in di¤erent geographic regions. For example, consider K regions with popu-
lations P1 ; P2 ; : : : ; PK and let j , j = 1; 2; : : : ; K be the annual expected number of
events per person for region j. By assuming that the number of events Yj for region j
in a given t-year period has a Poisson distribution with mean Pj j t, we can estimate
and compare the j ’s or test that they are equal.

(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; y2 ; : : : ; yK for a given t-year period. Describe
how to test the hypothesis that 1 = 2 = = K.
(c) The data below show the numbers of children yj born with “birth defects”for 5
regions over a given …ve year period, along with the total numbers of births Pj
for each region. Test the hypothesis that the …ve rates of birth defects are equal.

Pj 2025 1116 3210 1687 2840


yj 27 18 41 29 31

16. Using the data from Chapter 2, Problems 10 and 11 and assuming the Poisson model
holds for each dataset, test the hypothesis that the mean number of points per game is
the same for Wayne Gretzky and Sidney Crosby. Hint: See Example 5.4.3. Comment
on whether you think this is a reasonable way to compare these two great hockey
players.
222 5. HYPOTHESIS TESTING

17. Challenge Problem: Likelihood ratio test statistic for Gaussian model
and unknown Suppose that Y1 ; Y2 ; : : : ; Yn are independent G( ; ) observations.

(a) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
is given by
T2
( 0 ) = n log 1 +
n 1
p
where T = n(Y 0 )=S and S is the sample standard deviation. Note: you
will want to use the identity
P
n
2 P
n
(Yi 0) = (Yi Y )2 + n(Y 0)
2
i=1 i=1

(b) Show that the likelihood ratio test statistic for testing H0 : = 0 ( unknown)
can be written as ( 0 ) = U n log (U=n) n where

(n 1)S 2
U= 2
0

See Example 5.4.4.

18. Challenge Problem: Likelihood ratio test statistic for comparing two
Exponential means Suppose that X1 ; X2 ; : : : ; Xm is a random sample from the
Exponential( 1 ) distribution and independently and Y1 ; Y2 ; : : : ; Yn is a random sample
from the Exponential( 2 ) distribution. Determine the likelihood ratio test statistic
for testing H0 : 1 = 2 .
6. GAUSSIAN RESPONSE
MODELS

6.1 Introduction
A response variate Y is one whose distribution has parameters which depend on the value
of other variates. For the Gaussian models we have studied so far, we assumed that we had
a random sample Y1 ; Y2 ; : : : ; Yn from the same Gaussian distribution G( ; ). A Gaussian
response model generalizes this to permit the parameters of the Gaussian distribution for
Yi to depend on a vector xi of covariates (explanatory variates which are measured for
the response variate Yi ). Gaussian models are by far the most common models used in
statistics.

De…nition 40 A Gaussian response model is one for which the distribution of the response
variate Y , given the associated vector of covariates x = (x1 ; x2 ; : : : ; xk ) for an individual
unit, is of the form
Y G( (x) ; (x))

If observations are made on n randomly selected units we write the model as

Yi G ( (xi ); (xi )) for i = 1; 2; : : : ; n independently

In most examples we will assume (xi ) = is constant. This assumption is not necessary
but it does make the models easier to analyze. The choice of (x) is guided by past
information and on current data from the population or process. The di¤erence between
various Gaussian response models is in the choice of the function (x) and the covariates.
We often assume (xi ) is a linear function of the covariates. These models are called
Gaussian linear models and can be written as

Yi G ( (xi ); ) for i = 1; 2; : : : ; n independently (6.1)


P
k
with (xi ) = 0 + j xij
j=1

223
224 6. GAUSSIAN RESPONSE MODELS

where xi = (xi1 ; xi2 ; : : : ; xik ) is the vector of known covariates associated with unit i and
0 ; 1 ; : : : ; k are unknown parameters. These models are also referred to as linear regres-
sion models 13 , and the j ’s are called the regression coe¢ cients. Linear regressions models
are used in both machine learning and data science.

Here are some examples of settings where Gaussian response models can be used.

Example 6.1.1 Can …ller study


The soft drink bottle …lling process of Example 1.5.2 involved two machines (Old and
New). For a given machine it is reasonable to represent the distribution for the amount of
liquid Y deposited in a single bottle by a Gaussian distribution.
In this case we can think of the machines as acting like a covariate, with and di¤ering
for the two machines. We could write

Y G( O; O) for observations from the old machine


Y G( N; N) for observations from the new machine.

In this case there is no formula relating and to the machines; they are simply di¤erent.
Notice that an important feature of a machine is the variability of its production so we
have, in this case, permitted the two variance parameters to be di¤erent.

Example 6.1.2 Price versus size of commercial building


Ontario property taxes are based on “market value”, which is determined by comparing
a property to the price of those which have recently been sold. The value of a property is
separated into components for land and for buildings. Here we deal with the value of the
buildings only but a similar analysis could be conducted for the value of the property.
A manufacturing company was appealing the assessed market value of its property,
which included a large building. Sales records were collected on the 30 largest buildings
sold in the previous three years in the area. The data, which are available in the …le
sizepricedata.txt posted on the course website, are plotted in Figure 6.1. The size of the
building x is measured in m2 =105 and the selling price y is in $ per m2 . The purpose of
the analysis is to determine whether and to what extent we can determine the value of a
property from the single covariate x so that we know whether the assessed value appears
to be too high. The size of the building in question was 4:47 105 m2 , with an assessed
market value of $75 per m2 .
The scatterplot shows that the price y decreases linearly with size x but there is ob-
viously variability in the price of buildings having the same area (size). In this case we
might consider a model where the price of a building of size xi is represented by a random
variable Yi , with

Yi G( 0 + 1 xi ; ) for i = 1; 2; : : : ; n independently
13
The word regression is an historical term introduced in the 19th century in connection with these models.
6.1. INTRODUCTION 225

where 0 , 1 and are unknown parameters and x1 ; x2 ; : : : ; xn are known constants. Note
that, although this model assumes that the mean of the response variate Y depends on the
explanatory variate x, the model assumes that the standard deviation does not depend
on x.

700

650

600

550

500
Price
450

400

350

300

250

200
0 0.5 1 1.5 2 2.5 3 3.5
Size

Figure 6.1: Scatterplot of price versus building size

Example 6.1.3 Exam mark versus midterm mark


An instructor of an online course was interested in the relationship between midterm
marks and …nal exam marks. The data are (xi ; yi ), i = 1; 2; : : : ; 65 where yi = …nal
exam mark and xi = midterm mark for 65 students enrolled in the online course during a
particular winter term. The data, which are available in the …le midexamdata.txt posted
on the course website, are plotted in Figure 6.2. The scatterplot shows the …nal exam
mark y increases linearly with midterm mark x. The variability in …nal exam marks seems
greater for midterm marks between 55 and 70 however we notice that this is mostly due to
the fact that there are many more observations in this range as compared to the number
of observations for midterm marks below 45 and above 90. For these data it also seems
reasonable to use the model

Yi G( 0 + 1 xi ; ) for i = 1; 2; : : : ; n independently

where 0 , 1 and are unknown parameters and and x1 ; x2 ; : : : ; xn are known constants.
The standard deviation is assumed to be the same for all x.
226 6. GAUSSIAN RESPONSE MODELS

Figure 6.2: Scatterplot of exam mark versus midterm mark

Example 6.1.4 Breaking strength versus diameter of steel bolt


The “breaking strength”of steel bolts is measured by subjecting a bolt to an increasing
(lateral) force and determining the force at which the bolt breaks. This force is called
the breaking strength; it depends on the diameter of the bolt and the material the bolt is
composed of. There is variability in breaking strengths since two bolts of the same dimension
and material will generally break at di¤erent forces. Understanding the distribution of
breaking strengths is very important in manufacturing and construction.
In a quality control experiment the breaking strengths y of six steel bolts at each of
…ve di¤erent bolt diameters x were measured. The data, which are available in the …le
diameterstrengthdata.txt, are plotted in Figure 6.3. The scatterplot gives a clear picture of
the relationship between y and x. A reasonable model for the breaking strength Y of a
randomly selected bolt of diameter x would appear to be Y G( (x); ). The variability in
y values appears to be about the same for bolts of di¤erent diameters which again provides
some justi…cation for assuming to be constant. It is not obvious what the best choice for
(x) would be although the relationship looks slightly nonlinear so we might try a quadratic
function
2
(x) = 0 + 1x + 2x

where 0; 1; 2 are unknown parameters.


6.1. INTRODUCTION 227

2 .5

2 .4

2 .3

2 .2
S tr e n g th

2 .1

1 .9

1 .8

1 .7

1 .6
0 .0 5 0 .1 0 .1 5 0 .2 0 .2 5 0 .3 0 .3 5 0 .4 0 .4 5 0 .5 0 .5 5
D i a m e te r

Figure 6.3: Scatterplot of strength versus bolt diameter

Remark Sometimes the model (6.1) is written as

Yi = (xi ) + Ri where Ri G(0; )

In this form we can see that Yi is the sum of a deterministic component, (xi ) (a constant),
and a stochastic component, Ri (a random variable).
We now consider estimation and testing procedures for these Gaussian response models.
We begin with models which have no covariates so that the observations are all from the
same Gaussian distribution.

G( ; ) Model
In Chapters 4 and 5 we discussed estimation and testing hypotheses for samples from a
Gaussian distribution. Suppose that Y G( ; ) models a response variate y in some
population or process. A random sample Y1 ; Y2 ; : : : ; Yn is selected, and we want to estimate
the model parameters and possibly to test hypotheses about them. We can write this model
in the form
Yi = + Ri where Ri G(0; ) (6.2)
so this is a special case of the Gaussian response model in which the mean function is con-
stant. The estimator of the parameter that we used is the maximum likelihood estimator
Pn
Y = n1 Yi . This estimator is also a “least squares estimator”. Y has the property that
i=1
it is closer to the data than any other constant, or
P
n P
n
min (Yi )2 = (Yi Y )2
i=1 i=1
228 6. GAUSSIAN RESPONSE MODELS

You should be able to verify this. It will turn out that the methods for estimation, construct-
ing con…dence intervals, and tests of hypothesis discussed earlier for the single Gaussian
G( ; ) are all special cases of the more general methods derived in Section 6.5.
In the next section we begin with a simple generalization of (6.2) to the case in which
the mean is a linear function of a single covariate.

6.2 Simple Linear Regression


Many studies involve covariates x, as described in Section 6.1. In this section we consider
the case in which there is a single covariate x. Consider the model with independent Yi ’s
such that
Yi G( (xi ) ; ) where (xi ) = + xi (6.3)
This is of the form (6.1) with ( 0 ; 1 ) replaced by ( ; ). The xi ’s are assumed to be known
constants. The unknown parameters are , , and .
The likelihood function for ( ; ; ) is
Q
n 1 1
L( ; ; ) = p exp 2
(yi xi )2
i=1 2 2
or more simply
1 P
n
L( ; ; ) = n
exp 2
(yi xi )2 for 2 <; 2 <; >0
2 i=1

The log likelihood function is


1 P
n
l( ; ; ) = n log 2
(yi xi )2 for 2 <; 2 <; >0
2 i=1

To obtain the maximum likelihood estimates we solve the three equations


@l 1 P
n
= 2
(yi xi ) = 0 (6.4)
@ i=1
@l 1 Pn
= 2
(yi xi ) xi = 0 (6.5)
@ i=1
@l n 1 P
n
= + 3
(yi xi )2 = 0
@ i=1

simultaneously to obtain the maximum likelihood estimates


P
n
xi (yi y)
^= i=1 Sxy
=
Pn
Sxx
xi (xi x)
i=1
^ = y ^x
1 Pn
^ xi )2 = 1 Syy ^ Sxy
^2 = (yi ^
n i=1 n
6.2. SIMPLE LINEAR REGRESSION 229

where
P
n P
n P
n
Sxx = (xi x)2 , Syy = (yi y)2 , and Sxy = (xi x) (yi y)
i=1 i=1 i=1

See Chapter 6, Problems 1 and 2.

Least squares estimation


If we are given data (xi ; yi ), i = 1; 2; : : : ; n then one criterion which could be used to obtain
a line of “best …t” to these data is to …t the line which minimizes the sum of the squares
of the distances between the observed points, (xi ; yi ), i = 1; 2; : : : ; n, and the …tted line
y = + x. Mathematically this means we want to …nd the values of and which
minimize the function
Pn
g( ; ) = [yi ( + xi )]2
i=1

Such estimates are called least squares estimates. To …nd the least squares estimates we
need to solve the two equations

@g P
n
= (yi xi ) = 0
@ i=1
@g Pn
= (yi xi ) xi = 0
@ i=1

simultaneously. We note that this is equivalent to solving the maximum likelihood equations
(6.4) and (6.5).
In summary we have that the least squares estimates and the maximum likelihood
estimates obtained assuming the model (6.3) are the same estimates. Of course the method
of least squares only provides point estimates of the unknown parameters and while
assuming the model (6.3) allows us to obtain both estimates and con…dence intervals for
the unknown parameters.
Note that the line y = ^ + ^ x is often called the …tted regression line for y on x or
more simply the …tted line.
We now show how to obtain con…dence intervals based on the model (6.3).

Distribution of the estimator ~


The maximum likelihood estimator corresponding to ^ is

~ = 1 P xi (Yi
n
Y)
Sxx i=1

Since
P
n P
n
xi (Yi Y)= (xi x)Yi
i=1 i=1
230 6. GAUSSIAN RESPONSE MODELS

(see Chapter 6, Problem 1) we have

~ = 1 P (xi P
n n (xi x)
x)Yi = ai Yi where ai =
Sxx i=1 i=1 Sxx

which shows that ~ is a linear combination of the Gaussian random variables Yi and there-
fore has a Gaussian distribution. To …nd the mean and variance of ~ we use the identities
P
n P
n P
n 1
ai = 0; ai xi = 1; a2i =
i=1 i=1 i=1 Sxx

(see Chapter 6, Problem 1) to obtain


P
n P
n
E( ~ ) = ai E(Yi ) = ai ( + xi )
i=1 i=1
Pn P
n
= ai xi since ai = 0
i=1 i=1
P
n
= since ai xi = 1
i=1

and
P
n
V ar( ~ ) = a2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= a2i
i=1
2 P
n 1
= since a2i =
Sxx i=1 Sxx

In summary
~ G ;p
Sxx

Con…dence intervals for and test of hypothesis of no relationship


Although the maximum likelihood estimate of 2 is
1 Pn
^ xi )2 = 1 Syy ^ Sxy
^2 = (yi ^
n i=1 n

we will estimate 2 using


1 P
n
^ xi )2 = 1 ^ Sxy
s2e = (yi ^ Syy
n 2 i=1 n 2

since E Se2 = 2 where


1 P
n
~ xi )2
Se2 = (Yi ~
n 2 i=1
6.2. SIMPLE LINEAR REGRESSION 231

Con…dence intervals for are important because the parameter represents the increase
in the mean value of Y , resulting from an increase of one unit in the value of x. As well, if
= 0 then x has no e¤ect on Y (within this model). Since
~
p G (0; 1)
= Sxx
holds independently of
(n 2)Se2 2
2
(n 2) (6.6)
then by Theorem 32 it follows that
~
p t (n 2)
Se = Sxx
This pivotal quantity can be used to obtain con…dence intervals for and to construct tests
of hypotheses about .
Using the t table or R we …nd the constant a such that P ( a T a) = p where
T t (n 2). Since

p = P( a T a)
!
~
= P a p a
Se = Sxx
p p
= P ~ aSe = Sxx ~ + aSe = Sxx

therefore a 100p% con…dence interval for is given by


h p p i
^ ase = Sxx ; ^ + ase = Sxx (6.7)
p
= ^ ase = Sxx

To test the hypothesis of no relationship or H0 : = 0 we use the test statistic


~ 0
p
Se = Sxx
with observed value
^ 0
p
se = Sxx
and p value given by
0 1
^ 0
p value = P @jT j p A
se = Sxx
2 0 13
^ 0
= 2 41 P @T p A5 where T t (n 2)
se = Sxx
232 6. GAUSSIAN RESPONSE MODELS

Note that (6.6) can be used to obtain con…dence intervals for 2 or . A 100p% con…-
dence interval for 2 is
(n 2)s2e (n 2)s2e
;
b a
and a 100p% con…dence interval for is
" r r #
n 2 n 2
se ; se
b a

where
1 p
P (U a) = P (U > b) =
2
and U 2 (n 2).

Remark In regression models we often “rede…ne”a covariate xi as xi = xi c, where c is


P
n P
n
a constant value that makes xi close to zero. (Often we take c = x, which makes xi
i=1 i=1
exactly zero.) The reasons for doing this are that it reduces round-o¤ errors in calculations,
and that it makes the parameter more interpretable. Note that does not change if we
“centre” xi this way, because

E(Y jx) = + x= + (x + c) = ( + c) + x

Thus, the intercept changes if we rede…ne x, but not . In the examples we consider here
we have kept the given de…nition of xi , for simplicity.

Con…dence intervals for the mean response (x) = + x


We are often interested in estimating the quantity (x) = + x since it represents the
mean response at a speci…ed value of the covariate x. We can obtain a pivotal quantity for
doing this. The maximum likelihood estimator of (x) obtains by replacing the unknown
values ; by their maximum likelihood estimators,

~ (x) = ~ + ~ x = Y + ~ (x x)

since ~ = Y ~ x. Since
~ = P (xi x) Yi
n

i=1 Sxx
we can rewrite ~ (x) as

P
n 1 (xi x)
~ (x) = Y + ~ (x x) = bi Yi where bi = + (x x)
i=1 n Sxx
6.2. SIMPLE LINEAR REGRESSION 233

Since ~ (x) is a linear combination of Gaussian random variables it has a Gaussian distrib-
ution. To …nd the mean and variance of ~ (x) we use the identities

P
n P
n P
n 1 (x x)2
bi = 1, bi xi = x and b2i = +
i=1 i=1 i=1 n Sxx

(see Chapter 6, Problem 1) to obtain

P
n P
n
E[~ (x)] = bi E(Yi ) = bi ( + xi )
i=1 i=1
P
n P
n
= bi + bi xi
i=1 i=1
Pn P
n
= + x since bi = 1 and bi xi = x
i=1 i=1
= (x)

and

P
n
V ar [~ (x)] = b2i V ar(Yi ) since the Yi are independent random variables
i=1
2 P
n
= b2i
i=1

2 1 (x x)2
= +
n Sxx

Note that the variance of ~ (x) is smallest when x is close to x (the center of the data) and
much larger when (x x)2 is large. In summary, we have shown that
0 s 1
1 (x x)2
~ (x) G @ (x); + A
n Sxx

Since
~ (x) (x)
q G (0; 1)
1 (x x)2
n + Sxx

holds independently of (6.6) then by Theorem (32) we obtain the pivotal quantity

~ (x) (x)
q t (n 2)
1 (x x)2
Se n + Sxx

which can be used to obtain con…dence intervals for (x) in the usual manner. Using the t
234 6. GAUSSIAN RESPONSE MODELS

table or R …nd the constant a such that P ( a T a) = p where T t (n 2). Since

p = P( a T a)
0 1
~ (x) (x)
= P@ a q aA
1 (x x)2
Se + n Sxx
0 s s 1
1 (x x)2 1 (x x)2 A
= P @ ~ (x) aSe + (x) ~ (x) + aSe +
n Sxx n Sxx

a 100p% con…dence interval for (x) is given by


2 s s 3
4 ^ (x) 1 (x x)2 1 (x x)2 5
ase + ; ^ (x) + ase + (6.8)
n Sxx n Sxx

where ^ (x) = ^ + ^ x.

Remark Note that since = (0), a 95% con…dence interval for , is given by (6.8) with
x = 0 which gives
s
1 (x)2
^ ase + (6.9)
n Sxx

In fact one can see from (6.9) that if x is large in magnitude (which means the average xi
is large), then the con…dence interval for will be very wide. This would be disturbing if
the value x = 0 is a value of interest, but often it is not.

Prediction Interval for Future Response


In Section 4.7 we constructed a 100p% prediction interval for a future observation when
the data were assumed to arise from a G ( ; ) distribution. In the case of simple linear
regression we also like to estimate or predict the Y value for a random unit, not part of
the sample, which has a speci…c value x for its covariate. To obtain a pivotal quantity
that can be used to construct a prediction interval for the future response Y , we note that
Y G( (x); ) from (6.3) or alternatively

Y = (x) + R; where R G(0; )

is independent of Y1 ; Y2 ; : : : ; Yn . For a point estimator of Y it is natural to use the maximum


likelihood estimator ~ (x) of (x) which has distribution
0 s 1
1 (x x)2
~ (x) G @ (x); + A
n Sxx
6.2. SIMPLE LINEAR REGRESSION 235

The error in the point estimator of Y is given by

Y ~ (x) = Y (x) + (x) ~ (x) = R + [ (x) ~ (x)] (6.10)

Since R is independent of ~ (x) (it is not connected to the existing sample), (6.10) is the
sum of independent Normally distributed random variables and is consequently Normally
distributed. Since

E [Y ~ (x)] = E fR + [ (x) ~ (x)]g


= E(R) + E [ (x)] E [~ (x)]
= 0 + (x) (x) = 0

and

V ar [Y ~ (x)] = V ar(Y ) + V ar [~ (x)]


2 1 (x x)2
2
= + +
n Sxx
2 1 (x x)2
= 1+ +
n Sxx
we have !
1=2
1 (x x)2
Y ~ (x) G 0; 1+ +
n Sxx
or
Y ~ (x)
q G (0; 1) (6.11)
2
1 + n1 + (xSxxx)
Since (6.11) holds independently of (6.6) then by Theorem (32) we obtain the pivotal
quantity
Y ~ (x)
q t (n 2)
2
Se 1 + n1 + (xSxxx)
For an interval estimate with con…dence coe¢ cient p we choose a such that
p = P ( a T a) where T t (n 2). Since
0 1
Y ~ (x)
p=P@ a q aA
1 (x x)2
Se 1 + n + Sxx
0 s s 1
1 (x x) 2 1 (x x) A 2
= P @ ~ (x) aSe 1 + + Y ~ (x) + aSe 1 + +
n Sxx n Sxx

we obtain the 100p% prediction interval


2 s s 3
2 2
4 ^ (x) ase 1 + 1 + (x x) ; ^ (x) + ase 1 + 1 + (x x) 5 (6.12)
n Sxx n Sxx
236 6. GAUSSIAN RESPONSE MODELS

If we compare (6.8) and (6.12), we observe that the prediction interval will be wider than
the con…dence interval particularly if n is large. The prediction interval is an interval for a
future observation Y which is a random variable whereas the con…dence interval is an in-
terval for the unknown mean (x) = + x. The width of the con…dence interval depends
on the uncertainty in the estimation of the parameters and , that is, it depends on the
variances of the estimators ~ and ~ . The width of the prediction interval depends on the un-
certainty in the estimation of the parameters and as well the variance 2 of the random
variable. In other words the uncertainty in determining an interval for a random variable
Y is greater than the uncertainty in determining an interval for the constant (x) = + x.

Remark When we construct a con…dence interval or a prediction interval for a value of x


which lie outside the interval of observed xi ’s we are assuming that the linear relationship
holds beyond the observed data. This is dangerous since there are no data to support the
model assumptions.
These results are summarized in Tables 6.1 and 6.2.

Example 6.1.3 Revisited Exam mark versus midterm mark


In Example 6.1.3, Figure 6.2 suggested that a linear regression model of the form
E(Y jx) = + x would be reasonable for the data on …nal exam mark y and midterm
mark x. For the given data

n = 65
x = 65:06154 y = 75:38462
Sxx = 10813:75
Sxy = 6869:462
Syy = 8665:385

so we …nd

^ = Sxy = 6869:462 = 0:6352523


Sxx 10813:75
^=y ^ x = 75:38462 (0:6352523) (65:06154) = 34:05413
Syy ^ Sxy 1
s2e = = [8665:385 (0:6352523) (6869:462)] = 68:27847
n 2 63
se = 8:263079

Note that when calculating these values using a calculator you should use as many decimal
places as possible otherwise the values are a¤ected by roundo¤ error. The estimate ^ =
0:6352523 indicates an increase in average exam mark of 0:6352523 for each one mark
increase in midterm mark x.
6.2. SIMPLE LINEAR REGRESSION 237

Table 6.1
Con…dence/Prediction Intervals for
Simple Linear Regression Model

100p%
Unknown Pivotal
Estimate Estimator Con…dence/
Quantity Quantity
Prediction
Interval

~=
^= ~
p p
Pn
Se = Sxx ^ ase = Sxx
(xi x)Yi
Sxy i=1
Sxx Sxx t (n 2)

r~ q
^= ~= Se 1 (x)2
+S 1 (x)2
n xx ^ ase n + Sxx

y ^x Y ~x
t (n 2)

~ (x) (x)
r
(x) = ^ (x) = ~ (x) = Se 1 (x x)2
+ S q
n xx
1 (x x)2
^ (x) ase n + Sxx

+ x ^ + ^x ~ + ~x t (n 2)

Se2 =
s2e = (n 2)Se2 h i
2 2 (n 2)s2e (n 2)s2e
Pn
2 c ; b
(Yi ~ ~ xi )
Syy ^ Sxy i=1 2 (n
n 2 n 2 2)

r
Y ~ (x) Prediction Interval
1 (x x)2
Y Se 1+ n + S
xx q
1 (x x)2
^ (x) ase 1 + n + Sxx
t (n 2)

Notes: The value a is given by P (T a) = 1+p


2 where T t (n 2).
1 p 2 (n
The values b and c are given by P (W b) = 2 = P (W > c) where W 2).
238 6. GAUSSIAN RESPONSE MODELS

Table 6.2
Hypothesis Tests for
Simple Linear Regression Model

Test p value
Hypothesis
Statistic

j~ p 0j j^ p 0j
H0 : = 0 2P T se = Sxx
where T t (n 2)
Se = Sxx

0 1
j~ 0j
r j^ 0j
H0 : = 0 1 (x)2 2P @T r A where T t (n 2)
Se n
+S 1 (x)2
xx se n
+S
xx

(n 2)s2e (n 1)s2e
min 2P W 2 ; 2P W 2
(n 2)Se2 0 0
H0 : = 0 2
0
W 2 (n 2)

Figure 6.4 shows a the scatterplot of the data together with the …tted line, y = ^ + ^ x =
34:05413 + 0:6352523x. The …tted line passes through the points but we notice that there
is a quite a bit of variability about the …tted line.
The p value for testing H0 : = 0 is

0 1
^ 0
2P @T p A
se = Sxx
0 1
j0:6352523 0j
= 2P @T q A
1 (65:06154)2
(8:263079) 65 + 10813:75
= 2P (T 7:994522) t 0

where T t (63). Therefore there is very strong evidence against the hypothesis H0 : = 0
or the hypothesis of no relationship between exam mark and midterm mark based on the
data which is consistence with the what we see in Figure 6.4.
6.2. SIMPLE LINEAR REGRESSION 239

Figure 6.4: Scatterplot and …tted line for exam mark versus midterm mark

The p value for testing H0 : = 0 is


0 1
j^ 0j
2P @T q A
1 (x)2
se n + Sxx
0 1
j34:05413 0j
= 2P @T q A
1 (65:06154)2
(8:263079) 65 + 10813:75
= 2P (T 6:461314)
t 0

where T t (63). Therefore there is very strong evidence against the hypothesis H0 : = 0.
Note that this = 0 corresponds to a midterm mark of x = 0 which is well outside the
range of observed midterm marks. In other words we are assuming the linear relationship
hold outside the range of observed x values which might not be valid. In this example the
hypothesis H0 : = 0 is not of particular interest.
These results can also be more easily be obtained by using the command summary(lm(y~x))
in R. The table below gives us the parts of the output which are of interest to us for this
course.
240 6. GAUSSIAN RESPONSE MODELS

Coe¢ cients:
Estimate Std. Error t value Pr(>jtj)
(Intercept) 34:05413 5:27046 6:461 1:72e 08***
x 0:63525 0:07946 7:995 3:65e 11***

Signif. codes: 0 ‘***’0.001 ‘**’0.01 ‘*’0.05 ‘.’ 0.1 ‘’1


Residual standard error: 8.263 on 63 degrees of freedom

The values which are given in this table are:

Estimate Std. Error t value Pr(>jtj)


0 1
p j^ 0j
(Intercept) ^ se = Sxx r^ 0
2P @T r A
1 (x)2 1 (x)2
se n
+S se n
+S
xx xx

q
^ 1 (x)2 ^ j^ p 0j
x se n + Sxx
p 0
se = Sxx
2P T se = Sxx

where T t (n 2). The Residual standard error is equal to se the estimate of . The
entry 3.65e-11 *** in the row labeled x in the table indicates that the p value for testing
H0 : = 0 is equal to 3:65 10 11 which is less than 0:001.

Since P (T 1:998341) = 0:975 where T t (63), a 95% con…dence interval for is


p
^ 1:998341se = Sxx
p
= 0:6352523 1:998341 (8:263079) = 10813:75
= [0:4764622; 0:7940424]

This interval does not contain any values of close to zero which is consistent with the fact
that the p value for testing H0 : = 0 was approximately zero.
A 95% con…dence interval for is
s
1 (x)2
^ 1:998341se +
n Sxx
s
1 (65:06154)2
= 34:05413 1:998341 (8:263079) +
65 10813:75
= [23:52194; 44:58632]

This interval does not contain any values of close to zero which is consistent with the
fact that the p value for testing H0 : = 0 was approximately zero.
6.2. SIMPLE LINEAR REGRESSION 241

A 95% con…dence interval for (50) = + (50) = the mean exam grade for students
with a midterm mark of x = 50 is
s
1 (x 50)2
^ + ^ (50) 1:998341se+
n Sxx
s
1 (65:06154 50)2
= 65:81674 1:998341 (8:263079) +
65 10813:75
= [62:66799; 68:96549]

Note that this is a con…dence interval for the mean or average exam mark for students who
obtain a midterm mark of x = 50. If we want to give an interval of values for an individual
student who obtained a midterm mark of x = 50 then we should use a prediction interval.
A 95% prediction interval is
s
1 (x 50)2
^ + ^ (50) 1:998341se 1 + +
n Sxx
s
1 (65:06154 50)2
= 65:81674 1:998341 (8:263079) 1 + +
65 10813:75
= [49:00675; 82:62673]

As we have indicated before, this interval is much wider than the con…dence for the mean
exam mark. Based on this interval, what advice would you give to a student who obtained
a mark of 50 on the midterm?
These intervals can also be easily obtained using R. For example, the R commands
confint(lm(y~x),level=0.95)
predict(lm(y~x),data.frame("x"=50),interval="confidence",lev=0.95)
predict(lm(y~x),data.frame("x"=50),interval="prediction",lev=0.95)
give the output
2.5 % 97.5 %
(Intercept) 23.5219441 44.5863083
x 0.4764623 0.7940423

…t lwr upr
1 65.81674 62.66799 68.96549

…t lwr upr
1 65.81674 49.00676 82.62672

The values in the these tables can be compared to the intervals obtained above.
242 6. GAUSSIAN RESPONSE MODELS

Example 6.1.4 Revisited Breaking strength versus diameter of steel bolt


Recall the data given in Example 6.1.4, where Y represented the breaking strength of a
randomly selected steel bolt and x was the bolt’s diameter. A scatterplot of points (xi ; yi )
for 30 bolts suggested a nonlinear relationship between Y and x. A bolt’s strength might be
expected to be proportional to its cross-sectional area, which is proportional to x2 . Figure
6.5 shows a plot of points (x2i ; yi ) which looks quite linear. Because of this let us assign a
new variate name to x2 , say x1 = x2 . We then …t a linear model

Yi G( + x1i ; ) where x1i = x2i

to the data.
For these data we obtain ^ = 1:6668, ^ = 2:8378, se = 0:002656 and Sx1 x1 = 0:2244.
The …tted regression line is shown on the scatterplot in Figure 6.5. The model appears to
…t the data well.

2.5

2.4

2.3

2.2
Strength y=1.67+2.84x
2.1

1.9

1.8

1.7

1.6
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared

Figure 6.5: Scatterplot plus …tted line for strength versus diameter squared

The parameter represents the increase in average strength (x1 ) from increasing
x1 = x2 by one unit. Using (6.7) and the fact that P (T 2:0484) = 0:975 for T t (28),
a 95% con…dence interval for is given by
p
^ 2:0484se = Sxx = 2:8378 0:2228
= [2:6149; 3:0606]
6.2. SIMPLE LINEAR REGRESSION 243

Checking the Model Assumptions for Simple Linear Regression


There are two main components in Gaussian linear response models:

(1) The assumption that Yi (given any covariates xi ) is Gaussian with constant standard
deviation .

(2) The assumption that E (Yi ) = (xi ) is a linear combination of observed covariates
with unknown coe¢ cients.

Models should always be checked. In problems with only one x covariate, a plot of
the …tted line superimposed on the scatterplot of the data (as in Figures 6.4 and 6.5)
shows pretty clearly how well the model …ts. If there are two or more covariates in the
model, residual plots, which are described below, are very useful for checking the model
assumptions.
Consider the simple linear regression model for which Yi G( i ; ) where i = + xi
and Ri = Yi i G(0; ), i = 1; 2; : : : ; n independently. Residuals are de…ned as the
di¤erence between the observed response and the …tted response, that is, r^i = yi ^ i ,
i = 1; 2; : : : ; n where yi is the observed response and ^ i = ^ + ^ xi is the …tted response.
The idea behind the r^i ’s is that they can be thought of as “observed”Ri ’s. This isn’t exactly
correct since we are using ^ i instead of i , but if the model is correct, then the r^i ’s should
behave roughly like a random sample from the G(0; ) distribution. Another reason why
the r^i ’s only behave roughly like a random sample from the G(0; ) distribution is because
Pn
r^i = 0. To see this, recall that the maximum likelihood estimate of is ^ = y ^ x
i=1
which implies
^ x = 1 P yi ^ xi = 1 P r^i
n n
0=y ^ ^
n i=1 n i=1
or
P
n
r^i = 0
i=1

A random sample Ri G(0; ), i = 1; 2; : : : ; n does not satisfy such a restriction.


Residual plots can be used to check the model assumptions. Here are three residual
plots which can be used:

(1) Plot points (xi ; r^i ); i = 1; 2; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.

(2) Plot points (^ i ; r^i ); i = 1; 2; : : : ; n. If the model is satisfactory the points should lie
more or less horizontally within a constant band around the line r^i = 0.

(3) Plot a Normal qqplot of the residuals r^i . If the model is satisfactory the points should
lie more or less along a straight line. (Note that since the yi0 s do not all have the same
mean, it does not make sense to do a qqplot of the yi0 s.)
244 6. GAUSSIAN RESPONSE MODELS

1
standardized
residual
0

-1

-2

-3
0 10 20 30 40 50
x

Figure 6.6: Residual plot for example in which model assumptions hold

Figure 6.6 shows a residual plot in which the points lie reasonably within a horizontal
constant band around the line r^i = 0 which suggests that the model assumptions are
reasonable.
Systematic departures from the “expected” pattern suggest problems with the model
assumptions. In Figure 6.7, the points do not lie within a constant band around the line
r^i = 0. As x increases the points lie above the line r^i = 0 then below and then above. This
pattern of points suggests that the mean function i = (xi ) is not correctly speci…ed. A
quadratic form for the mean such as (xi ) = + xi + x2i might provide a better …t to
the data.

1
standardized
residual
0

-1

-2

-3
0 10 20 30 40 50
x

Figure 6.7: Example of residual plot which indicates that assumption


E (Yi ) = + xi does not hold
6.2. SIMPLE LINEAR REGRESSION 245

The model assumption, Yi G( + xi ; ), i = 1; 2; : : : ; n independently, assumes that


the standard deviation does not vary with x. The pattern of points in Figure 6.8 suggests
that this assumption of constant variance does not hold since the spread of the points about
the line r^i = 0 increases as x increases. Sometimes transforming the response variate can
p
solve this problem. Transformations such as log y and y are frequently used.

1
standardized
residual
0

-1

-2

-3
50 60 70 80 90 100
x

Figure 6.8: Example of residual plot which indicates that assumption V ar (Yi ) = 2

does not hold

Reading these plots requires practice. You should try not to read too much into plots
particularly if the plots are based on a small number of points.
Often we prefer to use standardized residuals
r^i
r^i =
se
yi ^i
=
se
yi ^ ^ xi
= for i = 1; 2; : : : ; n
se
Standardized residuals were used in Figures 6.7 and 6.8. The patterns in the plots are
unchanged whether we use r^i or r^i , however the r^i values tend to lie in the interval [ 3; 3].
The reason for this is that, since the r^i ’s behave roughly like a random sample from the
G(0; ) distribution, the r^i ’s should behave roughly like a random sample from the G (0; 1)
distribution. Since P ( 3 Z 3) = 0:9973 where Z G (0; 1), then roughly 99:73% of
the observations should lie in the interval [ 3; 3].
246 6. GAUSSIAN RESPONSE MODELS

Example 6.1.4 Revisited Breaking strength versus diameter of steel bolts


Figure 6.9 shows a standardized residual plot for the steel bolt data where the explana-
tory variate is diameter squared. No deviation from the expected pattern is observed. This
is of course also evident from Figure 6.5.

1.5

1
standardized
residual
0.5

-0.5

-1

-1.5

-2
0 0.05 0.1 0.15 0.2 0.25
Diameter Squared

Figure 6.9: Standard residuals versus diameter squared for bolt data

A qqplot of the standardized residuals is given in Figure 6.10. There are only 30 points.
The points lie reasonably along a straight line with more variability in the tails which is
expected. The Gaussian assumption seems reasonable based on this small number of points.

QQ Plot of Sample Data versus Standard Normal


3

2
Quantiles of Standardized Residuals

-1

-2

-3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
Standard Normal Quantiles

Figure 6.10: Qqplot of standardized residuals for bolt data


6.3. COMPARISON OF TWO POPULATION MEANS 247

6.3 Comparison of Two Population Means


Two Gaussian Populations with Common Variance
Suppose Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and indepen-
dently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution. Notice that we
have assumed that both populations have the same variance 2 . We use double subscripts
for the Y ’s here, the …rst index to indicate the population from which the sample was drawn,
the second to indicate which draw from that population. We could easily conform with the
notation of (6.1) by stacking these two sets of observations in a vector of n = n1 + n2
observations:
(Y11 ; Y12 ; : : : ; Y1n1 ; Y21 ; Y22 ; : : : ; Y2n2 )T

and obtain the conclusions below as a special case of the linear model. Below we derive the
estimates from the likelihood directly.
The likelihood function for 1 , 2 , is
nj
2 Q
Q 1 1 2
L( 1; 2; )= p exp 2
yji j for 1 2 <; 2 2 <; >0
j=1 i=1 2 2

Maximization of the likelihood function gives the maximum likelihood estimates

1 P n1
^1 = y1i = y1
n1 i=1
1 P n2
^2 = y2i = y2
n2 i=1
1 P
n1 P
n2
and ^ 2 = (y1i y1 ) 2 + (y2i y2 )2
n1 + n2 i=1 i=1

An estimate of the variance 2 called the pooled estimate of variance is

1 P
n1 P
n2
s2p = (y1i y1 )2 + (y2i y2 )2
n1 + n2 2 i=1 i=1
(n1 1)s21 + (n2 1)s22
=
n1 + n2 2
n1 + n2
= ^2
n1 + n2 2

where
1 P
n1 1 P
n2
s21 = (y1i y1 )2 and s22 = (y2i y2 )2
n1 1 i=1 n2 1 i=1
are the sample variances obtained from the individual samples. The estimate s2p can be
written as
w1 s21 + w2 s22
s2p =
w1 + w2
248 6. GAUSSIAN RESPONSE MODELS

to show that s2p is a weighted average of the sample variances s2j with weights equal to
wj = nj 1. With these weights the sample variance which has a larger sample size is
weighted more. Why does this make sense?
We will use the estimate s2p for 2 rather than ^ 2 since
2P
n1 P
n2 3
(Y Y1 )2 + (Y2i Y2 )2
6 i=1 1i 7
E Sp2 = E 6
4
i=1 7=
5
2
n1 + n2 2

Con…dence intervals for 1 2

To determine whether the two populations di¤er and by how much we will need to generate
con…dence intervals for the di¤erence 1 2 . First note that the maximum likelihood
estimator of this di¤erence is Y 1 Y 2 which has expected value

E(Y 1 Y 2) = 1 2

and variance
2 2 1 1
2
V ar(Y 1 Y 2 ) = V ar(Y 1 ) + V ar(Y 2 ) = + = +
n1 n2 n1 n2

It naturally follows that an estimator of V ar(Y 1 Y 2 ) from the pooled data is

1 1
Sp2 +
n1 n2

and that this has n1 1 + n2 1 = n1 + n2 2 degrees of freedom. This provides at least


an intuitive justi…cation for the following:

Theorem 41 If Y11 ; Y12 ; : : : ; Y1n1 is a random sample from the G( 1 ; ) distribution and
independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from the G( 2 ; ) distribution then

(Y 1 Y 2) ( 1 2)
q t (n1 + n2 2)
1 1
Sp n1 + n2

and
(n1 + n2 2)Sp2 1 P nj
2 P
2
= 2
(Yji Yj )2 2
(n1 + n2 2)
j=1 i=1

Con…dence intervals or tests of hypothesis for 1 2 and can be obtained using these
pivotal quantities.
In particular a 100p% con…dence interval for 1 2 is
r
1 1
y1 y 2 asp + (6.13)
n1 n2
6.3. COMPARISON OF TWO POPULATION MEANS 249

where P (T a) = (1 + p) =2 and T t (n1 + n2 2).


To test H0 : 1 2 = 0 we use the test statistic

Y1 Y2 0 Y1 Y2
D= q = q (6.14)
Sp n11 + n12 Sp n11 + n12

with
0 1 2 0 13
jy1 y2 0j A jy1 y2 0j A5
p value = P @jT j q = 2 41 P @T q
sp n11 + n12 sp n11 + n12

where T t (n1 + n2 2).


A 100p% con…dence interval for is
2s s 3
(n1 + n2 2) s2p (n1 + n2 2) s2p
4 ; 5
b a

where
1 p 1+p 2
P (U a) = ; P (U b) = ; and U (n1 + n2 2)
2 2

Two Gaussian Populations with Unequal Variances


The procedures derived above assume that the two Gaussian distributions have the same
standard deviations. Sometimes this is not a reasonable assumption (it can be tested using
a likelihood ratio test) and we must assume that Y11 ; Y12 ; : : : ; Y1n1 is a random sample from
the G( 1 ; 1 ) distribution and independently Y21 ; Y22 ; : : : ; Y2n2 is a random sample from
the G( 2 ; 2 ) but 1 6= 2 . If 1 and 2 are known then we could use the pivotal quantity
Y1 Y ( 2)
q2 2
1
2
G (0; 1) (6.15)
1
n1 + 2
n2

A 100p% con…dence interval for 1 2 is


s
2 2
1 2
y1 y2 a +
n1 n2
1+p
where P (Z a) = 2 and Z G (0; 1). To test H0 : 1 2 = 0 we use the test statistic
Y1 Y2 0 Y1 Y2
D= q 2 2
=q 2 2

n1 + n2 n1 + n2
1 2 1 2

with
0 1 2 0 13
jy1 y2 0j A jy1 y2 0j A5
p value = P @jZj q 2 2
= 2 41 P @Z q 2 2

n1 + n1 +
1 2 1 2
n2 n2

where Z G (0; 1).


250 6. GAUSSIAN RESPONSE MODELS

Table 6.3
Con…dence Intervals for
Two Sample Gaussian Model

Pivotal 100p% Con…dence


Model Parameter
Quantity Interval

G( Y1 Y2 ( 2)
1; 1) r 1
2 2 q 2 2
1+ 2
G( 2; 2) n1 n2 y1 y2 a n1
1
+ n2
2
1 2

1, 2 known G (0; 1)

G( 1; 1)
Y1 Y2 ( 1 2)
G( q q
2; 2) Sp n1 + n1 1 1
1 2 y1 y2 bsp n1 + n2
1 2
1 = 2= t (n1 + n2 2)
unknown

G( 1; ) (n1 +n2 2)Sp2


G( 2 h i
2; ) 2 (n1 +n2 2)s2p (n1 +n2 2)s2p
d ; c
2 (n + n2 2)
1
1, 2 unknown

asymptotic
Gaussian
G( 1; 1) approximate 100p%
pivotal quantity
G( 2; 2) con…dence interval
1 2 Y1 Y2 ( 1 2)
r q
1 6 = 2 2
S1 2
s21 s22
n1
+ n2 y1 y2 a +
2 n1 n2
1, 2 unknown

for large n1 ; n2

Notes:
The value a is given by P (Z a) = 1+p 2 where Z G (0; 1).
1+p
The value b is given by P (T b) = 2 where T t (n1 + n2 2).
The values c and d are given by P (W c) = 1 2 p = P (W > d) where W 2 (n
1 + n2 2).
6.3. COMPARISON OF TWO POPULATION MEANS 251

Table 6.4
Hypothesis Tests for
Two Sample Gaussian Model

Test
Model Hypothesis
Statistic p value

0 1
G( 1; 1) jy1 y2 ( 2 )j
j Y1 Y2 (
r 1 2) j 2P @Z r
2
1
2
A
G( 2; 2) 2 2 1+ 2
H0 : 1 = 2
1+ 2
n1 n2
n1 n2

1, 2 known Z G (0; 1)

!
G( 1; ) jy1 y2 ( 1 2 )j
j Y1 Y2 (
q 1 2) j 2P T q
G( 2; ) 1 1 sp n1 + n1
H0 : = Sp n1
+n 1 2
1 2 2

unknown T t (n1 + n2 2)

(n1 +n2 2)s2p


G( 1; )
min(2P W 2 ;
0
(n1 +n2 2)Sp2 (n1 +n2 2)s2p
G( 2; ) 2 2P W )
H0 : = 0 0 2
0

1, 2 unknown W 2 (n + n2 1)
1

approximate p value
G( 1; 1)
j Y1 Y2 ( j 0 1
G( 2)
2; 2) 1
r
2 S2 jy1 y2 ( 2 )j
2P @Z A
S1 1
H0 : = + n2 r
1 2 n1 2 s2 2
1 + s2
1 6 = 2 n1 n2

1, 2 unknown
Z G (0; 1)
252 6. GAUSSIAN RESPONSE MODELS

In the case in which 1 and 2 are unknown then there is no exact pivotal quantity
which can be used. However if we replace the quantities 21 and 22 in the pivotal quantity
(6.15) by their respective estimators S12 and S22 to obtain the the random variable

Y1 Y ( 2)
q2 1
(6.16)
S12 S22
n1 + n2

then it can be shown that this asymptotic pivotal quantity has approximately a G (0; 1)
distribution if n1 and n2 are both large. An approximate 100p% con…dence interval for
1 2 based on this pivotal quantity is
s
s21 s2
y1 y2 a + 2 (6.17)
n1 n2
1+p
where P (Z a) = 2 and Z G (0; 1).
These results are summarized in Tables 6.3 and 6.4.

Example 6.3.1 Durability of paint


In an experiment to assess the durability of two types of white paint used on asphalt
highways, 12 lines (each 4 inches wide) of each paint were laid across a heavily traveled
section of highway, in random order. After a period of time, re‡ectometer readings were
taken for each line of paint; the higher the readings the greater the re‡ectivity and the
visibility of the paint. The measurements of re‡ectivity were as follows:

Paint A 12:5 11:7 9:9 9:6 10:3 9:6 9:4 11:3 8:7 11:5 10:6 9:7
Paint B 9:4 11:6 9:7 10:4 6:9 7:3 8:4 7:2 7:0 8:2 12:7 9:2

The objectives of the experiment were to test whether the average re‡ectivities for paints A
and B are the same, and if there is evidence of a di¤erence, to obtain a con…dence interval
for their di¤erence. (In many problems where two attributes are to be compared we start
by testing the hypothesis that they are equal, even if we feel there may be a di¤erence. If
there is no statistical evidence of a di¤erence then we stop there.)
To do this it is assumed that, to a close approximation, the re‡ectivity measurements
Y1i ; i = 1; 2; : : : ; 12 for paint A are independent G( 1 ; 1 ) random variables, and indepen-
dently the measurements Y2i ; i = 1; 2; : : : ; 12 for paint B are independent G( 2 ; 2 ) random
variables. We can test H : 1 2 = 0 and get con…dence intervals for 1 2 by using
the pivotal quantity
Y1 Y2 ( 1 2)
q t (22) (6.18)
1 1
Sp 12 + 12
Using this pivotal quantity means we have assumed that the two population variances are
equal, 1 = 2 = , and that we are using the estimator Sp for . If the observed sample
variances di¤ered by a great deal we would not make this assumption. Unfortunately if the
6.3. COMPARISON OF TWO POPULATION MEANS 253

variances are not assumed equal the problem becomes more di¢ cult. The case of unequal
variances is discussed in the next section.
From these data we have
P
12
n1 = 12 y1 = 10:4 (y1i y1 )2 = 14:08 s1 = 1:1314
i=1
P12
n2 = 12 y2 = 9:0 (y2i y2 )2 = 38:64 s2 = 1:8742
i=1
s
1 P
12 P
12
sp = (y1i y1 ) 2 + (y2i y2 )2 = 1:5480
12 + 12 2 i=1 i=1

The observed value of the test statistic (6.14) is

jy1 y2 0j 1:4
d= q = q = 2:215
1 1
sp 12 + 12 1:5480 16

with

p value = P (jT j 2:215)


= 2 [1 P (T 2:215)]
= 0:038

where T t (22). Since 0:01 < p value < 0:05, there is evidence based on the data against
H0 : 1 = 2 .
Since y1 > y2 , the indication is that paint A keeps its visibility better. Since
P (T 2:074) = 0:975 where T t (22) a 95% con…dence interval for 1 2 based on
(6.13) is
r
1 1
10:4 9:0 2:074 (1:5480) +
12 12
= 1:4 1:3107
= [0:089; 2:711]

This suggests that although the di¤erence in re‡ectivity (and durability) of the paint is
statistically signi…cant, the size of the di¤erence is not really large relative to the sizes of
1 and 2 . This can be seen by noting that ^ 1 = y1 = 10:4, and ^ 2 = y2 = 9:0, whereas
^ 1 ^ 2 = 1:4 so the relative di¤erence is of the order of 10%.

Remark The R command t.test(y1 ,y2 ,var.equal=T,conf.level=p), where y1 and y2


are the data vectors,.will carry out the test above and give a 100p% con…dence interval for
1 2.
254 6. GAUSSIAN RESPONSE MODELS

Example 6.3.2 Scholastic achievement test scores


Tests that are designed to measure the achievement of students are often given in various
subjects. Educators and parents often compare results for di¤erent schools or districts. We
consider here the scores on a mathematics test given to Canadian students in the 5th grade.
Summary statistics (sample sizes, means, and standard deviations) of the scores y for the
students in two small school districts in Ontario are as follows:

District 1: n1 = 278 y1 = 60:2 s1 = 10:16


District 2: n2 = 345 y2 = 58:1 s2 = 9:02
If a likelihood ratio test of the hypothesis 1 = 2 is conducted there is strong evidence
against the hypothesis based on the data so we cannot assume equal variances in this
example.
The average score is somewhat higher in District 1, but is this di¤erence statistically
signi…cant? We will give a con…dence interval for the di¤erence in average scores in a model
representing this setting. This is done by thinking of the students in each district as a
random sample from a conceptual large population of “similar” students writing “similar”
tests. We assume that the scores in District 1 have a G( 1 ; 1 ) distribution and that
the scores in District 2 have a G( 2 ; 2 ) distribution. We can then test the hypothesis
H0 : 1 = 2 or alternatively construct a con…dence interval for the di¤erence 1 2.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so
this is a sensible procedure.)
Since n1 = 278 and n2 = 345 we use (6.17) to construct an approximate 95% con…dence
interval for 1 2 . We obtain
s
(10:16)2 (9:02)2
60:2 58:1 1:96 +
278 345
= 2:1 (1:96)(0:779)
= [0:57; 3:63]
Since 1 2 = 0 is outside the approximate 95% con…dence interval (can you show that
it is also outside the approximate 99% con…dence interval?) we can conclude there is fairly
strong evidence against the hypothesis H0 : 1 = 2 , suggesting that 1 > 2 . We should
not rely only on a comparison of their means. It is a good idea to look carefully at the data
and the distributions suggested for the two groups using histograms or boxplots.
The mean is a little higher for District 1 and because the sample sizes are so large,
this gives a “statistically signi…cant” di¤erence in a test of H0 : 1 = 2 . Unfortunately,
“signi…cant” tests like this are often used to make claims that one group or class or school
is “superior”to another. Recall that the validity of this method depends on the assumption
that the students in each district is a random sample from a conceptual large population of
“similar” students writing “similar” tests. How reasonable is this assumption? How likely
is it that marks in a class are independent of one another and no more alike than marks
between two classes in two di¤erent years?
6.3. COMPARISON OF TWO POPULATION MEANS 255

Comparison of Means Using Paired Data


Often experimental studies designed to compare means are conducted with pairs of units,
where the responses within a pair are not independent. The following examples illustrate
this.

Example 6.3.3 Heights of males versus females


In a study in England, the heights of 1401 (brother, sister) pairs of adults were deter-
mined. One objective of the study was to compare the heights of adult males and females;
another was to examine the relationship between the heights of male and female siblings.
Let Y1i and Y2i be the heights of the male and female, respectively, in the i’th (brother,
sister) pair (i = 1; 2; : : : ; 1401). Assuming that the pairs are sampled randomly from the
population, we can use them to estimate

1 = E(Y1i ) and 2 = E(Y2i )

and the di¤erence 1 2 . However, the heights of related persons are not independent.
If we know that one sibling from a family is tall(small) then on average we would expect
other siblings in the family to also be tall (small) so heights of siblings are correlated and
therefore not independent. The method in the preceding section should not be used to es-
timate 1 2 since it would require independent random samples of males and females.
In fact, the primary reason for collecting these data was to consider the joint distribution
of Y1i ; Y2i and to examine their relationship. A clear picture of the relationship could be
obtained by plotting the observed points (y1i ; y2i ) in a scatterplot.

Example 6.3.4 Comparison of car fuels


In a study to compare standard gasoline with gas containing an additive designed to
improve mileage (i.e. reduce fuel consumption), the following experiment was conducted.
Fifty cars of a variety of makes and engine sizes were chosen. Each car was driven in a
standard way on a test track for 1000 km, with the standard fuel (S) and also with the
enhanced fuel (E). The order in which the S and E fuels was used was randomized for each
car (you can think of a coin being tossed for each car, with fuel S being used …rst if a Head
occurred) and the same driver was used for both fuels in a given car. Drivers were di¤erent
across the 50 cars.
Suppose we let Y1i and Y2i be the amount of fuel consumed (in liters) for the i’th car with
the S and E fuels, respectively. We want to estimate E(Y1i Y2i ). The fuel consumptions
Y1i ; Y2i for the i’th car are related, because factors such as size, weight and engine size
(and perhaps the driver) a¤ect consumption. The assumption that the Y1i ’s are a random
sample from a large population with mean 1 , and independently the Y2i ’s are a random
sample from a large population with mean 2 would not be appropriate in this example.
The observations have been paired deliberately to eliminate some factors (like driver/ car
size) which might otherwise a¤ect the conclusion. Note that in this example it may not
256 6. GAUSSIAN RESPONSE MODELS

be of much interest to consider E(Y1i ) and E(Y2i ) separately, since there is only a single
observation on each car type for each fuel.

There are two types of Gaussian models which can be used to model paired data. The
…rst involves what is called a Bivariate Normal distribution for (Y1i ; Y2i ), and it could be
used in the fuel consumption example. The Bivariate Normal distribution is a continuous
bivariate model for which each component has a Normal distribution and the components
may be dependent. We will not describe this model here (it is studied in third year courses),
except to note one fundamental property: If (Y1i ; Y2i ) has a Bivariate Normal distribution
then the di¤erence between the two is also Normally distributed; where 2 = V ar(Y1i ) +
V ar(Y2i ) 2Cov(Y1i ; Y2i ). Thus, if we are interested in making inferences about 1 2
then we can do this by analyzing the within-pair di¤ erences Yi = Y1i Y2i and using the
model
2
Yi = Y1i Y2i N ( 1 2; ) i = 1; 2; : : : ; n independently
or equivalently
Yi G ( ; ) i = 1; 2; : : : ; n independently (6.19)
where = 1 2 . The methods for a G ( ; ) model discussed in Sections 4.7 and
5.2 can then be used to estimate and test hypotheses about the parameters
and .
The second Gaussian model used with paired data assumes
2 2
Y1i G 1 + i; 1 ; and Y2i G 2 + i; 2 independently

where the i ’s are unknown constants. The i ’s represent factors speci…c to the di¤erent
pairs so that some pairs can have larger (smaller) expected values than others. This model
also gives a Gaussian distribution like (6.19), since Y1i Y2i has a Gaussian distribution
with
E(Y1i Y2i ) = 1 2 =

(note that the i ’s cancel) and


2 2 2
V ar(Y1i Y2i ) = 1 + 2 =

Such a model might be reasonable for Example 6.3.4, where i refers to the i’th car type.
Thus, whenever we encounter paired data in which the random variables Y1i and Y2i
are adequately modeled by Gaussian distributions, we will make inferences about 1 2
by working with the model (6.19).

Example 6.3.3 Revisited Heights of males versus females


The data on 1401 (brother, sister) pairs gave di¤erences Yi = Y1i Y2i , i = 1; 2; : : : ; 1401
for which the sample mean and variance were

y = 4:895 inches
6.3. COMPARISON OF TWO POPULATION MEANS 257

and
P
1 1401
s2 = (yi y)2 = 6:5480 (inches)2
1400 i=1
Using the pivotal quantity
Y
p t (1400)
S= n
a 95% con…dence interval for = E(Yi ) is given by
p p
y 1:96s= n = 4:895 1:96 6:5480=1401
= 4:895 0:134
= [4:76; 5:03]

Note that t (1400) is indistinguishable from G(0; 1) so we use the value 1:96 from the G(0; 1)
distribution.

Remark The method above assumes that the (brother, sister) pairs are a random sample
from the population of families with a living adult brother and sister. The question arises
as to whether E(Yi ) also represents the di¤erence in the average heights of all adult males
and all adult females (call them 01 and 02 ) in the population. If 01 = 1 (that is, the
average height of all adult males equals the average height of all adult males who also have
an adult sister) and similarly 02 = 2 , then E(Yi ) does represent this di¤erence.

Pairing and Experimental Design


In settings where the population can be arranged in pairs, the estimation of a di¤erence
in means, 1 2 , can often be made more precise (shorter con…dence intervals) by using
pairing in the study. The condition for this is that the association or correlation between
Y1i and Y2i be positive. In Examples 6.3.3 and 6.3.4 a positive correlation seems to be a
reasonable assumption and the pairing in these studies is a good idea.
To see why the pairing is helpful in estimating the mean di¤erence 1 2 , suppose that
2 2
Y1i G 1 ; 1 and Y2i G 2 ; 2 , but that Y1i and Y2i are not necessarily independent
(i = 1; 2; : : : ; n). The estimator of 1 2 is

Y1 Y2

and we have that E(Y1 Y2 ) = 1 2 and

V ar(Y1 Y2 ) = V ar(Y1 ) + V ar(Y2 ) 2Cov(Y1 ; Y2 )


2 2
1 2 12
= + 2
n n n
where 12 = Cov(Y1i ; Y2i ). If 12 > 0, then V ar(Y1 Y2 ) is smaller than when 12 = 0
(that is, when Y1i and Y2i are independent). We would expect that the covariance between
258 6. GAUSSIAN RESPONSE MODELS

the heights of siblings in the same family to be positively correlated since they share parents.
Therefore if we can collect a sample of pairs (Y1i ; Y2i ), this is better than two independent
random samples (one of Y1i ’s and one of Y2i ’s) for estimating 1 2 . Note on the other
hand that if 12 < 0, then pairing is a bad idea since it increases the value of V ar(Y1 Y2 ).

The following example involves an experimental study with pairing.

Example 6.3.5 Fibre in diet and cholesterol level


In a study 20 subjects, volunteers from workers in a Boston hospital with ordinary
cholesterol levels, were given a low-…bre diet for 6 weeks and a high-…bre diet for another 6
week period. The order in which the two diets were given was randomized for each subject
(person), and there was a two-week gap between the two 6 week periods, in which no dietary
…bre supplements were given. A primary objective of the study was to see if cholesterol
levels are lower with the high-…bre diet.
Details of the study are given in the New England Journal of Medicine, volume 322
(January 18, 1990), pages 147-152. Here we will simply present the data from the study
and estimate the e¤ect of the amount of dietary …bre.

y1i y2i yi = y1i y2i yi =


Subject Subject
High F Low F y1i y2i High F Low F y1i y2i
1 5:55 5:42 0:13 11 4:44 4:43 0:01
2 2:91 2:85 0:06 12 5:22 5:27 0:05
3 4:77 4:25 0:52 13 4:22 3:61 0:61
4 5:63 5:43 0:20 14 4:29 4:65 0:36
5 3:58 4:38 0:80 15 4:03 4:33 0:30
6 5:11 5:05 0:06 16 4:55 4:61 0:06
7 4:29 4:44 0:15 17 4:56 4:45 0:11
8 3:40 3:36 0:04 18 4:67 4:95 0:28
9 4:18 4:38 0:20 19 3:55 4:41 0:86
10 5:41 4:55 0:86 20 4:44 4:38 0:06
Table 6.5: Cholesterol levels on two diets

Table 6.5 shows the cholesterol levels (in millimole per liter) for each subject, measured
at the end of each 6 week period. We let the random variables Y1i ; Y2i represent the
cholesterol levels for subject i on the high …bre and low …bre diets, respectively. We’ll also
assume that the di¤erences can be modeled using

Yi = Y1i Y2i G( 1 2; ) for i = 1; 2; : : : ; 20

The observed di¤erences yi , shown in Table 6.3, give y = 0:020 and s = 0:411. Since
6.4. GENERAL GAUSSIAN RESPONSE MODELS 259

P (T 2:093) = 0:975 where T t (19), a 95% con…dence interval for 1 2 is


p p
y 2:093s= n = 0:020 2:093 (0:411) = 20
= 0:020 0:192 or [ 0:212; 0:172]

This con…dence interval includes 1 2 = 0, and there is clearly no evidence that the high
…bre diet gives a lower cholesterol level at least in the time frame represented in this study.

Remark The results here can be obtained using the R function t.test.

Exercise Compute the p value for the test of hypothesis H0 : 1 2 = 0, using the test
statistic (5.1).

Final Remarks When you see data from a comparative study (that is, one whose
objective is to compare two distributions, often through their means), you have to determine
whether it involves paired data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from
a paired study unless there are equal numbers of each, but if there are equal numbers the
study might be either “paired”or “unpaired”. Note also that there is a subtle di¤erence in
the study populations in paired and unpaired studies. In the former it is pairs of individual
units that form the population where as in the latter there are (conceptually at least)
separate individual units for Y1 and Y2 measurements.

6.4 General Gaussian Response Models


We now consider general models of the form (6.1):

P
k
Yi G( i ; ) with (xi ) = j xij for i = 1; 2; : : : ; n independently
j=1

(Note: To facilitate the matrix proof below we have taken 0 = 0 in (6.1). The estimator of
0 can be obtained from the result below by letting xi1 = 1 for i = 1; 2; : : : ; n and 0 = 1 .)
For convenience we de…ne the n k (where n > k) matrix X of covariate values as

X = (xij ) for i = 1; 2; : : : ; n and j = 1; 2; : : : ; k

and the n 1 vector of responses Yn 1 = (Y1 ; Y2 ; : : : ; Yn )T . We assume that the values xij
are non-random quantities which we observe. We now summarize some results about the
maximum likelihood estimators of the parameters = ( 1 ; 2 ; : : : ; k )T and .
260 6. GAUSSIAN RESPONSE MODELS

=( T
Maximum Likelihood Estimators of 1; 2; : : : ; k) and of

Theorem 42 The maximum likelihood estimators for =( T


1; 2; : : : ; k) and are:

~ = (X T X) 1
XT Y (6.20)
1 Pn P
k
~ xij
and ~2 = (Yi ~ i )2 where ~ i = j (6.21)
n i=1 j=1

Proof. The likelihood function is


Q
n 1 1 2 P
k
L( ; ) = p exp 2
(yi i) where i = j xij
i=1 2 2 j=1

and the log-likelihood function is

l( ; ) = log L( ; )
1 P
n
2
= n log 2
(yi i)
2 i=1

Note that if we take the derivative with respect to a particular j and set this derivative
equal to 0, we obtain,
@l 1 P n @ i
= 2 (yi i) =0
@ j 2 i=1 @ j
or
P
n
(yi i ) xij = 0
i=1

for each j = 1; 2; : : : ; k. In terms of the matrix X and the vector y =(y1 ; y2 ; : : : ; yn )T we


can rewrite this system of equations more compactly as

X T (y X )= 0
or X T y = X T X :

Assuming that the k k matrix X T X has an inverse we can solve these equations to obtain
the maximum likelihood estimate of , in matrix notation as

^ = (X T X) 1
XT y

with corresponding maximum likelihood estimator

e = (X T X) 1
XT Y

In order to …nd the maximum likelihood estimator of , we take the derivative with respect
to and set the derivative equal to zero and obtain

@l @ 1 P
n
2
= n log 2
(yi i) =0
@ @ 2 i=1
6.4. GENERAL GAUSSIAN RESPONSE MODELS 261

or
n 1 P
n
2
+ 3
(yi i) =0
i=1

from which we obtain the maximum likelihood estimate of 2 as


1 Pn
^2 = (yi ^ i )2
n i=1
where
P
k
^ xij
^i = j
j=1

The corresponding maximum likelihood estimator 2 is


1 Pn
~2 = (Yi ~ i )2
n i=1
where
P
k
~ xij
~i = j
j=1

Recall that when we estimated the variance for a single sample from the Gaussian
distribution we considered a minor adjustment to the denominator and with this in mind
we also de…ne the following estimator14 of the variance 2 :
1 P
n n
Se2 = (Yi ~ i )2 = ~2
n k i=1 n k

Note that for large n there will be small di¤erences between the observed values of ~ 2 and
Se2 .

Theorem 43 1. The estimators ~ j are all Normally distributed random variables with
expected value j and with variance given by the j 0 th diagonal element of the matrix
2 (X T X) 1 ; j = 1; 2; : : : ; k:

2. The random variable


n~ 2 (n k)Se2
W = 2
= 2
(6.22)

has a Chi-squared distribution with n k degrees of freedom.

3. The random variable W is independent of the random vector ( ~ 1 ; : : : ; ~ k ):

Proof. The estimator ~ j can be written using (6.20) as a linear combination of the Normal
random variables Yi ,
~ = P bji Yi
n
j
i=1
14
It is clear why we needed to assume k < n: Otherwise n k 0 and we have no “degrees of freedom”
left for estimating the variance.
262 6. GAUSSIAN RESPONSE MODELS

where the matrix B = (bji )k n = (X T X) 1 X T . Note that BX = (X T X) 1 X T X equals


the identity matrix I. Because ~ j is a linear combination of independent Normal random
variables Yi , it follows that ~ j is Normally distributed. Moreover

P
n
E( ~ j ) = bji E(Yi )
i=1
Pn P
k
= bji i where i = l xil
i=1 l=1
Pn
= bji i
i=1

P
k
Note that i = l xil is the j’th component of the vector X which implies that E( ~ j )
l=1
is the j’th component of the vector BXX . But since BX is the identity matrix, this is
the j’th component of the vector or j : Thus E( ~ j ) = j for all j. The calculation of
the variance is similar.
P
n
V ar( ~ j ) = b2ji V ar(Yi )
i=1
2 P
n
= b2ji
i=1

P
n
and an easy matrix calculation will show, since BB T = (X T X) 1; that b2ji is the j’th
i=1
diagonal element of the matrix (X T X) 1 . We will not attempt to prove part (3) here,
which is usually proved in a subsequent statistics course.

Remark The maximum likelihood estimate ^ is also called a least squares estimate
of in that it is obtained by taking the sum of squared vertical distances between the
observations Yi and the corresponding …tted values ^ i and then adjusting the values of the
estimated j until this sum is minimized. Least squares is a method of estimation in linear
models that predates the method of maximum likelihood. Problem 16 describes the method
of least squares.

Remark15 From Theorem 32 we can obtain con…dence intervals and test hypotheses for
the regression coe¢ cients using the pivotal
~
j j
p t (n k) (6.23)
Se cj
1
where cj is the j’th diagonal element of the matrix X T X .

15 2
p
Recall: If Z G(0; 1) and W (m) then the random variable T = Z= W=m t (m).
~ (n k)S 2
j j
Let Z = p
cj
,W = 2 and m = n k to obtain this result.
6.4. GENERAL GAUSSIAN RESPONSE MODELS 263

Con…dence intervals for j


In a manner similar to the construction of con…dence intervals for the parameter
for observations from the G( ; ) distribution, we can use (6.23) to construct con…dence
intervals for the parameter j . For example for a 95% con…dence interval, we begin by
using the t distribution with n k degrees of freedom to …nd a constant a such that

P ( a < T < a) = 0:95 where T t (n k)

We then obtain the con…dence interval by solving the inequality


^
j j
a p a
se cj

to obtain
^ p ^ + ase pcj
j ase cj j j

where
1 P
n P
k
^ xij
s2e = (yi ^ i )2 and ^ i = j
n k i=1 j=1

Thus a 95% con…dence interval for j is


h p p i
^ as cj ; ^ j + as cj
j

which takes the familiar form

estimate a estimated standard deviation of estimator.

We now consider a special case of the Gaussian response models. We have already
seen this case in Chapter 4, but it provides a simple example to validate the more general
formulae.

Single Gaussian distribution


Here, Yi G( ; ), i = 1; 2; : : : ; n, i.e. (xi ) = and xi = x1i = 1; for all i = 1; 2; : : : ; n;
k = 1 we use the parameter instead of = ( 1 ). Notice that Xn 1 = (1; 1; : : : ; 1)T in
this case. This special case was also mentioned in Section 6.1. The pivotal quantity (6.23)
becomes
~ ~
1
p 1 = p
Se c1 S= n
since (X T X) 1 = 1=n. This pivotal quantity has the t distribution with n k =n 1.
You can also verify using (6.22) that

(n 1)S 2
2

has a Chi-squared(n 1) distribution.


264 6. GAUSSIAN RESPONSE MODELS

6.5 Chapter 6 Problems


1. Prove the following identities which are used in this chapter.
Pn Pn Pn 1
ai = 0; ai xi = 1; a2i =
i=1 i=1 i=1 Sxx
(xi x)
where ai =
Sxx

P
n P
n P
n 1 (x x)2
bi = 1, bi xi = x and b2i = +
i=1 i=1 i=1 n Sxx
1 (xi x)
where bi = + (x x)
n Sxx
2. Solve the three equations
@l 1 P
n
= 2
(yi xi ) = 0
@ i=1
@l 1 Pn
= 2
(yi xi ) xi = 0
@ i=1
@l n 1 P
n
= + 3
(yi xi )2 = 0
@ i=1

simultaneously to obtain the maximum likelihood estimates


P
n
xi (yi y)
^ = i=1 Sxy
=
Pn
Sxx
xi (xi x)
i=1
^ = y ^x
1 Pn
^ xi )2 = 1 Syy ~ Sxy
^2 = (yi ^
n i=1 n

3. Twenty-…ve female nurses working at a large hospital were selected at random and
their age (x) and systolic blood pressure (y) were recorded. The data are:
x y x y x y x y x y
46 136 37 115 58 139 48 134 59 142
36 132 45 129 50 156 35 120 54 135
62 138 39 127 41 132 42 137 57 150
26 115 28 134 31 115 27 120 60 159
53 143 32 133 51 143 34 128 38 127

x = 43:20 y = 133:56
Sxx = 2802:00 Syy = 3284:16 Sxy = 2325:20
To analyze these data assume the simple linear regression model Yi G( + xi ; ),
i = 1; 2; : : : ; 25 independently where the xi ’s are known constants.
6.5. CHAPTER 6 PROBLEMS 265

(a) Determine the maximum likelihood (least squares) estimates of and and an
unbiased estimate of 2 .
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) Construct a 95% con…dence interval for .
(d) Construct a 90% con…dence interval for the mean systolic blood pressure of
nurses aged x = 35.
(e) Construct a 99% prediction interval for the systolic blood pressure Y of a nurse
aged x = 50.

4. This problem is designed to cover concepts in this chapter as well as previous chapters.
The data below are the STAT 230 …nal grades (x) and STAT 231 …nal grades (y) for
30 students chosen at random from the group of students enrolled in STAT 231 in
Winter 2013. The data are available in the …le statgradedata.txt posted on the course
website.
x y x y x y x y x y x y
76 76 60 60 87 76 65 69 83 83 94 94
77 79 81 85 71 50 71 43 88 88 83 83
57 54 86 82 63 75 66 60 52 52 51 37
75 64 96 88 77 72 90 96 75 75 77 90
74 64 79 72 96 84 50 50 99 99 77 67

x = 76:73_ y = 72:23_
Sxx = 5135:86_ Syy = 7585:36_ Sxy = 5106:86_

(a) What type of study is this? Why?


(b) De…ne a possible Problem for this study. What type of Problem is it? Why?
(c) What is a unit in this study? De…ne a suitable target population for this study.
(d) What are the variates? What type are they?
(e) Why would it make sense to de…ne x = STAT 230 …nal grade as the explanatory
variate and y = STAT 231 …nal grade as the response variate?
(f) De…ne a suitable study population for this study. What is a possible source of
study error?
(g) What is the sampling protocol?
(h) Why was it important for the students to be chosen at random from the group
of students taking STAT 231? Why would it not be a good idea to chose the
…rst 30 students in an alphabetized list of all students?
(i) How are the variates measured? What is a possible source of measurement error?
(j) Determine the sample correlation.
266 6. GAUSSIAN RESPONSE MODELS

(k) Plot a scatterplot of the data. What do you notice?


(l) Fit the simple linear regression model to these data: Yi G( + xi ; ), i =
1; 2; : : : ; 30 independently where the xi ’s are known constants. What is the least
squares estimate of ? What is the maximum likelihood estimate of ? What
is an unbiased estimate of ?
(m) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(n) The parameter (x) = + x corresponds to what attribute of interest in the
study population? The parameter corresponds to what attribute of interest in
the study population?
(o) The parameter corresponds to what attribute of interest in the study popu-
lation? Test the hypothesis that there is no relationship (H0 : = 0) between
STAT 231 …nal grades and STAT 230 …nal grades.
(p) Test the hypothesis H0 : = 1. Why is this hypothesis of interest?
(q) Construct a 95% con…dence interval for . Is your con…dence interval consistent
with the p values determined in (o) and (p)? What is the interpretation of
this interval?
(r) Construct a 95% con…dence interval for the mean STAT 231 …nal grade for
students with a STAT 230 …nal grade of x = 75.
(s) Construct a 95% prediction interval for STAT 231 …nal grade for a student with
a STAT 230 …nal grade of x = 75. Compare this with the interval in (r). Why
is this interval so wide? How could the width of the interval be reduced?

5. Consider the data in Chapter 1 on the variates x = “value of an actor” and


y = “amount grossed by a movie”. The data are available in the …le actordata.txt
posted on the course website.

(a) Fit the simple linear regression model to these data: Yi G( + xi ; ), i =


1; 2; : : : ; 20 independently where the xi ’s are known constants.
(b) Use the plots discussed in Section 6:2 to check the adequacy of the model.
(c) What is the relationship between the maximum likelihood estimate of and the
sample correlation?
(d) Construct a 95% con…dence interval for . The parameter corresponds to what
attribute of interest in the study population?
(e) Test the hypothesis that there is no relationship between the “value of an ac-
tor” and the “amount grossed by a movie”. Are there any limitations to your
conclusion. (Hint: How were the data collected?)
(f) Construct a 95% con…dence interval for the mean amount grossed by movies for
actors whose value is x = 50. Construct a 95% con…dence interval for the mean
6.5. CHAPTER 6 PROBLEMS 267

amount grossed by movies for actors whose value is x = 100. What assumption
is being made in constructing the interval for x = 100?

6. Consider the price versus size of commercial building in Example 6:1:2. For these
data
n = 30 x = 0:9543 y = 548:9700
Sxx = 22:9453 Sxy = 3316:6771 Syy = 489; 624:723

(a) Fit the simple linear regression model to these data: Yi G( + xi ; ), i =


1; 2; : : : ; 30 independently where the xi ’s are known constants.

(b) Plot the …tted line on the scatterplot of the data.

(c) How would you interpret the value of ^ ?

(d) These data were used to decide a fair assessment value for a large building of
size x = 4:47 ( 105 )m2 . Determine a 95% con…dence interval for the mean price
of a building of this size.

(e) Determine a 95% prediction interval for a building of size x = 4:47 ( 105 )m2 .

(f) If you were an assessor deciding the fair assessment for a building of size x = 4:47
( 105 )m2 , would you use the interval in (e) or (f )?

7. Consider the steel bolt experiment in Example 6:1:4.

(a) Construct a 95% con…dence interval for the mean breaking strength of bolts of
diameter x = 0:35, that is, x1 = (0:35)2 = 0:1225.

(b) Construct a 95% prediction interval for the breaking strength Y of a single bolt
of diameter x = 0:35. Compare this with the interval in (a).

(c) Suppose that a bolt of diameter 0:35 is exposed to a large force V that could
potentially break it. In structural reliability and safety calculations, V is treated
as a random variable and if Y represents the breaking strength of the bolt (or
some other part of a structure), then the probability of a “failure”of the bolt is
P (V > Y ). Give a point estimate of this value if V G(1:60; 0:10), where V
and Y are independent.

8. There are often both expensive (and highly accurate) and cheaper (and less accurate)
ways of measuring concentrations of various substances (e.g. glucose in human blood,
salt in a can of soup). The table below gives the actual concentration x (determined
by an expensive but very accurate procedure) and the measured concentration y
268 6. GAUSSIAN RESPONSE MODELS

obtained by a cheap procedure, for each of 20 units.

x y x y x y x y
4:01 3:7 13:81 13:02 24:85 24:69 36:9 37:54
6:24 6:26 15:9 16 28:51 27:88 37:26 37:2
8:12 7:8 17:23 17:27 30:92 30:8 38:94 38:4
9:43 9:78 20:24 19:9 31:44 31:03 39:62 40:03
12:53 12:4 24:81 24:9 33:22 33:01 40:15 39:4

x = 23:7065 y = 23:5505
Sxx = 2818:946855 Syy = 2820:862295 Sxy = 2818:556835
The data are available in the …le expensivevscheapdata.txt posted on the course web-
site. To analyze these data assume the regression model: Yi G( + xi ; ),
i = 1; 2; : : : ; 20 independently.

(a) Fit the model to these data. Use the plots discussed in Section 6.2 to check the
adequacy of the model.
(b) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1. Construct 95% con…dence intervals for the intercept and test the
hypothesis = 0. Why are these hypotheses of interest?
(c) Describe brie‡y how you would characterize the cheap measurement process’s
accuracy to a lay person.
(d) If the units to be measured have true concentrations in the range 0 40, do you
think that the cheap method tends to produce a value that is lower than the true
concentration? Support your answer based on the data and the assumed model.

9. Regression through the origin: Consider the model Yi G( xi ; ); i = 1; 2; : : : ; n


independently.

(a) Assuming that is known, show that


P
n
x i yi
^= i=1
Pn
x2i
i=1

is the maximum likelihood estimate of and also the least squares estimate of
.
(b) Show that 0 1
P
n
xi Yi
B 2 C
~= i=1
NB
@ ;
C
A
Pn P
n
x2i x2i
i=1 i=1
6.5. CHAPTER 6 PROBLEMS 269

P
n
Hint: Write ~ in the form ai Yi .
i=1
(c) Prove the identity
2 1
P
n
^ xi
2 P
n P
n P
n
yi = yi2 xi yi x2i
i=1 i=1 i=1 i=1

This identity can be used to calculate

1 P
n
^ xi
2
s2e = yi
n 1 i=1

which is an unbiased estimate of 2.

(d) Show how to use the pivotal quantity

~
s t (n 1)
P
n
Se = x2i
i=1

to construct a 95% con…dence interval for .


(e) Explain how to test the hypothesis = 0 using the test statistic

~
0
s
P
n
Se = x2i
i=1

10. For the data in Problem 8


P
20 P
20 P
20
xi yi = 13984:5554 x2i = 14058:9097 yi2 = 13913:3833
i=1 i=1 i=1

Use the results from Problem 9 to do the following.

(a) Fit the model Yi G( xi ; ); i = 1; 2; : : : ; 20 independently to these data.


(b) Let ^ i = ^ xi and r^i = (yi ^ i ) =se .
Plot the following:
(i) a scatterplot of the data with the …tted line
(ii) the residual plot (xi ; r^i ), i = 1; 2; : : : ; 20
(iii) the residual plot (^ i ; r^i ), i = 1; 2; : : : ; 20
(iv) a qqplot of the standardized residuals r^i .
For each plot indicate what you would expect to see if the model is correct.
Based on these plots, comment on how well the model …ts the data.
270 6. GAUSSIAN RESPONSE MODELS

(c) Construct a 95% con…dence intervals for the slope and test the hypothesis
= 1.
(d) Using the results of this analysis as well as the analysis in Problem 8 what would
you conclude about using the model Yi G( + xi ; ) versus the simpler model
Yi G( xi ; ) for these data?

11. The following data were recorded concerning the relationship between drinking
(x = per capita wine consumption) and y = death rate from cirrhosis of the liver in
n = 46 states of the U.S.A. (for simplicity the data have been rounded).

x y x y x y x y x y x y
5 41 12 77 7 67 4 52 7 41 16 91
4 32 7 57 18 57 16 87 13 67 2 30
3 39 14 81 6 38 9 67 8 48 6 28
7 58 12 34 31 130 6 40 28 123 3 52
11 75 10 53 13 70 6 56 23 92 8 56
9 60 10 55 20 104 21 58 22 76 13 56
6 54 14 58 19 84 15 74 23 98
3 48 9 63 10 66 17 98 7 34

x = 11:5870 y = 63:5870
Sxx = 2155:1522 Syy = 24801:1521 Sxy = 6175:1522
The data are available in the …le liverdata.txt posted on the course website.

(a) Fit the simple linear regression model to these data: Yi G( + xi ; ), i =


1; 2; : : : ; 46 independently where the xi ’s are known constants.
(b) Use the plots discussed in Section 6.2 to check the adequacy of the model.
(c) Test the hypothesis that there is no relationship between wine consumption per
capita and the death rate from cirrhosis of the liver.
(d) Construct a 95% con…dence interval for .

12. Skinfold body measurements are used to approximate the body density of individuals.
The data on n = 92 men, aged 20-25, where x = skinfold measurement and Y = body
density are given available in the …le SkinfoldData.txt posted on the course website.
Note: The R function lm, with the command lm(y~x) gives the calculations for lin-
ear regression. The command summary(lm(y~x)) gives a summary of the calculations.
6.5. CHAPTER 6 PROBLEMS 271

# Import dataset skinfolddata.txt from course website using RStudio


relabel Skinfold variate as x
x<-SkinfoldData$Skinfold
# relabel Body Density variate as y
y<-SkinfoldData$BodyDensity
# run regression y = alpha+beta*x
RegModel<-lm(y~x)
# parameter estimates and p-value for test of no relationship
summary(RegModel)$coefficients
alphahat<-RegModel$coefficients[1] # estimate of intercept
betahat<-RegModel$coefficients[2] # estimate of slope
muhat<-RegModel$fitted.values # fitted responses
r<- RegModel$residuals # residuals
se<-summary(RegModel)$sigma # estimate of sigma
# Scatterplot of data with fitted line
par(mfrow=c(2,2))
plot(x,y,xlab="Skinfold",ylab="Body Density")
title(main="Scatterplot with Fitted Line")
abline(a=alphahat,b=betahat,col="red",lwd=2)
# Residual plots
rstar <- r/se # standardized residuals
plot(x,rstar,xlab="Skinfold",ylab="Standardized Residual")
title(main="Residual vs Skinfold")
abline(0,0,col="red",lwd=1.5)
plot(muhat,rstar,xlab="Muhat",ylab="Standardized Residual")
abline(0,0,col="red",lwd=1.5)
title(main="Residual vs Muhat")
qqnorm(rstar,main="")
title(main="Qqplot of Residuals")
# 95% Confidence interval for slope
confint(RegModel,level=0.95)
# 90% confidence interval for mean response at x=2
predict(RegModel,data.frame("x"=2),interval="confidence",level=0.90)
# 99% prediction interval for response at x=1.8
predict(RegModel,data.frame("x"=1.8),interval="prediction",level=0.99)
# 95% confidence interval for sigma
df<-length(y)-2
a<-qchisq(0.025,df)
b<-qchisq(0.975,df)
int<-c(se*sqrt(df/b),se*sqrt(df/a))
cat("95% confidence interval for sigma: ",int)
272 6. GAUSSIAN RESPONSE MODELS

(a) Run the given R code. What is the equation of the …tted line?
(b) What is the value of the test statistic and the p value for the hypothesis of no
relationship? What would you conclude?
(c) Give an estimate of .
(d) What do the plots indicate about the …t of the model?
(e) What is a 95% con…dence interval for ?
(f) What is a 90% con…dence interval for the mean body density of males with a
skinfold measurement of 2?
(g) What is a 99% prediction interval for the body density of a male with skinfold
measurement of 1:8?
(h) What is a 95% con…dence interval for ?
(i) Do you think that the skinfold measurements provide a reasonable approximation
to body density measurements?

13. The following data, collected by the British botanist Joseph Hooker in the Himalaya
Mountains between 1848 and 1850, relate atmospheric pressure to the boiling point
of water. Hooker wanted to estimate altitude above sea level from measurements of
the boiling point of water. He knew that the altitude could be determined from the
atmospheric pressure, measured with a barometer, with lower pressures correspond-
ing to higher altitudes. His interest in the above modelling problem was motivated
by the di¢ culty of transporting the fragile barometers of the 1840’s. Measuring the
boiling point would give travelers a quick way to estimate elevation, using both the
known relationship between elevation and atmospheric pressure, and the model relat-
ing atmospheric pressure to the boiling point of water. The data in the table below
are also available in the …le boilingpointdata.txt on the course website.

(a) Let y = atmospheric pressure (in Hg) and x = boiling point of water (in F).
Fit a simple linear regression model to the data (xi ; yi ), i = 1; 2; : : : ; 31. Prepare
a scatterplot of y versus x and draw on the …tted line. Plot the standardized
residuals versus x. How well does the model …t these data?
(b) Let z = log y. Fit a simple linear regression model to the data (xi ; zi ),
i = 1; 2; : : : ; 31. Prepare a scatterplot of z versus x and draw on the …tted line.
Plot the standardized residuals versus x. How well does the model …t these data?
(c) Based on the results in (a) and (b) which data are best …t by a linear model?
Does this con…rm the theory’s model?
(d) Obtain a 95% con…dence interval for the mean atmospheric pressure if the boiling
6.5. CHAPTER 6 PROBLEMS 273

point of water is 195 F .


Boiling Point Atmospheric Boiling Point Atmospheric
of Water Pressure of Water Pressure
F Hg F Hg
210:8 29:211 189:5 18:869
210:2 28:559 188:8 18:356
208:4 27:972 188:5 18:507
202:5 24:697 185:7 17:267
200:6 23:726 186:0 17:221
200:1 23:369 185:6 17:062
199:5 23:030 184:1 16:959
197:0 21:892 184:6 16:881
196:4 21:928 184:1 16:817
196:3 21:654 183:2 16:385
195:6 21:605 182:4 16:235
193:4 20:480 181:9 16:106
193:6 20:212 181:9 15:928
191:4 19:758 181:0 15:919
191:1 19:490 180:6 15:376
190:6 19:386
14. An educator believes that the new directed readings activities in the classroom will
help elementary school students improve some aspects of their reading ability. She
arranges for a Grade 3 class of 21 students to take part in the activities for an 8-
week period. A control classroom of 23 Grade 3 students follows the same curriculum
without the activities. At the end of the 8-week period, all students are given a Degree
of Reading Power (DRP) test, which measures the aspects of reading ability that the
treatment is designed to improve. The data are:
24 43 58 71 43 49 61 44 67 49 53
Treatment Group:
56 59 52 62 54 57 33 46 43 57

42 43 55 26 62 37 33 41 19 54 20 85
Control Group:
46 10 17 60 53 42 37 42 55 28 48
The data are available in the …le treatmentvscontroldata.txt posted on the course
website.
Let y1j = the DRP test score for the treatment group, j = 1; 2; : : : ; 21:
Let y2j = the DRP test score for the control group, j = 1; 2; : : : ; 23: For these data
P
21
y1 = 51:4762 (y1j y1 )2 = 2423:2381
j=1
P23
y2 = 41:5217 (y2j y2 )2 = 6469:7391
j=1
274 6. GAUSSIAN RESPONSE MODELS

To analyze these data assume

Y1j G( 1; ); j = 1; 2; : : : ; 21 independently

for the treatment group and independently

Y2j G( 2; ); j = 1; 2; : : : ; 23 independently

for the control group where 1; 2 and are unknown parameters.

(a) The parameters 1 ; 2 and correspond to what attributes of interest in the


study population?
(b) Plot a qqplot of the responses for the treatment group and a qqplot of the
responses for the control group. How reasonable are the Normality assumptions
stated in the assumed model?
(c) Calculate a 95% con…dence interval for the di¤erence in the means 1 2.

(d) Test the hypothesis of no di¤erence between the means, that is, test the hypoth-
esis H0 : 1 = 2 . What conclusion should the educator make based on these
data? Be sure to indicate any limitations to these conclusions.
(e) Here is the R code for doing this analysis
#Import dataset treatmentvscontroldata.txt in folder S231Datasets
y<-TeatmentVsContolData$DRP
y1<-y[seq(1,21,1)] # data for Treatment Group
y2<-y[seq(22,44,1)] # data for Control Group
# qqplots
qqnorm(y1,main="Qqplot for Treatment Group")
qqnorm(y2,main="Qqplot for Control Group")
# t test for hypothesis of no difference in means
# and 95% confidence interval for mean difference mu
# note that R uses mu = mu_control - mu_treament
t.test(DRP~Group,data=treatmentvscontroldata,var.equal=T,
conf.level=0.95)

15. A study was done to compare the durability of diesel engine bearings made of two
di¤erent compounds. Ten bearings of each type were tested. The following table gives
the “times” until failure (in units of millions of cycles):

Type I: y1i 3:03 5:53 5:60 9:30 9:92 12:51 12:95 15:21 16:04 16:84
Type II: y2i 3:19 4:26 4:47 4:53 4:67 4:69 12:78 6:79 9:37 12:75

P
10 P
10
y1 = 10:693 (y1i y1 )2 = 209:02961 y2 = 6:75 (y2i y2 )2 = 116:7974
i=1 i=1
6.5. CHAPTER 6 PROBLEMS 275

To analyze these data assume

Y1j G( 1; ); j = 1; 2; : : : ; 10 independently

for the Type I bearings and independently

Y2j G( 2; ); j = 1; 2; : : : ; 10 independently

for the Type II bearings where 1; 2 and are unknown parameters.

(a) Obtain a 90% con…dence interval for the di¤erence in the means 1 2.

(b) Test the hypothesis H0 : 1 = 2.

(c) It has been suggested that log failure times are approximately Normally dis-
tributed, but not failure times. Assuming that the log Y ’s for the two types of
bearing are Normally distributed with the same variance, test the hypothesis
that the two distributions have the same mean. How does the answer compare
with that in part (b)?
(d) How might you check whether Y or log Y is closer to Normally distributed?
(e) Give a plot of the data which could be used to describe the data and your
analysis.

16. To compare the mathematical abilities of incoming …rst year students in Mathemat-
ics and Engineering, 30 Math students and 30 Engineering students were selected
randomly from their …rst year classes and given a mathematics aptitude test. A sum-
mary of the resulting marks y1i (for the math students) and y2i (for the engineering
students), i = 1; 2; : : : ; 30, is as follows:

P
30
Math students: n = 30 y1 = 120 (y1i y1 )2 = 3050
i=1
P30
Engineering students: n = 30 y2 = 114 (y2i y2 )2 = 2937
i=1

To analyze these data assume

Y1j G( 1; ); j = 1; 2; : : : ; 30 independently

for the Math students and independently

Y2j G( 2; ); j = 1; 2; : : : ; 30 independently

for Engineering students where 1; 2 and are unknown parameters.


276 6. GAUSSIAN RESPONSE MODELS

(a) Obtain a 95% con…dence interval for the di¤erence in mean scores for …rst year
Math and Engineering students.
(b) Test the hypothesis that the di¤erence is zero.

17. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and
the numbers of cycles to failure were observed. The sample mean and variance of the
log failure times were y1 = 14:564 and s21 = 0:0914. Similar tests on ten additional
girders with repaired welds gave y2 = 14:291 and s22 = 0:0422. Log failure times are
assumed to be independent with a Gaussian distribution. Assuming equal variances
for the two types of girders, obtain a 95% con…dence interval for the di¤erence in
mean log failure times and test the hypothesis of no di¤erence.

18. Consider the data in Chapter 1 on the lengths of male and female coyotes. The data
are available in the …le coyotedata.txt posted on the course website.

(a) Construct a 95% con…dence interval the di¤erence in mean lengths for the two
sexes. State your assumptions.
(b) Estimate P (Y1 > Y2 ) (give the maximum likelihood estimate), where Y1 is the
length of a randomly selected female and Y2 is the length of a randomly selected
male. Can you suggest how you might get a con…dence interval?
(c) Give separate con…dence intervals for the average length of males and females.

19. To assess the e¤ect of a low dose of alcohol on reaction time, a sample of 24 student
volunteers took part in a study. Twelve of the students (randomly chosen from the 24)
were given a …xed dose of alcohol (adjusted for body weight) and the other twelve got
a nonalcoholic drink which looked and tasted the same as the alcoholic drink. Each
student was then tested using software that ‡ashes a coloured rectangle randomly
placed on a screen; the student has to move the cursor into the rectangle and double
click the mouse. As soon as the double click occurs, the process is repeated, up to a
total of 20 times. The response variate is the total reaction time (i.e. time to complete
the experiment) over the 20 trials. The data are given below.
“Alcohol” Group:

1:33 1:55 1:43 1:35 1:17 1:35 1:17 1:80 1:68 1:19 0:96 1:46

P
12
y1 = 16:44
12 = 1:370 (y1i y1 )2 = 0:608
i=1
“Non-Alcohol” Group:

1:68 1:30 1:85 1:64 1:62 1:69 1:57 1:82 1:41 1:78 1:40 1:43

P
12
y2 = 19:19
12 = 1:599 (y2i y2 )2 = 0:35569
i=1
6.5. CHAPTER 6 PROBLEMS 277

To analyze these data assume

Y1j G( 1; ); j = 1; 2; : : : ; 12 independently

for the Alcohol Group and independently

Y2j G( 2; ); j = 1; 2; : : : ; 12 independently

for the Non-Alcohol Group where 1 ; 2 and are unknown parameters. Determine
a 95% con…dence interval for the di¤erence in the means 1 2 . What can the
researchers conclude on the basis of this study?

20. An experiment was conducted to compare gas mileages of cars using a synthetic oil
and a conventional oil. Eight cars were chosen as representative of the cars in general
use. Each car was run twice under as similar conditions as possible (same drivers,
routes, etc.), once with the synthetic oil and once with the conventional oil, the order
of use of the two oils being randomized.
The gas mileages were as follows:

Car 1 2 3 4 5 6 7 8
Synthetic: y1i 21:2 21:4 15:9 37:0 12:1 21:1 24:5 35:7
Conventional: y21 18:0 20:6 14:2 37:8 10:6 18:5 25:9 34:7
yi = y1i y2i 3:2 0:8 1:7 0:8 1:5 2:6 1:4 1
P
8
y1 = 23:6125 (y1i y1 )2 = 535:16875
i=1
P8
y2 = 22:5375 (y2i y2 )2 = 644:83875
i=1
P
8
y = 1:075 (yi y)2 = 17:135
i=1

(a) Obtain a 95% con…dence interval for the di¤erence in mean gas mileage, and
state the assumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking
only one measurement on each of 16 cars?

21. The following table gives the number of sta¤ hours per month lost due to accidents
in eight factories of similar size over a period of one year and after the introduction
of an industrial safety program.

Factory i 1 2 3 4 5 6 7 8
After: y1i 28:7 62:2 28:9 0:0 93:5 49:6 86:3 40:2
Before: y2i 48:5 79:2 25:3 19:7 130:9 57:6 88:8 62:1
yi = y1i y2i 19:8 17:0 3:6 19:7 37:4 8:0 2:5 21:9
278 6. GAUSSIAN RESPONSE MODELS

P
8
y= 15:3375 (yi y)2 = 1148:79875
i=1

There is a natural pairing of the data by factory. Factories with the best safety records
before the safety program tend to have the best records after the safety program as
well. The analysis of the data must take this pairing into account and therefore the
model
Yi G( ; ); i = 1; 2; : : : ; 8 independently

is assumed where and are unknown parameters.

(a) The parameters and correspond to what attributes of interest in the study
population?
(b) Calculate a 95% con…dence interval for .
(c) Test the hypothesis of no di¤erence due to the safety program, that is, test the
hypothesis H0 : = 0:

22. Comparing sorting algorithms: Suppose you want to compare two algorithms
A and B that will sort a set of numbers into an increasing sequence. (The R function,
sort(x), will, for example, sort the elements of the numeric vector x.) To compare
the speed of algorithms A and B, you decide to “present” A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would
set up such a study, and discuss what pairing would mean in this context.

23. Sorting algorithms continued: Two sort algorithms as in the preceding problem
were each run on (the same) 20 sets of numbers (there were 500 numbers in each set).
Times to sort the sets of two numbers are shown below.

Set: 1 2 3 4 5 6 7 8 9 10
A: 3:85 2:81 6:47 7:59 4:58 5:47 4:72 3:56 3:22 5:58
B: 2:66 2:98 5:35 6:43 4:28 5:06 4:36 3:91 3:28 5:19
yi 1:19 :17 1:12 1:16 0:30 0:41 0:36 :35 :06 0:39

Set: 11 12 13 14 15 16 17 18 19 20
A: 4:58 5:46 3:31 4:33 4:26 6:29 5:04 5:08 5:08 3:47
B: 4:05 4:78 3:77 3:81 3:17 6:02 4:84 4:81 4:34 3:48
yi 0:53 0:68 :46 0:52 1:09 0:27 0:20 0:27 0:74 :01

20
X
y = 0:409 s2 = 1
19 (yi y)2 = 0:237483
i=1

Data are available in the …le sortdata.txt available on the course website.
6.5. CHAPTER 6 PROBLEMS 279

(a) Since the two algorithms are each run on the same 20 sets of numbers we analyse
the di¤erences yi = yAi yBi , i = 1; 2; : : : ; 20. Construct a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B,
assuming the di¤erence have a Gaussian distribution.
(b) Use a Normal qqplot to determine if a Gaussian model is reasonable for the
di¤erences.
(c) Give a point estimate of the probability that algorithm B will sort a randomly
selected list faster than A.
(d) Another way to estimate the probability p in part (c) is to notice that of the 20
sets of numbers in the study, B sorted faster on 15 sets of numbers. Obtain an
approximate 95% con…dence interval for p. (It is also possible to get a con…dence
interval using the Gaussian model.)
(e) Suppose the study had actually been conducted using two independent samples of
size 20 each. Using the two sample Normal analysis determine a 99% con…dence
interval for the di¤erence in the average time to sort with algorithms A and B.
Note:
y1 = 4:7375 s21 = 1:4697 y2 = 4:3285 s22 = 0:9945
How much better is the paired study as compared to the two sample study?
(f) Here is the R code for doing the t tests and con…dence intervals for the paired
analysis and the unpaired analysis:
# Import dataset sortdata.txt in folder S231Datasets
t.test(Time~Alg,data=sortdata,paired=T,conf.level=0.99)
t.test(Time~Alg,data=sortdata,paired=F,var.equal=T,conf.level=0.99)

24. Challenge Problem Let Y1 ; Y2 ; : : : ; Yn be a random sample from the G( 1 ; 1 )


distribution and let X1 ; : : : ; Xn be a random sample from the G( 2 ; 2 ) distribution.
Obtain the likelihood ratio test statistic for testing the hypothesis H0 : 1 = 2 and
show that it is a function of F = S12 =S22 , where S12 and S22 are the sample variances
from the y and x samples respectively.

25. Challenge Problem Readings produced by a set of scales are independent and
Normally distributed about the true weight of the item being measured. A study
is carried out to assess whether the standard deviation of the measurements varies
according to the weight of the item.

(a) Ten weighings of a 10 kilogram weight yielded y = 10:004 and s = 0:013 as the
sample mean and standard deviation. Ten weighings of a 40 kilogram weight
yielded y = 39:989 and s = 0:034. Is there any evidence of a di¤erence in the
standard deviations for the measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kilogram item. How could
you study the question of interest further?
280 6. GAUSSIAN RESPONSE MODELS

26. Challenge Problem Suppose you have a model where the mean of the response
variable Yi given the covariates xi = (xi1 ; : : : ; xik ) has the form

i = E(Yi jxi ) = (xi ; )

where is a k 1 vector of unknown parameters. Then the least squares estimate


of based on data (xi ; yi ); i = 1; 2; : : : ; n is the value that minimizes the objective
function
Pn
S( ) = [yi (xi ; )]2
i=1

Show that the least squares estimate of is the same as the maximum likelihood
estimate of in the Gaussian model Yi G( i ; ), when i is of the form

P
k
i = (xi ; ) = j xij
j=1

27. Challenge Problem Optimal Prediction In many settings we want to use co-
variates x to predict a future value Y . (For example, we use economic factors x to
predict the price Y of a commodity a month from now.) The value Y is random, but
suppose we know (x) = E(Y jx) and (x)2 = V ar(Y jx).

(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction” function.
Show that the minimum achievable value of E(Y^ Y )2 is minimized by choosing
g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when
g(x) = (x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal”prediction
(in terms of Euclidean distance) is possible. Part (b) shows that we should try
to …nd covariates x for which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
7. MULTINOMIAL MODELS
AND GOODNESS OF FIT TESTS

7.1 Likelihood Ratio Test for the Multinomial Model


Many important hypothesis testing problems can be addressed using Multinomial models.
Suppose the data arise from a Multinomial distribution with joint probability function
n! y1 y2 yk
f (y1 ; y2 ; : : : ; yk ; 1; 2; : : : ; k ) = 1 2 k (7.1)
y1 !y2 ! yk !
P
k
where yj = 0; 1; : : : and yj = n. The Multinomial probabilities j satisfy 0 j 1 and
j=1
P
k
j = 1. The likelihood function based on (7.1) is
j=1

n! y1 y2 yk
L( 1 ; 2; : : : ; k ) = 1 2 k
y1 !y2 ! yk !
or more simply
Q
k
yj
L( ) = j (7.2)
j=1

where = ( 1 ; 2 ; : : : ; k ). It can be shown that L( ) is maximized by ^ =(^1 ; ^2 ; : : : ; ^k )


where ^j = yj =n, j = 1; 2; : : : ; k. Note that although = ( 1 ; 2 ; : : : ; k ) there are actually
P
k
only k 1 parameters to be estimated since j = 1.
j=1
Suppose that we wish to test the hypothesis that the probabilities 1 ; 2 ; : : : ; k are
related in some way, for example, that they are all functions of a parameter , such that

H0 : j = j( ) for j = 1; 2; : : : ; k (7.3)

where = ( 1 ; 2 ; : : : ; p ) and p < k 1. In other words, p is equal to the number of


parameters that need to be estimated in the model assuming the null hypothesis (7.3). For
example, suppose = ( 1 ; 2 ; 3 ; 4 ) so k = 4 and the null hypothesis is

H0 : 1 = 1; 2 = 1 + 2; 3 = 2; 4 =1 2( 1 + 2)

281
282 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

then = ( 1 ; 2 ) and p = 2.
A likelihood ratio test of (7.3) is based on the likelihood ratio statistic
" #
L(~0 )
= 2 log (7.4)
L(~)

where ~0 maximizes L( ) assuming the null hypothesis (7.3) is true.


The test statistic (7.4) can be written in a simple form. Let ~0 = ( 1 (~ ); : : : ; k (~ ))
denote the maximum likelihood estimator of under the null hypothesis (7.3). Then
k
" #
X ~j
=2 Yj log
j=1 j (~ )

Noting that ~j = Yj =n and de…ning the expected frequencies under H0 as

Ej = n j (~ ) for j = 1; 2; : : : ; k

we can rewrite as
k
X Yj
=2 Yj log (7.5)
Ej
j=1

Let
k
X yj
=2 yj log
ej
j=1

be the observed value of where ej = n j (^ ), j = 1; 2; : : : ; k. (Remember log = ln.) Note


that the value of will be close to 0 if the observed values y1 ; y2 ; : : : ; yk are close to the
expected values e1 ; e2 ; : : : ; ek and that the value of will be large if the yj ’s and ej ’s di¤er
greatly.
If n is large and H0 is true then the distribution of is approximately 2 (k 1 p).
This enables us to compute p values from observed data by using the approximation
2
p value = P ( ; H0 ) t P (W ) where W (k 1 p)

This approximation is accurate when n is large and none of the j ’s is too small. In
particular, the expected frequencies determined assuming H0 is true should all be at least
5 to use the Chi-squared approximation.
An alternative test statistic that was developed historically before the likelihood ratio
test statistic is the Pearson goodness of …t statistic
k
X (Yj Ej )2
D= (7.6)
Ej
j=1

with observed value


k
X (yj ej )2
d=
ej
j=1
7.2. GOODNESS OF FIT TESTS 283

The Pearson goodness of …t statistic has similar properties to , that is, d takes on small
values if the yj ’s and ej ’s are close in value and d takes on large values if the yj ’s and ej ’s
di¤er greatly. It also turns out that, like , the statistic D has a limiting 2 (k 1 p)
distribution when H0 is true.

The remainder of this chapter consists of the application of the general methods above
to some important testing problems.

7.2 Goodness of Fit Tests


Recall from Section 2.4 that one way to check the …t of a probability distribution is by
comparing the observed frequencies fj and the expected frequencies ej = n^ pj . As indicated
there we did not know how close the observed and expected frequencies needed to be to
conclude that the model was adequate. It is possible to test the appropriateness of a model
by using the Multinomial model. We illustrate this test through two examples.

Example 7.2.1 MM, MN, NN blood types


In Example 2.4.2, n people were selected from a population and classi…ed as being one
of three blood types MM, MN, NN. Let Y1 = number of MM types observed, Y2 = number
of MN types observed and Y3 = number of NN types observed. If the proportions of the
population that are these three types are 1 , 2 , 3 respectively, with 1 + 2 + 3 = 1 then
the joint probability function of Y1 ; Y2 ; Y3 is Multinomial(n; 1 ; 2 ; 3 ) and k = 3.
Genetic theory indicates that the j ’s can be expressed in terms of a single parameter
. The null hypothesis corresponding to this is
2
H0 : 1 = , 2 = 2 (1 ), 3 = (1 )2 (7.7)

There is only one unknown parameter under (7.7), so p = 1.


Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to
test the hypothesis H0 . The likelihood function under (7.7) is

L1 ( ) = L( 1 ( ); 2( ); 3 ( ))
2 17 46
= c( ) [2 (1 )] [(1 )2 ]37
80
=c (1 )120 for 0 1

where c is a constant with respect to . We easily …nd that ^ = 0:40. The observed
expected frequencies under (7.7) are e1 = 100^ 2 = 16, e2 = 100[2^ (1 ^ )] = 48, e3 =
100[(1 ^ )2 ] = 36. The observed value of the likelihood ratio statistic (7.5) is
3
X yj 17 46 37
2 yj log = 2 17 log + 46 log + 37 log = 0:17
ej 16 48 36
j=1
284 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

The degrees of freedom for the Chi-squared approximation equal k 1 p=3 1 1 = 1.


The approximate p value is
2
p value t P ( 0:17; H0 ) t P (W 0:17) where W (1)
= 2 [1 P (Z 0:41)] where Z N (0; 1)
= 2(1 0:6591) = 0:6818

so there is no evidence against the model (7.7).


The observed values of the Pearson goodness of …t statistic (7.6) and the likelihood
ratio statistic are usually close when n is large and so it does not matter which test sta-
tistic is used. In this case we …nd that the observed value of (7.6) for these data is also 0:17.

Example 7.2.2 Goodness of …t and Poisson model


The number of service interruptions in a communications system over 200 separate days
is summarized in the following frequency table:
Number of interruptions: 0 1 2 3 4 5 >5 Total
Frequency observed yj : 64 71 42 18 4 1 0 200
Let Yj = number of times j interruptions are observed. The joint model for the Yj ’s is
Multinomial.
We wish to test whether a Poisson model for Y = the number of interruptions on a
single day is consistent with these data. The null hypothesis is
j
e
H0 : j = for j = 0; 1; : : :
j!
(Note that we are using rather than as the parameter of interest.) The maximum
likelihood estimate of based on the observed data in the table is
^ = 1 [0 (64) + 1 (71) + 2 (42) + 3 (18) + 4 (4) + 5 (1)] = 230 = 1:15
200 200
The observed and expected frequencies assuming a Poisson(1:15) distribution are given in
the table below
No. of interruptions 0 1 2 3 4 5 Total
yi 64 71 42 18 4 1
200
ei 63:33 72:83 41:88 16:05 4:61 1:30
where
(1:15)j e 1:15
ej = 200 for j = 0; 1; : : : ; 4
j!
and the last category is obtained by subtraction. Since the expected frequency in the last
category is less than 5 we combine the last two categories to obtain
No. of interruptions 0 1 2 3 4 Total
yi (ei ) 64(63:33) 71(72:83) 42(41:88) 18(16:05) 5(5:91) 200
7.2. GOODNESS OF FIT TESTS 285

The observed value of the likelihood ratio statistic is


64 71 42 18 5
2 64 log + 71 log + 42 log + 18 log + 5 log
63:33 72:83 41:88 16:05 5:91
= 0:43

The collapsed table has …ve categories so k = 5 and only one parameter has been estimated
under H0 so p = 1. The degrees of freedom for the Chi-squared approximation equal
k 1 p = 5 1 1 = 3. The approximate p value is
2
p value t P (W > 0:43) = 0:93 where W (3)

Since p value > 0:1 there is no evidence based on the data against the hypothesis that
the Poisson model …ts these data.

Example 7.2.3 Goodness of …t and Exponential model


Continuous distributions can also be tested by grouping the data into intervals and then
using the Multinomial model. Example 2.6.2 previously did this in an informal way for an
Exponential distribution and the lifetimes of brake pads data.
Suppose a random sample t1 ; t2 ; : : : ; t100 is collected and we wish to test the hypothesis
that the data come from an Exponential( ) distribution. We partition the range of T
into intervals j = 1; 2; : : : ; k, and count the number of observations yj that fall into each
interval. Assuming an Exponential( ) model, the probability that an observation lies in the
j’th interval Ij = (aj 1 ; aj ) is
Zaj
aj 1= aj =
pj ( ) = f (t; )dt = e e for j = 1; 2; :::; k (7.8)
aj 1

and if yj is the number of observations (t’s) that lie in Ij , then Y1 ; Y2 ; : : : ; Yk follow a


Multinomial(100; p1 ( ); p2 ( ) ; : : : ; pk ( )) distribution.
Suppose the observed data are
Interval 0 100 100 200 200 300 300 400 400 600 600 800 > 800
yj 29 22 12 10 10 9 8
ej 27:6 20:0 14:4 10:5 13:1 6:9 7:6
so k = 7. To calculate the expected frequencies under the null hypothesis (7.8) we need an
estimate of which is obtained by maximizing the likelihood function
Q
7
L( ) = [pj ( )]yj
j=1

Since there is only one unknown parameter under (7.8), p = 1. It is possible to maximize
L( ) to obtain ^ = 310:0. The expected frequencies, ej = 100pj (^), j = 1; 2; : : : ; 7, are
given in the table.
286 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

The observed value of the likelihood ratio statistic (7.5) is


7
X yj 29 22 8
2 yj log = 2 29 log + 22 log + + 8 log = 1:91
ej 27:6 20 7:6
j=1

The degrees of freedom for the Chi-squared approximation equal k 1 p=7 1 1 = 5.


The approximate p value is
2
p value = P ( 1:91; H0 ) t P (W 1:91) = 0:86 where W (5)

so there is no evidence against the model (7.8).

A goodness of …t test has some arbitrary elements, since we could have used di¤erent
intervals and a di¤erent number of intervals. Theory has been developed on how best to
choose the intervals. For this course we only give rough guidelines which are: chose 4 10
intervals, so that the observed expected frequencies under H0 are at least 5.

Example 7.2.4 Lifetime of brake pads and the Exponential model


Recall the data in Example 2.6.2 on the lifetimes of brake pads. The expected frequencies
are calculated using an Exponential model with mean estimated by the sample mean
^ = 49:0275. The data are given in Table 7.1.

Observed Expected
Interval
Frequency: fj Frequency: ej
[0; 15) 21 52:72
[15; 30) 45 38:82
[30; 45) 50 28:59
[45; 60) 27 21:05
[60; 75) 21 15:50
[75; 90) 9 11:42
[90; 105) 12 8:41
[105; 120) 7 6:19
[120; +1) 8 17:3
Total 200 200
Table 7.1: Frequency table for brake pad data

The observed value of the likelihood ratio statistic (7.5) is


9
X fj 21 45 8
2 fj log = 2 21 log + 45 log + + 8 log = 50:36
ej 52:72 38:82 17:3
j=1

The expected frequencies are all at least …ve and so k = 9. There is only one parameter
to be estimated under the hypothesized Exponential( ) model so p = 1. The degrees of
7.3. TWO-WAY (CONTINGENCY) TABLES 287

freedom for the Chi-squared approximation equal k 1 p = 9 1 1 = 7. The approximate


p value is

2
p value = P ( 50:36; H0 ) t P (W 50:36) t 0 where W (7)

and there is very strong evidence based on the data against the hypothesis that an Ex-
ponential model …ts these data. This conclusion is not unexpected since, as we noted in
Example 2.6.2, the observed and expected frequencies are not in close agreement. We could
have chosen a di¤erent set of intervals for these continuous data but the same conclusion
of a lack of …t would be obtained for any reasonable choice of intervals.

7.3 Two-Way (Contingency) Tables


Often we want to assess whether two factors or variates appear to be related. One tool for
doing this is to test the hypothesis that the factors are independent and thus statistically
unrelated. We will consider this in the case where both variates are discrete, and take on
a fairly small number of possible values. This turns out to cover a great many important
settings.
Two types of studies give rise to data that can be used to test independence, and in
both cases the data can be arranged as frequencies in a two-way table. These tables are
also called contingency tables.

Cross-Classi…cation of a Random Sample of Individuals


Suppose that individuals or items in a population can be classi…ed according to each of
two factors A and B. For A, an individual can be any of a mutually exclusive types
A1 ; A2 ; : : : ; Aa and for B an individual can be any of b mutually exclusive types B1 ; B2 ; : : : ; Bb ,
where a 2 and b 2.
If a random sample of n individuals is selected, let yij denote the number that have
A-type Ai and B-type Bj . The observed data may be arranged in a two-way table as seen
below:
AnB B1 B2 Bb Total
A1 y11 y12 y1b r1
A2 y21 y22 y2b r2
.. .. .. .. ..
. . . . .
Aa ya1 yab ra
Total c1 c2 cb n

P
b P
a P
a P
b
where ri = yij are the row totals, cj = yij are the column totals, and yij = n.
j=1 i=1 i=1 j=1
Let ij be the probability a randomly selected individual is combined type (Ai ; Bj ) and
288 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

P
a P
b
note that ij = 1. The a b frequencies (Y11 ; Y12 ; : : : ; Yab ) follow a Multinomial
i=1 j=1
distribution with k = ab classes.
To test independence of the A and B classi…cations, we test the hypothesis

H0 : ij = i j for i = 1; 2; : : : ; a; j = 1; 2; : : : ; b (7.9)

P
a P
b
where 0 < i < 1, 0 < j < 1, i = 1, j = 1. Note that
i=1 j=1

i = P (an individual is type Ai ) and j = P (an individual is type Bj )

and that (7.9) is the standard de…nition for independent events: P (Ai \ Bj ) = P (Ai )P (Bj ).
We note that testing (7.9) falls into the general framework of Section 7.1, where k = ab,
and the number of parameters estimated under (7.9) is p = (a 1) + (b 1) = a + b 2.
All that needs to be done in order to use the statistics (7.5) or (7.6) to test H0 is to obtain
the maximum likelihood estimates ^ i , ^ j under the model (7.9), and then the calculate the
expected frequencies eij .
Under the model (7.9), the likelihood function for the yij ’s is

Q
a Q
b
L1 ( ; ) = [ ij ( ; )]yij
i=1 j=1
Q
a Q
b
yij
= ( i j)
i=1 j=1

The log likelihood function `( ; ) = log L( ; ) must be maximized subject to the linear
Pa P
b
constraints i = 1, j = 1. The maximum likelihood estimates can be shown to be
i=1 j=1

ri ^ cj
^i = ; j= i = 1; 2; : : : a; j = 1; 2; : : : b
n n
and the expected frequencies are given by
ri cj
eij = n^ i ^ j = i = 1; 2; : : : a; j = 1; 2; : : : b (7.10)
n
The observed value of the likelihood ratio statistic for testing H0 is
a X
X b
yij
=2 yij log
eij
i=1 j=1

The degrees of freedom for the Chi-squared approximation are

k 1 p = (ab 1) (a 1+b 1) = (a 1)(b 1)

and the approximate p value is


2
p value = P ( ; H0 ) t P (W ) where W ((a 1)(b 1))
7.3. TWO-WAY (CONTINGENCY) TABLES 289

Example 7.3.1 Blood classi…cations


Human blood is classi…ed according to several systems. Two systems are the OAB
system and the Rh system. In the former a person is one of four types O, A, B, AB and in
the latter system a person is Rh+ or Rh . To determine whether these two classi…cation
systems are genetically independent, a random sample of 300 persons were chosen. Their
blood was classi…ed according to the two systems and the observed frequencies are given in
the table below.
O A B AB Total

Rh+ 82 89 54 19 244

Rh 13 27 7 9 56

Total 95 116 61 28 300


We can think of the Rh types as the A-type classi…cation and the OAB types as the B-type
classi…cation in the general theory above. The row and column totals are also shown in the
table, since they are the values needed to compute the eij ’s in (7.10).
To carry out the test that a person’s Rh and OAB blood types are statistically inde-
pendent, we merely need to compute the eij ’s by (7.10). For example,
(244)(95) 244(116) 244 (61)
e11 = = 77:27; e12 = = 94:35 and e13 = = 49:61
300 300 300
The remaining expected frequencies can be obtained by subtraction and these are given in
the table below in brackets below the observed frequencies.
O A B AB Total

82 89 54 19
Rh+ 244
(77:27) (94:35) (49:61) (22:77)

13 27 7 9
Rh 56
(17:73) (21:65) (11:39) (5:23)

Total 95 116 61 28 300


The degrees of freedom for the Chi-squared approximation are (a 1)(b 1) = 3 (1) =
3 which is consistent with the fact that, once we had calculated three of the expected
frequencies, the remaining expected frequencies could be obtained by subtraction.
The observed value of the likelihood ratio test statistic is = 8:447, and the p value
is approximately P (W 8:447) = 0:0376 where W 2 (3) so there is evidence against

the hypothesis of independence based on the data. Note that by comparing the eij ’s and
the yij ’s we see that the degree of dependence does not appear large.
290 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

Testing equality of Multinomial parameters for two or more groups


A similar problem arises when individuals in a population can be one of b types
B1 ; B2 ; : : : ; Bb , but where the population is sub-divided into a groups A1 ; A2 ; : : : ; Aa . In this
case, we might be interested in whether the proportions of individuals of types B1 ; B2 ; : : : ; Bb
are the same for each group. This is essentially the same as the question of independence
in the preceding section: we want to know whether the probability ij that a person in
population group i is B-type Bj is the same for all i = 1; 2; : : : ; a. That is, ij = P (Bj jAi )
and we want to know if this depends on Ai or not.
Although the framework is super…cially the same as the preceding section, the details
are a little di¤erent. In particular, the probabilities ij satisfy

i1 + i2 + + ib = 1 for each i = 1; 2; : : : ; a

and the hypothesis we are interested in testing is

H0 : 1 = 2 = = a; (7.11)

where i = ( i1 ; i2 ; : : : ; ib ). Furthermore, the data in this case arise by selecting speci…ed


numbers of individuals ni from groups i = 1; 2; : : : ; a and so there are actually a di¤erent
Multinomial distributions, Multinomial(ni ; i1 ; i2 ; : : : ; ib ), i = 1; 2; : : : ; a.
If we denote the observed frequency of Bj -type individuals in the sample from the i’th
group as yij (where yi1 + yi2 + + yib = ni ), then it can be shown that the likelihood ratio
statistic for testing (7.11) is exactly the same as (7.10), where now the expected frequencies
eij are given by
y+j
eij = ni for i = 1; 2; : : : ; a; j = 1; 2; : : : ; b
n
P
a P
b
where n = n1 + n2 + + na and y+j = yij . Since ni = yi+ = yij the expected
i=1 j=1
frequencies have exactly the same form as in the preceding section, when we lay out the
data in a two-way table with a rows and b columns.

Example 7.3.1 Revisited Blood classi…cations


The study in Example 7.3.1 could have been conducted di¤erently, by selecting a …xed
number of Rh+ persons and a …xed number of Rh persons, and then determining their
OAB blood type. Then the proper framework would be to test that the probabilities for
the four types O, A, B, AB were the same for Rh+ and for Rh persons, and so the
methods of the present section apply. This study gives exactly the same testing procedure
as one where the numbers of Rh+ and Rh persons in the sample are random, as discussed.

Example 7.3.2 Aspirin and strokes


In a randomized clinical trial to assess the e¤ectiveness of a small daily dose of aspirin
in preventing strokes among high-risk persons, a group of patients were randomly assigned
to get either aspirin or a placebo. A total of 240 patients were assigned to the aspirin group
7.3. TWO-WAY (CONTINGENCY) TABLES 291

and 236 were assigned to the placebo group. (There were actually an equal number in each
group but four patients withdrew from the placebo group during the study.) The patients
were followed for three years, and it was determined for each person whether they had a
stroke during that period or not. The data were as follows (expected frequencies are given
in brackets).
Stroke No Stroke Total
Aspirin Group 64 (75:6) 176 (164:4) 240
Placebo Group 86(74:4) 150 (161:6) 236
Total 150 326 476
We can think of the persons receiving aspirin and those receiving placebo as two groups,
and test the hypothesis
H0 : 11 = 21
where 11 = P (stroke) for a person in the aspirin group and 21 = P (stroke) for a person
in the placebo group. The expected frequencies under H0 : 11 = 21 are

(yi+ )(y+j )
eij = for i = 1; 2
476
This gives the expected frequencies shown in the table in brackets. The observed value of
the likelihood ratio statistic is
2 X
X 2
yij
2 yij log = 5:25
eij
i=1 j=1

and the approximate p value is


2
p value = P ( 5:25; H0 ) t P (W 5:25) where W (1)
= 2 [1 P (Z 2:29)] where Z N (0; 1)
= 2(1 0:98899) = 0:02202

so there is evidence against H0 based on the data. A look at the yij ’s and the eij ’s indicates
that persons receiving aspirin have had fewer strokes than expected under H0 , suggesting
that 11 < 21 .
This test can be followed up with estimates for 11 and 21 . Because each row of the
table follows a Binomial distribution, we have

^11 = y11 = 64 = 0:267 and ^21 = y21 = 86 = 0:364


n1 240 n2 236
We can also give individual con…dence intervals for 11 and 21 . Based on methods derived
earlier we have an approximate 95% con…dence interval for 11 given by
r
(0:267) (0:733)
0:267 1:96 or [0:211; 0:323]
240
292 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

and an approximate 95% con…dence interval for 21 given by


r
(0:364) (0:636)
0:364 1:96 or [0:303; 0:425]
236
Con…dence intervals for the di¤erence in proportions 11 21 can also be obtained from
the approximate G(0; 1) pivotal quantity

(~11 ~21 ) ( 11 21 )
q
~11 (1 ~11 )=n1 + ~21 (1 ~21 )=n2

Remark This and other tests involving Binomial probabilities and contingency tables
can be carried out using the R function prop.test which uses the Pearson goodness of …t
statistic.
7.4. CHAPTER 7 PROBLEMS 293

7.4 Chapter 7 Problems


1. In a large STAT 231 class, each student was given a box of Smarties and then asked
to count the number of each colour: red, green, yellow, blue, purple, brown, orange,
pink. The observed frequencies were:
Colour: Red Green Yellow Blue Purple Brown Orange Pink
Frequency (yi ): 556 678 739 653 725 714 566 797

Test the hypothesis that each of the colours has the same probability H0 : i = 18 ;
i = 1; 2; : : : ; 8.
The following R code calculates the observed values of the likelihood ratio test statis-
tic and the Pearson goodness of …t statistic D and the corresponding p values.
y<-c(556,678,739,653,725,714,566,797) # observed frequencies
e<-sum(y)/8 # expected frequencies
lambda<-2*sum(y*log(y/e)) # observed value of LR statistic
df<-7 # degrees for freedom for this example equal 7
pvalue<-1-pchisq(lambda,df) # p-value for LR test
c(lambda,df,pvalue)
d<-sum((y-e)^2/e) # observed value of Pearson goodness of fit statistic
pvalue<-1-pchisq(d,df) # p-value for Pearson goodness of fit test
c(d,df,pvalue)
What would you conclude about the distribution of colours in boxes of Smarties?

2. Test whether a Poisson model for Y = the number of alpha particles emitted in a time
interval of 1=8 minute is consistent with the Rutherford and Geiger data of Example
2.6.1.

3. Test whether a Poisson model for Y = the number of points per game is consistent
with the data for Wayne Gretzky given in Chapter 2, Problem 10.

4. Test whether a Poisson model for Y = the number of points per game is consistent
with the data for Sidney Crosby given in Chapter 2, Problem 11.

5. In the Wintario lottery draw, six digit numbers were produced by six machines that
operate independently and which each simulate a random selection from the digits
0; 1; : : : ; 9. Of 736 numbers drawn over a period from 1980-82, the following frequen-
cies were observed for position 1 in the six digit numbers:
Digit (i): 0 1 2 3 4 5 6 7 8 9 Total
Frequency (fi ): 70 75 63 59 81 92 75 100 63 58 736

Consider the 736 draws as trials in a Multinomial experiment and let

j = P (digit j is drawn on any trial); j = 0; 1; : : : ; 9


294 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

If the machines operate in a truly “random” fashion, then we should have j = 0:1,
j = 0; 1; : : : ; 9.

(a) Test this hypothesis using the likelihood ratio test. What do you conclude?
(b) The data above were for digits in the …rst position of the six digit Wintario
numbers. Suppose you were told that similar likelihood ratio tests had in fact
been carried out for each of the six positions, and that position one had been
singled out for presentation above because it gave the largest observed value of
the likelihood ratio statistic . How would you test the hypothesis j = 0:1;
j = 0; 1; 2; : : : ; 9 for all six positions simultaneously?

6. A long sequence of digits (0; 1; : : : ; 9) produced by a pseudo random number generator


was examined. There were 51 zeros in the sequence, and for each successive pair of
zeros, the number of (non-zero) digits between them was counted. The results were
as follows:

1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5

(a) Give an appropriate probability model for the number of digits between two
successive zeros, if the pseudo random number generator is truly producing digits
for which P (any digit = j) = 0:1, j = 0; 1; : : : ; 9, independent of any other digit.
(b) Construct a frequency table and test the goodness of …t of your model.

7. Mass-produced items are packed in cartons of 12 as they come o¤ an assembly line.


The items from 250 cartons are inspected for defects, with the following results:

Number defective: 0 1 2 3 4 5 6 >6 Total


Frequency observed: 103 80 31 19 11 5 1 0 250

(a) Test the hypothesis that the number of defective items Y in a single carton has
a Binomial(12; ) distribution.
(b) Why might the Binomial not be a suitable model?
7.4. CHAPTER 7 PROBLEMS 295

8. The table below records data on 292 litters of mice classi…ed according to litter size
P
and number of females in the litter. Note that yn+ = ynj .
j

Number of females = j Total number


ynj 0 1 2 3 4 of litters = yn+
1 8 12 20
Litter 2 23 44 13 80
Size = n 3 10 25 48 13 96
4 5 30 34 22 5 96

(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter
of size n has Binomial distribution with parameters n and n = P (female). Test
the Binomial model separately for each of the litter sizes n = 2; n = 3 and n = 4.
(Why is it of scienti…c interest to do this?)
(b) Assuming that the Binomial model is appropriate for each litter size, test the
hypothesis that 1 = 2 = 3 = 4 .

9. The following data on heights of 210 married couples were presented by Yule in 1900.

Tall wife Medium wife Short wife Total


Tall husband 18 28 19 65
Medium husband 20 51 28 99
Short husband 12 25 9 46
Total 50 104 56 210

Test the hypothesis that the heights of husbands and wives are independent.
The following R code determines the p value for testing the hypothesis of independence.
# matrix of observed frequencies
f<-matrix(c(18,28,19,20,51,28,12,25,9),ncol=3,byrow=TRUE)
row<-margin.table(f,1) # row totals
col<-margin.table(f,2) # column totals
e<-outer(row,col)/sum(f) # matrix of expected frequencies
lambda<-2*sum(f*log(f/e)) # observed value of likelihood ratio statistic
df<-(length(row)-1)*(length(col)-1) # degrees of freedom
pvalue<-1-pchisq(lambda,df)
c(lambda,df,pvalue)
296 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

10. A study was undertaken to determine whether there is an association between the
birth weights of infants and the smoking habits of their parents. Out of 50 infants of
above average weight, 9 had parents who both smoked, 6 had mothers who smoked
but fathers who did not, 12 had fathers who smoked but mothers who did not, and
23 had parents of whom neither smoked. The corresponding results for 50 infants of
below average weight were 21, 10, 6, and 13, respectively.

(a) Test whether these results are consistent with the hypothesis that birth weight
is independent of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of
the mother, the smoking habits of the father are not related to birth weight?

11. School children with tonsils were classi…ed according to tonsil size and absence or
presence of the carrier for streptococcus pyogenes. The results were as follows:

Normal Enlarged Much enlarged Total


Carrier present 19 29 24 72
Carrier absent 497 560 269 1326
Total 516 589 293 1398

Is there evidence of an association between the two classi…cations?

12. A random sample of 1000 Canadians aged 25 34 were classi…ed according to their
highest level of education and whether they were employed or not (data based on
2011 Canadian census data).

Employed Unemployed Total


No certi…cate,
diploma or degree 66 10 76

High school
diploma or equivalent 185 16 201

Postsecondary
certi…cate, diploma or degree 683 40 723

Total 934 66 1000

Test the hypothesis that level of education is independent of whether or not a Cana-
dian aged 25 34 is employed.
7.4. CHAPTER 7 PROBLEMS 297

13. In the following table, 64 sets of triplets are classi…ed according to the age of their
mother at their birth and their sex distribution:

3 boys 2 boys 2 girls 3 girls Total


Mother under 30 5 8 9 7 29
Mother over 30 6 10 13 6 35
Total 11 18 22 13 64

(a) Is there any evidence of an association between the sex distribution and the age
of the mother?
(b) Suppose that the probability of a male birth is 0:5, and that the sexes of triplets
are determined independently. Find the probability that there are y boys in a
set of triples y = 0; 1; 2; 3, and test whether the column totals are consistent with
this distribution.

14. To investigate the e¤ectiveness of a rust-proo…ng procedure, 50 cars that had been
rust-proofed and 50 cars that had not were examined for rust …ve years after pur-
chase. For each car it was noted whether rust was present (actually de…ned as having
moderate or heavy rust) or absent (light or no rust). The data are as follows:

Rust-Proofed Not Rust Proofed


Rust present 14 28
Rust absent 36 22
Total 50 50

(a) Test the hypothesis that the probability of rust occurring is the same for the
rust-proofed cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proo…ng prevents rust?
How might a better study be designed?

15. Two hundred volunteers participated in an experiment to examine the e¤ectiveness


of vitamin C in preventing colds. One hundred were selected at random to receive
daily doses of vitamin C and the others received a placebo. (None of the volunteers
knew which group they were in.) During the study period, 20 of those taking vitamin
C and 30 of those receiving the placebo caught colds. Test the hypothesis that the
probability of catching a cold during the study period was the same for each group.

16. Comparing speech recognition algorithms: To compare the performance of two


algorithms, A and B, for speech recognition, researchers presented each algorithm
with a set of labeled utterances for recognition. The utterances which are syllables,
words, or short phrases are assumed to be a random sample from a large population
of utterances. An error is made when an algorithm does not correctly identify the
utterance. On a set of 1400 utterances algorithm A made 72 errors. On a di¤erent
298 7. MULTINOMIAL MODELS AND GOODNESS OF FIT TESTS

independent set of 1400 utterances algorithm B made 62 errors. Test the hypothesis
that the probability of an error is the same for both algorithms.

17. Challenge Problem: Comparing speech recognition algorithms - paired


data Suppose in Problem 16 that the same set of utterances is presented to both
algorithms A and B. The analysis in Problem 16 cannot be used. (Both algorithms
are more likely to identify simple utterances correctly and make errors more often
when utterances are more complicated.) The data for this type of paired experiment
can be summarized in general as

B
Correct Incorrect
A Correct y11 y12
Incorrect y21 y22
n

where, for example, y11 = the number of utterances correctly identi…ed by both al-
gorithms A and B. Since (Y11 ; Y12 ; Y21 ; Y22 ) Multinomial(n; 11 ; 12 ; 21 ; 22 ), the
hypothesis that the probability of an error is the same for both algorithms is
H0 : P (A identi…es utterance correctly) = 11 + 12 = P (B identi…es utterance correctly)
= 11 + 21 or equivalently H0 : 12 = 21 .

(a) Show that, under H0 , the maximum likelihood estimates are

^11 = y11 ; ^12 = ^21 = y12 + y21 ; ^22 = y22


n 2n n

(b) Use the likelihood ratio test to test the hypothesis H0 : 12 = 21 for the data

B
Correct Incorrect
A Correct 1325 3
Incorrect 13 59
1400

Compare this result with the result in Problem 16.


8. CAUSAL RELATIONSHIPS

8.1 Establishing Causation


As mentioned in Chapters 1 and 3, many studies are carried out with causal objectives in
mind. That is, we would like to be able to establish or investigate a possible cause and
e¤ect relationship between variates x and y.
We use the word “causes”often; for example we might say that “gravity causes dropped
objects to fall to the ground”, or that “smoking causes lung cancer”. The concept of
causation (as in “x causes y”) is nevertheless di¢ cult to de…ne. One reason is that the
“strengths” of causal relationships vary a lot. For example, on earth gravity may always
lead to a dropped object falling to the ground; however, not everyone who smokes gets lung
cancer.
Idealized de…nitions of causation are often of the following form. Let y be a response
variate associated with units in a population or process, and let x be an explanatory variate
associated with some factor that may a¤ect y. Then, if all other factors that a¤ect y
are held constant, let us change x (or observe di¤erent values of x) and see if y
changes. If y changes then we say that x has a causal e¤ect on y.
In fact, this de…nition is not broad enough, because in many settings a change in x may
only lead to a change in y in some probabilistic sense. For example, giving an individual
person at risk of stroke a small daily dose of aspirin instead of a placebo may not necessarily
lower that individual’s risk. Not everyone will necessarily be helped by the daily dose of
aspirin. However, on average, the e¤ect is to lower the risk of stroke in the population.
Therefore, a better idealized de…nition of causation is to say that changing x should
result in a change in some attribute of the variate y, for example, the proportion of the
population who develop a stroke within 3 years. Thus we revise the de…nition above to say:
If all other factors that a¤ect y are held constant, let us change x (or observe
di¤erent values of x) and see if some speci…ed attribute of y changes. If the
speci…ed attribute of y changes then we say x has a causal e¤ect on y.
These de…nitions are unfortunately unusable in most settings since we cannot hold all
other factors that a¤ect y constant; often we don’t even know what all the factors are.
However, the de…nition serves as a useful ideal for how we should carry out studies in order
to show that a causal relationship exists. We try to design studies so that alternative (to
the variate x) explanations of what causes changes in attributes of y can be ruled out,

299
300 8. CAUSAL RELATIONSHIPS

leaving x as the causal agent. This is much easier to do in experimental studies, where
explanatory variates may be controlled, than in observational studies. The following are
brief examples.

Example 8.1.1 Strength of steel bolts


Recall Example 6.1.4 concerning the (breaking) strength y of a steel bolt and the diam-
eter x of the bolt. It is clear that bolts with larger diameters tend to have higher strength,
and it seems clear on physical and theoretical grounds that increasing the diameter “causes”
an increase in strength. This can be investigated in experimental studies like that in Ex-
ample 6.1.4, when random samples of bolts of di¤erent diameters are tested, and their
strengths y determined.
Clearly, the value of x does not determine y exactly (di¤erent bolts with the same
diameter don’t have the same strength), but we can consider attributes such as the average
value of y. In the experiment we can hold other factors more or less constant (e.g. the
ambient temperature, the way the force is applied; the metallurgical properties of the bolts)
so we feel that the observed larger average values of y for bolts of larger diameter x is due
to a causal relationship.
Note that even here we have to depart slightly from the idealized de…nition of cause
and e¤ect. In particular, a bolt cannot have its diameter x changed, so that we can see
if y changes. All we can do is consider two bolts that are as similar as possible, and are
subject to the same explanatory variates (aside from diameter). This di¢ culty arises in
many experimental studies.

Example 8.1.2 Smoking and lung cancer


Suppose that data have been collected on 10; 000 persons aged 40 80 who have smoked
for at least 20 years, and 10; 000 persons in the same age range who have not. There is
roughly the same distribution of ages in the two groups. The (hypothetical) data concerning
the numbers with lung cancer are as follows:

Lung Cancer No Lung Cancer Total


Smokers 500 9500 10; 000
Non-Smokers 100 9900 10; 000

There are many more lung cancer cases among the smokers, but without further information
or assumptions we cannot conclude that a causal relationship (smoking causes lung cancer)
exists. Alternative explanations might explain some or all of the observed di¤erence. (This
is an observational study and other possible explanatory variates are not controlled.) For
example, family history is an important factor in many cancers; maybe smoking is also
related to family history. Moreover, smoking tends to be connected with other factors such
as diet and alcohol consumption; these may explain some of the e¤ect seen.
The last example illustrates that association (statistical dependence) between
two variates x and y does not imply that a causal relationship exists. Suppose for
8.2. EXPERIMENTAL STUDIES 301

example that we observe a positive correlation between x and y; higher values of x tend to
go with higher values of y in a unit. Then there are at least three “explanations” for this
correlation:

(1) x causes y (meaning x has a causative e¤ect on y)

(2) y causes x

(3) some other variate(s) z cause both x and y.

We’ll now consider the question of cause and e¤ect in experimental and observational
studies in a little more detail.

8.2 Experimental Studies


Suppose we want to investigate whether a variate x has a causal e¤ect on a response variate
y. In an experimental setting we can control the values of x that a unit “sees”. In addition,
we can use one or both of the following devices for ruling out alternative explanations for
any observed changes in y that might be caused by x:

(1) Hold other possible explanatory variates …xed.

(2) Use randomization to control for other variates.

These devices are mostly simply explained via examples.

Example 8.2.1 Aspirin and the risk of stroke


Suppose 500 persons that are at high risk of stroke have agreed to take part in a clinical
trial to assess whether aspirin lowers the risk of stroke. These persons are representative
of a population of high risk individuals. The study is conducted by giving some persons
aspirin and some a placebo, then comparing the two groups in terms of the number of
strokes observed.
Other factors such as age, sex, weight, existence of high blood pressure, and diet also
may a¤ect the risk of stroke. These variates obviously vary substantially across persons and
cannot be held constant or otherwise controlled. However, such studies use randomization
in the following way: among the study subjects, who gets aspirin and who gets a placebo
is determined by a random mechanism. For example, we might ‡ip a coin (or draw a
random number from f0; 1g), with one outcome (say Heads) indicating a person is to be
given aspirin, and the other indicating that they get the placebo.
The e¤ect of this randomization is to balance the other possible explanatory variates
in the two “treatment” groups (aspirin and placebo). Thus, if at the end of the study we
observe that 20% of the placebo subjects have had a stroke but only 9% of the aspirin
subjects have, then we can attribute the di¤erence to the causative e¤ect of the aspirin.
302 8. CAUSAL RELATIONSHIPS

Here’s how we rule out alternative explanations: suppose you claim that its not the aspirin
but dietary factors and blood pressure that cause this observed e¤ect. I respond that the
randomization procedure has lead to those factors being balanced in the two treatment
groups. That is, the aspirin group and the placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a di¤erence in
the two groups should not be due to these factors.

Example 8.2.2 Driving speed and fuel consumption


It is thought that fuel consumption in automobiles is greater at speeds in excess of 100
km per hour. (Some years ago during oil shortages, many U.S. states reduced speed limits
on freeways because of this.) A study is planned that will focus on freeway-type driving,
because fuel consumption is also a¤ected by the amount of stopping and starting in town
driving, in addition to other factors.
In this case a decision was made to carry out an experimental study at a special paved
track owned by a car company. Obviously a lot of factors besides speed a¤ect fuel con-
sumption: for example, the type of car and engine, tire condition, fuel grade and the driver.
As a result, these factors were controlled in the study by balancing them across di¤erent
driving speeds. An experimental plan of the following type was employed.

84 cars of eight di¤erent types were used; each car was used for 8 test drives.

the cars were each driven twice for 600 km on the track at each of four speeds: 80,
100, 120, and 140 km/hr.

8 drivers were involved, each driving each of the 8 cars for one test, and each driving
two tests at each of the four speeds.

the cars had similar initial mileages and were carefully checked and serviced so as to
make them as comparable as possible; they used comparable fuels.

the drivers were instructed to drive steadily for the 600 km. Each was allowed a 30
minute rest stop after 300 km.

the order in which each driver did their 8 test drives was randomized. The track
was large enough that all 8 drivers could be on it at the same time. (The tests were
conducted over 8 days.)

The response variate was the amount of fuel consumed for each test drive. Obviously
in the analysis we must deal with the fact that the cars di¤er in size and engine type, and
their fuel consumption will depend on that as well as on driving speed. A simple approach
would be to add the fuel amounts consumed for the 16 test drives at each speed, and to
compare them (other methods are also possible). Then, for example, we might …nd that
the average consumption (across the 8 cars) at 80, 100, 120, and 140 km/hr were 43.0, 44.1,
45.8, and 47.2 liters respectively. Statistical methods of testing and estimation could then
8.3. OBSERVATIONAL STUDIES 303

be used to test or estimate the di¤erences in average fuel consumption at each of the four
speeds. (Can you think of a way to do this?)

Exercise Suppose that statistical tests demonstrated a signi…cant di¤erence in consump-


tion across the four driving speeds, with lower speeds giving lower consumption. What (if
any) quali…cations would you have about concluding there is a causal relationship?

8.3 Observational Studies


In observational studies there are often unmeasured factors that a¤ect the response variate
y. If these factors are also related to the explanatory variate x whose (potential) causal
e¤ect we are trying to assess, then we cannot easily make any inferences about causation.
For this reason, we try in observational studies to measure other important factors besides
x.
For example, Problem 14 at the end of Chapter 7 discusses an observational study on
whether rust-proo…ng prevents rust. It is clear that an unmeasured factor is the care a car
owner takes in looking after a vehicle; this could quite likely be related to whether a person
decides to have their car rust-proofed.
The following example shows how we must take note of other variates that may a¤ect y.

Example 8.3.1 Graduate studies admissions


Suppose that over a …ve year period, the applications and admissions to graduate studies
in Engineering and Arts faculties in a university are as follows:

No. Applied No. Admitted % Admitted


Engineering 1000 600 60% Men
200 150 75% Women
Arts 1000 400 40% Men
1800 800 44% Women
Total 2000 1000 50% Men
2000 950 47:5% Women

We want to see if females have a lower probability of admission than males. If we looked
only at the totals for Engineering plus Arts, then it would appear that the probability a
male applicant is admitted is a little higher than the probability for a female applicant.
However, if we look separately at Arts and Engineering, we see the probability for females
being admitted appears higher in each case! The reason for the reverse direction in the
totals is that Engineering has a higher admission rate than Arts, but the fraction of women
applying to Engineering is much lower than for Arts.
304 8. CAUSAL RELATIONSHIPS

In cause and e¤ect language, we would say that the faculty one applies to (i.e. Engi-
neering or Arts) is a causative factor with respect to probability of admission. Furthermore,
it is related to the sex (male or female) of an applicant, so we cannot ignore it in trying to
see if sex is also a causative factor.

Remark The feature illustrated in the example above is sometimes called Simpson’s Para-
dox. In probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can have

P (AjB1 Ci ) > P (AjB2 Ci ) for each i = 1; 2; : : : ; k

but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend
i=1
on what P (Ci jB1 ) and P (Ci jB2 ) are.) In the example above we can take B1 = fperson
is femaleg, B2 = fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson
applies to Artsg, and A = fperson is admittedg.

Exercise Write down estimated probabilities for the various events based on Example
8.3.1, and so illustrate Simpson’s paradox.

Epidemiologists (specialists in the study of disease) have developed guidelines or criteria


which should be met in order to argue that a causal association exists between a risk factor x
and a disease (represented by a response variate y = I(person has the disease), for example)
in the case in which an experimental study cannot be conducted. These include:

The association between x and y must be observed in many studies of di¤erent types
among di¤erent groups. This reduces the chance that an observed association is due
to a defect in one type of study or a peculiarity in one group of subjects.

The association between x and y must continue to hold when the e¤ects of plausible
confounding variates are taken into account.

There must be a plausible scienti…c explanation for the direct in‡uence of x on y, so


that a causal link does not depend on the observed association alone.

There must be a consistent response, that is, y always increases (decreases) when x
increases.

Example 8.3.2 Smoking and lung cancer


The claim that cigarette smoking causes lung cancer meets these four criteria. A strong
association has been observed in numerous studies in many countries. Many possible sources
of confounding variates have been examined in these studies and have not been found to
8.4. CLOFIBRATE STUDY 305

explain the association. For example, data about nonsmokers who are exposed to second-
hand smoke contradicts the genetic hypothesis. Animal experiments have demonstrated
conclusively that tobacco smoke contains substances that cause cancerous tumors. There-
fore there is a known pathway by which smoking causes lung cancer. The lung cancer rates
for ex-smokers decrease over time since smoking cessation. The evidence for causation here
is about as strong as non-experimental evidence can be.

Similar criteria apply to other scienti…c areas of research.

8.4 Clo…brate Study


In the early seventies, the Coronary Drug Research Group implemented a large medical
trial16 in order to evaluate an experimental drug, clo…brate, for its e¤ect on the risk of
heart attacks in middle-aged people with heart trouble. Clo…brate operates by reducing
the cholesterol level in the blood and thereby potentially reducing the risk of heart disease.

Study I: An Experimental Plan

Problem

Investigate the e¤ect of clo…brate on the risk of fatal heart attack for patients with a
history of a previous heart attack.

The target population consists of all individuals with a previous non-fatal heart attack
who are at risk for a subsequent heart attack. The response of interest is the occurrence/non-
occurrence of a fatal heart attack. This is primarily a causative problem in that the investi-
gators are interested in determining whether the prescription of clo…brate causes a reduction
in the risk of subsequent heart attack. The …shbone diagram (Figure 8.1) indicates a broad
variety of factors a¤ecting the occurrence (or not) of a heart attack.

Plan
The study population consists of men aged 30 to 64 who had a previous heart attack not
more than three months prior to initial contact. The sample consists of subjects from the
study population who were contacted by participating physicians, asked to participate in
the study, and provided informed consent. (All patients eligible to participate had to sign a
consent form to participate in the study. The consent form usually describes current state
of knowledge regarding the best available relevant treatments, the potential advantages and
disadvantages of the new treatment, and the overall purpose of the study.)
The following treatment protocol was developed:
16
The Coronary Drug Research Group, New England Journal of Medicine (1980), pg. 1038.
306 8. CAUSAL RELATIONSHIPS

Me a s u r e me n t Ma te r ia l Pe r s o n n e l
ag e
s tr es s
follow- up time mental health
diet
per s onality type
follow- up method dos e
g ender
definition of hear t attac k exer c is e
dr ug s m oking s tatus
dr inking s tatus
doc tor medic ations
family his tor y
phys ic al tr aits
per s onal his tor y

F a ta l H e a rt Atta c k
weather method of adminis tr ation

loc ation dos e


wor k envir onment when taken
home envir onment
fr eq uenc y of dr ug

En v ir o n me n t Me th o d s

Figure 8.1: Fishbone diagram for chlo…brate example

Randomly assign eligible men to either clo…brate or placebo treatment groups. (This
is an attempt to make the clo…brate and placebo groups alike with respect to most ex-
planatory variates other than the focal explanatory variate. See the …shbone diagram
above.)

Administer treatments in identical capsules in a double-blinded fashion. (In this con-


text, double-blind means that neither the patient nor the individual administering the
treatment knows if it is clo…brate or placebo; only the person heading the investiga-
tion knows. This is to avoid di¤erential reporting rates from physicians enthusiastic
about the new drug - a form of measurement error.)

Follow patients for 5 years and record the occurrence of any fatal heart attacks expe-
rienced in either treatment group.

Determination of whether a fatality was attributable to a heart attack or not is based


on electrocardiograms and physical examinations by physicians.

Data

1,103 patients were assigned to clo…brate and 2,789 were assigned to the placebo
group.

221 of the patients in the clo…brate group died and 586 of the patients in the placebo
group died.
8.4. CLOFIBRATE STUDY 307

Analysis

The proportion of patients in the two groups having subsequent fatal heart attacks
(clo…brate: 221=1103 = 0:20 and placebo: 586=2789 = 0:21) are comparable.

Conclusions

Based on these data we would conclude that Clo…brate does not reduce mortality due
to heart attacks in high risk patients.
This conclusion has several limitations. For example, study error has been introduced
by restricting the study population to male subjects alone. While clo…brate might be
discarded as a bene…cial treatment for the target population, there is no information
in this study regarding its e¤ects on female patients at risk for secondary heart attacks.

Study II: An Observational Plan


Supplementary analyses indicate that one reason that clo…brate did not appear to save
lives might be because the patients in the clo…brate group did not take their medicine. It
was therefore of interest to investigate the potential bene…t of clo…brate for patients who
adhered to their medication program.
Subjects who took more than 80% of their prescribed treatment were called “adherers”
to the protocol.

Problem

Investigate the occurrence of fatal heart attacks in the group of patients assigned to
clo…brate who were adherers.

The remaining parts of the problem stage are as before.

Plan

Compare the occurrence of heart attacks in patients assigned to clo…brate who main-
tained the designated treatment schedule with the patients assigned to clo…brate who
abandoned their assigned treatment schedule.

Note that this is a further reduction of the study population.

Data

In the clo…brate group, 708 patients were adherers and 357 were non-adherers. The
remaining 38 patients could not be classi…ed as adherers or non-adherers and so were
excluded from this analysis. Of the 708 adherers, 106 had a fatal heart attack during
the …ve years of follow up. Of the 357 non-adherers, 88 had a fatal heart attack during
the …ve years of follow up.
308 8. CAUSAL RELATIONSHIPS

Analysis

The proportion of adherers su¤ering from subsequent heart attack is given by 106=708 =
0:15 while this proportion for the non-adherers is 88=357 = 0:25.

Conclusions

It would appear based on these data that clo…brate does reduce mortality due to
heart attack for high risk patients if properly administered.
However, great care must be taken in interpreting the above results since they are
based on an observational plan. While the data were collected based on an exper-
imental plan, only the treatment was controlled. The comparison of the mortality
rates between the adherers and non-adherers is based on an explanatory variate (ad-
herence) that was not controlled in the original experiment. The investigators did not
decide who would adhere to the protocol and who would not; the subjects decided
themselves.
Now the possibility of confounding is substantial. Perhaps, adherers are more health
conscious and exercised more or ate a healthier diet. Detailed measurements of these
variates are needed to control for them and reduce the possibility of confounding.
8.5. CHAPTER 8 PROBLEMS 309

8.5 Chapter 8 Problems


1. In an Ontario study, 50267 live births were classi…ed according to the baby’s weight
(less than or greater than 2.5 kg.) and according to the mother’s smoking habits (non-
smoker, 1-20 cigarettes per day, or more than 20 cigarettes per day). The results were
as follows:
No. of cigarettes 0 1 20 > 20
Weight 2:5 1322 1186 793
Weight > 2:5 27036 14142 5788

(a) Test the hypothesis that birth weight is independent of the mother’s smoking
habits.
(b) Explain why it is that these results do not prove that birth weights would increase
if mothers stopped smoking during pregnancy. How should a study to obtain
such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount
smoked by the father. Explain why this is to be expected even if the father’s
smoking habits are irrelevant.

2. One hundred and …fty Statistics students took part in a study to evaluate computer-
assisted instruction (CAI). Seventy-…ve received the standard lecture course while
the other 75 received some CAI. All 150 students then wrote the same examination.
Fifteen students in the standard course and 29 of those in the CAI group received a
mark over 80%.

(a) Are these results consistent with the hypothesis that the probability of achieving
a mark over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances
of a mark over 80%. How should the study have been carried out in order for
this conclusion to be valid?

3.

(a) The following data were collected some years ago in a study of possible sex bias
in graduate admissions at a large university:

Admitted Not admitted


Male applicants 3738 4704
Female applicants 1494 2827

Test the hypothesis that admission status is independent of sex. Do these data
indicate a lower admission rate for females?
310 8. CAUSAL RELATIONSHIPS

(b) The following table shows the numbers of male and female applicants and the
percentages admitted for the six largest graduate programs in (a):

Men Women
Program Applicants % Admitted Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7

Test the independence of admission status and sex for each program. Do any of
the programs show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the
results for individual programs in (b) do not?

4. To assess the (presumed) bene…cial e¤ects of rust-proo…ng cars, a manufacturer ran-


domly selected 200 cars that were sold 5 years earlier and were still used by the original
buyers. One hundred cars were selected from purchases where the rust-proo…ng op-
tion package was included, and one hundred from purchases where it was not (and
where the buyer did not subsequently get the car rust-proofed by a third party).
The amount of rust on the vehicles was measured on a scale in which the responses Y
were assumed to have a Gaussian distribution. For the rust-proofed cars the responses
were assumed to be G( 1 ; 1 ) and for the non-rust-proofed cars the responses were
assumed to be G( 2 ; 2 ). Sample means and standard deviations for the two sets of
cars were (higher y means more rust):

Rust-proofed cars y1 = 11:7 s1 = 2:1


Non-rust-proofed cars y2 = 12:0 s2 = 2:4

(a) Test the hypothesis that there is no di¤erence between the mean amount of rust
for rust-proofed cars as compared to non-rust-proofed cars.
(b) The manufacturer was surprised to …nd that the data did not show a bene…cial
e¤ect of rust-proo…ng. Describe problems with their study and outline how you
might carry out a study designed to demonstrate a causal e¤ect of rust-proo…ng.

5. In Chapter 6, Problem 11 there was strong evidence against the hypothesis of no


relationship between death rate from cirrhosis of the liver and wine consumption per
capita in 46 states in the United States. Based on this study is it possible to conclude
a causal relationship between wine consumption and cirrhosis of the liver?
8.5. CHAPTER 8 PROBLEMS 311

6. Chapter 6, Problem 13 contained data, collected by the British botanist Joseph


Hooker in the Himalaya Mountains between 1848 and 1850, on atmospheric pressure
and the boiling point of water. Was this an experimental study or an observational
study? Based on these data can you conclude that the boiling point of water a¤ects
atmospheric pressure?

7. In randomized clinical trials that compare two (or more) medical treatments it is
customary not to let either the subject or their physician know which treatment they
have been randomly assigned. These are referred to as double blind studies.
Discuss why doing a double blind study is a good idea in an experimental study.

8. Public health researchers want to study whether speci…cally designed educational


programs about the e¤ects of cigarette smoking have the e¤ect of discouraging people
from smoking. One particular program is delivered to students in grade 9, with follow-
up in grade 11 to determine each student’s smoking “history”. Brie‡y discuss some
factors you would want to consider in designing such a study, and how you might
address them.
312 8. CAUSAL RELATIONSHIPS
9. REFERENCES AND
SUPPLEMENTARY
RESOURCES

9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat
231 Course Notes)

C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis
and Inference. John Wiley and Sons, New York.

J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability.
American Statistician 57,74-79

9.2 Departmental Web Resources


See www.watstat.ca

313
314 9. REFERENCES AND SUPPLEMENTARY RESOURCES
10. DISTRIBUTIONS AND
STATISTICAL TABLES

315
Summary of Discrete Distributions
Moment
Probability
Notation and Mean Variance Generating
Function
Parameters EY VarY Function
fy
Mt

Discrete Uniforma, b b
1
b−a1 ab b−a1 2 −1 1
∑ e tx
b≥a 2 12
b−a1
xa
y  a, a  1, … , b
a, b integers t∈

HypergeometricN, r, n  r
y
N−r
n−y
 Nn 
N  1, 2, … nr nr
1 − r
 N−n Not tractable
n  0, 1, … , N y  max 0, n − N  r, N N N N−1

r  0, 1, … , N … , minr, n

Binomialn, p
 ny p y q n−y np npq pe t  q n
0 ≤ p ≤ 1, q  1 − p
y  0, 1, … , n t∈
n  1, 2, …

Bernoullip p y q 1−y p pq pe t  q
0 ≤ p ≤ 1, q  1 − p y  0, 1 t∈
yk−1
Negative Binomialk, p y pkqy p k
kq kq
1−qe t
0  p ≤ 1, q  1 − p  −k
y p −q k y p p2
t  − ln q
k  1, 2, … y  0, 1, …
p
Geometricp pq y q q 1−qe t
p p2
0  p ≤ 1, q  1 − p y  0, 1, … t  − ln q
e −  y
Poisson e e −1
t
x!  
≥0 y  0, 1, … t∈

fy 1 , y 2 , … , y k  
Multinomialn; p 1 , p 2 , … , p k  y y y Mt 1 , t 2 , … , t k 
n!
y 1 !y 2 !y k !
p 11 p 22 p k k
0 ≤ pi ≤ 1 VarY i  p 1 e t 1 p 2 e t 2 
y i  0, 1, … , n EY i   np i
i  1, 2, … , k  np i 1 − p i  p k−1 e t k−1 p k  n
i  1, 2, … , k i  1, 2, … , k
k i  1, 2, … , k ti ∈ 
and ∑ p i  1 k
i1 and ∑ y i  n i  1, 2, … , k − 1
i1
Summary of Continuous Distributions
Probability Moment
Notation and Density Mean Variance Generating
Parameters Function EY VarY Function
fy Mt

e bt −e at
Uniforma, b 1
ab b−a 2 b−at
t≠0
b−a
2 12
ba a≤y≤b 1 t0

1 1
Exponential e −y/
  2 1−t
1
0 y≥0 t 

1 2 2
N,  2   G,  e −y− /2  e t
2 t 2 /2
2   2
 ∈ ,  2  0 y∈ t∈

y k/2−1 e −y/2
2 k/2 Γk/2
 2 k y0 1 − 2t −k/2
k 2k
1
k  1, 2, …  t 2
Γa   x a−1 −x
e dx
0

ck
y2
1  k1/2
if k  3, 4, …
k
tk 0 if k  2, 3, … k
DNE
y∈ k−2
k  1, 2, … Γ k1 DNE if k  1 DNE if k  1, 2
ck  2
k Γ k
2
N(0,1) Cumulative
Distribution Function

This table gives values of F(x) = P(X ≤ x) for X ~ N(0,1) and x ≥ 0


x  0.00  0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09 
0.0  0.50000  0.50399  0.50798  0.51197  0.51595  0.51994  0.52392  0.52790  0.53188  0.53586 
0.1  0.53983  0.54380  0.54776  0.55172  0.55567  0.55962  0.56356  0.56749  0.57142  0.57535 
0.2  0.57926  0.58317  0.58706  0.59095  0.59483  0.59871  0.60257  0.60642  0.61026  0.61409 
0.3  0.61791  0.62172  0.62552  0.62930  0.63307  0.63683  0.64058  0.64431  0.64803  0.65173 
0.4  0.65542  0.65910  0.66276  0.66640  0.67003  0.67364  0.67724  0.68082  0.68439  0.68793 
0.5  0.69146  0.69497  0.69847  0.70194  0.70540  0.70884  0.71226  0.71566  0.71904  0.72240 
0.6  0.72575  0.72907  0.73237  0.73565  0.73891  0.74215  0.74537  0.74857  0.75175  0.75490 
0.7  0.75804  0.76115  0.76424  0.76730  0.77035  0.77337  0.77637  0.77935  0.78230  0.78524 
0.8  0.78814  0.79103  0.79389  0.79673  0.79955  0.80234  0.80511  0.80785  0.81057  0.81327 
0.9  0.81594  0.81859  0.82121  0.82381  0.82639  0.82894  0.83147  0.83398  0.83646  0.83891 
1.0  0.84134  0.84375  0.84614  0.84849  0.85083  0.85314  0.85543  0.85769  0.85993  0.86214 
1.1  0.86433  0.86650  0.86864  0.87076  0.87286  0.87493  0.87698  0.87900  0.88100  0.88298 
1.2  0.88493  0.88686  0.88877  0.89065  0.89251  0.89435  0.89617  0.89796  0.89973  0.90147 
1.3  0.90320  0.90490  0.90658  0.90824  0.90988  0.91149  0.91309  0.91466  0.91621  0.91774 
1.4  0.91924  0.92073  0.92220  0.92364  0.92507  0.92647  0.92785  0.92922  0.93056  0.93189 
1.5  0.93319  0.93448  0.93574  0.93699  0.93822  0.93943  0.94062  0.94179  0.94295  0.94408 
1.6  0.94520  0.94630  0.94738  0.94845  0.94950  0.95053  0.95154  0.95254  0.95352  0.95449 
1.7  0.95543  0.95637  0.95728  0.95818  0.95907  0.95994  0.96080  0.96164  0.96246  0.96327 
1.8  0.96407  0.96485  0.96562  0.96638  0.96712  0.96784  0.96856  0.96926  0.96995  0.97062 
1.9  0.97128  0.97193  0.97257  0.97320  0.97381  0.97441  0.97500  0.97558  0.97615  0.97670 
2.0  0.97725  0.97778  0.97831  0.97882  0.97932  0.97982  0.98030  0.98077  0.98124  0.98169 
2.1  0.98214  0.98257  0.98300  0.98341  0.98382  0.98422  0.98461  0.98500  0.98537  0.98574 
2.2  0.98610  0.98645  0.98679  0.98713  0.98745  0.98778  0.98809  0.98840  0.98870  0.98899 
2.3  0.98928  0.98956  0.98983  0.99010  0.99036  0.99061  0.99086  0.99111  0.99134  0.99158 
2.4  0.99180  0.99202  0.99224  0.99245  0.99266  0.99286  0.99305  0.99324  0.99343  0.99361 
2.5  0.99379  0.99396  0.99413  0.99430  0.99446  0.99461  0.99477  0.99492  0.99506  0.99520 
2.6  0.99534  0.99547  0.99560  0.99573  0.99585  0.99598  0.99609  0.99621  0.99632  0.99643 
2.7  0.99653  0.99664  0.99674  0.99683  0.99693  0.99702  0.99711  0.99720  0.99728  0.99736 
2.8  0.99744  0.99752  0.99760  0.99767  0.99774  0.99781  0.99788  0.99795  0.99801  0.99807 
2.9  0.99813  0.99819  0.99825  0.99831  0.99836  0.99841  0.99846  0.99851  0.99856  0.99861 
3.0  0.99865  0.99869  0.99874  0.99878  0.99882  0.99886  0.99889  0.99893  0.99896  0.99900 
3.1  0.99903  0.99906  0.99910  0.99913  0.99916  0.99918  0.99921  0.99924  0.99926  0.99929 
3.2  0.99931  0.99934  0.99936  0.99938  0.99940  0.99942  0.99944  0.99946  0.99948  0.99950 
3.3  0.99952  0.99953  0.99955  0.99957  0.99958  0.99960  0.99961  0.99962  0.99964  0.99965 
3.4  0.99966  0.99968  0.99969  0.99970  0.99971  0.99972  0.99973  0.99974  0.99975  0.99976 
3.5  0.99977  0.99978  0.99978  0.99979  0.99980  0.99981  0.99981  0.99982  0.99983  0.99983 
 

N(0,1) Quantiles: This table gives values of F-1(p) for p ≥ 0.5


p  0.00  0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.075  0.08  0.09  0.095 
0.5  0.0000  0.0251  0.0502  0.0753  0.1004  0.1257  0.1510  0.1764  0.1891  0.2019  0.2275  0.2404 
0.6  0.2533  0.2793  0.3055  0.3319  0.3585  0.3853  0.4125  0.4399  0.4538  0.4677  0.4959  0.5101 
0.7  0.5244  0.5534  0.5828  0.6128  0.6433  0.6745  0.7063  0.7388  0.7554  0.7722  0.8064  0.8239 
0.8  0.8416  0.8779  0.9154  0.9542  0.9945  1.0364  1.0803  1.1264  1.1503  1.1750  1.2265  1.2536 
0.9  1.2816  1.3408  1.4051  1.4758  1.5548  1.6449  1.7507  1.8808  1.9600  2.0537  2.3263  2.5758 
Chi‐Squared Quantiles
This table gives values of x for p = P(X ≤ x) =  F(x)
df\p 0.005 0.01 0.025 0.05 0.1 0.9 0.95 0.975 0.99 0.995
1 0.000 0.000 0.001 0.004 0.016 2.706 3.842 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 0.211 4.605 5.992 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 0.584 6.251 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 1.064 7.779 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.146 1.610 9.236 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 2.204 10.645 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 2.833 12.017 14.067 16.013 18.475 20.278
8 1.344 1.647 2.180 2.733 3.490 13.362 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 4.168 14.684 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 4.865 15.987 18.307 20.483 23.209 25.188
11 2.603 3.054 3.816 4.575 5.578 17.275 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 6.304 18.549 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 7.042 19.812 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 7.790 21.064 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 8.547 22.307 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 9.312 23.542 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 10.085 24.769 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.391 10.865 25.989 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 11.651 27.204 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 12.443 28.412 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 13.240 29.615 32.671 35.479 38.932 41.401
22 8.643 9.542 10.982 12.338 14.041 30.813 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 14.848 32.007 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 15.659 33.196 36.415 39.364 42.980 45.559
25 10.520 11.524 13.120 14.611 16.473 34.382 37.652 40.646 44.314 46.928
26 11.160 12.198 13.844 15.379 17.292 35.563 38.885 41.923 45.642 48.290
27 11.808 12.879 14.573 16.151 18.114 36.741 40.113 43.195 46.963 49.645
28 12.461 13.565 15.308 16.928 18.939 37.916 41.337 44.461 48.278 50.993
29 13.121 14.256 16.047 17.708 19.768 39.087 42.557 45.722 49.588 52.336
30 13.787 14.953 16.791 18.493 20.599 40.256 43.773 46.979 50.892 53.672
40 20.707 22.164 24.433 26.509 29.051 51.805 55.758 59.342 63.691 66.766
50 27.991 29.707 32.357 34.764 37.689 63.167 67.505 71.420 76.154 79.490
60 35.534 37.485 40.482 43.188 46.459 74.397 79.082 83.298 88.379 91.952
70 43.275 45.442 48.758 51.739 55.329 85.527 90.531 95.023 100.430 104.210
80 51.172 53.540 57.153 60.391 64.278 96.578 101.880 106.630 112.330 116.320
90 59.196 61.754 65.647 69.126 73.291 107.570 113.150 118.140 124.120 128.300
100 67.328 70.065 74.222 77.929 82.358 118.500 124.340 129.560 135.810 140.170

 
Student t Quantiles
This table gives values of x  for p  = P(X  ≤ x ) =  F (x ), for p ≥ 0.6

df \ p 0.6 0.7 0.8 0.9 0.95 0.975 0.99 0.995 0.999 0.9995
1 0.3249 0.7265 1.3764 3.0777 6.3138 12.7062 31.8205 63.6567 318.3088 636.6192
2 0.2887 0.6172 1.0607 1.8856 2.9200 4.3027 6.9646 9.9248 22.3271 31.5991
3 0.2767 0.5844 0.9785 1.6377 2.3534 3.1824 4.5407 5.8409 10.2145 12.9240
4 0.2707 0.5686 0.9410 1.5332 2.1318 2.7764 3.7469 4.6041 7.1732 8.6103
5 0.2672 0.5594 0.9195 1.4759 2.0150 2.5706 3.3649 4.0321 5.8934 6.8688
6 0.2648 0.5534 0.9057 1.4398 1.9432 2.4469 3.1427 3.7074 5.2076 5.9588
7 0.2632 0.5491 0.8960 1.4149 1.8946 2.3646 2.9980 3.4995 4.7853 5.4079
8 0.2619 0.5459 0.8889 1.3968 1.8595 2.3060 2.8965 3.3554 4.5008 5.0413
9 0.2610 0.5435 0.8834 1.3830 1.8331 2.2622 2.8214 3.2498 4.2968 4.7809
10 0.2602 0.5415 0.8791 1.3722 1.8125 2.2281 2.7638 3.1693 4.1437 4.5869
11 0.2596 0.5399 0.8755 1.3634 1.7959 2.2010 2.7181 3.1058 4.0247 4.4370
12 0.2590 0.5386 0.8726 1.3562 1.7823 2.1788 2.6810 3.0545 3.9296 4.3178
13 0.2586 0.5375 0.8702 1.3502 1.7709 2.1604 2.6503 3.0123 3.8520 4.2208
14 0.2582 0.5366 0.8681 1.3450 1.7613 2.1448 2.6245 2.9768 3.7874 4.1405
15 0.2579 0.5357 0.8662 1.3406 1.7531 2.1314 2.6025 2.9467 3.7328 4.0728
16 0.2576 0.5350 0.8647 1.3368 1.7459 2.1199 2.5835 2.9208 3.6862 4.0150
17 0.2573 0.5344 0.8633 1.3334 1.7396 2.1098 2.5669 2.8982 3.6458 3.9651
18 0.2571 0.5338 0.8620 1.3304 1.7341 2.1009 2.5524 2.8784 3.6105 3.9216
19 0.2569 0.5333 0.8610 1.3277 1.7291 2.0930 2.5395 2.8609 3.5794 3.8834
20 0.2567 0.5329 0.8600 1.3253 1.7247 2.0860 2.5280 2.8453 3.5518 3.8495
21 0.2566 0.5325 0.8591 1.3232 1.7207 2.0796 2.5176 2.8314 3.5272 3.8193
22 0.2564 0.5321 0.8583 1.3212 1.7171 2.0739 2.5083 2.8188 3.5050 3.7921
23 0.2563 0.5317 0.8575 1.3195 1.7139 2.0687 2.4999 2.8073 3.4850 3.7676
24 0.2562 0.5314 0.8569 1.3178 1.7109 2.0639 2.4922 2.7969 3.4668 3.7454
25 0.2561 0.5312 0.8562 1.3163 1.7081 2.0595 2.4851 2.7874 3.4502 3.7251
26 0.2560 0.5309 0.8557 1.3150 1.7056 2.0555 2.4786 2.7787 3.4350 3.7066
27 0.2559 0.5306 0.8551 1.3137 1.7033 2.0518 2.4727 2.7707 3.4210 3.6896
28 0.2558 0.5304 0.8546 1.3125 1.7011 2.0484 2.4671 2.7633 3.4082 3.6739
29 0.2557 0.5302 0.8542 1.3114 1.6991 2.0452 2.4620 2.7564 3.3962 3.6594
30 0.2556 0.5300 0.8538 1.3104 1.6973 2.0423 2.4573 2.7500 3.3852 3.6460
40 0.2550 0.5286 0.8507 1.3031 1.6839 2.0211 2.4233 2.7045 3.3069 3.5510
50 0.2547 0.5278 0.8489 1.2987 1.6759 2.0086 2.4033 2.6778 3.2614 3.4960
60 0.2545 0.5272 0.8477 1.2958 1.6706 2.0003 2.3901 2.6603 3.2317 3.4602
70 0.2543 0.5268 0.8468 1.2938 1.6669 1.9944 2.3808 2.6479 3.2108 3.4350
80 0.2542 0.5265 0.8461 1.2922 1.6641 1.9901 2.3739 2.6387 3.1953 3.4163
90 0.2541 0.5263 0.8456 1.2910 1.6620 1.9867 2.3685 2.6316 3.1833 3.4019
100 0.2540 0.5261 0.8452 1.2901 1.6602 1.9840 2.3642 2.6259 3.1737 3.3905
>100 0.2535 0.5247 0.8423 1.2832 1.6479 1.9647 2.3338 2.5857 3.1066 3.3101

You might also like