Christensen 2016 Analysis of Variance Design and Regression Linear Modeling For Unbalanced Data
Christensen 2016 Analysis of Variance Design and Regression Linear Modeling For Unbalanced Data
Variance, Design,
and Regression
Linear Modeling for
Unbalanced Data
Second Edition
Ronald Christensen
University of New Mexico
Albuquerque, USA
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
Preface xvii
Computing xxv
1 Introduction 1
1.1 Probability 1
1.2 Random variables and expectations 4
1.2.1 Expected values and variances 6
1.2.2 Chebyshev’s inequality 9
1.2.3 Covariances and correlations 10
1.2.4 Rules for expected values and variances 12
1.3 Continuous distributions 13
1.4 The binomial distribution 17
1.4.1 Poisson sampling 21
1.5 The multinomial distribution 21
1.5.1 Independent Poissons and multinomials 23
1.6 Exercises 24
2 One Sample 27
2.1 Example and introduction 27
2.2 Parametric inference about μ 31
2.2.1 Significance tests 34
2.2.2 Confidence intervals 37
2.2.3 P values 38
2.3 Prediction intervals 39
2.4 Model testing 42
2.5 Checking normality 43
2.6 Transformations 48
2.7 Inference about σ 2 51
2.7.1 Theory 54
2.8 Exercises 55
4 Two Samples 87
4.1 Two correlated samples: Paired comparisons 87
4.2 Two independent samples with equal variances 90
4.2.1 Model testing 95
4.3 Two independent samples with unequal variances 96
4.4 Testing equality of the variances 101
4.5 Exercises 104
References 599
Background
Big Data are the future of Statistics. The electronic revolution has increased exponentially our abil-
ity to measure things. A century ago, data were hard to come by. Statisticians put a premium on
extracting every bit of information that the data contained. Now data are easy to collect; the prob-
lem is sorting through them to find meaning. To a large extent, this happens in two ways: doing a
crude analysis on a massive amount of data or doing a careful analysis on the moderate amount of
data that were isolated from the massive data as being meaningful. It is quite literally impossible to
analyze a million data points as carefully as one can analyze a hundred data points, so “crude” is
not a pejorative term but rather a fact of life.
The fundamental tools used in analyzing data have been around a long time. It is the emphases
and the opportunities that have changed. With thousands of observations, we don’t need a per-
fect statistical analysis to detect a large effect. But with thousands of observations, we might look
for subtle effects that we never bothered looking for before, and such an analysis must be done
carefully—as must any analysis in which only a small part of the massive data are relevant to the
problem at hand. The electronic revolution has also provided us with the opportunity to perform
data analysis procedures that were not practical before, but in my experience, the new procedures
(often called machine learning), are sophisticated applications of fundamental tools.
This book explains some of the fundamental tools and the ideas needed to adapt them to big
data. It is not a book that analyzes big data. The book analyzes small data sets carefully but by using
tools that 1) can easily be scaled to large data sets or 2) apply to the haphazard way in which small
relevant data sets are now constructed. Personally, I believe that it is not safe to apply models to
large data sets until you understand their implications for small data. There is also a major emphasis
on tools that look for subtle effects (interactions, homologous effects) that are hard to identify.
The fundamental tools examined here are linear structures for modeling data; specifically, how
to incorporate specific ideas about the structure of the data into the model for the data. Most of the
book is devoted to adapting linear structures (regression, analysis of variance, analysis of covari-
ance) to examine measurement (continuous) data. But the exact same methods apply to either-or
(Yes/No, binomial) data, count (Poisson, multinomial) data, and time-to-event (survival analysis,
reliability) data. The book also places strong emphasis on foundational issues, e.g., the meaning of
significance tests and the interval estimates associated with them; the difference between prediction
and causation; and the role of randomization.
The platform for this presentation is the revision of a book I published in 1996, Analysis of
Variance, Design, and Regression: Applied Statistical Methods. Within a year, I knew that the book
was not what I thought needed to be taught in the 21st century, cf., Christensen (2000). This book,
Analysis of Variance, Design, and Regression: Linear Modeling of Unbalanced Data, shares with
the earlier book lots of the title, much of the data, and even some of the text, but the book is radically
different. The original book focused greatly on balanced analysis of variance. This book focuses on
modeling unbalanced data. As such, it generalizes much of the work in the previous book. The more
general methods presented here agree with the earlier methods for balanced data. Another advantage
of taking a modeling approach to unbalanced data is that by making the effort to treat unbalanced
analysis of variance, one can easily handle a wide range of models for nonnormal data, because the
same fundamental methods apply. To that end, I have included new chapters on logistic regression,
log-linear models, and time-to-event data. These are placed near the end of the book, not because
they are less important, but because the real subject of the book is modeling with linear structures
and the methods for measurement data carry over almost immediately.
In early versions of this edition I made extensive comparisons between the methods used here
and the balanced ANOVA methods used in the 1996 book. In particular, I emphasized how the
newer methods continue to give the same results as the earlier methods when applied to balanced
data. While I have toned that down, comparisons still exist. In such comparisons, I do not repeat the
details of the balanced analysis given in the earlier book. CRC Press/Chapman & Hall have been
kind enough to let me place a version of the 1996 book on my website so that readers can explore the
comparisons in detail. Another good thing about having the old book up is that it contains a chapter
on confounding and fractional replications in 2n factorials. I regret having to drop that chapter, but
the discussion is based on contrasts for balanced ANOVA and did not really fit the theme of the
current edition.
When I was in high school, my two favorite subjects were math and history. On a whim, I made
the good choice to major in Math for my BA. I mention my interest in history to apologize (primarily
in the same sense that C.S. Lewis was a Christian “apologist”) for using so much old data. Unless
you are trying to convince 18-year-olds that Statistics is sexy, I don’t think the age of the data should
matter.
I need to thank Adam Branscum, my coauthor on Christensen et al. (2010). Adam wrote the first
drafts of Chapter 7 and Appendix C of that book. Adam’s work on Chapter 7 definitely influenced
this work and Adam’s work on Appendix C is what got me programming in R. This is also a
good time to thank the people who have most influenced my career: Wes Johnson, Ed Bedrick,
Don Berry, Frank Martin, and the late, great Seymour Geisser. My colleague Yan Lu taught out
of a prepublication version of the book, and, with her students, pointed out a number of issues.
Generally, the first person whose opinions and help I sought was my son Fletcher.
After the effort to complete this book, I’m feeling as unbalanced as the data being analyzed.
Specifics
I think of the book as something to use in the traditional Master’s level year-long course on regres-
sion and analysis of variance. If one needed to actually separate the material into a regression course
and an ANOVA course, the regression material is in Chapters 6–11 and 20–23. Chapters 12–19 are
traditionally viewed as ANOVA. But I much prefer to use both regression and ANOVA ideas when
examining the generalized linear models of Chapters 20–22. Well-prepared students could begin
with Chapter 3 and skip to Chapter 6. By well-prepared, I tautologically mean students who are
already familiar with Chapters 1, 2, 4, and 5.
For less well-prepared students, obviously I would start at the beginning and deemphasize the
more difficult topics. This is what I have done when teaching data analysis to upper division Statis-
tics students and graduate students from other fields. I have tried to isolate more difficult material
into clearly delineated (sub)sections. In the first semester of such a course, I would skip the end of
Chapter 8, include the beginning of Chapter 12, and let time and student interest determine how
much of Chapters 9, 10, and 13 to cover. But the book wasn’t written to be a text for such a course;
it is written to address unbalanced multi-factor ANOVA.
The book requires very little pre-knowledge of math, just algebra, but does require that one
not be afraid of math. It does not perform calculus, but it discusses that integrals provide areas
under curves and, in an appendix, gives the integral formulae for means and variances. It largely
avoids matrix algebra but presents enough of it to enable the matrix approach to linear models to
be introduced. For a regression-ANOVA course, I would supplement the material after Chapter 11
with occasional matrix arguments. Any material described as a regression approach to an ANOVA
problem lends itself to matrix discussion.
Although the book starts at the beginning mathematically, it is not for the intellectually unso-
phisticated. By Chapter 2 it discusses the impreciseness of our concepts of populations and how
the deletion of outliers must change those concepts. Chapter 2 also discusses the “murky” transfor-
mation from a probability interval to a confidence interval and the differences between significance
testing, Neyman–Pearson hypothesis testing, and Bayesian methods. Because a lot of these ideas
are subtle, and because people learn best from specifics to generalities rather than the other way
around, Chapter 3 reiterates much of Chapter 2 but for general linear models. Most of the remainder
of the book can be viewed as the application of Chapter 3 to specific data structures. Well-prepared
students could start with Chapter 3 despite occasional references made to results in the first two
chapters.
Chapter 4 considers two-sample data. Perhaps its most unique feature is, contrary to what seems
popular in introductory Statistics these days, the argument that testing equality of means for two
independent samples provides much less information when the variances are different than when
they are the same.
Chapter 5 exists because I believe that if you teach one- and two-sample continuous data prob-
lems, you have a duty to present their discrete data analogs. Having gone that far, it seemed silly to
avoid analogs to one-way ANOVA. I do not find the one-way ANOVA F test for equal group means
to be all that useful. Contrasts contain more interesting information. The last two sections of Chap-
ter 5 contain, respectively, discrete data analogs to one-way ANOVA and a method of extracting
information similar to contrasts.
Chapters 6, 7, and 8 provide tools for exploring the relationship between a single dependent
variable and a single measurement (continuous) predictor. A key aspect of the discussion is that
the methods in Chapters 7 and 8 extend readily to more general linear models, i.e., those involving
categorial and/or multiple predictors. The title of Chapter 8 arises from my personal research inter-
est in testing lack of fit for linear models and the recognition of its relationship to nonparametric
regression.
Chapters 9, 10, and 11 examine features associated with multiple regression. Of particular note
are new sections on modeling interaction through generalized additive models and on lasso regres-
sion. I consider these important concepts for serious students of Statistics. The last of these chapters
is where the book’s use of matrices is focused. The discussion of principal component regression is
located here, not because the discussion uses matrices, but because the discussion requires matrix
knowledge to understand.
The rest of the book involves categorical predictor variables. In particular, the material after
Chapter 13 is the primary reason for writing this edition. The first edition focused on multifactor
balanced data and looking at contrasts, not only in main effects but contrasts within two- and three-
factor interactions. This edition covers the same material for unbalanced data.
Chapters 12 and 13 cover one-way analysis of variance (ANOVA) models and multiple com-
parisons but with an emphasis on the ideas needed when examining multiple categorical predictors.
Chapter 12 involves one categorical predictor much like Chapter 6 involved one continuous predic-
tor.
Chapter 14 examines the use of two categorial predictors, i.e., two-way ANOVA. It also in-
troduces the concept of homologous factors. Chapter 15 looks at models with one continuous and
one categorical factor, analysis of covariance. Chapter 16 considers models with three categorical
predictors.
Chapters 17 and 18 introduce the main ideas of experimental design. Chapter 17 introduces
a wide variety of standard designs and concepts of design. Chapter 18 introduces the key idea of
defining treatments with factorial structure. The unusual aspect of these chapters is that the analyses
presented apply when data are missing from the original design.
Chapter 19 introduces the analysis of dependent data. The primary emphasis is on the analysis
of split-plot models. A short discussion is also given of multivariate analysis. Both of these methods
require groups of observations that are independent of other groups but that are dependent within
the groups. Both methods require balance within the groups but the groups themselves can be un-
balanced. Subsection 19.2.1 even introduces a method for dealing with unbalance within groups.
It seems to have become popular to treat fixed and random effects models as merely two options
for analyzing data. I think these are very different models with very different properties; random
effects being far more sophisticated. As a result, I have chosen to introduce random effects as a
special case of split-plot models in Subsection 19.4.2. Subsampling models can also be viewed as
special cases of split-plot models and are treated in Subsection 19.4.1.
Chapters 20, 21, and 22 illustrate that the modeling ideas from the previous chapters continue
to apply to generalized linear models. In addition, Chapter 20 spends a lot of time pointing out
potholes that I see in standard programs for performing logistic regression.
Chapter 23 is a brief introduction to nonlinear regression. It is the only chapter, other than Chap-
ter 11, that makes extensive use of matrices and the only one that requires knowledge of calculus.
Nonlinear regression is a subject that I think deserves more attention than it gets. I think it is the
form of regression that we should aspire to, in the sense that we should aspire to having science that
is sophisticated enough to posit such models.
Ronald Christensen
Albuquerque, New Mexico
February 2015
Edited Preface to First Edition
This book examines the application of basic statistical methods: primarily analysis of variance and
regression but with some discussion of count data. It is directed primarily towards Master’s degree
students in Statistics studying analysis of variance, design of experiments, and regression analysis.
I have found that the Master’s level regression course is often popular with students outside of
Statistics. These students are often weaker mathematically and the book caters to that fact while
continuing to give a complete matrix formulation of regression.
The book is complete enough to be used as a second course for upper division and beginning
graduate students in Statistics and for graduate students in other disciplines. To do this, one must
be selective in the material covered, but the more theoretical material appropriate only for Statistics
Master’s students is generally isolated in separate subsections and, less often, in separate sections.
I think the book is reasonably encyclopedic. It really contains everything I would like my stu-
dents to know about Applied Statistics prior to them taking courses in linear model theory or log-
linear models.
I believe that beginning students (even Statistics Master’s students) often find statistical proce-
dures to be a morass of vaguely related special techniques. As a result, this book focuses on four
connecting themes.
1. Most inferential procedures are based on identifying a (scalar) parameter of interest, estimat-
ing that parameter, obtaining the standard error of the estimate, and identifying the appropriate
reference distribution. Given these items, the inferential procedures are identical for various pa-
rameters.
2. Balanced one-way analysis of variance has a simple, intuitive interpretation in terms of compar-
ing the sample variance of the group means with the mean of the sample variances for each group.
All balanced analysis of variance problems can be considered in terms of computing sample vari-
ances for various group means. These concepts exist in the new edition but are de-emphasized as
are balanced data.
3. Comparing different models provides a structure for examining both balanced and unbalanced
analysis of variance problems and for examining regression problems. In some problems the
most reasonable analysis is simply to find a succinct model that fits the data well. This is the core
of the new edition.
4. Checking assumptions is a crucial part of every statistical analysis.
The object of statistical data analysis is to reveal useful structure within the data. In a model-
based setting, I know of two ways to do this. One way is to find a succinct model for the data. In
such a case, the structure revealed is simply the model. The model selection approach is particu-
larly appropriate when the ultimate goal of the analysis is making predictions. This book uses the
model selection approach for multiple regression and for general unbalanced multifactor analysis
of variance. The other approach to revealing structure is to start with a general model, identify in-
teresting one-dimensional parameters, and perform statistical inferences on these parameters. This
parametric approach requires that the general model involve parameters that are easily interpretable.
We exploit the parametric approach for one-way analysis of variance and simple linear regression.
All statistical models involve assumptions. Checking the validity of these assumptions is crucial
because the models we use are never correct. We hope that our models are good approximations
of the true condition of the data and experience indicates that our models often work very well.
Nonetheless, to have faith in our analyses, we need to check the modeling assumptions as best we
can. Some assumptions are very difficult to evaluate, e.g., the assumption that observations are statis-
tically independent. For checking other assumptions, a variety of standard tools has been developed.
Using these tools is as integral to a proper statistical analysis as is performing an appropriate confi-
dence interval or test. For the most part, using model-checking tools without the aid of a computer
is more trouble than most people are willing to tolerate.
My experience indicates that students gain a great deal of insight into balanced analysis of
variance by actually doing the computations. The computation of the mean square for treatments in
a balanced one-way analysis of variance is trivial on any hand calculator with a variance or standard
deviation key. More importantly, the calculation reinforces the fundamental and intuitive idea behind
the balanced analysis of variance test, i.e., that a mean square for treatments is just a multiple of
the sample variance of the corresponding treatment means. I believe that as long as students find
the balanced analysis of variance computations challenging, they should continue to do them by
hand (calculator). I think that automated computation should be motivated by boredom rather than
bafflement. While I still believe this is true, it too is deemphasized in this edition.
In addition to the four primary themes discussed above, there are several other characteristics
that I have tried to incorporate into this book.
I have tried to use examples to motivate theory rather than to illustrate theory. Most chapters
begin with data and an initial analysis of that data. After illustrating results for the particular data,
we go back and examine general models and procedures. I have done this to make the book more
palatable to two groups of people: those who only care about theory after seeing that it is useful and
those unfortunates who can never bring themselves to care about theory. (The older I get, the more I
identify with the first group. As for the other group, I find myself agreeing with W. Edwards Deming
that experience without theory teaches nothing.) As mentioned earlier, the theoretical material is
generally confined to separate subsections or, less often, separate sections, so it is easy to ignore.
I believe that the ultimate goal of all statistical analysis is prediction of observable quantities. I
have incorporated predictive inferential procedures where they seemed natural.
The object of most Statistics books is to illustrate techniques rather than to analyze data; this
book is no exception. Nonetheless, I think we do students a disservice by not showing them a
substantial portion of the work necessary to analyze even ‘nice’ data. To this end, I have tried to
consistently examine residual plots, to present alternative analyses using different transformations
and case deletions, and to give some final answers in plain English. I have also tried to introduce
such material as early as possible. I have included reasonably detailed examinations of a three-factor
analysis of variance and of a split-plot design with four factors. I have included some examples in
which, like real life, the final answers are not ‘neat.’ While I have tried to introduce statistical ideas
as soon as possible, I have tried to keep the mathematics as simple as possible for as long as possible.
For example, matrix formulations are postponed to the last chapter on multiple regression.
I never use side conditions or normal equations in analysis of variance. But computer programs
use side conditions and I discuss how they affect model interpretations.
In multiple comparison methods, (weakly) controlling the experimentwise error rate is discussed
in terms of first performing an omnibus test for no treatment effects and then choosing a criterion for
evaluating individual hypotheses. Most methods considered divide into those that use the omnibus
F test, those that use the Studentized range test, and the Bonferroni method, which does not use any
omnibus test. In the current edition I have focused primarily on multiple comparison methods that
work for unbalanced data.
I have tried to be very clear about the fact that experimental designs are set up for arbitrary
groups of treatments and that factorial treatment structures are simply an efficient way of defining
the treatments in some problems. Thus, the nature of a randomized complete block design does not
depend on how the treatments happen to be defined. The analysis always begins with a breakdown
of the sum of squares into blocks, treatments, and error. Further analysis of the treatments then
focuses on whatever structure happens to be present.
The analysis of covariance chapter no longer includes an extensive discussion of how the covari-
ates must be chosen to maintain a valid experiment. That discussion has been moved to the chapter
Basic Experimental Designs. Tukey’s one degree of freedom test for nonadditivity is presented as
a test for the need to perform a power transformation rather than as a test for a particular type of
interaction. Tukey’s test is now part of the Model Checking chapter, not the ACOVA chapter.
The chapter on confounding and fractional replication has more discussion of analyzing such
data than many other books contain.
Acknowledgements
Many people provided comments that helped in writing this book. My colleagues Ed Bedrick,
Aparna Huzurbazar, Wes Johnson, Bert Koopmans, Frank Martin, Tim O’Brien, and Cliff Qualls
helped a lot. I got numerous valuable comments from my students at the University of New Mex-
ico. Marjorie Bond, Matt Cooney, Jeff S. Davis, Barbara Evans, Mike Fugate, Jan Mines, and Jim
Shields stand out in this regard. The book had several anonymous reviewers, some of whom made
excellent suggestions.
I would like to thank Martin Gilchrist and Springer-Verlag for permission to reproduce Exam-
ple 7.6.1 from Plane Answers to Complex Questions: The Theory of Linear Models. I also thank
the Biometrika Trustees for permission to use the tables in Appendix B.5. Professor John Deely and
the University of Canterbury in New Zealand were kind enough to support completion of the book
during my sabbatical there.
Now my only question is what to do with the chapters on quality control, pn factorials, and
response surfaces that ended up on the cutting room floor. I have pretty much given up on publishing
the quality control material. Response surfaces got into Advanced Linear Modeling (ALM) and I’m
hoping to get pn factorials into a new edition of ALM.
Ronald Christensen
Albuquerque, New Mexico
February 1996
Edited, October 2014
Computing
There are two aspects to computing: generating output and interpreting output. We cannot always
control the generation of output, so we need to be able to interpret a variety of outputs. The book
places great emphasis on interpreting the range of output that one might encounter when dealing
with the data structures in the book. This comes up most forcefully when dealing with multiple
categorical predictors because arbitrary choices must be made by computer programmers to produce
some output, e.g., parameter estimates. The book deals with the arbitrary choices that are most
commonly made. Methods for generating output have, for the most part, been removed from the
book and placed on my website.
R has taken over the Statistics computing world. While R code is in the book, illustrations
of all the analyses and all of the graphics have been performed in R and are available on
my website: www.stat.unm.edu/∼fletcher. Also, substantial bodies of Minitab and SAS code
(particularly for SAS’s GENMOD and LOGISTIC procedures) are available on my website. While
Minitab and many versions of SAS are now menu driven, the menus essentially write the code for
running a procedure. Presenting the code provides the information needed by the programs and,
implicitly, the information needed in the menus. That information is largely the same regardless of
the program. The choices of R, Minitab, and SAS are not meant to denigrate any other software.
They are merely what I am most familiar with.
The online computing aids are chapter for chapter (and for the most part, section for section)
images of the book. Thus, if you want help computing something from Section 2.5 of the book, look
in Section 2.5 of the online material.
My strong personal preference is for doing whatever I can in Minitab. That is largely because
Minitab forces me to remember fewer arcane commands than any other system (that I am familiar
with). Data analysis output from Minitab is discussed in the book because it differs from the output
provided by R and SAS. For fitting large tables of counts, as discussed in Chapter 21, I highly
recommend the program BMDP 4F. Fortunately, this can now be accessed through some batch
versions of SAS. My website contains files for virtually all the data. But you need to compare
each file to the tabled data and not just assume that the file looks exactly like the table.
Finally, I would like to point out a notational issue. In both Minitab and SAS, “glm” refers
to fitting general linear models. In R, “glm” refers to fitting generalized linear models, which are
something different. Generalized linear models contain general linear models as a special case. The
models in Chapters 20, 21, and 22 are different special cases of generalized linear models. (I am not
convinced that generalized linear models are anything more than a series of special cases connected
by a remarkable computing trick, cf. Christensen, 1997, Chapter 9.)
BMDP Statistical Software was located at 1440 Sepulveda Boulevard, Los Angeles, CA 90025.
MINITAB is a registered trademark of Minitab, Inc., 3081 Enterprise Drive, State College, PA
16801, telephone: (814) 238-3280, telex: 881612.
Chapter 1
Introduction
Statistics has two roles in society. First, Statistics is in the business of creating stereotypes. Think of
any stereotype you like, but to keep me out of trouble let’s consider something innocuous, like the
hypothesis that Italians talk with their hands more than Scandinavians. To establish the stereotype,
you need to collect data and use it to draw a conclusion. Often the conclusion is that either the
data suggest a difference or that they do not. The conclusion is (almost) never whether a difference
actually exists, only whether or not the data suggest a difference and how strongly they suggest it.
Statistics has been filling this role in society for at least 100 years.
Statistics’ less recognized second role in society is debunking stereotypes. Statistics is about
appreciating variability. It is about understanding variability, explaining it, and controlling it. I ex-
pect that with enough data, one could show that, on average, Italians really do talk with their hands
more than Scandinavians. Collecting a lot of data helps control the relevant variability and allows
us to draw a conclusion. But I also expect that we will never be able to predict accurately whether a
random Italian will talk with their hands more than a random Scandinavian. There is too much vari-
ability among humans. Even when differences among groups exist, those differences often pale in
comparison to the variability displayed by individuals within the groups—to the point where group
differences are often meaningless when dealing with individuals. For statements about individuals,
collecting a lot of data only helps us to more accurately state the limits of our (very considerable)
uncertainty.
Ultimately, Statistics is about what you can conclude and, equally, what you cannot conclude
from analyzing data that are subject to variability, as all data are. Statisticians use ideas from prob-
ability to quantify variability. They typically analyze data by creating probability models for the
data.
In this chapter we introduce basic ideas of probability and some related mathematical concepts
that are used in Statistics. Values to be analyzed statistically are generally thought of as random
variables; these are numbers that result from random events. The mean (average) value of a pop-
ulation is defined in terms of the expected value of a random variable. The variance is introduced
as a measure of the variability in a random variable (population). We also introduce some special
distributions (populations) that are useful in modeling statistical data. The purpose of this chapter is
to introduce these ideas, so they can be used in analyzing data and in discussing statistical models.
In writing statistical models, we often use symbols from the Greek alphabet. A table of these
symbols is provided in Appendix B.6.
Rumor has it that there are some students studying Statistics who have an aversion to mathemat-
ics. Such people might be wise to focus on the concepts of this chapter and not let themselves get
bogged down in the details. The details are given to provide a more complete introduction for those
students who are not math averse.
1.1 Probability
Probabilities are numbers between zero and one that are used to explain random phenomena. We are
all familiar with simple probability models. Flip a standard coin; the probability of heads is 1/2. Roll
1
2 1. INTRODUCTION
a die; the probability of getting a three is 1/6. Select a card from a well-shuffled deck; the probability
of getting the queen of spades is 1/52 (assuming there are no jokers). One way to view probability
models that many people find intuitive is in terms of random sampling from a fixed population.
For example, the 52 cards form a fixed population and picking a card from a well-shuffled deck is
a means of randomly selecting one element of the population. While we will exploit this idea of
sampling from fixed populations, we should also note its limitations. For example, blood pressure is
a very useful medical indicator, but even with a fixed population of people it would be very difficult
to define a useful population of blood pressures. Blood pressure depends on the time of day, recent
diet, current emotional state, the technique of the person taking the reading, and many other factors.
Thinking about populations is very useful, but the concept can be very limiting both practically and
mathematically. For measurements such as blood pressures and heights, there are difficulties in even
specifying populations mathematically.
For mathematical reasons, probabilities are defined not on particular outcomes but on sets of
outcomes (events). This is done so that continuous measurements can be dealt with. It seems much
more natural to define probabilities on outcomes as we did in the previous paragraph, but consider
some of the problems with doing that. For example, consider the problem of measuring the height of
a corpse being kept in a morgue under controlled conditions. The only reason for getting morbid here
is to have some hope of defining what the height is. Living people, to some extent, stretch and con-
tract, so a height is a nebulous thing. But even given that someone has a fixed height, we can never
know what it is. When someone’s height is measured as 177.8 centimeters (5 feet 10 inches), their
height is not really 177.8 centimeters, but (hopefully) somewhere between 177.75 and 177.85 cen-
timeters. There is really no chance that anyone’s height is exactly 177.8 cm, √or exactly
√ 177.8001 cm,
or exactly 177.800000001 cm, or exactly 56.5955π cm, or exactly (76 5 + 4.5 3) cm. In any
neighborhood of 177.8, there are more numerical values than one could even imagine counting. The
height should be somewhere in the neighborhood, but it won’t be the particular value 177.8. The
point is simply that trying to specify all the possible heights and their probabilities is a hopeless
exercise. It simply cannot be done.
Even though individual heights cannot be measured exactly, when looking at a population of
heights they follow certain patterns. There are not too many people over 8 feet (244 cm) tall. There
are lots of males between 175.3 cm and 177.8 cm (5′ 9′′ and 5′ 10′′ ). With continuous values, each
possible outcome has no chance of occurring, but outcomes do occur and occur with regularity. If
probabilities are defined for sets instead of outcomes, these regularities can be reproduced mathe-
matically. Nonetheless, initially the best way to learn about probabilities is to think about outcomes
and their probabilities.
There are five key facts about probabilities:
1. Probabilities are between 0 and 1.
2. Something that happens with probability 1 is a sure thing.
3. If something has no chance of occurring, it has probability 0.
4. If something occurs with probability, say, .25, the probability that it will not occur is 1 − .25 =
.75.
5. If two events are mutually exclusive, i.e., if they cannot possibly happen at the same time, then
the probability that either of them occurs is just the sum of their individual probabilities.
Individual outcomes are always mutually exclusive, e.g., you cannot flip a coin and get both heads
and tails, so probabilities for outcomes can always be added together. Just to be totally correct, I
should mention one other point. It may sound silly, but we need to assume that something occurring
is always a sure thing. If we flip a coin, we must get either heads or tails with probability 1. We
could even allow for the coin landing on its edge as long as the probabilities for all the outcomes
add up to 1.
E XAMPLE 1.1.1. Consider the nine outcomes that are all combinations of three heights, tall (T),
1.1 PROBABILITY 3
medium (M), short (S), and three eye colors, blue (Bl), brown (Br) and green (G). The combinations
are displayed below.
{(T, Bl), (T, Br), (T, G), (M, Bl), (M, Br), (M, G), (S, Bl), (S, Br), (S, G)} .
The event that someone is tall consists of the three pairs in the first row of the table, i.e.,
This is the union of the three outcomes (T, Bl), (T, Br), and (T, G). Similarly, the set of people with
blue eyes is obtained from the first column of the table; it is the union of (T, Bl), (M, Bl), and (S, Bl)
and can be written
{Bl} = {(T, Bl), (M, Bl), (S, Bl)} .
If we know that {T} and {Bl} both occur, there is only one possible outcome, (T, Bl).
The event that {T} or {Bl} occurs consists of all outcomes in either the first row or the first
column of the table, i.e.,
{(T, Bl), (T, Br), (T, G), (M, Bl), (S, Bl)} . ✷
E XAMPLE 1.1.2. Table 1.1 contains probabilities for the nine outcomes that are combinations of
height and eye color from Example 1.1.1. Note that each of the nine numbers is between 0 and 1
and that the sum of all nine equals 1. The probability of blue eyes is
Similarly, Pr(Br) = .5 and Pr(G) = .1. The probability of not having blue eyes is
Even if there are a countable (but infinite) number of possible outcomes, one can still define a
probability by defining the probabilities for each outcome. It is only for measurement data that one
really needs to define probabilities on sets.
Two random events are said to be independent if knowing that one of them occurs provides no
information about the probability that the other event will occur. Formally, two events A and B are
independent if
Pr(A and B) = Pr(A)Pr(B).
Thus the probability that both events A and B occur is just the product of the individual probabilities
that A occurs and that B occurs. As we will begin to see in the next section, independence plays an
important role in Statistics.
E XAMPLE 1.1.3. Using the probabilities of Table 1.1 and the computations of Example 1.1.2,
the events tall and brown eyes are independent because
On the other hand, medium height and blue eyes are not independent because
s s
s s
s s
s s s s s s s
s
s s s s s s s
Without even thinking about it, we define a random variable that transforms these six faces into the
numbers 1, 2, 3, 4, 5, 6.
In Statistics we think of observations as random variables. These are often some number asso-
ciated with a randomly selected member of a population. For example, one random variable is the
height of a person who is to be randomly selected from among University of New Mexico students.
(A random selection gives the same probability to every individual in the population. This random
1.2 RANDOM VARIABLES AND EXPECTATIONS 5
variable presumes that we have well-defined methods for measuring height and defining UNM stu-
dents.) Rather than measuring height, we could define a different random variable by giving the
person a score of 1 if that person is female and 0 if the person is male. We can also perform math-
ematical operations on random variables to yield new random variables. Suppose we plan to select
a random sample of 10 students; then we would have 10 random variables with female and male
scores. The sum of these random variables is another random variable that tells us the (random)
number of females in the sample. Similarly, we would have 10 random variables for heights and
we can define a new random variable consisting of the average of the 10 individual height random
variables. Some random variables are related in obvious ways. In our example we measure both a
height and a sex score on each person. If the sex score variable is a 1 (telling us that the person is fe-
male), it suggests that the height may be smaller than we would otherwise suspect. Obviously some
female students are taller than some male students, but knowing a person’s sex definitely changes
our knowledge about their probable height.
We do similar things in tossing a coin.
E XAMPLE 1.2.1. Consider tossing a coin twice. The four outcomes are ordered pairs of heads
(H) and tails (T ). The outcomes can be denoted as
(H, H) (H, T ) (T, H) (T, T )
where the outcome of the first toss is the first element of the ordered pair.
The standard probability model has the four outcomes equally probable, i.e., 1/4 = Pr(H, H) =
Pr(H, T ) = Pr(T, H) = Pr(T, T ). Equivalently
Second toss
Heads Tails Total
First toss Heads 1/4 1/4 1/2
Tails 1/4 1/4 1/2
Total 1/2 1/2 1
The probability of heads on each toss is 1/2. The probability of tails is 1/2. We will define two
random variables:
1 if r = H
y1 (r, s) =
0 if r = T
1 if s = H
y2 (r, s) =
0 if s = T .
Thus, y1 is 1 if the first toss is heads and 0 otherwise. Similarly, y2 is 1 if the second toss is heads
and 0 otherwise.
The event y1 = 1 occurs if and only if we get heads on the first toss. We get heads on the first toss
by getting either of the outcome pairs (H, H) or (H, T ). In other words, the event y1 = 1 is equivalent
to the event {(H, H), (H, T )}. The probability of y1 = 1 is just the sum of the probabilities of the
outcomes in {(H, H), (H, T )}.
Pr(y1 = 1) = Pr(H, H) + Pr(H, T )
= 1/4 + 1/4 = 1/2.
Similarly,
Pr(y1 = 0) = Pr(T, H) + Pr(T, T )
= 1/2
Pr(y2 = 1) = 1/2
Pr(y2 = 0) = 1/2 .
6 1. INTRODUCTION
Now define another random variable,
W (r, s) = y1 (r, s) + y2 (r, s) .
y2
1 0 y1 totals
y1 1 1/4 1/4 1/2
0 1/4 1/4 1/2
y2 totals 1/2 1/2 1
Note that, for example, Pr [(y1 , y2 ) = (1, 0)] = 1/4 and Pr(y1 = 1) = 1/2. This table shows the
distribution of the probabilities for y1 and y2 both separately (marginally) and jointly. ✷
For any random variable, a statement of the possible outcomes and their associated probabilities
is referred to as the (marginal) probability distribution of the random variable. For two or more
random variables, a table or other statement of the possible joint outcomes and their associated
probabilities is referred to as the joint probability distribution of the random variables.
All of the entries in the center of the distribution table given above for y1 and y2 are independent.
For example,
Pr[(y1 , y2 ) = (1, 0)] ≡ Pr(y1 = 1 and y2 = 0) = Pr(y1 = 1)Pr(y2 = 0).
We therefore say that y1 and y2 are independent. In general, two random variables y1 and y2 are
independent if any event involving only y1 is independent of any event involving only y2 .
Independence is an extremely important concept in Statistics. Observations to be analyzed are
commonly assumed to be independent. This means that the random aspect of one observation con-
tains no information about the random aspect of any other observation. (However, every observation
tells us about fixed aspects of the underlying population such as the population center.) For most
purposes in Applied Statistics, this intuitive understanding of independence is sufficient.
E XAMPLE 1.2.3. Five pieces of paper are placed in a hat. The papers have the numbers 2, 4, 6,
6, and 8 written on them. A piece of paper is picked at random. The expected value of the number
drawn is the mean of the numbers on the five pieces of paper. Let y be the random variable that
relates a piece of paper to the number on that paper. Each piece of paper has the same probability of
being chosen, so, because the number 6 appears twice, the distribution of the random variable y is
1
= Pr(y = 2) = Pr(y = 4) = Pr(y = 8)
5
2
= Pr(y = 6) .
5
The expected value is
1 1 2 1
E(y) = 2 +4 +6 +8
5 5 5 5
= (2 + 4 + 6 + 6 + 8)/5
= 5.2 . ✷
E XAMPLE 1.2.4. Consider the coin tossing random variables y1 , y2 , and W from Example 1.2.1.
Recalling that y1 and y2 have the same distribution,
1 1 1
E(y1 ) = 1 +0 =
2 2 2
1
E(y2 ) =
2
1 1 1
E(W ) = 2 +1 +0 = 1.
4 2 4
The variable y1 is the number of heads in the first toss of the coin. The two possible values 0 and
1 are equally probable, so the middle of the distribution is 1/2. W is the number of heads in two
tosses; the expected number of heads in two tosses is 1. ✷
The expected value indicates the middle of a distribution, but does not indicate how spread out
(dispersed) a distribution is.
E XAMPLE 1.2.5. Consider three gambles that I will allow you to take. In game z1 you have equal
chances of winning 12, 14, 16, or 18 dollars. In game z2 you can again win 12, 14, 16, or 18 dollars,
but now the probabilities are .1 that you will win either $14 or $16 and .4 that you will win $12 or
$18. The third game I call z3 and you can win 5, 10, 20, or 25 dollars with equal chances. Being no
fool, I require you to pay me $16 for the privilege of playing any of these games. We can write each
game as a random variable.
8 1. INTRODUCTION
z1 outcome 12 14 16 18
probability .25 .25 .25 .25
z2 outcome 12 14 16 18
probability .4 .1 .1 .4
z3 outcome 5 10 20 25
probability .25 .25 .25 .25
I try to be a good casino operator, so none of these games is fair. You have to pay $16 to play, but
you only expect to win $15. It is easy to see that
But don’t forget that I’m taking a loss on the ice-water I serve to players and I also have to pay for
the pictures of my extended family that I’ve decorated my office with.
Although the games z1 , z2 , and z3 have the same expected value, the games (random variables)
are very different. Game z2 has the same outcomes as z1 , but much more of its probability is placed
farther from the middle value, 15. The extreme observations 12 and 18 are much more probable
under z2 than z1 . If you currently have $16, need $18 for your grandmother’s bunion removal, and
anything less than $18 has no value to you, then z2 is obviously a better game for you than z1 .
Both z1 and z2 are much more tightly packed around 15 than is z3 . If you needed $25 for the
bunion removal, z3 is the game to play because you can win it all in one play with probability .25.
In either of the other games you would have to win at least five times to get $25, a much less likely
occurrence. Of course you should realize that the most probable result is that Grandma will have
to live with her bunion. You are unlikely to win either $18 or $25. While the ethical moral of this
example is that a fool and his money are soon parted, the statistical point is that there is more to a
random variable than its mean. The variability of random variables is also important. ✷
The (population) variance is a measure of how spread out a distribution is from its expected
value. Let y be a random variable having a discrete distribution with E(y) = μ , then the variance of
y is
Var(y) ≡ ∑ (r − μ )2 Pr(y = r) .
all r
This is the average squared distance of the outcomes from the center of the population. More tech-
nically, it is the expected squared distance between the outcomes and the mean of the distribution.
E XAMPLE 1.2.7. Consider the coin tossing random variables of Examples 1.2.1 and 1.2.4.
2 2
1 1 1 1 1
Var(y1 ) = 1− + 0− =
2 2 2 2 4
1
Var(y2 ) =
4
2 1 2 1 2 1 1
Var(W ) = (2 − 1) + (1 − 1) + (0 − 1) = . ✷
4 2 4 2
A problem with the variance is that it is measured on the wrong scale. If y is measured in meters,
Var(y) involves the terms (r − μ )2 ; hence it is measured in meters squared. To get things back on
the original scale, we consider the standard deviation of y
Std. dev. (y) ≡ Var(y) .
E XAMPLE 1.2.8. Consider the random variables of Examples 1.2.5 and 1.2.6. The variances of
the games are measured in dollars squared while the standard deviations are measured in dollars.
√ .
Std. dev. (z1 ) = √5 = 2.236
.
Std. dev. (z2 ) = √7.4 = 2.720
.
Std. dev. (z3 ) ≡ 62.5 = 7.906
The standard deviation of z3 is 3 to 4 times larger than the others. From examining the distribu-
tions, the standard deviations seem to be more intuitive measures of relative variability than the
variances. The variance of z3 is 8.5 to 12.5 times larger than the other variances; these values seem
unreasonably inflated. ✷
Standard deviations and variances are useful as measures of the relative dispersions of different
random variables. The actual numbers themselves do not mean much. Moreover, there are other
equally good measures of dispersion that can give results that are somewhat inconsistent with these.
One reason standard deviations and variances are so widely used is because they are convenient
mathematically. In addition, normal (Gaussian) distributions are widely used in Applied Statistics
and are completely characterized by their expected values (means) and variances (or standard devi-
ations). Knowing these two numbers, the mean and variance, one knows everything about a normal
distribution.
E XAMPLE 1.2.9. Let z1 and z2 be two random variables defined by the following probability
table:
z2
0 1 2 z1 totals
6 0 1/3 0 1/3
z1 4 1/3 0 0 1/3
2 0 0 1/3 1/3
z2 totals 1/3 1/3 1/3 1
1.2 RANDOM VARIABLES AND EXPECTATIONS 11
Then
1 1 1
E(z1 ) = 6 +4 +2 = 4,
3 3 3
1 1 1
E(z2 ) = 0 +1 +2 = 1,
3 3 3
21 2 1 2 1
Var(z1 ) = (2 − 4) + (4 − 4) + (6 − 4)
3 3 3
= 8/3,
2 1 2 1 2 1
Var(z2 ) = (0 − 1) + (1 − 1) + (2 − 1)
3 3 3
= 2/3,
1
Cov(z1 , z2 ) = (2 − 4)(0 − 1)(0) + (2 − 4)(1 − 1)(0) + (2 − 4)(2 − 1)
3
1
+ (4 − 4)(0 − 1) + (4 − 4)(1 − 1)(0) + (4 − 4)(2 − 1)(0)
3
1
+ (6 − 4)(0 − 1)(0) + (6 − 4)(1 − 1) + (6 − 4)(2 − 1)(0)
3
= −2/3,
Corr(z1 , z2 ) = (−2/3) (8/3)(2/3)
= −1/2.
This correlation indicates that relatively large z1 values tend to occur with relatively small z2 values.
However, the correlation is considerably greater than −1, so the linear relationship is less than
perfect. Moreover, the correlation measures the linear relationship and fails to identify the perfect
nonlinear relationship between z1 and z2 . If z1 = 2, then z2 = 2. If z1 = 4, then z2 = 0. If z1 = 6,
then z2 = 1. If you know one random variable, you know the other, but because the relationship is
nonlinear, the correlation is not ±1. ✷
E XAMPLE 1.2.10. Consider the coin toss random variables y1 and y2 from Example 1.2.1. We
earlier observed that these two random variables are independent. If so, there should be no relation-
ship between them (linear or otherwise). We now show that their covariance is 0.
1 1 1 1 1 1
Cov(y1 , y2 ) = 0− 0− + 0− 1−
2 2 4 2 2 4
1 1 1 1 1 1
+ 1− 0− + 1− 1−
2 2 4 2 2 4
1 1 1 1 ✷
= − − + = 0.
16 16 16 16
In general, whenever two random variables are independent, their covariance (and thus their
correlation) is 0. However, just because two random variables have 0 covariance does not imply that
they are independent. Independence has to do with not having any kind of relationship; covariance
examines only linear relationships. Random variables with nonlinear relationships can have zero
covariance but not be independent.
12 1. INTRODUCTION
1.2.4 Rules for expected values and variances
We now present some extremely useful results that allow us to show that statistical estimates are
reasonable and to establish the variability associated with statistical estimates. These results relate to
the expected values, variances, and covariances of linear combinations of random variables. A linear
combination of random variables is something that only involves multiplying random variables by
fixed constants, adding such terms together, and adding a constant.
Proposition 1.2.11. Let y1 , y2 , y3 , and y4 be random variables and let a1 , a2 , a3 , and a4 be real
numbers.
1. E(a1 y1 + a2y2 + a3) = a1 E(y1 ) + a2E(y2 ) + a3.
2. If y1 and y2 are independent, Var(a1 y1 + a2y2 + a3 ) = a21 Var(y1 ) + a22Var(y2 ).
3. Var(a1 y1 + a2y2 + a3 ) = a21 Var(y1 ) + 2a1a2 Cov(y1 , y2 ) + a22Var(y2 ).
4. Cov(a1 y1 + a2 y2 , a3 y3 + a4 y4 ) = a1 a3 Cov(y1 , y3 ) + a1 a4 Cov(y1 , y4 ) + a2 a3 Cov(y2 , y3 ) +
a2 a4 Cov(y2 , y4 ).
All of these results generalize to linear combinations involving more than two random variables.
E XAMPLE 1.2.12. Recall that when independently tossing a coin twice, the total number of
heads, W , is the sum of y1 and y2 , the number of heads on the first and second tosses, respectively.
We have already seen that E(y1 ) = E(y2 ) = .5 and that E(W ) = 1. We now illustrate item 1 of the
proposition by finding E(W ) again. Since W = y1 + y2 ,
E(W ) = E(y1 + y2 ) = E(y1 ) + E(y2 ) = .5 + .5 = 1.
We have also seen that Var(y1 ) = Var(y2 ) = .25 and that Var(W ) = .5. Since the coin tosses are
independent, item 2 above gives
Var(W ) = Var(y1 + y2) = Var(y1 ) + Var(y2 ) = .25 + .25 = .5 .
The key point is that this is an easier way of finding the expected value and variance of W than using
the original definitions. ✷
E XAMPLE 1.2.13. Let y1 , y2 , y3 , and y4 be four random variables each with the same (population)
mean μ , i.e., E(yi ) = μ for i = 1, 2, 3, 4. We can compute the sample mean (average) of these,
defining
y1 + y2 + y3 + y4
ȳ· ≡
4
1 1 1 1
= y1 + y2 + y3 + y4 .
4 4 4 4
The · in the subscript of ȳ· indicates that the sample mean is obtained by summing over the subscripts
of the yi s. The · notation is not necessary for this problem but becomes useful in dealing with the
analysis of variance problems treated later in the book.
Using item 1 of Proposition 1.2.11 we find that
1 1 1 1
E(ȳ· ) = E y1 + y2 + y3 + y4
4 4 4 4
1 1 1 1
= E(y1 ) + E(y2 ) + E(y3 ) + E(y4 )
4 4 4 4
1 1 1 1
= μ+ μ+ μ+ μ
4 4 4 4
= μ.
1.3 CONTINUOUS DISTRIBUTIONS 13
Thus one observation on ȳ· would make a reasonable estimate of μ .
If we also assume that the yi s are independent with the same variance, say, σ 2 , then from item 2
of Proposition 1.2.11
1 1 1 1
Var(ȳ· ) = Var y1 + y2 + y3 + y4
4 4 4 4
2 2
1 1
= Var(y1 ) + Var(y2 )
4 4
2 2
1 1
+ Var(y3 ) + Var(y4 )
4 4
2 2 2 2
1 1 1 1
= σ2 + σ2 + σ2 + σ2
4 4 4 4
σ2
= .
4
The variance of ȳ· is only one fourth of the variance of an individual observation. Thus the ȳ·
observations are more tightly packed around their mean μ than the yi s are. This indicates that one
observation on ȳ· is more likely to be close to μ than an individual yi . ✷
These results for ȳ· hold quite generally; they are not restricted to the average of four random
variables. If ȳ· = (1/n)(y1 + · · · + yn ) = ∑ni=1 yi /n is the sample mean of n independent random
variables all with the same population mean μ and population variance σ 2 ,
E(ȳ· ) = μ
and
σ2
Var(ȳ· ) = .
n
Proving these general results uses exactly the same ideas as the proofs for a sample of size 4.
As with a sample of size 4, the general results on ȳ· are very important in statistical inference. If
we are interested in determining the population mean μ from future data, the obvious estimate is the
average of the individual observations, ȳ· . The observations are random, so the estimate ȳ· is also a
random variable and the middle of its distribution is E(ȳ· ) = μ , the original population mean. Thus
ȳ· is a reasonable estimate of μ . Moreover, ȳ· is a better estimate than any particular observation
yi because ȳ· has a smaller variance, σ 2 /n, as opposed to σ 2 for yi . With less variability in the
estimate, any one observation of ȳ· is more likely to be near its mean μ than a single observation
yi . In practice, we obtain data and compute a sample mean. This constitutes one observation on the
random variable ȳ· . If our sample mean is to be a good estimate of μ , our one look at ȳ· had better
have a good chance of being close to μ . This occurs when the variance of ȳ· is small. Note that the
larger the sample size n, the smaller is σ 2 /n, the variance of ȳ· . We will return to these ideas later.
Generally, we will use item 1 of Proposition 1.2.11 to show that estimates are unbiased. In other
words, we will show that the expected value of an estimate is what we are trying to estimate. In
estimating μ , we have E(ȳ· ) = μ , so ȳ· is an unbiased estimate of μ . All this really does is show that
ȳ· is a reasonable estimate of μ . More important than showing unbiasedness is using item 2 to find
variances of estimates. Statistical inference depends crucially on having some idea of the variability
of an estimate. Item 2 is the primary tool in finding the appropriate variance for different estimators.
1−α α
0
0 K(1 − α)
accurately. This theoretical inability to measure things exactly has little impact on our practical
world, but it has a substantial impact on the theory of Statistics.
The data in most statistical applications can be viewed either as counts of how often some event
has occurred or as measurements. Probabilities associated with count data are easy to describe. We
discuss some probability models for count data in Sections 1.4 and 1.5. With measurement data, we
can never obtain an exact value, so we don’t even try. With measurement data, we assign probabil-
ities to intervals. Thus we do not discuss the probability that a person has the height 177.8 cm or
177.8001 cm or 56.5955π cm, but we do discuss the probability that someone has a height between
177.75 cm and 177.85 cm. Typically, we think of doing this in terms of pictures. We associate prob-
abilities with areas under curves. (Mathematically, this involves integral calculus and is discussed
in a brief appendix at the end of the chapter.) Figure 1.1 contains a picture of a continuous proba-
bility distribution (a density). Probabilities must be between 0 and 1, so the curve must always be
nonnegative (to make all areas nonnegative) and the area under the entire curve must be 1.
Figure 1.1 also shows a point K(1 − α ). This point divides the area under the curve into two
parts. The probability of obtaining a number less than K(1 − α ) is 1 − α , i.e., the area under the
curve to the left of K(1 − α ) is 1 − α . The probability of obtaining a number greater than K(1 − α )
is α , i.e., the area under the curve to the right of K(1 − α ). K(1 − α ) is a particular number, so the
probability is 0 that K(1 − α ) will actually occur. There is no area under a curve associated with any
particular point.
Pictures such as Figure 1.1 are often used as models for populations of measurements. With a
fixed population of measurements, it is natural to form a histogram, i.e., a bar chart that plots in-
tervals for the measurement against the proportion of individuals that fall into a particular interval.
Pictures such as Figure 1.1 can be viewed as approximations to such histograms. The probabilities
described by pictures such as Figure 1.1 are those associated with randomly picking an individ-
ual from the population. Thus, randomly picking an individual from the population modeled by
Figure 1.1 yields a measurement less than K(1 − α ) with probability 1 − α .
Ideas similar to those discussed in Section 1.2 can be used to define expected values, variances,
and covariances for continuous distributions. These extensions involve integral calculus and are
discussed in the appendix. In any case, Proposition 1.2.11 continues to apply.
The most commonly used distributional model for measurement data is the normal distribution
1.3 CONTINUOUS DISTRIBUTIONS 15
(also called the Gaussian distribution). The bell-shaped curve in Figure 1.1 is referred to as the
standard normal curve. The formula for writing the curve is not too ugly; it is
1 2
f (x) = √ e−x /2 .
2π
Here e is the base of natural logarithms. Unfortunately, even with calculus it is very difficult to
compute areas under this curve. Finding standard normal probabilities requires a table or a computer
routine.
By itself, the standard normal curve has little value in modeling measurements. For one thing,
the curve is centered about 0. I don’t take many measurements where I think the central value should
be 0. To make the normal distribution a useful model, we need to expand the standard normal into
a family of distributions with different centers (expected values) μ and different spreads (standard
deviations) σ . By appropriate recentering and rescaling of the plot, all of these curves will have
the same shape as Figure 1.1. Another important fact that allows us to combine data into estimates
is that linear combinations of independent normally distributed observations are again normally
distributed.
The standard normal distribution is the special case of a normal with μ = 0 and σ = 1. The
standard normal plays an important role because it is the only normal distribution for which we
actually compute probabilities. (Areas under the curve are hard to compute so we rely on computers
or, heaven forbid, tables.) Suppose a measurement y has a normal distribution with mean μ , standard
deviation σ , and variance σ 2 . We write this as
y ∼ N(μ , σ 2 ).
y−μ
∼ N(0, 1),
σ
cf. Exercise 1.6.2. This standardization process allows us to find probabilities for all normal distri-
butions using only one difficult computational routine.
The standard normal distribution is sometimes used in constructing statistical inferences but
more often a similar distribution is used. When data are normally distributed, statistical inferences
often require something called Student’s t distribution. (Student was the pen name of the Guinness
brewmaster W. S. Gosset.) The t distribution is a family of distributions all of which look roughly
like Figure 1.1. They are all symmetric about 0, but they have slightly different amounts of dis-
persion (spread). The amount of variability in each distribution is determined by a positive integer
parameter called the degrees of freedom. With only 1 degree of freedom, the mathematical proper-
ties of a t distribution are fairly bizarre. (This special case is called a Cauchy distribution.) As the
number of degrees of freedom get larger, the t distributions get better behaved and have less vari-
ability. As the degrees of freedom gets arbitrarily large, the t distribution approximates the standard
normal distribution; see Figure 1.2.
Two other distributions that come up later are the chi-squared distribution (χ 2 ) and the F dis-
tribution. These arise naturally when drawing conclusions about the population variance from data
that are normally distributed. Both distributions differ from those just discussed in that both are
asymmetric and both are restricted to positive numbers. However, the basic idea of probabilities
being areas under curves remains unchanged. The shape of a chi-squared distribution depends on
one parameter called its degrees of freedom. An F depends on two parameters, its numerator and
denominator degrees of freedom. Figure 1.3 illustrates a χ 2 (8) and an F(3, 18) distribution along
with illustrating the notation for an α percentile. With three or more degrees of freedom for a χ 2 and
three or more numerator degrees of freedom for an F, the distributions are shaped roughly like those
in Figure 1.3, i.e., they are positive skewed distributions with densities that start at 0, increase, and
16 1. INTRODUCTION
0.4
N(0,1)
t(3)
t(8)
0.3
0.2
0.1
0.0
−4 −2 0 2 4
α
0 8 χ2(α,8) 20
α
0 1 F(α,3,18) 4
Figure 1.3 Top: A χ 2 (8) distribution with the α percentile. Bottom: An F(3, 18) distribution the α percentile.
then decrease. With fewer than three degrees of freedom, the densities take on their largest values
near 0.
In Section 1.2, we introduced Chebyshev’s inequality. Shewhart (1931, p. 177) discusses work
by Camp and Meidell that allows us to improve on Chebyshev’s inequality for continuous distri-
butions. Once again let E(y) = μ and Var(y) = σ 2 . If the density, i.e., the function that defines the
curve, is symmetric, unimodal (has only one peak), and always decreases as one moves farther away
from the mode, then the inequality can be sharpened to
1
Pr[μ − kσ < y < μ + kσ ] ≥ 1 − .
(2.25)k2
1.4 THE BINOMIAL DISTRIBUTION 17
As discussed in the previous section, with y normal and k = 3, the true probability is .997, Cheby-
shev’s inequality gives a lower bound of .889, and the new improved Chebyshev inequality gives
a lower bound of .951. By making some relatively innocuous assumptions, we get a substantial
improvement in the lower bound.
E XAMPLE 1.4.1. Being somewhat lonely in my misspent youth, I decided to use the computer
dating service aTonal.com. The service was to provide me with five matches. Being a very open-
minded soul, I convinced myself that the results of one match would not influence my opinion about
other matches. From my limited experience with the opposite sex, I have found that I enjoy about
40% of such brief encounters. I decided that my money would be well spent if I enjoyed two or more
of the five matches. Unfortunately, my loan shark repossessed my 1954 Studebaker before I could
indulge in this taste of nirvana. Back in those days, we chauvinists believed: no wheels—no women.
Nevertheless, let us compute the probability that I would have been satisfied with the dating service.
Let W be the number of matches I would have enjoyed. The simplest way to find the probability of
satisfaction is
Pr(W = 5) = Pr(L, L, L, L, L) .
Remember, I assumed that the matches were independent and that the probability of my liking any
one is .4. Thus,
Pr(W = 5) = Pr(L)Pr(L)Pr(L)Pr(L)Pr(L)
= (.4)5 .
The probability of liking four matches is a bit more complicated. I could only dislike one match,
but there are five different choices for the match that I could dislike. It could be the fifth, the fourth,
the third, the second, or the first. Any pattern of 4 Ls and a D excludes the other patterns from
occurring, e.g., if the only match I dislike is the fourth, then the only match I dislike cannot be the
18 1. INTRODUCTION
second. Since the patterns are mutually exclusive (disjoint), the probability of disliking one match
is the sum of the probabilities of the individual patterns.
By assumption Pr(L) = .4, so Pr(D) = 1 − Pr(L) = 1 − .4 = .6. The matches are independent, so
Pr(L, L, L, L, D) = Pr(L)Pr(L)Pr(L)Pr(L)Pr(D)
= (.4)4 .6 .
Similarly,
Pr(L, L, L, D, L) = Pr(L, L, D, L, L)
= Pr(L, D, L, L, L)
= Pr(D, L, L, L, L)
= (.4)4 .6 .
Pr(W = 3) = Pr(L, L, L, D, D)
+ Pr(L, L, D, L, D)
+ Pr(L, D, L, L, D)
+ Pr(D, L, L, L, D)
+ Pr(L, L, D, D, L)
+ Pr(L, D, L, D, L)
+ Pr(D, L, L, D, L)
+ Pr(L, D, D, L, L)
+ Pr(D, L, D, L, L)
+ Pr(D, D, L, L, L).
Again all of these patterns have exactly the same probability. For example, using independence
Binomial random variables can also be generated by sampling from a fixed population. If we
were going to make 20 random selections from the UNM student body, the number of females would
have a binomial distribution. Given a set of procedures for defining and sampling the student body,
there would be some fixed number of students of which a given number would be females. Under
random sampling, the probability of selecting a female on any of the 20 trials would be simply the
proportion of females in the population. Although it is very unlikely to occur in this example, the
sampling scheme must allow the possibility of students being selected more than once in the sample.
If people were not allowed to be chosen more than once, each successive selection would change the
proportion of females available for the subsequent selection. Of course, when making 20 selections
out of a population of over 20,000 UNM students, even if you did not allow people to be reselected,
the changes in the proportions of females are insubstantial and the binomial distribution makes a
good approximation to the true distribution. On the other hand, if the entire student population was
40 rather than 20,000+, it might not be wise to use the binomial approximation when people are
not allowed to be reselected.
Typically, the outcome of interest in a binomial is referred to as a success. If the probability
of a success is p for each of N independent identical trials, then the number of successes y has a
binomial distribution with parameters N and p. Write
y ∼ Bin(N, p) .
The distribution of y is
N r
Pr(y = r) = p (1 − p)N −r
r
for r = 0, 1, . . . , N. Here
N N!
≡
r r!(N − r)!
where for any positive integer m, m! ≡ m(m − 1)(m − 2) · · · (2)(1) and 0! ≡ 1. The notation Nr
is read “N choose r” because it is the number of distinct ways of choosing r individuals out of a
collection containing N individuals.
E XAMPLE 1.4.2. The random variables in Example 1.2.1 were y1 , the number of heads on the
first toss of a coin, y2 , the number of heads on the second toss of a coin, and W , the combined
number of heads from the two tosses. These have the following distributions:
1
y1 ∼ Bin 1,
2
1
y2 ∼ Bin 1,
2
1
W ∼ Bin 2, .
2
Note that W , the Bin 2, 12 , was obtained by adding together the two independent Bin 1, 12 random
variables y1 and y2 . This result is quite general. Any Bin (N, p) random variable can be written as
the sum of N independent Bin (1, p) random variables. ✷
20 1. INTRODUCTION
Given the probability distribution of a binomial, we can find the mean (expected value) and
variance. By definition, if y ∼ Bin(N, p), the mean is
N
N r
E(y) = ∑r r
p (1 − p)N −r .
r=0
This is difficult to evaluate directly, but by writing y as the sum of N independent Bin (1, p) random
variables and using Exercise 1.6.1 and Proposition 1.2.11, it is easily seen that
E(y) = N p .
but by again writing y as the sum of N independent Bin (1, p) random variables and using Exer-
cise 1.6.1 and Proposition 1.2.11, it is easily seen that
Var(y) = N p(1 − p) .
y2 = N − y1
y1 ∼ Bin(N, p)
and
y2 ∼ Bin(N, 1 − p) .
The last result holds because, with independent identical trials, the number of outcomes that we call
failures must also have a binomial distribution. If p is the probability of success, the probability of
failure is 1 − p. Of course,
E(y2 ) = N(1 − p)
Var(y2 ) = N(1 − p)p .
Cov(y1 , y2 ) = −N p(1 − p)
and
Corr(y1 , y2 ) = −1 .
There is a perfect linear relationship between y1 and y2 . If y1 goes up one count, y2 goes down one
count. When we look at both successes and failures write
(y1 , y2 ) ∼ Bin N, p, (1 − p) .
This is the simplest case of the multinomial distribution discussed in the next section. But first we
look at a special case of Binomial sampling.
1.5 THE MULTINOMIAL DISTRIBUTION 21
1.4.1 Poisson sampling
The Poisson distribution might be used to model the number of flaws on a dvd. There is no obvious
upper bound on the number of flaws. If we put a grid over the (square?) dvd, we could count
whether every grid square contains a flaw. The number of grid squares with a flaw has a binomial
distribution. As we make the grid finer and finer, the number of grid squares that contain flaws will
become the actual number of flaws. Also, for finer grids, the probability of a flaw decreases as the
size of each square decreases but the number of grid squares increases correspondingly while the
expected number of squares with flaws remains the same. After all, the number of flaws we expect
on the dvd has nothing to do with the grid that we decide to put over it. If we let λ be the expected
number of flaws, λ = N p where N is the number of grid squares and p is the probability of a flaw
in the square.
The Poisson distribution is an approximation used for binomials with a very large number of
.
trials, each having a very small probability of success. Under these conditions, if N p = λ we write
y ∼ Pois(λ ) .
Pr(y = r) = λ r eλ /r!,
r = 0, 1, 2, . . .. These probabilities are just the limits of the binomial probabilities under the condi-
tions described. The mean and variance of a Pois(λ ) are
E(y) = λ
and
Var(y) = λ .
E XAMPLE 1.5.1. Consider the probabilities for the nine height and eye color categories given in
Example 1.1.2. The probabilities are repeated below.
Suppose a random sample of 50 individuals was obtained with these probabilities. For example,
one might have a population of 100 people in which 12 were tall with blue eyes, 15 were tall with
brown eyes, 3 were short with green eyes, etc. We could randomly select one of the 100 people
as the first individual in the sample. Then, returning that individual to the population, take another
random selection from the 100 to be the second individual. We are to proceed in this way until 50
people are selected. Note that with a population of 100 and a sample of 50 there is a substantial
chance that some people would be selected more than once. The numbers of selections falling into
each of the nine categories has a multinomial distribution with N = 50 and these probabilities.
It is unlikely that one would actually perform sampling from a population of 100 people as
described above. Typically, one would not allow the same person to be chosen more than once.
22 1. INTRODUCTION
However, if we had a population of 10,000 people where 1200 were tall with blue eyes, 1500 were
tall with brown eyes, 300 were short with green eyes, etc., with a sample size of 50 we might be
willing to allow the possibility of selecting the same person more than once simply because it is
extremely unlikely to happen. Technically, to obtain the multinomial distribution with N = 50 and
these probabilities, when sampling from a fixed population we need to allow individuals to appear
more than once. However, when taking a small sample from a large population, it does not matter
much whether or not you allow people to be chosen more than once, so the multinomial often
provides a good approximation even when individuals are excluded from reappearing in the sample.
✷
Consider a group of N independent identical trials in which each trial results in the occurrence
of one of q events. Let yi , i = 1, . . . , q be the number of times that the ith event occurs and let pi be
the probability that the ith event occurs on any trial. The pi s must satisfy p1 + p2 + · · · + pq = 1. We
say that (y1 , . . . , yq ) has a multinomial distribution with parameters N, p1 , . . . , pq . Write
(y1 , . . . , yq ) ∼ Mult(N, p1 , . . . , pq ) .
N! r r
Pr(y1 = r1 , . . . , yq = rq ) = p 1 · · · pqq
r1 ! · · · rq ! 1
q q
N! ∏ ri ! ∏ pi i .
r
=
i=1 i=1
Here the ri s are allowed to be any whole numbers with each ri ≥ 0 and r1 + · · · + rq = N. Note that if
q = 2, this is just a binomial distribution. In general, each individual component yi of a multinomial
consists of N trials in which category i either occurs or does not occur, so individual components
have the marginal distributions
yi ∼ Bin(N, pi ).
It follows that
E(yi ) = N pi
and
Var(yi ) = N pi (1 − pi ) .
It can also be shown that
Cov(yi , y j ) = −N pi p j for i = j .
E XAMPLE 1.5.2. Suppose that the 50 individuals from Example 1.5.1 fall into the categories as
listed below.
50!
(.12)5 (.15)8 (.03)2 (.22)10 (.34)18 (.04)2 (.06)3 (.01)1 (.03)1 .
5!8!2!10!18!2!3!1!1!
1.5 THE MULTINOMIAL DISTRIBUTION 23
This number is zero to over 5 decimal places. The fact that this is a very small number is not
surprising. There are a lot of possible tables, so the probability of getting any particular table is very
small. In fact, many of the possible tables are much less likely to occur than this table.
Let’s return to thinking about the observations as random. The expected number of observations
for each category is given by N pi . It is easily seen that the expected counts for the cells are as
follows.
Height—eye color expected values
Eye color
Blue Brown Green
Tall 6.0 7.5 1.5
Height Medium 11.0 17.0 2.0
Short 3.0 0.5 1.5
To define the covariance between two random variables, say y1 and y2 , we need a joint density
f (y1 , y2 ). We can find the density for y1 alone as
∞
f1 (y1 ) = f (y1 , y2 ) dy2
−∞
Writing E(y1 ) = μ1 and E(y2 ) = μ2 , we can now define the covariance between y1 and y2 as
∞ ∞
Cov(y1 , y2 ) = (y1 − μ1 )(y2 − μ2 ) f (y1 , y2 ) dy1 dy2 .
−∞ −∞
1.6 Exercises
E XERCISE 1.6.1. Use the definitions to find the expected value and variance of a Bin(1, p) dis-
tribution.
E XERCISE 1.6.2. Let y be a random variable with E(y) = μ and Var(y) = σ 2 . Show that
y−μ
E =0
σ
and
y−μ
Var = 1.
σ
Let ȳ· be the sample mean of n independent observations yi with E(yi ) = μ and Var(yi ) = σ 2 .
What is the expected value and variance of
ȳ· − μ
√ ?
σ/ n
Hint: For the first part, write
y−μ 1 μ
as y−
σ σ σ
1.6 EXERCISES 25
and use Proposition 1.2.11.
E XERCISE 1.6.3. Let y be the random variable consisting of the number of spots that face up upon
rolling a die. Give the distribution of y. Find the expected value, variance, and standard deviation of
y.
E XERCISE 1.6.4. Consider your letter grade for this course. Obviously, it is a random phe-
nomenon. Define the ‘grade point’ random variable: y(A) = 4, y(B) = 3, y(C) = 2, y(D) = 1,
y(F) = 0. If you were lucky enough to be taking the course from me, you would find that I am
an easy grader. I give 5% As, 10% Bs, 35% Cs, 30% Ds, and 20% Fs. I also assign grades at ran-
dom, that is to say, my tests generate random scores. Give the distribution of y. Find the expected
value, variance, and standard deviation of the grade points a student would earn in my class. (Just
in case you hadn’t noticed, I’m being sarcastic.)
E XERCISE 1.6.5. Referring to Exercise 1.6.4, supposing I have a class of 40 students, what is the
joint distribution for the numbers of students who get each of the five grades? Note that we are no
longer looking at how many grade points an individual student might get, we are now counting how
many occurrences we observe of various events. What is the distribution for the number of students
who get Bs? What is the expected value of the number of students who get Cs? What is the variance
and standard deviation of the number of students who get Cs? What is the probability that in a class
of 5 students, 1 gets an A, 2 get Cs, 1 gets a D, and 1 fails?
E XERCISE 1.6.6. Graph the function f (x) = 1 if 0 < x < 1 and f (x) = 0 otherwise. This is known
as the uniform density on (0, 1). If we use this curve to define a probability function, what is the
probability of getting an observation larger than 1/4? Smaller than 2/3? Between 1/3 and 7/9?
E XERCISE 1.6.7. Arthritic ex-football players prefer their laudanum made with Old Pain-Killer
Scotch by two to one. If we take a random sample of 5 arthritic ex-football players, what is the
distribution of the number who will prefer Old Pain-Killer? What is the probability that only 2 of
the ex-players will prefer Old Pain-Killer? What is the expected number who will prefer Old Pain-
Killer? What are the variance and standard deviation of the number who will prefer Old Pain-Killer?
E XERCISE 1.6.8. Let W ∼ Bin(N, p) and for i = 1, . . . , N take independent yi s that are Bin(1, p).
Argue that W has the same distribution as y1 + · · · + yN . Use this fact, along with Exercise 1.6.1 and
Proposition 1.2.11, to find E(W ) and Var(W ).
E XERCISE 1.6.9. Appendix B.1 gives probabilities for a family of distributions that all look
roughly like Figure 1.1. All members of the family are symmetric about zero and the members are
distinguished by having different numbers of degrees of freedom (df ). They are called t distribu-
tions. For 0 ≤ α ≤ 1, the α percentile of a t distribution with df degrees of freedom is the point x
such that Pr[t(df ) ≤ x] = α . For example, from Table B.1 the row corresponding to df = 10 and the
column for the .90 percentile tells us that Pr[t(10) ≤ 1.372] = .90.
(a) Find the .99 percentile of a t(7) distribution.
(b) Find the .975 percentile of a t(50) distribution.
(c) Find the probability that a t(25) is less than or equal to 3.450.
(d) Find the probability that a t(100) is less than or equal to 2.626.
(e) Find the probability that a t(16) is greater than 2.92.
(f) Find the probability that a t(40) is greater than 1.684.
(g) Recalling that t distributions are symmetric about zero, what is the probability that a t(40) dis-
tribution is less than −1.684?
26 1. INTRODUCTION
(h) What is the probability that a t(40) distribution is between −1.684 and 1.684?
(i) What is the probability that a t(25) distribution is less than −3.450?
(j) What is the probability that a t(25) distribution is between −3.450 and 3.450?
E XERCISE 1.6.10. Consider a random variable that takes on the values 25, 30, 45, and 50 with
probabilities .15, .25, .35, and .25, respectively. Find the expected value, variance, and standard
deviation of this random variable.
E XERCISE 1.6.11. Consider three independent random variables X, Y , and Z. Suppose E(X ) =
25, E(Y ) = 40, and E(Z) = 55 with Var(X ) = 4, Var(Y ) = 9, and Var(Z) = 25.
(a) Find E(2X + 3Y + 10) and Var(2X + 3Y + 10).
(b) Find E(2X + 3Y + Z + 10) and Var(2X + 3Y + Z + 10).
E XERCISE 1.6.12. As of 1994, Duke University had been in the final four of the NCAA’s national
basketball championship tournament seven times in nine years. Suppose their appearances were
independent and that they had a probability of .25 for winning the tournament in each of those
years.
(a) What is the probability that Duke would win two national championships in those seven appear-
ances?
(b) What is the probability that Duke would win three national championships in those seven ap-
pearances?
(c) What is the expected number of Duke championships in those seven appearances?
(d) What is the variance of the number of Duke championships in those seven appearances?
E XERCISE 1.6.13. Graph the function f (x) = 2x if 0 < x < 1 and f (x) = 0 otherwise. If we use
this curve to define a probability function, what is the probability of getting an observation larger
than 1/4? Smaller than 2/3? Between 1/3 and 7/9?
E XERCISE 1.6.14. A pizza parlor makes small, medium, and large pizzas. Over the years they
make 20% small pizzas, 35% medium pizzas, and 45% large pizzas. On a given Tuesday night they
were asked to make only 10 pizzas. If the orders were independent and representative of the long-
term percentages, what is the probability that the orders would be for four small, three medium, and
three large pizzas? On such a night, what is the expected number of large pizzas to be ordered and
what is the expected number of small pizzas to be ordered? What is the variance of the number of
large pizzas to be ordered and what is the variance of the number of medium pizzas to be ordered?
E XERCISE 1.6.15. When I order a limo, 65% of the time the driver is male. Assuming indepen-
dence, what is the probability that 6 of my next 8 drivers are male? What is the expected number of
male drivers among my next eight? What is the variance of the number of male drivers among my
next eight?
E XERCISE 1.6.16. When I order a limo, 65% of the time the driver is clearly male, 30% of
the time the driver is clearly female, and 5% of the time the gender of the driver is indeterminant.
Assuming independence, what is the probability that among my next 8 drivers 5 are clearly male
and 3 are clearly female? What is the expected number of indeterminant drivers among my next
eight? What is the variance of the number of clearly female drivers among my next eight?
Chapter 2
One Sample
In this chapter we examine the analysis of a single random sample consisting of n independent
observations from some population.
E XAMPLE 2.1.1. Consider the dropout rate from a sample of math classes at the University of
New Mexico as reported by Koopmans (1987). The data are
5, 22, 10, 12, 8, 17, 2, 25, 10, 10, 7, 7, 40, 7, 9, 17, 12, 12, 1,
13, 10, 13, 16, 3, 14, 17, 10, 10, 13, 59, 11, 13, 5, 12, 14, 3, 14, 15.
This list of n = 38 observations is not very illuminating. A graphical display of the numbers is
more informative. Figure 2.1 plots the data above a single axis. This is often called a dot plot. From
Figure 2.1, we see that most of the observations are between 0 and 18. There are two conspicuously
large observations. Going back to the original data we identify these as the values 40 and 59. In
particular, these two outlying values strongly suggest that the data do not follow a bell-shaped curve
and thus that the data do not follow a normal distribution. ✷
Typically, for one sample of data we assume that the n observations are
Data Distribution
y1 , y2 , . . . , yn independent N(μ , σ 2 )
The key assumptions are that the observations are independent and have the same distribution. In
particular, we assume they have the same (unknown) mean μ and the same (unknown) variance σ 2 .
These assumptions of independence and a constant distribution should be viewed as only useful
approximations to actual conditions. Often the most valuable approach to evaluating these assump-
tions is simply to think hard about whether they are reasonable. In any case, the conclusions we
reach are only as good as the assumptions we have made. The only way to be positive that these
assumptions are true is if we arrange for them to be true. If we have a fixed finite population and take
a random sample from the population allowing elements of the population to be observed more than
once, then the assumptions (other than normality) are true. In Example 2.1.1, if we had the dropout
:
. . : ::: .
.: : :::.:::.: . . . .
-+---------+---------+---------+---------+---------+-----
0 12 24 36 48 60
27
28 2. ONE SAMPLE
rates for all math classes in the year and randomly selected these 38 while allowing for classes to
appear more than once in the sample, the assumptions of independence with the same distribution
are satisfied.
The ideal conditions of independent sampling from a fixed population are difficult to achieve.
Many populations refuse to hold still while we sample them. For example, the population of students
at a large university changes almost continuously (during working hours). To my way of thinking,
the populations associated with most interesting data are virtually impossible to define unambigu-
ously. Who really cares about the dropout rates? As such, they can only be used to fix blame. Our
real interest is in what the data can tell us about current and future dropout rates. If the data are
representative of current or future conditions, the data can be used to fix problems. For example,
one might find out whether certain instructors generate huge dropout rates, and avoid taking classes
from them. Perhaps the large dropout rates are because the instructor is more demanding. You might
want to seek out such a class. It is difficult to decide whether these or any data are representative of
current or future conditions because we cannot possibly know the future population and we cannot
practically know the current population. As mentioned earlier, often our best hope is to think hard
about whether these data approximate independent observations from the population of interest.
Even when sampling from a fixed population, we use approximations. In practice we rarely
allow elements of a fixed population to be observed more than once in a sample. This invalidates
the assumptions. If the first sampled element is eliminated, the second element is actually being
sampled from a different population than the first. (One element has been eliminated.) Fortunately,
when the sample contains a small proportion of the fixed population, the standard assumptions make
a good approximation. Moreover, the normal distribution is never more than an approximation to a
fixed population. The normal distribution has an infinite number of possible outcomes, while fixed
populations are finite. Often, the normal distribution makes a good approximation, especially if we
do our best to validate it. In addition, the assumption of a normal distribution is only used when
drawing conclusions from small samples. For large samples we can get by without the assumption
of normality.
Our primary objective is to draw conclusions about the mean μ . We condense the data into sum-
mary statistics. These are the sample mean, the sample variance, and the sample standard deviation.
The sample mean has the algebraic formula
1 n 1
ȳ· ≡ ∑ yi = n [y1 + y2 + · · · + yn]
n i=1
where the · in ȳ· indicates that the mean is obtained by averaging the yi s over the subscript i. The
sample mean ȳ· estimates the population mean μ . The sample variance is an estimate of the popula-
tion variance σ 2 . The sample variance is essentially the average squared distance of the observations
from the sample mean,
1 n
s2 ≡ ∑ (yi − ȳ·)2
n − 1 i=1
(2.1.1)
1
2 2 2
= (y1 − ȳ·) + (y2 − ȳ·) + · · · + (yn − ȳ·) .
n−1
The sample standard deviation is just the square root of the sample variance,
√
s ≡ s2 .
5 + 22 + 10 + 12 + 8 + · · ·+ 3 + 14 + 15
ȳ· = = 13.105.
38
2.1 EXAMPLE AND INTRODUCTION 29
If we think of these data as a sample from the fixed population of math dropout rates, ȳ· is obviously
an estimate of the simple average of all the dropout rates of all the classes in that academic year.
Equivalently, ȳ· is an estimate of the expected value for the random variable defined as the dropout
rate obtained when we randomly select one class from the fixed population. Alternatively, we may
interpret ȳ· as an estimate of the mean of some population that is more interesting but less well
defined than the fixed population of math dropout rates.
The sample variance is
2 (5 − 13.105)2 + (22 − 13.105)2 + · · · + (14 − 13.105)2 + (15 − 13.105)2
s =
38 − 1
= 106.42.
This estimates the variance of the random variable obtained when randomly selecting one class from
the fixed population. The sample standard deviation is
√
s = 106.42 = 10.32 . ✷
The only reason s2 is not the average squared distance of the observations from the sample mean
is that the denominator in (2.1.1) is n − 1 instead of n. If μ were known, a better estimate of the
population variance σ 2 would be
n
σ̂ 2 ≡ ∑ (yi − μ )2 /n. (2.1.2)
i=1
In s2 , we have used ȳ· to estimate μ . Not knowing μ , we know less about the population, so s2 cannot
be as good an estimate as σ̂ 2 . The quality of a variance estimate can be measured by the number of
observations on which it is based; σ̂ 2 makes full use of all n observations for estimating σ 2 . In using
s2 , we lose the functional equivalent of one observation for having estimated the parameter μ . Thus
s2 has n − 1 in the denominator of (2.1.1) and is said to have n − 1 degrees of freedom. In nearly all
problems that we will discuss, there is one degree of freedom available for every observation. The
degrees of freedom are assigned to various estimates and we will need to keep track of them.
The statistics ȳ· and s2 are estimates of μ and σ 2 , respectively. The Law of Large Numbers is a
mathematical result implying that for large sample sizes n, ȳ· gets arbitrarily close to μ and s2 gets
arbitrarily close to σ 2 .
Both ȳ· and s2 are computed from the random observations yi . The summary statistics are func-
tions of random variables, so they must also be random. Each has a distribution and to draw conclu-
sions about the unknown parameters μ and σ 2 we need to know the distributions. In particular, if
the original data are normally distributed, the sample mean has the distribution
σ2
ȳ· ∼ N μ ,
n
or equivalently,
ȳ − μ
· ∼ N(0, 1); (2.1.3)
σ 2 /n
see Exercise 1.6.2. In Subsection 1.2.4 we established that E(ȳ· ) = μ and Var(ȳ· ) = σ 2 /n, so the
only new claim made here is that the sample mean computed from independent, identically dis-
tributed (iid) normal random variables is again normally distributed. Actually, this is a special case
of the earlier claim that any linear combinations of independent normals is again normal. More-
over, the Central Limit Theorem is a mathematical result stating that the normal distribution for ȳ·
is approximately true for ‘large’ samples n, regardless of whether the original data are normally
distributed.
As we will see below, the distributions given earlier are only useful in drawing conclusions
30 2. ONE SAMPLE
0.4
N(0,1)
t(3)
t(1)
0.3
0.2
0.1
0.0
−4 −2 0 2 4
Figure 2.2: Three distributions: solid, N(0, 1); long dashes, t(1); short dashes, t(3).
about data when σ 2 is known. Generally, we will need to estimate σ 2 with s2 and proceed as best
we can. By the law of large numbers, s2 becomes arbitrarily close to σ 2 , so for large samples we can
substitute s2 for σ 2 in the distributions above. In other words, for large samples the approximation
ȳ − μ
· ∼ N(0, 1) (2.1.4)
s2 /n
holds regardless of whether the data were originally normal.
For small samples we cannot rely on s2 being close to σ 2 , so we fall back on the assumption that
the original data are normally distributed. For normally distributed data, the appropriate distribution
is called a t distribution with n − 1 degrees of freedom. In particular,
ȳ − μ
· ∼ t(n − 1). (2.1.5)
s2 /n
The t distribution is similar to the standard normal but more spread out; see Figure 2.2. It only makes
sense that if we need to estimate σ 2 rather than knowing it, our conclusions will be less exact. This
is reflected in the fact that the t distribution is more spread out than the N(0, 1). In the previous
paragraph we argued that for large n the appropriate distribution is
ȳ − μ
· ∼ N(0, 1).
s2 /n
We are now arguing that for normal data the appropriate distribution is t(n − 1). It had better be
the case (and is) that for large n the N(0, 1) distribution is approximately the same as the t(n − 1)
distribution. In fact, we define t(∞) to be a N(0, 1) distribution where ∞ indicates an infinitely
large number.
Definition 2.1.3. A t distribution is the distribution obtained when a random variable with a
N(0, 1) distribution is divided by an independent random variable that is the square root of a χ 2
random variable over its degrees of freedom. The t distribution has the same degrees of freedom as
the chi-square.
In particular, [ȳ· − μ ]/ σ 2 /n is N(0, 1), [(n − 1)s2/σ 2 ] /(n − 1) is the square root of a chi-
squared random variable over its degrees of freedom, and the two are independent because ȳ· and
s2 are independent, so
ȳ· − μ [ȳ· − μ ]/ σ 2 /n
= ∼ t(n − 1).
s2 /n [(n − 1)s2 /σ 2 ] /(n − 1)
The t distribution has the same degrees of freedom as the estimate of σ 2 ; this is typically the case
in other applications.
1−α α
0
0 t(1 − α, df)
Recall that for large samples from a normal population, it is largely irrelevant whether we use the
standard normal or the t distribution because they are essentially the same. In the unrealistic
case
where σ 2 is known we do not need to estimate it, so we use σ 2 /n instead of s2 /n for the
standard error. In this case, the appropriate distribution is N(0, 1) as in (2.1.2) if either the original
data are normal or the sample size is large.
We need notation for the percentage points of the known distribution and we need a name for
the point that cuts off the top α of the distribution. Typically, we need to find points that cut off the
top 5%, 2.5%, 1%, or 0.5% of the distribution, so α is 0.05, 0.025, 0.01, or 0.005. As discussed
in the previous paragraph, the appropriate distribution depends on various circumstances of the
problem, so we begin by discussing percentage points with a generic notation. We use the notation
t(1 − α , df ) for the point that cuts off the top α of the distribution. Figure 2.3 displays this idea
graphically for a value of α between 0 and 0.5. The distribution is described by the curve, which is
symmetric about 0. t(1 − α , df ) is indicated along with the fact that the area under the curve to the
right of t(1 − α , df ) is α . Formally the point that cuts off the top α of the distribution is t(1 − α , df )
where
ȳ· − μ
Pr > t(1 − α , df ) = α .
SE(ȳ· )
Note that the same point t(1 − α , df ) also cuts off the bottom 1 − α of the distribution, i.e.,
ȳ· − μ
Pr < t(1 − α , df ) = 1 − α .
SE(ȳ· )
This is illustrated in Figure 2.3 by the fact that the area under the curve to the left of t(1 − α , df )
is 1 − α . The reason the point is labeled t(1 − α , df ) is because it cuts off the bottom 1 − α of the
distribution. The labeling depends on the percentage to the left even though our interest is in the
percentage to the right.
There are at least three different ways to label these percentage points; I have simply used the
one I feel is most consistent with general usage in Probability and Statistics. The key point however
is to be familiar with Figure 2.3. We need to find points that cut off a fixed percentage of the area
under the curve. As long as we can find such points, what we call them is irrelevant. Ultimately,
2.2 PARAMETRIC INFERENCE ABOUT μ 33
α 1 − 2α α
0
anyone doing Statistics will need to be familiar with all three methods of labeling. One method of
labeling is in terms of the area to the left of the point; this is the one we will use. A second method
is labeling in terms of the area to the right of the point; thus the point we call t(1 − α , df ) could be
labeled, say, Q(α , df ). The third method is to call this number, say, W (2α , df ), where the area to
the right of the point is doubled in the label. For example, if the distribution is a N(0, 1) = t(∞),
the point that cuts off the bottom 97.5% of the distribution is 1.96. This point also cuts off the top
2.5% of the area. It makes no difference if we refer to 1.96 as the number that cuts off the bottom
97.5%, t(0.975, ∞), or as the number that cuts off the top 2.5%, Q(0.025, ∞), or as the number
W (0.05, ∞) where the label involves 2 × 0.025; the important point is being able to identify 1.96
as the appropriate number. Henceforth, we will always refer to points in terms of t(1 − α , df ), the
point that cuts off the bottom 1 − α of the distributions. No further reference to the alternative
labelings will be made but all three labels are used in Appendix B.1. There t(1 − α , df )s are labeled
as percentiles and, for reasons related to statistical tests, Q(α , df )s and W (2α , df )s are labeled as
one-sided and two-sided α levels, respectively.
A fundamental assumption in our inference about μ is that the distribution of [ȳ· − μ ]/SE(ȳ· )
is symmetric about 0. By the symmetry around zero, if t(1 − α , df ) cuts off the top α of the distri-
bution, −t(1 − α , df ) must cut off the bottom α of the distribution. Thus for distributions that are
symmetric about 0 we have t(α , df ), the point that cuts off the bottom α of the distribution, equal
to −t(1 − α , df ). This fact is illustrated in Figure 2.4. Algebraically, we write
ȳ· − μ ȳ· − μ
Pr < −t(1 − α , df ) = Pr < t(α , df ) = α .
SE(ȳ· ) SE(ȳ· )
Frequently, we want to create a central interval that contains a specified probability, say 1 − α .
Figure 2.5 illustrates the construction of such an interval. Algebraically, the middle interval with
probability 1 − α is obtained by
ȳ − μ
α α
< t 1 − , df = 1 − α .
·
Pr −t 1 − , df <
2 SE(ȳ· ) 2
34 2. ONE SAMPLE
α 2 1−α α 2
0
Percentiles of the t distribution are given in Appendix B.1 with the ∞ row giving percentiles of
the N(0, 1) distribution.
E XAMPLE 2.2.1. For the dropout rate data, we might be interested in the hypothesis that the
true dropout rate is 10%. Thus the null hypothesis is H0 : μ = 10. The other assumptions were
discussed at the beginning of the chapter. They include such things as independence, normality, and
all observations having the same mean and variance. While we can never confirm that these other
2.2 PARAMETRIC INFERENCE ABOUT μ 35
assumptions are absolutely valid, it is a key aspect of modern statistical practice to validate the
assumptions as far as is reasonably possible. When we are convinced that the other assumptions are
reasonably valid, data that contradict the assumptions can be reasonably interpreted as contradicting
the specific assumption H0 . ✷
The test is based on all the assumptions including H0 being true and we check to see if the data
are inconsistent with those assumptions. The idea is much like the idea of a proof by contradiction.
We assume a model that includes the assumption H0 . If the data contradict that model, we can con-
clude that something is wrong with the model. If we can satisfy ourselves that all of the assumptions
other than the assumption H0 are true, and we have data that are inconsistent with the model, then
H0 must be false. If the data do not contradict the H0 model, we can only conclude that the data
are consistent with the assumptions. We can never conclude that the assumptions are true. Unfortu-
nately, data almost never yield an absolute contradiction to the null model. We need to quantify the
extent to which the data are inconsistent with the null model.
We need to be able to identify data that are inconsistent with the null model. Under the assump-
tions that the data are independent with common mean and variance, with either normal distributions
or a large sample and with μ = m0 , the distribution of (ȳ· − m0 )/ s2 /n has an approximate t(n − 1)
distribution with density as illustrated in Figures 2.2–2.5. From those illustrations, the least likely
observations to occur under a t(n − 1) distribtion are those that are far from 0. Thus, values of ȳ· far
from m0 make us question the validity of the null model.
We reject the null model if the test statistic is too far from zero, that is, if
ȳ· − m0
SE(ȳ· )
is greater than some positive cutoff value or less than some negative cutoff value. Very large and
very small (large negative) values of the test statistic are those that are most inconsistent with the
model that includes μ = m0 .
The problem is in specifying the cutoff values. For example, we do not want to reject μ = 10
if the data are consistent with μ = 10. One of our basic assumptions is that we know the distribu-
tion of [ȳ· − μ ]/SE(ȳ· ). Thus if H0 : μ = 10 is true, we know the distribution of the test statistic
[ȳ· − 10]/SE(ȳ· ), so we know what kind of data are consistent with the μ = 10 model. For instance,
when μ = 10, 95% of the possible values of [ȳ· − 10]/SE(ȳ· ) are between −t(0.975, n − 1) and
t(0.975, n − 1). Any values of [ȳ· − 10]/SE(ȳ· ) that fall between these numbers are reasonably con-
sistent with μ = 10 and values outside the interval are defined as being inconsistent with μ = 10.
Thus values of [ȳ· − 10]/SE(ȳ· ) greater than t(0.975, n − 1) or less than −t(0.975, n − 1) cause us to
reject the null model. Note that we arbitrarily specified the central 95% of the distribution as being
consistent with the μ = 10 model, as opposed to the central 99% or central 90%. We get to pick our
criterion for what is consistent with the null model.
E XAMPLE 2.2.2. For the dropout rate data, consider the null hypothesis H0 : μ = 10, i.e., that
the mean dropout rate is 10%. These data are not normal, so we must hope that the sample size is
large enough to justify use of the t distribution. Mathematically, large n suggest a t(∞) = N(0, 1)
distribution, but we consider the t(n − 1) to be a better approximate distribution. If we choose a
the probability of being outside the central interval is α = 0.10, and the
central 90% interval, then
upper cutoff value is t 1 − α2 , 37 = t(0.95, 37) = 1.687.
The α = 0.10 level test for the model incorporating H0 : μ = 10 is to reject the null model if
ȳ· − 10
√ > 1.687,
s/ 38
or if
ȳ· − 10
√ < −1.687.
s/ 38
36 2. ONE SAMPLE
√ √
The estimate of μ is ȳ· = 13.105 and the observed standard error is s/ n = 10.32/ 38 = 1.673,
so the observed value of the test statistic is
13.105 − 10
tobs ≡ = 1.856 .
1.673
Comparing this to the cutoff value of 1.687 we have 1.856 > 1.687, so the null model is rejected.
There is evidence at the α = 0.10 level that the model with mean dropout rate of 10% is incorrect.
In fact, since ȳ· = 13.105 > 10, if the assumptions other than H0 are correct, there is the suggestion
that the dropout rate is greater than 10%.
This conclusion depends on the choice of the α level. If we choose α = 0.05, then the appropri-
ate cutoff value is t(0.975, 37) = 2.026. Since the observed value of the test statistic is 1.856, which
is neither greater than 2.026 nor less than −2.026, we do not reject the null model. When we do not
reject the H0 model, we cannot say that the true mean dropout rate is 10%, but we can say that, at
the α = 0.05 level, the data are consistent with the (null) model that has a true mean dropout rate of
10%. ✷
Generally, a test of significance is based on an α level that indicates how unusual the data
are relative to the assumptions of the null model. The α -level test for the model that incorporates
H0 : μ = m0 is to reject the null model if
ȳ· − m0 α
> t 1 − ,n − 1
SE(ȳ· ) 2
or if
ȳ· − m0 α
< −t 1 − , n − 1 .
SE(ȳ· ) 2
This is equivalent to saying, reject H0 if
|ȳ· − m0 | α
> t 1 − ,n − 1 .
SE(ȳ· ) 2
Also note that we are rejecting the H0 model for those values of [ȳ· − m0 ]/SE(ȳ· ) that are most
inconsistent with the t(n − 1) distribution, those being the values of the test statistic with large
absolute values.
In significance testing, a null model should never be accepted; it is either rejected or not rejected.
A better way to think of a significance test is that one concludes that the data are either consistent or
inconsistent with the null model. The statement that the data are inconsistent with the H0 model is a
strong statement. It suggests in some specified degree that something is wrong with the H0 model.
The statement that the data are consistent with H0 is not a strong statement; it does not suggest the
H0 model is true. For example, the dropout data happen to be consistent with H0 : μ = 12; the test
statistic
ȳ· − 12 13.105 − 12
= = 0.66
SE(ȳ· ) 1.673
is quite small. However, the data are equally consistent with μ = 12.00001. These data cannot
possibly indicate that μ = 12 rather than μ = 12.00001. In fact, we established earlier that based on
an α = 0.05 test, these data are even consistent with μ = 10. Data that are consistent with the H0
model do not imply that the null model is correct.
With these data there is very little hope of distinguishing between μ = 12 and μ = 12.00001. The
probability of getting data that lead to rejecting H0 : μ = 12 when μ = 12.00001 is only just slightly
more than the probability of getting data that lead to rejecting H0 when μ = 12. The probability of
getting data that lead to rejecting H0 : μ = 12 when μ = 12.00001 is called the power of the test
when μ = 12.00001. The power is the probability of appropriately rejecting H0 and depends on the
2.2 PARAMETRIC INFERENCE ABOUT μ 37
particular value of μ (= 12). The fact that the power is very small for detecting μ = 12.00001 is not
much of a problem because no one would really care about the difference between a dropout rate of
12 and a dropout rate of 12.00001. However, a small power for a difference that one cares about is a
major concern. The power is directly related to the standard error and √ can be increased by reducing
the standard error. One natural way to reduce the standard error s/ n is by increasing the sample
size n. Of course this discussion of power presupposes that all assumptions in the model other than
H0 are correct.
One of the difficulties in a general discussion of significance testing is that the actual null hy-
pothesis is always context specific. You cannot give general rules for what to use as a null hypothesis
because the null hypothesis needs to be some interesting claim about the population mean μ . When
you sample different populations, the population mean differs, and interesting claims about the pop-
ulation mean depend on the exact nature of the population. The best practice for setting up null
hypotheses is simply to look at lots of problems and ask yourself what claims about the population
mean are of interest to you. As we examine more sophisticated data structures, some interesting
hypotheses will arise from the structures themselves. For example, if we have two samples of sim-
ilar measurements we might be interested in testing the null hypothesis that they have the same
population means. Note that there are lots of ways in which the means could be different, but only
one way in which they can be the same. Of course if the specific context suggests that one mean
should be, say, 25 units greater than the other, we can use that as the null hypothesis. Similarly, if
we have a sample of objects and two different measurements on each object, we might be interested
in whether or not the measurements are related. In that case, an interesting null hypothesis is that the
measurements are not related. Again, there is only one way in which measurements can be unrelated
(independent), but there are many ways for measurements to display a relationship.
In practice, nobody actually uses the procedures just presented. These procedures require us to
pick specific values for m0 in H0 : μ = m0 and for α . In practice, one either picks an α level and
presents results for all values of m0 by giving a confidence interval, or one picks a value m0 and
presents results for all α levels by giving a P value.
Thus, the value m0 is not rejected by an α -level test if and only if m0 is within the interval having
endpoints ȳ· ± t(0.975, n − 1)SE(ȳ· ).
More generally, a (1 − α )100% confidence interval for μ is based on observing that an α -level
test of H0 : μ = m0 does not reject when
α ȳ − m α
· 0
−t 1 − , n − 1 < < t 1 − ,n − 1
2 SE(ȳ· ) 2
which is algebraically equivalent to
α α
ȳ· − t 1 − , n − 1 SE(ȳ· ) < m0 < ȳ· + t 1 − , n − 1 SE(ȳ· ).
2 2
38 2. ONE SAMPLE
A proof of the algebraic equivalence is given in the appendix to the next chapter. The endpoints of
the interval can be written
α
ȳ· ± t 1 − , n − 1 SE(ȳ· ),
2
or, substituting the form of the standard error,
α s
ȳ· ± t 1 − , n − 1 √ .
2 n
The 1 − α confidence interval contains all the values of μ that are consistent with both the data and
the model as determined by an α -level test. Note that increasing the sample size n decreases the
standard error and thus makes the confidence interval narrower. Narrower confidence intervals give
more precise information about μ . In fact, by taking n large enough, we can make the confidence
interval arbitrarily narrow.
E XAMPLE 2.2.3. For the dropout rate data presented at the beginning of the chapter, the param-
√ is the mean√dropout rate for math classes, the estimate is ȳ· = 13.105, and the standard error is
eter
s/ n = 10.32/ 38 = 1.673. As seen in the dot plot, the original data are not normally distributed.
The plot looks nothing at all like the bell-shaped curve in Figure 1.1, which is a picture of a normal
distribution. Thus we hope that a sample of size 38 is sufficiently large to justify use of the central
limit theorem and the law of large numbers. We use the t(37) distribution as a small sample approx-
imation to the t(∞) = N(0, 1) distribution that is suggested by the mathematical results. For a 95%
confidence interval, 95 = (1 − α )100, .95 = (1 − α ), α = 1 − 0.95 = 0.05, and 1 − α /2 = 0.975,
so the number we need from the t table is t(0.975, 37) = 2.026. The endpoints of the confidence
interval are
13.105 ± 2.026(1.673)
giving an interval of
(9.71, 16.50).
Rounding to simple numbers, we are 95% confident that the true dropout rate is between 10% and
16.5%, but only in the sense that these are the parameter values that are consistent with the data and
the model based on a α = 0.05 test. ✷
Many people think that a 95% confidence interval for μ has a 95% probability of containing
the parameter μ . The definition of the confidence interval just given does not lend itself towards
that misinterpretation. There is another method of developing confidence intervals, one that has
never made any sense to me. This alternative development does lends itself to being misinterpreted
as a statement about the probability that the parameter is contained in the interval. Traditionally,
statisticians have worked very hard to correct this misinterpretation. Personally, I do not think the
misinterpretation does any real harm since it can be justified using arguments from Bayesian Statis-
tics.
2.2.3 P values
Rather than having formal rules for when to reject the null model, one can report the evidence
against the null model. This is done by reporting the significance level of the test, also known as
the P value. The P value is computed assuming that the null model including μ = m0 is true and
the P value is the probability of seeing data that are as weird or more weird than those that were
actually observed. In other words, it is the α level at which the test would just barely not be rejected.
Remember, based on Figures 2.2 through 2.5, weird data are those that lead to t values that are far
from 0.
E XAMPLE 2.2.4. For H0 : μ = 10 the observed value of the test statistic is 1.856. Clearly, data that
2.3 PREDICTION INTERVALS 39
give values of the test statistic that are greater than 1.856 are more weird than the actual data. Also,
by symmetry, data that give a test statistic of −1.856 are just as weird as data that yield a 1.856.
Finally, data that give values smaller than −1.856 are more weird than data yielding a statistic of
1.856. As before, we use the t(37) distribution. From an appropriate computer program,
Thus the approximate P value is 0.07. The P value is approximate because the use of the t(37)
distribution is an approximation based on large samples. Algebraically,
We can see from this that the P value corresponds to the α level of a test where H0 : μ = 10 would
just barely not be rejected. Thus, with a P value of 0.07, any test of H0 : μ = 10 with α > 0.07 will
be rejected while any test with α ≤ 0.07 will not be rejected. In this case, 0.07 is less than 0.10, so
an α = 0.10 level test of the null model with H0 : μ = 10 will reject H0 . On the other hand, 0.07 is
greater than 0.05, so an α = 0.05 test does not reject the null model.
If you do not have access to a computer, rough P values can be determined from a t table.
Comparing |1.856| to the t tables of Appendix B.1, we see that
In other words, t(0.95, 37) is the cutoff value for an α = 0.10 test and t(0.975, 37) is the cutoff
value for an α = 0.05 test; |1.856| falls between these values, so the P value is between 0.10 and
0.05. When only a t table is available, P values are most simply specified in terms of bounds such
as these. ✷
The P value is a measure of the evidence against the null hypothesis in which the smaller the P
value the more evidence against H0 . The P value can be used to perform various α -level tests.
Theory
In this chapter we assume that the observations yi are independent from a population with mean
μ and variance σ 2 . We have assumed that all our previous observations on the process have been
independent, so it is reasonable to assume that the future observation y0 is independent of the previ-
ous observations with the same mean and variance. The prediction interval is actually based on the
difference y0 − ȳ· , i.e., we examine how far a new observation may reasonably be from our point
predictor. Note that
E(y0 − ȳ· ) = μ − μ = 0.
To proceed we need a standard error for y0 − ȳ· and a distribution that is symmetric about 0. The
standard error of y0 − ȳ· is just the standard deviation of y0 − ȳ· when available or, more often, an
estimate of the standard deviation. First we need to find the variance. As ȳ· is computed from the
previous observations, it is independent of y0 and, using Proposition 1.2.11,
σ2 1
Var(y0 − ȳ· ) = Var(y0 ) + Var(ȳ· ) = σ +
2
= σ 1+
2
.
n n
The standard deviation is the square root of the variance. Typically, σ 2 is unknown, so we estimate
it with s2 and our standard error becomes
s2 1 1
SE(y0 − ȳ·) = s2 + = s2 1 + = s 1+ .
n n n
For future reference, note that the first equality in this equation can be rewritten as
SE(y0 − ȳ· ) = s2 + SE(ȳ· )2 .
2.3 PREDICTION INTERVALS 41
To get an appropriate distribution, we assume that all the observations are normally distributed.
In this case,
y0 − ȳ·
∼ t(n − 1).
SE(y0 − ȳ· )
The validity of the t(n − 1) distribution is established in Exercise 2.8.10.
Using the distribution based on normal observations, a 99% prediction interval is obtained from
the following inequalities:
y0 − ȳ·
−t(0.995, n − 1) < < t(0.995, n − 1)
SE(y0 − ȳ· )
which occurs if and only if
ȳ· − t(0.995, n − 1)SE(y0 − ȳ· ) < y0 < ȳ· + t(0.995, n − 1)SE(y0 − ȳ· ).
The key point is that the two sets of inequalities are algebraically equivalent. A 99% prediction
interval has endpoints
ȳ· ± t(0.995, n − 1)SE(y0 − ȳ· ).
This looks similar to a 99% confidence interval for μ but the standard error is very different. In the
prediction interval, the endpoints are
1
ȳ· ± t(0.995, n − 1) s 1 + ,
n
We return to the subject of testing hypotheses about μ but now we use model-based tests. If we were
only going to perform tests on the mean of one sample, there would be little point in introducing
this alternative test procedure, but testing models works in many situations where testing a single
parameter is difficult. Moreover, model testing can provide tests of more than one parameter. The
focus of this section is to introduce model-based tests and to show the relationship between para-
metric tests and model-based tests for hypotheses about the mean μ . Throughout, we have assumed
that the process of generating the data yields independent observations from some population. In
quality control circles this is referred to as having a process that is under statistical control.
Model-based tests depend on measures of how well different models explain the data. For many
problems, we use variance estimates to quantify how well a model explains that data. A better ex-
planation will lead to a smaller variance estimate. For one-sample problems, the variance estimates
we will use are s2 as defined in Equation (2.1.1) and σ̂ 2 as defined in (2.1.2). Recall that σ̂ 2 is the
variance estimate used when μ is known.
Under the one-sample model with μ unknown, our variance estimate is s2 . Under the one-sample
model with the null hypothesis H0 : μ = m0 assumed to be true, the variance estimate is
1 n
σ̂02 ≡ ∑ (yi − m0)2 .
n i=1
If the null model is true, the two variance estimates should be about the same. If the two variance
estimates are different, it suggests that something is wrong with the null model. One way to evaluate
whether the estimates are about the same is to evaluate whether σ̂ 2 /s2 is about 1.
Actually, it is not common practice to compare the two variance estimates σ̂02 and s2 directly.
Typically, one rewrites the variance estimate from the null model σ̂02 as a weighted average of the
more general estimate s2 and something else. This something else will also be an estimate of σ 2
when the null model is true; see Chapter 3. It turns out that in a one-sample problem, the something
else has a particularly nice form. The formula for the weighted average turns out to be
n−1 2 1
σ̂02 = s + n(ȳ· − m0 )2 .
n n
This estimate has weight (n − 1)/n on s2 and weight 1/n on the something else, n(ȳ· − m0 )2 . When
the null model is true, n(ȳ· − m0 )2 is an estimate of σ 2 with 1 degree of freedom. The estimate σ̂02
has n degrees of freedom and it is being split into s2 with n − 1 degrees of freedom and n(ȳ· − m0 )2 ,
so there is only 1 degree of freedom left for n(ȳ· − m0 )2 .
The test is based on looking at whether n(ȳ· − m0 )2 /s2 is close to 1 or not. Under the null model
including normality, this ratio has a distribution called an F distribution. The variance estimate in the
numerator has 1 degree of freedom and the variance estimate in the denominator has n − 1 degrees
of freedom. The degrees of freedom identify a particular member of the family of F distributions.
Thus we write,
n(ȳ· − m0 )2
∼ F(1, n − 1).
s2
To compare this test to the parameter-based test, note that
2
n(ȳ· − m0 )2 |ȳ· − m0 |
= ,
s2 s2 /n
with the right-hand side being the square of the t statistic for testing the null model with H0 : μ = m0 .
2.5 CHECKING NORMALITY 43
:
. : : : . .
. . : : : . . : . : : : . . : . .
+---------+---------+---------+---------+---------+-------
0 5 10 15 20 25
Figure 2.6: Dot plot for drop rate percentage data: outliers deleted.
E XAMPLE 2.5.1. Consider the dropout rate data. Figure 2.7 contains the normal plot for the com-
plete data. The two outliers cause the plot to be severely nonlinear. Figure 2.8 contains the normal
plot for the dropout rate data with the two outliers deleted. It is certainly not horribly nonlinear.
There is a little shoulder at the bottom end and some wiggling in the middle.
We can eliminate the shoulder in this plot by transforming the original data. Figure 2.9 contains
a normal plot for the square roots of the data with the outliers deleted. While the plot no longer has
a shoulder on the lower end, it seems to be a bit less well behaved in the middle.
We might now repeat our tests and confidence intervals for the 36 observations left when the
outliers are deleted. We can do this for either the original data or the square roots of the original
data. In either case, it now seems reasonable to treat the data as normal, so we can more confidently
use a t(36 − 1) distribution instead of hoping that the sample is large enough to justify use of the
t(37) distribution. We will consider these tests and confidence intervals in the next chapter.
It is important to remember that if outliers are deleted, the conclusions reached are not valid
for data containing outliers. For example, a confidence interval will be for the mean dropout rate
excluding the occasional classes with extremely large dropout rates. If we are confident that any
deleted outliers are not really part of the population of interest, this causes no problem. Thus, if
we were sure that the large dropout rates were the result of clerical errors and did not provide any
44 2. ONE SAMPLE
Normal Q−Q Plot
60
50
40
Drop rates
30
20
10
0
−2 −1 0 1 2
Theoretical Quantiles
Figure 2.7: Normal plot for drop rate percentage data: full data.
10
5
−2 −1 0 1 2
Theoretical Quantiles
Figure 2.8: Normal plot for drop rate percentage data: outliers deleted.
information about true dropout rates, our conclusions about the population should be based on the
data excluding the outliers. More often, though, we do not know that outliers are simple mistakes.
Often, outliers are true observations and often they are the most interesting and useful observations
in the data. If the outliers are true observations, systematically deleting them changes both the
sample and the population of interest. In this case, the confidence interval is for the mean of a
population implicitly defined by the process of deleting outliers. Admittedly, the idea of the mean
dropout rate excluding the occasional outliers is not very clearly defined, but remember that the
real population of interest is not too clearly defined either. We do not really want to learn about the
clearly defined population of dropout rates; we really want to treat the dropout rate data as a sample
2.5 CHECKING NORMALITY 45
Normal Q−Q Plot
5
4
Drop rates
3
2
1
−2 −1 0 1 2
Theoretical Quantiles
Figure 2.9: Normal plot for square roots of drop rate percentage data: outliers deleted.
from a population that allows us to draw useful inferences about current and future dropout rates.
If we really cared about the fixed population, we could specify exactly what kinds of observations
we would exclude and what we meant by the population mean of the observations that would be
included. Given the nature of the true population of interest, I think that such technicalities are more
trouble than they are worth at this point. ✷
Normal plots are subject to random variation because the data used in them are subject to random
variation. Typically, normal plots are not perfectly straight. Figures 2.10 through 2.13 each present
nine normal plots for which the data are in fact normally distributed. The figures differ by the
number of observations in each plot, which are 10, 25, 50, 100, respectively. By comparison to these,
Figures 2.8 and 2.9, the normal plots for the dropout rate data and the square root of the dropout
rates both with outliers deleted, look reasonably normal. Of course, if the dropout rate data are truly
normal, the square root of these data cannot be truly normal and vice versa. However, both are
reasonably close to normal distributions.
From Figures 2.10 through 2.13 we see that as the sample size n gets bigger, the plots get
straighter. Normal plots based on even larger normal samples tend to appear straighter than these.
Normal plots based on smaller normal samples can look much more crooked.
Testing normality
In an attempt to quantify the straightness of a normal plot, Shapiro and Francia (1972) proposed
the summary statistic W ′ , which is the squared sample correlation between the pairs of points in
the plots. The population correlation coefficient was introduced in Subsection 1.2.3. The sample
correlation coefficient is introduced in Chapter 6. At this point, it is sufficient to know that sam-
ple correlation coefficients near 0 indicate very little linear relationship between two variables and
sample correlation coefficients near 1 or −1 indicate a very strong linear relationship. Since you
need a computer to get the normal scores (rankits) anyway, just rely on the computer to give you the
squared sample correlation coefficient.
A sample correlation coefficient near 1 indicates a strong tendency of one variable to increase
(linearly) as the other variable increases, and sample correlation coefficients near −1 indicate a
46 2. ONE SAMPLE
(a) (b) (c)
Sample Quantiles
Sample Quantiles
Sample Quantiles
0 1
0.5
−0.5 0.5
−2
−1.0
−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5
Sample Quantiles
Sample Quantiles
0.5 1.5
−1.0
−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5
Sample Quantiles
Sample Quantiles
0.5 1.5
−0.5 0.5 1.5
−0.5 1.0
−1.0
−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5
Sample Quantiles
Sample Quantiles
0.0 1.5
0 1
1.0
−0.5
−2
−2.0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
Sample Quantiles
3
1
−1.0 0.5
−3 −1
−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
Sample Quantiles
−1.0 0.5 2.0
1 2
−1.5 0.0
−1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
strong tendency for one variable to decrease (linearly) as the other variable increases. In normal
plots we are looking for a strong tendency for one variable, the ordered data, to increase as the
other variable, the rankits, increases, so normal data should display a sample correlation coefficient
near 1 and thus the square of the sample correlation, W ′ , should be near 1. If W ′ is too small,
it indicates that the data are inconsistent with the assumption of normality. If W ′ is smaller than,
say, 95% of the values one would see from normally distributed data, it is substantial evidence
that the data are not normally distributed. If W ′ is smaller than, say, 99% of the values one would
see from normally distributed data, it is strong evidence that the data are not normally distributed.
Appendix B.3 presents tables of the values W ′ (0.05, n) and W ′ (0.01, n). These are the points above
2.5 CHECKING NORMALITY 47
(a) (b) (c)
Sample Quantiles
Sample Quantiles
Sample Quantiles
1.5
2
−3 −1 1
−1.5 0.0
0
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
Sample Quantiles
2
0 1 2
2
0
−2 0
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
Sample Quantiles
2
−3 −1 1
−3 −1 1
0
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
Sample Quantiles
Sample Quantiles
2
2
−3 −1 1
0
−2
−2
−2 0 1 2 −2 0 1 2 −2 0 1 2
Sample Quantiles
Sample Quantiles
2
2
1
−2 0
−2 0
−3 −1
−2 0 1 2 −2 0 1 2 −2 0 1 2
Sample Quantiles
Sample Quantiles
2
2
0
−3 −1
0
−2
−2
−2 0 1 2 −2 0 1 2 −2 0 1 2
which fall, respectively, 95% and 99% of the W ′ values one would see from normally distributed
data. Of course the W ′ percentiles are computed using not only the assumption of normality, but also
the assumptions that the observations are independent with the same mean and variance. Note also
that the values of these percentiles depend on the sample size n. The tabled values are consistent
with our earlier observation that the plots are more crooked for smaller numbers of observations
and straighter for larger numbers of observations in that the tabled values get larger with n. For
comparison, we give the observed W ′ values for the data used in Figure 2.11.
48 2. ONE SAMPLE
Shapiro–Francia statistics for Figure 2.11
Plot W′ Plot W′ Plot W′
(a) 0.940 (d) 0.982 (g) 0.977
(b) 0.976 (e) 0.931 (h) 0.965
(c) 0.915 (f) 0.915 (i) 0.987
. .
These should be compared to W ′ (0.05, 25) = 0.918 and W ′ (0.01, 25) = 0.88 from Appendix B.3.
Two of these nine values are below the 5% point, which is quite strange.
E XAMPLE 2.5.2. For the dropout rate data we have three normal plots. The complete, untrans-
formed data yield a W ′ value of 0.697. This value is inconsistent with the assumption that the
dropout rate data has a normal distribution. Deleting the two outliers, W ′ is 0.978 for the untrans-
formed data and 0.960 for the square roots of the data. The tabled percentiles are W ′ (0.05, 36) =
0.940 and W ′ (0.01, 36) = 0.91, so the untransformed data and the square root data look alright. In
addition, W ′ was computed for the square roots of the complete data. Its value, 0.887, is still signif-
icantly low, but is a vast improvement over the untransformed complete data. The outliers are not
nearly as strange when the square roots of the data are considered. Sometimes it is possible to find
a transformation that eliminates outliers. ✷
2.6 Transformations
In analyzing a collection of numbers, we assume that the observations are a random sample from
some population. Often, the population from which the observations come is not as well defined as
we might like. For example, if our observations are the yields of corn grown on 30 one-acre plots of
ground in the summer of 2013, what is the larger population from which this is a sample? Typically,
we do not have a large number of one-acre plots from which we randomly select 30. Even if we
had a large collection of plots, these plots are subject to different weather conditions, have different
fertilities, etc. Most importantly, we are rarely interested in corn grown in 2013 for its own sake. If
we are studying corn grown in 2013, we are probably interested in predicting how that same type
of corn would behave if we planted it at some time in the future. No population that currently exists
could be completely appropriate for drawing conclusions about plant growths in a future year. Thus
the assumption that the observations are a random sample from some population is often only a
useful approximation.
When making approximations, it is often necessary to adjust things to make the approximations
more accurate. In Statistics, two approximations we frequently make are that all the data have the
same variance and that the data are normally distributed. Making numerical transformations of
the data is a primary tool for improving the accuracy of these approximations. When sampling
from a fixed population, we are typically interested in transformations that improve the normality
assumption because having different variances is not a problem associated with sampling from a
fixed population. With a fixed population, the variance of an object is the variance of randomly
choosing an object from the population. This is a constant regardless of which object we end up
choosing. But data are rarely as simple as random samples from a fixed population. Once we have
an object from the population, we have to obtain an observation (measurement or count) from the
object. These observations on a given object are also subject to random error and the error may well
depend on the specific object being observed.
We now examine the fact that observations often have different variances, depending on the
object being observed. First consider taking length measurements using a 30-centimeter ruler that
has millimeters marked on it. For measuring objects that are less than 30 centimeters long, like
this page, we can make very accurate measurements. We should be able to measure things within
half a millimeter. Now consider trying to measure the height of a doghouse that is approximately
3.5 feet tall. Using the 30-cm ruler, we measure up from the base, mark 30 cm, measure from the
mark up another 30 cm, make another mark, measure from the new mark up another 30 cm, mark
again, and finally we measure from the last mark to the top of the house. With all the marking
2.6 TRANSFORMATIONS 49
and moving of the ruler, we have much more opportunity for error than we have in measuring the
length of the book. Obviously, if we try to measure the height of a house containing two stories,
we will have much more error. If we try to measure the height of the Burj Khalifa in Dubai using a
30 cm ruler, we will not only have a lot of error, but large psychiatric expenses as well. The moral
of this tale is that, when making measurements, larger objects tend to have more variability. If the
objects are about the same size, this causes little or no problem. One can probably measure female
heights with approximately the same accuracy for all women in a sample. One probably cannot
measure the weights of a large sample of marine animals with constant variability, especially if the
sample includes both shrimp and blue whales. When the observations are the measured amounts
of something, often the standard deviation of an observation is proportional to its mean. When the
standard deviation is proportional to the mean, analyzing the logarithms of the observations is more
appropriate than analyzing the original data.
Now consider the problem of counting up the net financial worth of a sample of people. For
simplicity, let’s think of just three people, me, my 10-year-old grandson (the one my son has yet
to provide), and my rich uncle, Scrooge. In fact, let’s just think of having a stack of one dollar
bills in front of each person. My pile is of a decent size, my grandson’s is small, and my uncle’s
is huge. When I count my pile, it is large enough that I could miscount somewhere and make a
significant, but not major, error. When I count my son’s pile, it is small enough that I should get
it about right. When I count my uncle’s pile, it is large enough that I will, almost inevitably, make
several significant errors. As with measuring amounts of things, the larger the observation, the larger
the potential error. However, the process of making these errors is very different than that described
for measuring amounts. In such cases, the variance of the observations is often proportional to the
mean of the observations. The standard corrective measure for counts is different from the standard
corrective measure for amounts. When the observations are counts of something, often the variance
of the count is proportional to its mean. In this case, analyzing the square roots of the observations
is more appropriate than analyzing the original data.
Suppose we are looking at yearly sales for a sample of corporations. The sample may include
both the corner gas (petrol) station and Exxon. It is difficult to argue that one can really count sales
for a huge company such as Exxon. In fact, it may be difficult to count even yearly sales for a gas
station. Although in theory one should be able to count sales, it may be better to think of yearly
sales as measured amounts. It is not clear how to transform such data. Another example is age. We
usually think of counting the years a person has been alive, but one could also argue that we are
measuring the amount of time a person has been alive. In practice, we often try both logarithmic
and square root transformations and use the transformation that seems to work best, even when the
type of observation (count or amount) seems clear.
Finally, consider the proportion of times people drink a particular brand of soda pop, say,
Dr. Pepper. The idea is simply that we ask a group of people what proportion of the time they
drink Dr. Pepper. People who always drink Dr. Pepper are aware of that fact and should give a quite
accurate proportion. Similarly, people who never drink Dr. Pepper should be able to give an accurate
proportion. Moreover, people who drink Dr. Pepper about 90% of the time or about 10% of the time,
can probably give a fairly accurate proportion. The people who will have a lot of variability in their
replies are those who drink Dr. Pepper about half the time. They will have little idea whether they
drink it 50% of the time, or 60%, or 40%, or just what. With observations that are counts or amounts,
larger observations have larger variances. With observations that are proportions, observations near
0 and 1 have small variability and observations near 0.5 have large variability. Proportion data call
for a completely different type of transformation. The standard transformation for proportion data
is the inverse sine (arcsine) of the square root of the proportion. When the observations are propor-
tions, often the variance of the proportion is a constant times μ (1 − μ )/N, where μ is the mean and
N is the number of trials. In this case, analyzing the inverse sine (arcsine) of the square root of the
proportion is more appropriate than analyzing the original data.
In practice, the square root transformation is sometimes used with proportion data. After all,
many proportions are obtained as a count divided by the total number of trials. For example, the
50 2. ONE SAMPLE
best data we could get in the Dr. Pepper drinking example would be the count of the number of
Dr. Peppers consumed divided by the total number of sodas imbibed.
There is a subtle but important point that was glossed over in the previous paragraphs. If we take
multiple measurements on a house, the variance depends on the true height, but the true height is the
same for all observations. Such a dependence of the variance on the mean causes no problems. The
problem arises when we measure a random sample of buildings, each with a variance depending on
its true height.
E XAMPLE 2.6.1. For the dropout rate data, we earlier considered the complete, untransformed
data and after deleting two outliers, we looked at the untransformed data and the square roots of the
data. In Examples 2.5.1 and 2.5.2 we saw that the untransformed data with the outliers deleted and
the square roots of the data with the outliers deleted had approximate normal distributions. Based on
the W ′ statistic, the untransformed data seemed to be more nearly normal. The data are proportions
of people who drop from a class, so our discussion in this section suggests transforming by the
inverse sine of the square roots of the proportions. Recall that proportions are values between 0 and
1, while the dropout rates were reported as values between 0 and 100, so the reported rates need to
be divided by 100. For the complete data, this transformation yields a W ′ value of 0.85, which is
much better than the untransformed value of 0.70, but worse than the value 0.89 obtained with the
square root transformation. With the two outliers deleted, the inverse sine of the square roots of the
proportions yields the respectable value W ′ = 0.96, but the square root transformation is simpler
and gives almost the same value, while the untransformed data give a much better value of 0.98.
Examination of the six normal plots (only three of which have been presented here) reinforce the
conclusions given above.
With the outliers deleted, it seems reasonable to analyze the untransformed data and, to a lesser
extent, the data after either transformation. Other things being equal, we prefer using the simplest
transformation that seems to work. Simple transformations are easier to explain, justify, and inter-
pret. The square root transformation is simpler, and thus better, than the inverse sine of the square
roots of the proportions. Of course, not making a transformation seems to work best and not trans-
forming is always the simplest transformation. Actually some people would point out, and it is
undeniably true, that the act of deleting outliers is really a transformation of the data. However, we
will not refer to it as such. ✷
Theory
The standard transformations given above are referred to as variance-stabilizing transformations.
The idea is that each observation is a look at something with a different mean and variance, where
the variance depends on the mean. For example, when we measure the height of a house, the house
has some ‘true’ height and we simply take a measurement of it. The variability of the measurement
depends on the true height of the house. Variance-stabilizing transformations are designed to elimi-
nate the dependence of the variance on the mean. Although variance-stabilizing transformations are
used quite generally for counts, amounts, and proportions, they are derived for certain assumptions
about the relationship between the mean and the variance. These relationships are tied to theoretical
distributions that are appropriate for some counts, amounts, and proportions. Rao (1973, Section 6g)
gives a nice discussion of the mathematical theory behind variance-stabilizing transformations.
Proportions are related to the binomial distribution for numbers of successes. We have a fixed
number of trials; the proportion is the number of successes divided by the number of trials. The
mean of a Bin(N, p) distribution is N p and the variance is N p(1 − p). This relationship between the
mean and variance of a binomial leads to the inverse sine of the square root transformation.
Counts are related to the Poisson distribution. Poisson data has the property that the variance
equals the mean of the observation. This relationship leads to the square root as the variance-
stabilizing transformation.
For amounts, the log transformation comes from having the standard deviation proportional
2.7 INFERENCE ABOUT σ 2 51
to the mean. The standard deviation divided by the mean is called the coefficient of variation, so
the log transformation is appropriate for observations that have a constant coefficient of variation.
(The square root transformation comes from having the variance, rather than the standard deviation,
proportional to the mean.) A family of continuous distributions called the gamma distributions has
a constant coefficient of variation; see Section 22.2.
The variance-stabilizing transformations are given below. In each case we assume E(yi ) = μi
and Var(yi ) = σi2 . The symbol ∝ means “proportional to.”
Variance-stabilizing transformations
Mean, variance
Data Distribution relationship Transformation
√
Count Poisson μi ∝ σi 2 yi
Amount Gamma μi ∝ σi log(yi )
μi (1− μi ) √
Proportion Binomial/N N ∝ σi2 sin−1 yi
α 2 1−α α 2
0
They also tend to be sensitive to the assumption of normality. The procedures do not follow the
same pattern used for most inferences that involve 1) a parameter of interest, 2) an estimate of the
parameter, 3) the standard error of the estimate, and 4) a known distribution symmetric about zero;
however, there are similarities. Procedures for variances typically require a parameter, an estimate,
and a known distribution.
The procedures discussed in this section actually apply to all the problems in this book that
involve a single variance parameter σ 2 . One need only substitute the relevant estimate of σ 2 and
use its degrees of freedom. Applications to the data and models considered in Chapter 19 are not
quite as straightforward because there the models involve more than one variance.
In the one-sample problem, the parameter is σ 2 , the estimate is s2 , and the distribution, as
discussed in Equation (2.1.6), is
(n − 1)s2
∼ χ 2 (n − 1).
σ2
The notation χ 2 (1 − α , n − 1) is used to denote the point that cuts off the bottom 1 − α (top α ) of
the χ 2 distribution with n − 1 degrees of freedom. Note that (n − 1)s2 /σ 2 is nonnegative, so the
curve in Figure 2.14 illustrating the χ 2 distribution is also nonnegative. Figure 2.14 shows a central
interval with probability 1 − α for a χ 2 distribution.
To test H0 : σ 2 = σ02 the value σ02 must be known. As usual, we assume that the null hypothesis
is true, i.e., σ 2 = σ02 , so under this assumption an α -level test is based on
α
(n − 1)s2 α
1 − α = Pr χ 2
,n − 1 < < χ 1 − ,n − 1 ;
2
2 σ02 2
see Figure 2.14. If we
observe data yielding an s2 such that (n − 1)s2 σ02 is between the values
χ 2 α2 , n − 1 and χ 2 1 − α2 , n − 1 , the data are consistent with the assumption that σ 2 = σ02 at
level α . Conversely, we reject H0 : σ 2 = σ02 with a two-sided α -level test if
(n − 1)s2 α
> χ 2
1 − , n − 1
σ02 2
2.7 INFERENCE ABOUT σ 2 53
or if
(n − 1)s2 2 α
< χ , n − 1 .
σ02 2
More specifically, we reject the null model that the data are independent, normally distributed with
a constant variance σ 2 , that we have the correct model for the mean structure, and that σ 2 = σ02 .
E XAMPLE 2.7.1. For the dropout rate data consider testing H0 : σ 2 = 50 with α = 0.01. Again,
we use the data with the two outliers deleted, because they are more nearly normal. Thus, our
concept of the population variance σ 2 must account for our deletion of weird cases. The deleted
data contain 36 observations and s2 for the deleted data is 27.45. The test statistic is
(n − 1)s2 35(27.45)
= = 19.215.
σ02 50
The critical region, the region for which we reject H0 , contains all values greater than
χ 2 (0.995, 35) = 60.275 and all values less than χ 2 (0.005, 35) = 17.19. The test statistic is cer-
tainly not greater than 60.275 and it is also not less than 17.19, so we have no basis for rejecting the
null hypothesis at the α = 0.01 level. At the 0.01 level, the data are consistent with the claim that
σ 2 = 50. ✷
Confidence intervals are defined in terms of testing the hypothesis H0 : σ 2 = σ02 . A (1 − α )100%
confidence interval for σ 2 is based on the following inequalities:
α (n − 1)s2 α
χ2 ,n − 1 < < χ2 1 − ,n − 1
2 σ02 2
which occurs if and only if
(n − 1)s2 (n − 1)s2
α < σ 2
< .
χ2 1 − 2 ,n − 1 0
χ 2 α2 , n − 1
The first inequality corresponds
to Figure 2.14 and just reflects the definition of the percentage
points χ 2 α2 , n − 1 and χ 2 1 − α2 , n − 1 . These are defined to be the points that cut out the middle
1 − α of the chi-squared distribution and are tabled in Appendix B.2. The second inequality is based
on algebraic manipulation of the terms in the first inequality. The actual derivation is given later in
this section. The second inequality gives an interval that contains σ02 values that are consistent with
the data and the model.
(n − 1)s2 (n − 1)s2
, . (2.7.1)
χ 2 1 − 2 , n − 1 χ 2 α2 , n − 1
α
The confidence interval for σ 2 requires the data to be normally distributed. This assumption is
more vital for inferences about σ 2 than it is for inferences about μ . For inferences about μ , the
central limit theorem indicates that the sample means are approximately normal even when the data
are not normal. There is no similar result indicating that the sample variance is approximately χ 2
even when the data are not normal. (For large n, both s2 and χ 2 (n−1) approach normal distributions,
but in general the approximate normal distribution for (n − 1)s2 /σ 2 is not the approximately normal
χ 2 (n − 1) distribution.)
E XAMPLE 2.7.2. Consider again the dropout rate data. We have seen that the complete data are
not normal, but that after deleting the two outliers, the remaining data are reasonably normal. We
find a 95% confidence interval for σ 2 from the deleted data. The percentage points for the χ 2 (36−1)
distribution are χ 2 (0.025, 35) = 20.57 and χ 2 (0.975, 35) = 53.20. The 95% confidence interval is
35(27.45) 35(27.45)
, ,
53.20 20.57
54 2. ONE SAMPLE
or equivalently (18.1, 46.7). The interval contains all values of σ 2 that are consistent with the data
and the model as determined by a two-sided α = 0.05 level test. The interval does not contain 50,
so we do have evidence against H0 : σ 2 = 50 at the α = 0.05 level. Remember that this is the true
variance after the deletion of outliers. Again, when we delete outliers we are a little fuzzy about the
exact definition of our parameter, but we are also being fuzzy about the exact population of interest.
The exception to this is when we believe that the only outliers that exist are observations that are
not really part of the population. ✷
2.7.1 Theory
Alas, these procedures for σ 2 are merely ad hoc. They are neither appropriate significance testing
results nor appropriate Neyman–Pearson theory results. Neither of those tests would use the central
χ 2 interval illustrated in Figure 2.14. However, the confidence interval has a Bayesian justification.
These methods are valid Neyman–Pearson procedures but not the optimal procedures. As
Neyman–Pearson procedures, the endpoints of the confidence interval (2.7.2) are random. To use
the interval, we replace the random variable s2 with the observed value of s2 and replace the term
“probability (1 − α )” with “(1 − α )100% confidence.” Once the observed value of s2 is substituted
into the interval, nothing about the interval is random any longer, the fixed unknown value of σ 2
is either in the interval or it is not; there is no probability associated with it. The probability state-
ment about random variables is mystically transformed into a ‘confidence’ statement. This is not
unreasonable, but the rationale is, to say the least, murky.
A significance test would not use the cutoff values χ 2 (α /2, n − 1) and χ 2 (1 − α /2, n − 1). Let
the vertical axis in Figure 2.14 be z and the horizontal axis be w. The density function being plotted
is f (w). Any positive value of z corresponds to two points w1 and w2 with
z = f (w1 ) = f (w2 ).
For an α -level significance test, one would find z0 so that the corresponding points w01 and w02 have
the property that
α = Pr[χ 2 (n − 1) ≤ w01 ] + Pr[χ 2 (n − 1) ≥ w02 ].
The significance test then uses w01 and w02 as the cutoff values for an α -level test. As you can see,
the method presented earlier is much simpler than this. (The optimal Neyman–Pearson test is even
more complicated than the significance test.) But the significance test uses an appropriate measure
of how weird any particular data value s2 is relative to a null model based on σ02 . Given the cut-off
values w01 and w02 , finding a confidence interval works pretty much as for the ad hoc method.
While methods for drawing inferences about variances do not fit our standard pattern for a single
parameter of interest based on 1) a parameter of interest, 2) an estimate of the parameter, 3) the
standard error of the estimate, and 4) a known distribution symmetric about zero, it should be noted
that the basic logic behind these confidence intervals and tests is the same. The correspondence
to model testing is strong since we are comparing the variance estimate of the original model s2
to the variance under the null model σ02 . The only real difference is that the appropriate reference
distribution turns out to be a χ 2 (n − 1) rather than an F. In any case, significance tests are based on
evaluating whether the data are consistent with the null model. Consistency is defined in terms of a
known distribution that applies when the null model is true. If the data are inconsistent with the null
model, the null model is rejected as being inconsistent with the observed data.
Below is a series of equalities that justify the confidence interval.
α (n − 1)s2 α
χ2 ,n − 1 < < χ 2
1 − , n − 1
2 σ2 2
1 σ2 1
α > > 2
χ2 2 ,n − 1
(n − 1)s2 χ 1 − α2 , n − 1
2.8 EXERCISES 55
1 σ2 1
< < 2 α
χ 2 1 − α2 , n − 1 (n − 1)s2 χ 2 ,n − 1
(n − 1)s2 (n − 1)s2
< σ 2
< .
χ 2 1 − α2 , n − 1 χ 2 α2 , n − 1
2.8 Exercises
E XERCISE 2.8.1. Mulrow et al. (1988) presented data on the melting temperature of biphenyl as
measured on a differential scanning calorimeter. The data are given below; they are the observed
melting temperatures in Kelvin less 340.
3.02, 2.36, 3.35, 3.13, 3.33, 3.67, 3.54, 3.11, 3.31, 3.41, 3.84, 3.27, 3.28, 3.30
Compute the sample mean, variance, and standard deviation. Give a 99% confidence interval for the
population mean melting temperature of biphenyl as measured by this machine. (Note that we don’t
know whether the calorimeter is accurately calibrated.)
E XERCISE 2.8.2. Box (1950) gave data on the weights of rats that were about to be used in an
experiment. The data are repeated in Table 2.1. Assuming that these are a random sample from a
broader population of rats, give a 95% confidence interval for the population mean weight. Test the
null hypothesis that the population mean weight is 60 using a 0.01 level test.
E XERCISE 2.8.3. Fuchs and Kenett (1987) presented data on citrus juice for fruits grown during
a specific season at a specific location. The sample size was 80 but many variables were measured
on each sample. Sample statistics for some of these variables are given below.
The variables are BX—total soluble solids produced at 20o C, AC—acidity as citric acid unhydrons,
SUG—total sugars after inversion, K—potassium, FORM—formol number, PECT—total pectin.
Give a 99% confidence interval for the population mean of each variable. Give a 99% prediction
interval for each variable. Test whether the mean of BX equals 10. Test whether the mean of SUG
is equal to 7.5. Use α = 0.01 for each test.
E XERCISE 2.8.4. Jolicoeur and Mosimann (1960) gave data on female painted turtle shell
lengths. The data are presented in Table 2.2. Give a 95% confidence interval for the population
mean length. Give a 99% prediction interval for the shell length of a new female.
E XERCISE 2.8.5. Mosteller and Tukey (1977) extracted data from the Coleman Report. Among
the variables considered was the percentage of sixth-graders whose fathers were employed in white-
collar jobs. Data for 20 New England schools are given in Table 2.3. Are the data reasonably normal?
Do any of the standard transformations improve the normality? After finding an appropriate trans-
formation (if necessary), test the null hypothesis that the percentage of white-collar fathers is 50%.
56 2. ONE SAMPLE
Use a 0.05 level test. Give a 99% confidence interval for the percentage of fathers with white-collar
jobs. If a transformation was needed, relate your conclusions back to the original measurement
scale.
E XERCISE 2.8.6. Give a 95% confidence interval for the population variance associated with
the data of Exercise 2.8.5. Remember that inferences about variances require the assumption of
normality. Could the variance reasonably be 10?
E XERCISE 2.8.7. Give a 95% confidence interval for the population variance associated with
the data of Exercise 2.8.4. Remember that the inferences about variances require the assumption of
normality.
E XERCISE 2.8.8. Give 99% confidence intervals for the population variances of all the variables
in Exercise 2.8.3. Assume that the original data were normally distributed. Using α = 0.01, test
whether the potassium variance could reasonably be 45,000. Could the formol number variance be
8?
E XERCISE 2.8.9. Shewhart (1931, p. 62) reproduces Millikan’s data on the charge of an election.
These are repeated in Table 2.4. Check for outliers and nonnormality. Adjust the data appropriately
if there are any problems. Give a 98% confidence interval for the population mean value. Give
a 98% prediction interval for a new measurement. (Millikan argued that some adjustments were
needed before these data could be used in an optimal fashion but we will ignore his suggestions.)
E XERCISE 2.8.10. Let y0 , y1 , . . . , yn be independent N μ , σ 2 random variables and compute ȳ· ,
and s2 from observations 1 through n. Show that (y0 − ȳ· )/ σ 2 + σ 2 /n ∼ N(0, 1) using results
from Chapter 1 and the fact that linear combinations of independent normals are normal. Recalling
s are independent and that (n − 1)s /σ ∼ χ (n − 1), use Definition 2.1.3 to show
that y0 , ȳ· , and 2 2 2 2
2 2
that (y − ȳ· )/ s + s /n ∼ t(n − 1).
Before we can perform a statistical analysis on data, we need to make assumptions about the data.
A model for the data is simply a statement of those assumptions. Typical assumptions are that
the observations are independent, have equal variances, and that either the observations are nor-
mally distributed or involve large sample sizes. (We don’t really know what “large” means, so large
samples is an assumption.) Typically, models also say something about the expected values of the
observations. In fact, it is the expected values that generally receive most of the attention when dis-
cussing models. Most statistical procedures, e.g., confidence intervals, prediction intervals, and tests
of a null hypothesis, rely on the validity of the model for the validity of the procedure. As such, it is
vitally important that we do what we can to establish the validity of the model. Sections 2.5 and 2.6
contained our first steps in that direction.
This chapter focuses on significance testing as a fundamental procedure in statistical inference.
Confidence intervals and P values are presented as extensions of a basic testing procedure. The
approach is very much in the spirit of the traditional approach used by R.A. Fisher as opposed to a
later approach to testing and confidence intervals introduced by Jerzy Neyman and E.S. Pearson. As
such, we do our best to avoid the artifacts of the Neyman–Pearson approach including alternative
hypotheses, one-sided testing, and the concept of the probability of Type I error. Although I am a
strong proponent of the use of Bayesian procedures—see Christensen et al. (2010)—they receive
little attention in this book.
The basic idea of significance testing is that one has a model for the data and seeks to determine
whether the data are consistent with that model or whether they are inconsistent with the model.
Determining that the data are inconsistent with the model is a strong statement. It suggests that
the model is wrong. It is a characteristic of statistical analysis that data rarely give an absolute
contradiction to a model, so we need to measure the extent to which the data are inconsistent with
the model. On the other hand, observing that the data are consistent with the model is a weak
statement. Although the data may be consistent with the current model, we could always construct
other models for which the data would also be consistent.
Frequently, when constructing tests, we have an underlying model for the data to which we add
some additional assumption, and then we want to test whether this new model is consistent with
the data. There are two terminologies for this procedure. First, the additional assumption is often
referred to as a null hypothesis, so the original model along with the additional assumption is called
the null model. Alternatively, the original model is often called the full model and the null model is
called the reduced model. The null model is a reduced model in the sense that it is a special case of
the full model, that is, it consists of the full model with the added restriction of the null hypothesis.
When discussing full and reduced models, we might not bother to specify the null hypothesis, but
every reduced model corresponds to some null hypothesis.
The most commonly used statistical tests and confidence intervals derive from a theory based
on a single parameter of interest, i.e., the null hypothesis is a specific assumption about a single
parameter. While we use this single parameter theory when convenient, the focus of this book is on
models rather than parameters. We begin with a general statement of our model-based approach to
57
58 3. GENERAL STATISTICAL INFERENCE
testing and then turn to an examination of the single parameter approach. A key aspect of the model-
based approach is that it easily allows for testing many parameters at once. The basic ideas of both
theories were illustrated in Chapter 2. The point of the current chapter is to present the theories in
general form and to reemphasize fundamental techniques. The general theories will then be used
throughout the book. Because the theories are stated in quite general terms, some prior familiarity
with the ideas as discussed in Chapter 2 is highly recommended.
The values ε̂i ≡ yi − ŷi are called residuals and are also used to evaluate model assumptions like
independence, equal variances, and normality.
Typically the model involves parameters that describe the interrelations between the μi s. If there
are r (functionally distinct) parameters for the mean values, we define the degrees of freedom for
error (dfE) as
dfE = n − r.
The degrees of freedom can be thought of as the effective number of observations that are available
for estimating the variance σ 2 after using the model to estimate the means. Finally, our estimate of
σ 2 is the mean squared error (MSE) defined as
SSE
MSE = .
dfE
To test two models, identify the full model (Full) and compute SSE(Full), dfE(Full),
and MSE(Full). Similarly, for the reduced model (Red.), compute SSE(Red.), dfE(Red.), and
MSE(Red.). The identification of a full model and a reduced model serves to suggest a test statistic,
i.e., something on which to base a test. The test is a test of whether the reduced model is correct.
We establish the behavior (distribution) of the test statistic when the reduced model is correct, and
if the observed value of the test statistic looks unusual relative to this reference distribution, we
conclude that something is wrong with the reduced model. (Or that we got unlucky in collecting our
data—always a possibility.)
Although the test assumes that the reduced model is correct and checks whether the data tend to
3.1 MODEL-BASED TESTING 59
contradict that assumption, for the purpose of developing a test we often act as if the full model is
true, regardless of whether the reduced model is true. This focuses the search for abnormal behavior
in a certain direction. Nonetheless, concluding that the reduced model is wrong does not imply that
the full model is correct.
Since the reduced model is a special case of the full model, the full model must always explain
the data at least as well as the reduced model. In other words, the error from Model (Red.) must
be as large as the error from Model (Full), i.e., SSE(Red.) ≥ SSE(Full). The reduced model being
smaller than the full, it also has more degrees of freedom for error, i.e., dfE(Red.) ≥ dfE(Full).
If the reduced model is true, we will show later that the statistic
SSE(Red.) − SSE(Full)
MSTest ≡
dfE(Red.) − dfE(Full)
is an estimate of the variance, σ 2 , with degrees of freedom dfE(Red.) − dfE(Full). Since the re-
duced model is a special case of the full model, whenever the reduced model is true, the full model
is also true. Thus, if the reduced model is true, MSE(Full) is also an estimate of σ 2 , and the ratio
F ≡ MSTest/MSE(Full)
should be about 1, since it is the ratio of two estimates of σ 2 . This ratio is called the F statistic in
honor of R.A. Fisher.
Everybody’s favorite reduced model takes
E(yi ) = μ
so that every observation has the same mean. This is the reduced model being tested in nearly all of
the three-line ANOVA tables given by computer programs, but we have much more flexibility than
that.
The F statistic is an actual number that we can compute from the data, so we eventually have
an actual observed value for the F statistic, say Fobs . If Fobs is far from 1, it suggests that something
may be wrong with the assumptions in the reduced model, i.e., either the full model is wrong or the
null hypothesis is wrong. The question becomes, “What constitutes an Fobs far from 1?” Even when
Model (Red.) is absolutely correct, the variability in the data causes variability in the F statistic.
Since MSTest and MSE(Full) are always nonnegative, the F statistic is nonnegative. Huge values
of Fobs are clearly far from 1. But we will see that sometimes values of Fobs very near 0 are also
far from 1. By quantifying the variability in the F statistic when Model (Red.) is correct, we get an
idea of what F statistics are consistent with Model (Red.) and what F values are inconsistent with
Model (Red.).
When, in addition to the assumption of independent observations with common variance σ 2 and
the assumption that the reduced model for the means is correct, we also assume that the data are
normally distributed and that both the full and reduced models are “linear” so that they have nice
mathematical properties, the randomness in the F statistic is described by an F distribution. Proper-
ties of the F distribution can be tabled, or more commonly, determined by computer programs. The
F distribution depends on two parameters, the degrees of freedom for MSTest and the degrees of
freedom for MSE(Full); thus we write
MSTest
F= ∼ F[dfE(Red.) − dfE(Full), dfE(Full)].
MSE(Full)
The shape (density) of the F[dfE(Red.) − dfE(Full), dfE(Full)] distribution determines which
values of the F statistic are inconsistent with the null model. A typical F density is shown in Fig-
ure 3.1. F values for which the curve takes on small values are F values that are unusual under the
null model. Thus, in Figure 3.1, unusual values of F occur when F is either very much larger than 1
or very close to 0. Generally, when dfE(Red.) − dfE(Full) ≥ 3 both large values of F and values of
60 3. GENERAL STATISTICAL INFERENCE
F(df1, df2)
1−α
0
1 F(1 − α,df1,df2)
F(1,df)
F(2,df)
0
0 1
F near 0 are inconsistent with the null model. As shown in Figure 3.2, when dfE(Red.) − dfE(Full)
is one or two, only very large values of the F statistic are inconsistent with the null model because
in those cases the density is large for F values near 0.
It can be shown that when the full model is wrong (which implies that the reduced model is also
wrong), it is possible for the F statistic to get either much larger than 1 or much smaller than 1.
Either case calls the reduced model in question.
Traditionally, people and computer programs have concerned themselves only with values of
the F statistic that are much larger than 1. If the full model is true but the reduced model is not true,
for linear models it can be shown that MSTest estimates σ 2 + δ where δ is some positive number.
3.1 MODEL-BASED TESTING 61
Since the full model is true, MSE(Full) still estimates σ 2 , so MSTest/MSE(Full) estimates (σ 2 +
δ )/σ 2 = 1 + (δ /σ 2 ) > 1. Thus, if the full model is true but the reduced model is not true, the F
statistic tends to get larger than 1 (and not close to 0). Commonly, an α -level test of whether Model
(Red.) is an adequate substitute for Model (Full) has been rejected when
[SSE(Red.) − SSE(Full)] [dfE(Red.) − dfE(Full)]
MSE(Full)
> F[1 − α , dfE(Red.) − dfE(Full), dfE(Full)] . (3.1.1)
The argument that the F statistic tends to be larger than 1 when the reduced model is false de-
pends on the validity of the assumptions made about the full model. These include the data being
independent with equal variances and involve some structure on the mean values. As discussed ear-
lier, comparing the F statistic to the F distribution additionally presumes that the full and reduced
models are linear and that the data are normally distributed.
Usually, the probability that an F[dfE(Red.) − dfE(Full), dfE(Full)] distribution is larger than
the observed value of the F statistic is reported as something called a P value. For three or more
numerator degrees of freedom, I do not think this usual computation of a P value is really a P value
at all. It is slightly too small. A P value is supposed to be the probability of seeing data as weird or
weirder than you actually observed. With three or more numerator degrees of freedom, F statistics
near 0 can be just as weird as F statistics much larger than 1. Weird values should be determined
as values that have a low probability of occurring or, in continuous cases like these, have a low
probability density function, i.e., the curve plotted in Figure 3.1. The probability density for an F
distribution with three or more degrees of freedom in the numerator gets small both for values much
larger than 1 and for values near 0. To illustrate, consider an F(5, 20) distribution and an observed
test statistic of Fobs = 2.8. The usual reported P value would be 0.0448, the probability of being
at least 2.8. By our two-sided definition, the actual P value should be 0.0456. The computations
depend on the fact that, from the shape of the F(5, 20) distribution; seeing an F statistic of 0.036 is
just as weird as seeing the 2.8 and the probability of seeing something smaller than 0.036 is 0.0008.
Technically, Fobs = 2.8 and Fobs = 0.036 have the same density, and the densities get smaller as the
F values get closer to infinity and to zero, respectively. The P value should be the probability of
being below 0.036 and above 2.8, not just the probability of being above 2.8.
For this little example, the difference between our two-sided and the usual one-sided P values
is 0.0008, so as commonly reported, a one-sided P value of 0.9992 = 1 − 0.0008, which would be
reported for Fobs = 0.036, would make us just as suspicious of the null model as the P value 0.0448,
which would be reported when seeing Fobs = 2.8.
Alas, I suspect that the “one-sided” P values will be with us for quite a while. I doubt that many
software packages are going to change how they compute the P values for F tests simply because I
disapprove of their current practice. Besides, as a practical matter, checking whether the one-sided
P values are very close to 1 works reasonably well.
We now establish that MSTest is a reasonable estimate of σ 2 when the reduced model (Red.)
holds. The basic idea is this: If we have three items where the first is an average of the other two,
and if the first item and one of the other two both estimate σ 2 , then the third item must also be an
estimate of σ 2 ; see Exercise 3.10.10. Write
1
MSE(Red.) = [SSE(Red.) − SSE(Full) + SSE(Full)]
dfE(Red.)
dfE(Red.) − dfE(Full) SSE(Red.) − SSE(Full) dfE(Full)
= + MSE(Full)
dfE(Red.) dfE(Red.) − dfE(Full) dfE(Red.)
dfE(Red.) − dfE(Full) dfE(Full)
= MSTest + MSE(Full).
dfE(Red.) dfE(Red.)
62 3. GENERAL STATISTICAL INFERENCE
This displays MSE(Red.) as a weighted average of MSTest and MSE(Full) because the multipliers
dfE(Red.) − dfE(Full) dfE(Full)
and
dfE(Red.) dfE(Red.)
are both between 0 and 1 and they add to 1. Since the reduced model is a special case of the full
model, when the reduced model is true, both MSE(Red.) and MSE(Full) are reasonable estimates
of σ 2 . Since one estimate of σ 2 , the MSE(Red.), has been written as a weighted average of another
estimate of σ 2 , the MSE(Full), and something else, MSTest, it follows that the something else must
also be an estimate of σ 2 .
In data analysis, we are looking for a (relatively) succinct way of summarizing the data. The
smaller the model, the more succinct the summarization. However, we do not want to eliminate
useful aspects of a model, so we test the smaller (more succinct) model against the larger model to
see if the smaller model gives up significant explanatory power. Note that the larger model always
has at least as much explanatory power as the smaller model because the larger model includes
everything in the smaller model plus more. Although a reduced model may be an adequate substitute
for a full model on a particular set of data, it does not follow that the reduced model will be an
adequate substitute for the full model with any data collected on the variables in the full model. Our
models are really approximations and a good approximate model for some data might not be a good
approximation for data on the same variables collected differently.
Finally, we mention an alternative way of specifying models. Here we supposed that the model
involves independent data yi , i = 1, . . . , n, with E(yi ) = μi and some common variance, Var(yi ) = σ 2 .
We generally impose some structure on the μi s and sometimes we assume that the yi s are normally
distributed. An equivalent way of specifying the model is to write
yi = μi + εi
and make the assumptions that the εi s are independent with E(εi ) = 0, Var(εi ) = σ 2 , and are nor-
mally distributed. Using the rules for means and variances, it is easy to see that once again,
E(yi ) = E(μi + εi ) = μi + E(εi ) = μi + 0 = μi
and
Var(yi ) = Var(μi + εi ) = Var(εi ) = σ 2 .
It also follows that if the εi s are independent, the yi s are independent, and if the εi s are normally
distributed, the yi s are normally distributed. The εi s are called errors and the residuals ε̂i = yi − ŷi
are estimates (actually predictors) of the errors.
Typically, the full model specifies a relationship among the μi s that depends on some parameters,
say, θ1 , . . . , θr . Typically, a reduced model specifies some additional relationship among the θ j s that
is called a null hypothesis (H0 ), for example, θ1 = θ2 . As indicated earlier, everybody’s favorite
reduced model has a common mean for all observations, hence
yi = μ + εi .
We now apply this theory to the one-sample problems of Chapter 2. The full model is simply the
one-sample model, thus the variance estimate is MSE(Full) = s2 , which we know has dfE(Full) =
n − 1. A little algebra gives SSE(Full) = dfE(Full) × MSE(Full) = (n − 1)s2. For testing the null
model with H0 : μ = m0 , the variance estimate for the reduced model is
1 n
MSE(Red.) = σ̂02 ≡ ∑ (yi − m0)2
n i=1
and 2
MSTest n(ȳ· − m0 )2 ȳ· − m0
F= = = √ .
MSE(Full) s2 s/ n
The F statistic should be close to 1 if the null model is correct. If the data are normally distributed
under the null model, the F statistic should be one observation from an F(1, n − 1) distribution,
which allows us more precise determinations of the extent to which an F statistic far from 1 contra-
dicts the null model. Recall that with one or two degrees of freedom in the numerator of the F test,
values close to 0 are the values most consistent with the reduced model, cf. Figure 3.2.
E XAMPLE 3.1.1. Years ago, 16 people were independently abducted by S.P.E.C.T.R.E after a
Bretagne Swords concert and forced to submit to psychological testing. Among the tests was a
measure of audio acuity. From many past abductions in other circumstances, S.P.E.C.T.R.E knows
that such observations form a normal population. The observed values of ȳ· and s2 were 22 and
0.25, respectively, for the audio acuity scores. Now the purpose of all this is that S.P.E.C.T.R.E.
had a long-standing plot that required the use of a loud rock band. They had been planning to
use the famous oriental singer Perry Cathay but Bretagne Swords’ fans offered certain properties
they preferred, provided that those fans’ audio acuity scores were satisfactory. From extremely long
experience with abducting Perry Cathay fans, S.P.E.C.T.R.E. knows that they have a population
mean of 20 on the audio acuity test. S.P.E.C.T.R.E. wishes to know whether Bretagne Swords fans
differ from this value. Naturally, they tested H0 : μ = 20.
The test is to reject the null model if
16(ȳ· − 20)2
F=
s2
is far from 1 or, if the data are normally distributed, if the F statistic looks unusual relative to an
F(1, 15) distribution. Using the observed data,
16(22 − 20)2
Fobs = = 256
0.25
which is very far from 1. ✷
E XAMPLE 3.1.2. The National Association for the Abuse of Student Yahoos (also known as
NAASTY) has established guidelines indicating that university dropout rates for math classes should
be 15%. In Chapter 2 we considered data from the University of New Mexico’s 1984–85 academic
year on dropout rates for math classes. We found that the 38 observations on dropout rates were
not normally distributed; they contained two outliers. Based on an α = .05 test, we wish to know if
the University of New Mexico (UNM) meets the NAASTY guidelines when treating the 1984–85
academic year data as a random sample. As is typical in such cases, NAASTY has specified that
the central value of the distribution of dropout rates should be 15% but it has not stated a specific
definition of the central value. We interpret the central value to be the population mean of the dropout
rates and test the null hypothesis H0 : μ = 15%.
The complete data consist of 38 observations from which we compute ȳ· = 13.11 and s2 =
106.421. The data are nonnormal so, although the F statistic is reasonable, we have little to justify
comparing the F statistic to the F(1, 37) distribution. Substituting the observed values for ȳ· and s2
into the F statistic gives the observed value of the test statistic
38(13.11 − 15)2
Fobs = = 1.275,
106.421
64 3. GENERAL STATISTICAL INFERENCE
which is not far from 1. The 1984–85 data provide no evidence that UNM violates the NAASTY
guidelines.
If we delete the two outliers, the analysis changes. The summary statistics become ȳd = 11.083
and s2d = 27.45. Here the subscript d is used as a reminder that the outliers have been deleted. With-
out the outliers, the data are approximately normal and we can more confidently use the F(1, 35)
reference distribution,
36(11.083 − 15)
Fobs,d = = 20.2.
27.45
.
This is far from 1. In fact, the 0.999 percentile of an F(1, 35) is F(0.999, 1, 35) = 12.9, so an
observed Fd of 20.2 constitutes very unusual data relative to the null model. Now we have evidence
that dropout rates differ from 15% (or that something else is wrong with the model) but only for a
population that no longer includes “outliers.” ✷
has a distribution that is some member of the family of t distributions, say t(df ), where df specifies
the degrees of freedom. The estimate Est is taken to be a random variable. The standard error,
SE(Est), is the standard deviation of the estimate if that is known, but more commonly it is an
estimate of the standard deviation. If the SE(Est) is estimated, it typically involves an estimate of
σ 2 and the estimate of σ 2 determines the degrees of freedom for the t distribution. If the SE(Est) is
known, then typically σ 2 is known, and the distribution is usually the standard normal distribution,
3.2 INFERENCE ON SINGLE PARAMETERS: ASSUMPTIONS 65
1−α α
0
0 t(1 − α, df)
i.e., t(∞). In some problems, e.g., problems involving the binomial distribution, the central limit
theorem is used to get an approximate distribution and inferences proceed as if that distribution
were correct. Although appealing to the central limit theorem, so the known distribution is the
standard normal, we generally use a t with finite degrees of freedom hoping that it provides a better
approximation to the true reference distribution than a standard normal.
Identifying a parameter of interest and an estimate of that parameter is relatively easy. The more
complicated part of the procedure is obtaining the standard error. To do that, one typically derives the
variance of Est, estimates it (if necessary), and takes the square root. Obviously, rules for deriving
variances play an important role in finding standard errors.
These four items—Par, Est, SE(Est), reference distribution—depend crucially on the assump-
tions made in modeling the data. They depend on assumptions made about the expected values of
the observations but also on assumptions of independence, equal variances (homoscedasticity), and
normality or large sample sizes. For the purposes of this discussion, we refer to the assumptions
made to obtain the four items as the (full) model.
We need notation for the percentage points of the t distribution. In particular, we need a name for
the point that cuts off the top α of the distribution. The point that cuts off the top α of the distribution
also cuts off the bottom 1 − α of the distribution. These ideas are illustrated in Figure 3.3. The
notation t(1 − α , df ) is used for the point that cuts off the top α .
The illustration in Figure 3.3 is written formally as
Est − Par
Pr > t(1 − α , df ) = α .
SE(Est)
The value t(1 − α , df ) is called a percentile or percentage point. It is most often found from a
computer program but can also be found from a t table or, in the case of t(∞), from a standard
normal table. One can get a feeling for how similar a t(df ) distribution is to a standard normal simply
66 3. GENERAL STATISTICAL INFERENCE
0.4
N(0,1)
t(3)
t(1)
0.3
0.2
0.1
0.0
−4 −2 0 2 4
by examining the t tables in Appendix B.1 and noting how quickly the t percentiles approach the
values given for infinite degrees of freedom. Alternatively, Figure 3.4 shows that t(df ) distributions
are centered around 0 and that a t(1) distribution is more spread out than a t(3) distribution, which
is more spread out than a N(0, 1) ≡ t(∞) distribution.
Although we have advertised the methods to be developed in the next sections as being based
on parameters rather than models, our discussion of parametric testing will continue to be based
on the models assumed for the data and the more specific null models determined by specifying a
particular value for the parameter.
H0 : Par = m0 .
In this context, the null (reduced) model consists of the assumptions made to obtain the four ele-
ments discussed in the previous section together with H0 .
The number m0 must be known; it is some number that is of interest for the specific data being
analyzed. It is impossible to give general rules for picking m0 because the choice must depend on
the context of the data. As mentioned in the previous chapter, the structure of the data (but not the
actual values of the data) sometimes suggests interesting hypotheses such as testing whether two
populations have the same mean or testing whether there is a relationship between two variables.
Ultimately the researcher must determine what hypotheses are of interest and these hypotheses
determine both Par and m0 . In any case, m0 is never just an unspecified symbol; it must have
meaning within the context of the problem.
The test of the null model involving H0 : Par = m0 is based on the four elements discussed in
the previous section and therefore relies on all of the assumptions of the basic model for the data.
In addition, the test assumes H0 is true, so the test is performed assuming the validity of the null
model. The idea of the test is to check whether the data seem to be consistent with the null model.
When the (full) model is true, Est provides an estimate of Par, regardless of the value of
3.3 PARAMETRIC TESTS 67
Par. Under the null model, Par = m0 , so Est should be close to m0 , and thus the t statistic
[Est − m0 ]/SE(Est) should be close to 0. Large positive and large negative values of the t statis-
tic indicate data that are inconsistent with the null model. The problem is in specifying what we
mean by “large.” We will conclude that the data contradict the null model if we observe a value of
[Est − m0 ]/SE(Est) that is farther from 0 than some cutoff values.
The problem is then to make intelligent choices for the cutoff values. The solution is based on
the fact that if the null model is true,
Est − m0
∼ t(df ).
SE(Est)
In other words, the t statistic, computed from the data and H0 , has a t(df ) distribution. From Fig-
ure 3.3, values of the t(df ) distribution close to 0 are common and values far from 0 are unusual.
We use the t(df ) distribution to quantify how unusual values of the t statistic are.
When we substitute the observed values of Est and SE(Est) into the t statistic we get one
observation on the random t statistic, say tobs . When the null model is true, this observation comes
from the reference distribution t(df ). The question is whether it is reasonable to believe that this one
observation came from the t(df ) distribution. If so, the data are consistent with the null model. If the
observation could not reasonably have come from the reference distribution, the data contradict the
null model. Contradicting the null model is a strong inference; it implies that something about the
null model is false. (Either there is something wrong with the basic model or with the assumption
that Par = m0 .) On the other hand, inferring that the data are consistent with the null model does
not suggest that it is true. Such data can also be consistent with models other than the null model.
The cutoff values for testing are determined by choosing an α level. The α -level test for H0 :
Par = m0 is to reject the null model if
Est − m0 α
> t 1 − , df
SE(Est) 2
or if
Est − m0 α
< −t 1 − , df .
SE(Est) 2
This is equivalent to rejecting H0 if
|Est − m0 | α
> t 1 − , df .
SE(Est) 2
We are rejecting H0 for those values of [Est − m0 ]/SE(Est) that are most inconsistent with the t(df )
distribution, those being the values far from zero. The α level is just a measure of how weird we
require the data to be before we reject the null model.
E XAMPLE 3.3.1. Consider again the 16 people who were independently abducted by
S.P.E.C.T.R.E after a Bretagne Swords concert and forced to submit to audio acuity testing.
S.P.E.C.T.R.E knows that the observations are normal and observed ȳ· = 22 and s2 = .25.
S.P.E.C.T.R.E. wishes to know whether Bretagne Swords fans differ from the population mean of
20 that Perry Cathay fans display. Naturally, they tested H0 : μ = 20. They chose an α level of 0.01.
1) Par = μ
2) Est = ȳ·
√
3) SE(Est) = s/ 16. In this case the SE(Est) is estimated.
√
4) [Est − Par] SE(Est) = [ȳ· − μ ] [s/ 16] has a t(15) distribution. This follows because the data
are normally distributed and the standard error is estimated using s.
68 3. GENERAL STATISTICAL INFERENCE
The α = 0.01 test is to reject H0 if
|ȳ· − 20|
√ > 2.947 = t(0.995, 15).
s/ 16
To find the appropriate cutoff value, note that 1 − α /2 = 1 − 0.01/2 = 0.995, so t(1 − α /2, 15) =
t(0.995, 15). With ȳ· = 22 and s2 = 0.25, we reject H0 if
|22 − 20|
|tobs | ≡ > 2.947.
0.25/16
Since |22 − 20| 0.25/16 = 16 is greater than 2.947, we reject the null model at the α = 0.01 level.
If the assumptions of the basic model are correct, there is clear (indeed, overwhelming) evidence
that the Bretagne Swords fans have higher scores. (Unfortunately, my masters will not let me inform
you whether high scores mean better hearing or worse.) ✷
E XAMPLE 3.3.2. We again consider data from the University of New Mexico’s 1984–85 aca-
demic year on dropout rates for math classes and compare them to the NAASTY guidelines of 15%
dropout rates. Based on an α = .05 test, we wish to know if the University of New Mexico meets
the NAASTY guidelines of 15% dropout rates when treating the 1984–85 academic year data as a
random sample. We test the null hypothesis H0 : μ = 15%. The 38 observations on dropout rates
were not normally distributed; they contained two outliers.
From the complete data of 38 observations we compute ȳ· = 13.11 and s2 = 106.421. The data
are nonnormal, so we have little choice but to hope that 38 observations constitute a sufficiently
large sample to justify the use of a t approximation, i.e.,
ȳ − μ
· ∼ t(37).
s2 /38
With an α level of 0.05 and the t(37) distribution, the test rejects H0 if
ȳ· − 15 α
> 2.026 = t(0.975, 37) = t 1 − , 37
s2 /38 2
or if
ȳ − 15
· < −2.026.
s2 /38
Substituting the observed values for ȳ· and s2 gives the observed value of the test statistic
13.11 − 15
tobs = = −1.13.
106.421/38
The value of −1.13 is neither greater than 2.026 nor less than −2.026, so the null hypothesis cannot
be rejected at the 0.05 level. The 1984–85 data provide no evidence that UNM violates the NAASTY
guidelines (or that anything else is wrong with the null model). Many people would use a t(∞)
distribution in this example based on the hope that n = 38 qualifies as a large sample size, but the
t(∞) seems too optimistic to me.
If we delete the two outliers, the analysis changes. Again, the subscript d is used as a reminder
that the outliers have been deleted. Without the outliers, the data are approximately normal and we
can more confidently use the reference distribution
ȳ − μd
d ∼ t(35).
s2d /36
3.3 PARAMETRIC TESTS 69
For this reference distribution the α = 0.05 test rejects H0 : μd = 15 if
ȳ − 15
d > 2.030 = t(0.975, 35)
s2d /36
or if
ȳ − 15
d < −2.030 = −t(0.975, 35).
s2d /36
With ȳd = 11.083 and s2d = 27.45 from the data without the outliers, the observed value of the t
statistic is
11.083 − 15
tobs,d = = −4.49.
27.45/36
The absolute value of −4.49 is greater than 2.030, i.e., −4.49 < −2.030, so we reject the null model
with H0 : μd = 15% at the 0.05 level. When we exclude the two extremely high observations, we
have evidence that the typical dropout rate was different from 15%, provided the other assumptions
are true. In particular, since the test statistic is negative, we have evidence that the population mean
dropout rate with outliers deleted was actually less than 15%. Obviously, most of the UNM math
faculty during 1984–85 were not sufficiently nasty.
Finally, we consider the role of transformations in testing. As in Chapter 2, we again consider
the square roots of the dropout rates with the two outliers deleted. As discussed earlier, NAASTY
has specified that the central value of the distribution of dropout rates should be 15% but has not
stated a specific definition of the central value. We are reasonably free to interpret their guideline and
we √now interpret it as though the population mean of the square roots of √the dropout rates should
be 15. This interpretation leads us to the null hypothesis H0 : μrd = 15. Here the subscript r
reminds us that square roots have been taken and the subscript d reminds us that outliers have been
deleted. As discussed earlier, a reasonable appropriate reference distribution is
ȳrd − μrd
∼ t(35),
s2rd /36
The sample mean and variance of the transformed, deleted data are ȳrd = 3.218 and s2rd = 0.749574,
so the observed value of the t statistic is
3.218 − 3.873
tobs,rd = = −4.54.
0.749574/36
The test statistic is similar to that in the previous paragraph. The null hypothesis is again rejected and
all conclusions drawn from the rejection are essentially the same. I believe that when two analyses
both appear to be valid, either the practical conclusions agree or neither analysis should be trusted.
✷
In practice, people rarely use the procedures presented in this section. These procedures require
one to pick specific values for m0 in H0 : Par = m0 and for α . In practice, one either picks an α level
and presents results for all values of m0 or one picks a value m0 and presents results for all α levels.
The first of these options is discussed in the next section.
70 3. GENERAL STATISTICAL INFERENCE
3.4 Confidence intervals
Confidence intervals are commonly viewed as the single most useful procedure in statistical infer-
ence. (I don’t think I agree with that view.) A (1 − α ) confidence interval for Par consists of all
the values m0 that would not be rejected by an α -level test of H0 : Par = m0 . In other words, the
confidence interval consists of all the parameter values that are consistent with both the data and the
model as determined by an α -level test. (Since the parameter is part of the model, it seems a little
redundant to specify that these are parameter values that are consistent with the model. One might
take that to be understood.)
A 95% confidence interval for Par is based on the fact that an α = .05 level test of H0 : Par = m0
will not be rejected when
Est − m0
−t(0.975, d f ) < < t(0.975, df ).
SE(Est)
Some algebra (given in the appendix to the chapter) shows that the test will not be rejected when
Est − t(0.975, df )SE(Est) < m0 < Est + t(.975, df )SE(Est).
Thus, the value m0 is not rejected by a 0.05 level test if and only if m0 is within the interval having
endpoints Est ± t(0.975, df )SE(Est).
E XAMPLE 3.4.1. In Example 3.3.1 we considered past data on audio acuity in a post-rock en-
vironment. Those data were collected on fans of Bretagne Swords from her days of playing Statler
Brothers Solitaire. The nefarious organization responsible for this study found it necessary to up-
date their findings after she found her missing card. This time they abducted for themselves 10
independent observations and they were positive that the data would follow a normal distribution
with variance 6. (Such arrogance is probably responsible for the failure of S.P.E.C.T.R.E.’s plans
of world domination. In any case, their resident statistician was in no position to question these
assumptions.) S.P.E.C.T.R.E. found that ȳ· was 17. They seek a 95% confidence interval for μ , the
mean of the population.
1) Par = μ ,
2) Est = ȳ· ,
3) SE(Est) = 6/10, in this case SE(Est) is known and not estimated.
4) [Est − Par] SE(Est) = [ȳ· − μ ] 6/10 has a t(∞) distribution.
For a 95% confidence interval, observe that 1 − α = 95% = 0.95 and α = 0.05. It follows that
t 1 − α2 , ∞ = t(0.975, ∞) = 1.96. The limits of the 95% confidence interval are
ȳ· ± 1.96 6/10
or, since ȳ· = 17,
17 ± 1.96 6/10.
S.P.E.C.T.R.E. concluded that for this model the data were consistent with a mean hearing score
between 15.5 and 18.5 for people at this concert (or at least for the population they were considering
for abduction) based on a 0.05 level test. ✷
E XAMPLE 3.4.2. In Chapter 2 we considered data on dropout rates for math classes. The 38
observations contained two outliers. Our parameter for these data is μ , the populationmean dropout
rate for math classes, the estimate is the sample mean ȳ· , and the standard error is s2 /38 where
s2 is the sample variance. Based on the central limit theorem and the law of large numbers, we used
the approximate reference distribution
ȳ − μ
· ∼ t(37).
s2 /38
3.4 CONFIDENCE INTERVALS 71
From the 38 observations, we computed ȳ· = 13.11 and s2 = 106.421 and found a 95% confidence
interval for the dropout rate of (9.7, 16.5). The endpoints of the confidence interval are computed as
13.11 ± 2.026( 106.421/38).
If we drop the two outliers, the remaining data seem to be normally distributed. Recomputing
the sample mean and sample variance with the outliers deleted, we get ȳd = 11.083 and s2d = 27.45.
Without the outliers, we can use the reference distribution
ȳ − μd
d ∼ t(35).
s2d /36
This t(35) distribution relies on the assumption of normality (which we have validated) rather than
relying on the unvalidated large sample approximations from the central limit theorem and law of
large numbers. Philosophically, the t(35) distribution should give more accurate results, but we have
no way to establish whether that is actually true for these data. To compute a 95% confidence interval
based on the data without the outliers, we need to find the appropriate
tabled values. Observe once
again that 1 − α = 95% = 0.95 and α = 0.05. It follows that t 1 − α2 , df = t(0.975, 35) = 2.030,
and, substituting the observed values of ȳd and s2d , the confidence interval has endpoints
11.083 ± 2.030( 27.45/36).
The actual interval is (9.3, 12.9). Excluding the extremely high values that occasionally occur, the
model and data are consistent with a mean dropout rate between 9.3 and 12.9 percent based on a
0.05 test. Remember, this is a confidence interval for the mean of math classes; it does not indicate
that you can be 95% confident that your next math class will have a dropout rate between 9.3 and
12.9 percent. Such an inference requires a prediction interval, cf. Section 3.7.
The interval (9.3, 12.9) is much narrower than the one based on all 38 observations, largely
because our estimate of the variance is much smaller when the outliers have been deleted. Note also
that with the outliers deleted, we are drawing inferences about a different parameter than when they
are present. With the outliers deleted, our conclusions are only valid for the bulk of the observations.
While occasional weird observations can be eliminated from our analysis, we cannot stop them from
occurring.
We have also looked at the square roots of the dropout rate data. We now consider the effect on
confidence intervals of transforming the data. With the two outliers deleted and taking square roots
of the observations, we found earlier that the data are reasonably normal. The sample mean and
variance of the transformed, deleted data are ȳrd = 3.218 and s2rd = 0.749574. Using the reference
distribution
ȳrd − μrd
∼ t(35),
s2rd /36
we obtain a 95% confidence interval with endpoints
0.749574
3.218 ± 2.030 .
36
The confidence interval reduces to (2.925, 3.511). This is a 95% confidence interval for the pop-
ulation mean of the square roots of the dropout rate percentages with ‘outliers’ removed from the
population.
The confidence interval (2.925, 3.511) does not really address the issue that we set out to inves-
tigate. We wanted some idea of the value of the population mean dropout rate. We have obtained
a 95% confidence interval for the population mean of the square roots of the dropout rate percent-
ages (with outliers removed from the population). There is no simple, direct relationship between
72 3. GENERAL STATISTICAL INFERENCE
the population mean dropout rate and the population mean of the square roots of the dropout rate
percentages, but a simple device can be used to draw conclusions about typical values for dropout
rates when the analysis is performed on the square roots of the dropout rates.
If the square root data are normal, the mean is the same as the median. The median is a value
with 50% of observations falling at or below it and 50% falling at or above it. Although the mean
on the square root scale does not transform back to the mean on the original scale, the median does.
Since (2.925, 3.511) provides a 95% confidence interval for the median from the square roots of
the dropout rate percentages, we simply square all the values in the interval to draw conclusions
about the median dropout rate percentages. Squaring the endpoints of the interval gives the new
interval (2.9252, 3.5112 ) = (8.6, 12.3). We are now 95% confident that the median of the population
of dropout rates is between 8.6 and 12.3. Interestingly, we will see in Section 3.7 that prediction
intervals do not share these difficulties in interpretation associated with transforming the data.
Note that the back transformed interval (8.6, 12.3) for the median obtained from the trans-
formed, deleted data is similar to the interval (9.3, 12.9) for the mean (which is also the median of
the assumed model) obtained earlier from the untransformed data with the outliers deleted. Again,
when two distinct analyses both seem reasonably valid, I would be very hesitant about drawing
practical conclusions that could not be justified from both analyses. ✷
The confidence intervals obtained from this theory can frequently be obtained by another ap-
proach to statistical inference using ‘Bayesian’ arguments; see Berry (1996). In the Bayesian justi-
fication, the correct interpretation of a 95% confidence interval is that the probability is 95% that
the parameter is in the interval.
Rather than the testing interpretation or the Bayesian interpretation, most statisticians seem to
favor the Neyman–Pearson definition for confidence intervals based on the idea that in a long run
of performing 95% confidence intervals, about 95% will contain the true parameter. Of course this
does not actually tell you anything about the confidence interval at hand. It also assumes that all the
models are correct in the long run of confidence intervals. It is difficult to get students to accept this
definition as anything other than a memorized fact. Students frequently misinterpret this definition
as the Bayesian interpretation.
The long run interpretation of confidence intervals tempts people to make a mistake in inter-
pretation. If I am about to flip a coin, we can agree that the physical mechanism involved gives
probability 1/2 to both heads and tails. If I flip the coin but don’t show it to you, you still feel
like the probabilities are both 1/2. But I know the result! Therefore, the probabilities based on the
physical mechanism no longer apply, and your feeling that probability 1/2 is appropriate is entirely
in your head. It feels good, but what is the justification? Bayesian Statistics involves developing
justifications for such probabilities.
The long run interpretation of confidence intervals is exactly the same as flipping a coin that
turns up heads, say, 95% of the time. The parameter being in the interval is analogous to the coin
being heads. Maybe it is; maybe it isn’t. How the number 0.95 applies to a particular interval or
flip, after it has been determined, is a mystery. Of course many statisticians simply recite the correct
probability statement and ignore its uselessness. The significance testing and Bayesian interpreta-
tions of the intervals both seem reasonable to me.
Confidence intervals give all the possible parameter values that seem to be consistent with the
data and the model. In particular, they give the results of testing H0 : Par = m0 for a fixed α but every
choice of m0 . In the next section we discuss P values that give the results of testing H0 : Par = m0
for a fixed m0 but every choice of α .
3.5 P values
Rather than having formal rules for when to reject the null model, one can report the evidence
against the null model. That is done by reporting the P value. The P value is computed under the
null model. It is the probability of seeing data that are as weird or more weird than those that were
3.5 P VALUES 73
actually observed. Formally, with H0 : Par = m0 we write tobs for the observed value of the t statistic
as computed from the observed values of Est and SE(Est). Thus tobs is our summary of the data
that were actually observed. Recalling our earlier discussion that the most unusual values of tobs are
those far from 0, the probability under the null model of seeing something as or more weird than
we actually saw is the probability that a t(df ) distribution is farther from 0 than |tobs |. Formally, we
can write this as
Est − m0
P = Pr ≥ |tobs | .
SE(Est)
Here Est (and usually SE(Est)) are viewed as random and it is assumed that Par = m0 so that
(Est − m0 )/SE(Est) has the known reference distribution t(df ). The value of tobs is a fixed known
number, so we can actually compute P. Using the symmetry of the t(df ) distribution, the basic
idea is that for, say, tobs positive, any value of (Est − m0 )/SE(Est) greater than tobs is more weird
than tobs . Any data that yield (Est − m0 )/SE(Est) = −tobs are just as weird as tobs and values of
(Est − m0 )/SE(Est) less than −tobs are more weird than observing tobs .
E XAMPLE 3.5.1. Again consider the Bretagne Swords data. We have 16 observations taken
from a normal population
√ and we wishto test H0 : μ = 20. As√before, 1) Par = μ , 2) Est = ȳ· ,
3) SE(Est) = s/ 16, and 4) [Est − Par] SE(Est) = [ȳ· − μ ] [s/ 16] has a t(15) distribution. This
time we take ȳ· = 19.78 and s2 = .25, so the observed test statistic is
19.78 − 20
tobs = = −1.76.
0.25/16
The P value is the smallest α level for which the test would be rejected. Thus, if we perform an
α -level test where α is less than the P value, we can conclude immediately that the null model is not
rejected. If we perform an α -level test where α is greater than the P value, we know immediately
that the null model is rejected. Thus computing a P value eliminates the need to go through the
formal testing procedures described in Section 3.3. Knowing the P value immediately gives the test
results for any choice of α . The P value is a measure of how consistent the data are with the null
model. Large values (near 1) indicate great consistency. Small values (near 0) indicate data that are
inconsistent with the null model.
E XAMPLE 3.5.2. In Example 3.3.2 we considered tests for the drop rate data. Using the complete
untransformed data and the null hypothesis H0 : μ = 15, we observed the test statistic
13.11 − 15
tobs = = −1.13.
106.421/38
An α = 0.26 test would be just barely rejected by these data. Any test with an α level smaller than
0.26 is more stringent (the cutoff values are farther from 0 than 1.13) and would not be rejected.
Thus the commonly used α = 0.05 and α = 0.01 tests would not be rejected. Similarly, any test
with an α level greater than 0.26 is less stringent and would be rejected. Of course, it is extremely
rare that one would use a test with an α level greater than 0.26. Recall that the P value of 0.26 is
74 3. GENERAL STATISTICAL INFERENCE
a highly questionable number because it was based on a highly questionable reference distribution,
the t(37).
Using the untransformed data with outliers deleted and the null hypothesis H0 : μd = 15, we
observed the test statistic
11.083 − 15
tobs,d = = −4.49.
27.45/36
We compute
P = Pr [|t(35)| ≥ | − 4.49|] = 0.000.
This P value is not really zero; it is a number that is so small that when we round it off to three
decimal places the number is zero. In any case, the test is rejected for any reasonable choice of α .
In other words, the test is rejected for any choice of α that is greater than 0.000. (Actually for any
α greater than 0.0005 because of the round-off issue.) √
Using the square roots of the data with outliers deleted and the null hypothesis H0 : μrd = 15,
the observed value of the test statistic is
3.218 − 3.873
tobs,rd = = −4.54.
0.749574/36
We compute
P = Pr [|t(35)| ≥ | − 4.54|] = 0.000.
Once again, the test result is highly significant. But remember, unless you are reasonably sure that
the model is right, you cannot be reasonably sure that H0 is wrong. ✷
E XAMPLE 3.5.3. In Example 3.3.1 we considered audio acuity data for Bretagne Swords fans
and tested whether their mean score differed from fans of Perry Cathay. In this example we test
whether their mean score differs from that of Tangled Female Sibling fans. Recall that the observed
values of n, ȳ· , and s2 for Bretagne Swords fans were 16, 22, and 0.25, respectively and that the
data were normal. Tangled Female Sibling fans have a population mean score of 22.325, so we test
H0 : μ = 22.325. The test statistic is (22 − 22.325)/ 0.25/16 = −2.6. If we do an α = 0.05 test,
| − 2.6| > 2.13 = t(0.975, 15), so we reject H0 , but if we do an α = 0.01 test, | − 2.6| < 2.95 =
.
t(0.995, 15), so we do not reject H0 . In fact, | − 2.6| = t(0.99, 15), so the P value is essentially
.02 = 2(1 − .99). The P value is larger than 0.01, so the 0.01 test does not reject H0 ; the P value is
less than 0.05, so the test rejects H0 at the 0.05 level.
If we consider confidence intervals, the 99% interval has endpoints 22 ± 2.95 0.25/16 for an
interval of (21.631, 22.369) and the 95% interval has endpoints 22 ± 2.13 0.25/16 for an interval
of (21.734, 22.266). Notice that the hypothesized value of 22.325 is inside the 99% interval, so it is
not rejected by a 0.01 level test, but 22.325 is outside the 95% interval, so a 0.05 test rejects H0 :
μ = 22.325. The 98% interval has endpoints 22 ± 2.60 0.25/16 for an interval of (21.675, 22.325)
and the hypothesized value is on the edge of the interval. ✷
In the absence of other assumptions, a large P value does not constitute evidence in support
of the null model. A large P value indicates that the data are consistent with the null model but,
typically, it is easy to find other models even more consistent with the data. In Example 3.5.1, the
data are even more consistent with μ = 19.78.
Philosophically, it would be more proper to define the P value prior to defining an α -level test,
defining an α -level test as one that rejects when the P value is less than or equal to α . One would
then define confidence intervals relative to α -level tests. I changed the order because I caved to my
perception that people are more interested in confidence intervals.
3.6 VALIDITY OF TESTS AND CONFIDENCE INTERVALS 75
3.6 Validity of tests and confidence intervals
In significance testing, we make an assumption, namely the null model, and check whether the data
are consistent with the null model or inconsistent with it. If the data are consistent with the null
model, that is all that we can say. If the data are inconsistent with the null model, it suggests that
the null model is somehow wrong. (This is very similar to the mathematical idea of a proof by
contradiction.)
Often people want a test of the null hypothesis H0 : Par = m0 rather than the null model. The
null model involves a series of assumptions in addition to H0 : Par = m0 . Typically we assume that
observations are independent and have equal variances. In most tests that we will consider, we as-
sume that the data have normal distributions. As we consider more complicated data structures, we
will need to make more assumptions. The proper conclusion from a test is that either the data are
consistent with our assumptions or the data are inconsistent with our assumptions. If the data are
inconsistent with the assumptions, it suggests that at least one of them is invalid. In particular, if the
data are inconsistent with the assumptions, it does not necessarily imply that the particular assump-
tion embodied in the null hypothesis is the one that is invalid. Before we can reasonably conclude
that the null hypothesis is untrue, we need to ensure that the other assumptions are reasonable. Thus
it is crucial to check our assumptions as fully as we can. Plotting the data, or more often plotting the
residuals, plays a vital role in checking assumptions. Plots are used throughout the book, but special
emphasis on plotting is given in Chapter 7.
In Section 3.2 it is typically quite easy to define parameters Par and estimates Est. The role of
the assumptions is crucial in obtaining a valid SE(Est) and an appropriate reference distribution.
If our assumptions are reasonably valid, our SE(Est) and reference distribution will be reasonably
valid and the procedures outlined here lead to conclusions about Par with reasonable validity. Of
course the assumptions that need to be checked depend on the precise nature of the analysis being
performed, i.e., the precise model that has been assumed.
Typically, the prediction standard error is much larger than the standard error of the estimate, so
prediction intervals are much wider than confidence intervals for E(y0 ). In particular, increasing the
number of observations typically decreases the standard error of the estimate but has a relatively
minor effect on the standard error of prediction. Increasing the sample size is not intended to make
σ̂ 2 smaller, it only makes σ̂ 2 a more accurate estimate of σ 2 .
E XAMPLE 3.7.1. As in Example 3.3.2, we eliminate the two outliers from the dropout rate data.
The 36 remaining observations are approximately normal. A 95% confidence interval for the mean
had endpoints
11.083 ± 2.030 27.45/36.
A 95% prediction interval has endpoints
27.45
11.083 ± 2.030 27.45 +
36
or
11.083 ± 10.782.
The prediction interval is (0.301, 21.865), which is much wider than the confidence interval of
(9.3, 12.9). Dropout rates for a new math class between 0.3% and 21.9% are consistent with the data
and the model based on a 0.05 level test. Population mean dropout rates for math classes between
3.7 THEORY OF PREDICTION INTERVALS 77
9% and 13% are consistent with the data and the model. Of course the prediction interval assumes
that the new class is from a population similar to the 1984–85 math classes with huge dropout rates
deleted. Such assumptions are almost impossible to validate. Moreover, there is some chance that
the new observation will be one with a huge dropout rate and this interval says nothing about such
observations.
In Example 3.3.2 we also considered the square roots of the dropout rate data with the two
outliers eliminated. To predict the square root of a new observation, we use the 95% interval
0.749574
3.218 ± 2.030 0.749574 + ,
36
which reduces to (1.436, 5.000). This is a prediction interval for the square root of a new observa-
tion, so actual values of the new observation between (1.4362, 5.0002), i.e., (2.1, 25) are consistent
with the data and model based on a 0.05 level test. Retransforming a prediction interval back into
the original scale causes no problems of interpretation whatsoever. This prediction interval and the
one in the previous paragraph are comparable. Both include values from near 0 up to the low to mid
twenties. ✷
E XAMPLE 3.7.2. Assuming that a sample of 36 observations is enough to ensure that s2 is es-
sentially equal to σ 2 , the nominal 95% prediction interval given in Example 3.7.1 for dropout rates
has a confidence level, regardless of the distribution of the data, that is at least
1 1
1− = 76% or even 1 − = 89%.
2.0302 2.25(2.030)2
The true α level for the corresponding test is no more than 0.24, or, if the improved version of
Chebyshev applies, 0.11.
The tabled value t(1 − α /2, df ) can be approximated by t(1 − α /2, ∞). If we specify that the
confidence interval is to be w units wide, set
and solve for the (approximate) appropriate sample size. In Equation (3.8.1), w, t(1 − α /2, ∞), and
σ are all known and A is a known function of the sample size.
Unfortunately it is not possible to take Equation (3.8.1) any further and show directly how it
determines the sample size. The discussion given here is general and thus the ultimate solution
depends on the type of data being examined. In√the only case we √
have examined as yet, there is one-
sample, Par = μ , Est = ȳ· , and SE(Est) = σ / n. Thus, A = 1/ n and Equation (3.8.1) becomes
√
w = 2t(1 − α /2, ∞)σ / n.
Cox shows that this procedure gives reasonable powers for common choices of α . Here mA and m0
are known and SE(Est) = σ A, where σ is known and A is a known function of sample size. Also
note that this suggestion does not depend explicitly on the α level of the test. As with Equation
80 3. GENERAL STATISTICAL INFERENCE
(3.8.1), Equation (3.8.2) can be solved to give n in particular cases, but a general solution for n is
not possible because it depends on the exact nature of the value A.
Consider again the problem of determining the mean height. If my null hypothesis is H0 : μ = 72
and I want a reasonable √ chance of√ rejecting H0 when μ = 73, Cox’s rule suggests that I should have
. . .
1 = |73 − 72| = 3 (3/ n) so that n = 9 or n = 81.
It is important to remember that these are only rough guides for sample sizes. They involve
several approximations, the most important of which is approximating σ . If there is more than
one parameter of interest in a study, sample size computations can be performed for each and a
compromise sample size can be selected.
In the early years of my career, I was amazed at my own lack of interest in teaching students
about statistical power until Cox (1958, p. 161) finally explained it for me. He points out that power
is very important in planning investigations but it is not very important in analyzing them. I might
even go so far as to say that once the data have been collected, a power analysis can at best tell you
whether you have been wasting your time. In other words, a power analysis will only tell you how
likely you were to find differences given the design of your experiment and the choice of test.
Although the simple act of rejecting a null model does nothing to suggest what models might be
correct, it can still be interesting to see whether we have a reasonable chance of rejecting the null
model when some alternative model is true. Hence our discussion. However, the theory of testing
presented here is not an appropriate theory for making choices between a null model and some
alternative. Our theory is a procedure for (possibly) falsifying a null model.
where m(·) is some fixed function that determines the mean of y for a given x and ε is some unob-
servable error term with mean 0. Thus
yh = m(xh ) + εh , h = 1, . . . , n. (3.9.1)
We typically assume
εh s independent N(0, σ 2 ). (3.9.2)
Here σ 2 is an unknown parameter that we must estimate. Together (3.9.1) and (3.9.2) constitute our
model for the observations. The function m(·) is our model for the mean of y. We make assumptions
about the form of m(·) that typically include unknown (mean) parameters that we must estimate.
3.9 THE SHAPE OF THINGS TO COME 81
Frequently, we find it more convenient to express the model in terms of the observations. These are
independent, normally distributed, and have the same variance σ 2 , i.e.,
If x is a single variable that only ever takes on one value, say, x ≡ 1, then we have the model for
a one-sample problem as discussed in Chapter 2. In particular, Model (3.9.1) becomes
yh = m(1) + εh, h = 1, . . . , n.
yh = μ + εh , h = 1, . . . , n,
yh s independent N(μ , σ 2 ).
In Chapter 6 we deal with a model that involves a single measurement predictor. In particular,
we discuss verbal abilities y in a school and relate them to a measurement of socio-economic status
x for the school. In simple linear regression we assume that
m(x) = β0 + β1x,
so our model incorporates a linear relationship between x and the expected value of y. For a set of n
observations, write
yh = β0 + β1 xh + εh , h = 1, . . . , n.
Here x is a known value but β0 and β1 are unknown, uniquely defined mean parameters that we must
estimate.
In Chapter 8 we introduce models with more complicated functions of a single predictor x. These
include polynomial models. A third-degree polynomial regression model has
Again the β s are unknown, uniquely defined mean parameters and x is treated as fixed. If the rela-
tionship between x and E(y) is nonlinear, polynomials provide one method of modeling the nonlin-
ear relationship.
In Section 6.9 we introduce, and in Chapters 9, 10, and 11 we consider in detail, models for
measurement variables with a vector of predictors x = (x1 , . . . , x p )′ . With p = 3, a typical multiple
regression model incorporates
m(x) = β0 + β1x1 + β2 x2 + β3 x3 .
Here the β j s are unknown parameters and the xh j values are all treated as fixed.
The predictors xh j used in (3.9.4) are necessarily numerical. Typically, they are either measure-
ments of some sort or 0-1 indicators of group membership. Categorical variables do not have to
be numerical (Sex, Race) but categories are often coded as numbers, e.g., Female = 1, Male = 2. It
would be inappropriate to use a (non-binary) categorical variable taking numerical values in (3.9.4).
82 3. GENERAL STATISTICAL INFERENCE
A categorical variable with, say, five categories should be incorporated into a multiple regression by
incorporating four predictor variables that take on 0-1 values. More on this in Sections 6.8 and 12.3.
Chapter 4 deals with two-sample problems, so it deals with a single categorical predictor that
only takes on two values. Suppose x takes on just the two values 1 and 2 for, say, females and males.
Then our model for the mean of y reduces to the two-sample model
μ1 ≡ m(1), if x = 1
m(x) =
μ2 ≡ m(2), if x = 2.
We have only two uniquely defined mean parameters to estimate: μ1 and μ2 . This m(·) gives the
model used in Section 4.2.
Unlike simple, polynomial, and multiple regression, there is no convenient way to write the
specific two-sample model in the general form (3.9.1). Although the two-sample model clearly fits
the general form, to deal with categorical variables it is convenient to play games with our subscripts.
We replace the single subscript h that indicates all n of the observations in the data with a pair of
subscripts: i that identifies the group and j that identifies observations within the group. If we have
Ni observations in group i, the total number of observations n must equal N1 + N2 . Now we can
rewrite Model (3.9.1), when x is a two-group categorical predictor, as
yi j = m(i) + εi j , i = 1, 2, j = 1, . . . , Ni .
yi j = μi + εi j , i = 1, 2, j = 1, . . . , Ni .
A single categorical predictor variable with more than two groups works pretty much the same
way. If there are a groups and the categorical predictor variable takes on the values 1, . . . , a, the
model has ⎧
⎪ μ1 ≡ m(1), if x = 1
⎪
⎨ μ2 ≡ m(2), if x = 2
m(x) = .. ..
⎪
⎪
⎩ . .
μa ≡ m(a), if x = a,
with a uniquely defined mean parameters to estimate. We can rewrite Model (3.9.1) when x is an a
group categorical predictor as
yi j = μi + εi j , i = 1, . . . , a, j = 1, . . . , Ni .
These one-way analysis of variance (ANOVA) models are examined in Chapter 12.
It really does not matter what values the categorical predictor actually takes as long as there are
only a distinct values. Thus, x can take on any a numbers or it can take on any a letter values or a
symbols of any kind, as long as they constitute distinct group identifiers. If the category is sex, the
values may be the words “male” and “female.”
Sometimes group identifiers can simultaneously be meaningful measurement variables. In Chap-
ter 12 we examine data on the strength of trusses built with metal plates of different lengths. The
metal plates are 4, 6, 8, 10, or 12 inches long. There are 7 observations for each length of plate,
so we create a predictor variable x with n = 35 that takes on these five numerical values. We now
have two options. We can treat x as a categorical variable with five groups, or we can treat x as a
measurement predictor variable and fit a linear or other polynomial regression model. We will see
in Chapter 12 that fitting a polynomial of degree four (one less than the number of categories) is
equivalent to treating the variable as a five-category predictor.
If we have two categorical predictors, say, x1 a type of drug and x2 a racial group, we have
considerable variety in the models we can build. Suppose x1 takes on the values 1, . . . , a, x2 takes
3.9 THE SHAPE OF THINGS TO COME 83
on the values 1, . . . , b, and x = (x1 , x2 ). Perhaps the simplest two-category predictor model to state
is the interaction model
with ab uniquely defined mean parameters. Using alternative subscripts we can write this model as
yi jk = μi j + εi jk , i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , Ni j ,
where Ni j is the number of observations that have both x1 = i and x2 = j. For the interaction model,
we could replace the two categorical variables having a and b groups, respectively, with a single
categorical variable that takes on ab distinct categories.
Two categorical variables naturally allow some useful flexibility. We can write an additive-
effects model, also called a no-interaction model, as
or
yi jk = μ + αi + η j + εi jk , i = 1, . . . , a, j = 1, . . . , b, k = 1, . . . , Ni j .
Here there are 1 + a + b parameters, μ , the αi s, and the η j s. Two of the parameters (not just any
two) are redundant, so there are (1 + a + b) − 2 functionally distinct parameters. Models with two
categorical predictors are discussed in Chapter 14. Models with one categorical predictor and one
continuous predictor are discussed in Chapter 15 along with instances when the continuous predictor
can also be viewed as a categorical predictor. Models for three categorical predictors are discussed
in Chapter 16.
Models based on two categorical predictors are called two-factor ANOVA models. A model
based on two or more categorical predictors is called a multifactor ANOVA model. Models with
three or more categorical predictors may also be called higher-order ANOVAs. An ANOVA model
is considered balanced if the number of observations on each group or combination of groups is the
same. For a one-way ANOVA that means N1 = · · · = Na ≡ N and for a two-factor ANOVA it means
Ni j ≡ N for all i and j. Computations for ANOVA models are much simpler when they are balanced.
Analysis of covariance (ACOVA or ANCOVA) consists of situations in which we have both types
of predictors (measurement and categorical) in the same model. ACOVA is primarily introduced
in Chapter 15. Some use of it is also made in Section 8.4. When group identifiers are simultane-
ously meaningful measurements, we have the option of performing ACOVA, multifactor ANOVA,
or multiple regression, depending on whether we view the predictors as a mix of categorical and
measurement, all categorical, or all measurement.
The models m(·) all involve some unknown parameters that we must estimate, although some of
the parameters may be redundant. Call the number of nonredundant, i.e., functionally distinct, mean
parameters r. Upon estimating the mean parameters, we get an estimated model m̂(·). Applying
this estimated model to the predictor variables in our data gives the fitted values, also called the
predicted values,
ŷh ≡ m̂(xh ), h = 1, . . . , n.
From the fitted values, we compute the residuals,
ε̂h ≡ yh − ŷh , h = 1, . . . , n.
As discussed in Chapter 7, we use the residuals to check the assumptions made in (3.9.2).
We also use the residuals to estimate the unknown variance, σ 2 , in (3.9.2). The degrees of
freedom for error is defined as the number of observations minus the number of functionally distinct
mean parameters, i.e.,
dfE = n − r.
84 3. GENERAL STATISTICAL INFERENCE
The sum of squares error is defined as the sum of the squared residuals, i.e.,
n
SSE = ε̂12 + · · · + ε̂n2 = ∑ ε̂h2 .
h=1
Finally, our estimate of the variance of an observation, σ 2 , is the mean squared error defined by
SSE
MSE = .
dfE
Two models are considered to be equivalent if they give exactly the same fitted values for any
set of observations. In such cases the number of functionally distinct mean parameters will be the
same, as will the residuals, SSE, dfE, and MSE.
It is possible to put all of the models we have discussed here in the form of a multiple regression
model by properly selecting or constructing the predictor variables. Such models are called linear
models. Unless otherwise stated, we will assume that all of our measurement data models are linear
models. Linear models are “linear” in the parameters, not the predictor variables x. For example,
polynomial regression models are linear in the regression parameters β j but they are not linear in
the predictor variable x. The models used for count data at the end of the book differ somewhat from
the linear models for continuous measurements but all use similar mean structures m(x) that allow
us to exploit the tools developed in earlier chapters.
A valuable measure of the predictive ability of a model is R2 , the squared sample correlation
coefficient between the pairs (ŷh , yh ), cf. Section 6.7. Values near 0 indicate little predictive ability
while values near 1 indicate great predictive ability. (Actually, it is possible to get a high R2 with
lousy predictions but it is then easy to turn those lousy predictions into very good predictions.) R2
measures predictive ability, not the correctness of the model. Incorrect models can be very good
predictors and have very high R2 s while perfect models can be poor predictors and have very low
R2 s. Models with more parameters in them tend to have higher values of R2 because the larger
models can do a better job of approximating the yh values in the fitted data. Unfortunately, this can
happen when the bigger models actually do a worse job of predicting y values that are outside the
fitted data.
On occasion, better to satisfy the assumptions (3.9.2), we might transform the original data y into
y∗ , for example y∗ = log(y), cf. Section 7.3. If we then fit a model y∗h = m(xh ) + εh and get fitted
values ŷ∗h , these can be back transformed to the original scale giving ŷh , say, ŷh = eŷ∗h . R2 values
computed in this way between (ŷh , yh ) are comparable regardless of any transformations involved.
and
α α
Est − t 1 − , df SE(Est) < Par < Est + t 1 − , df SE(Est).
2 2
We do this by establishing a series of equivalences. The justifications for the equivalences are given
at the end:
α Est − Par α
−t 1 − , df < < t 1 − , df (1)
2 SE(Est) 2
if and only if
α α
−t 1 − , df SE(Est) < Est − Par < t 1 − , df SE(Est) (2)
2 2
3.10 EXERCISES 85
if and only if
α α
t 1 − , df SE(Est) > −Est + Par > −t 1 − , df SE(Est) (3)
2 2
if and only if
α α
Est + t 1 − , df SE(Est) > Par > Est − t 1 − , df SE(Est) (4)
2 2
if and only if
α α
Est − t 1 − , df SE(Est) < Par < Est + t 1 − , df SE(Est). (5)
2 2
J USTIFICATION OF S TEPS.
For (1) iff (2): if c > 0, then a < b if and only if ac < bc.
For (2) iff (3): a < b if and only if −a > −b.
For (3) iff (4): a < b if and only if a + c < b + c.
For (4) iff (5): a > b if and only if b < a.
3.10 Exercises
E XERCISE 3.10.1. Identify the parameter, estimate, standard error of the estimate, and reference
distribution for Exercise 2.8.1.
E XERCISE 3.10.2. Identify the parameter, estimate, standard error of the estimate, and reference
distribution for Exercise 2.8.2.
E XERCISE 3.10.3. Identify the parameter, estimate, standard error of the estimate, and reference
distribution for Exercise 2.8.4.
E XERCISE 3.10.4. Consider that I am collecting (normally distributed) data with a variance of
4 and I want to test a null hypothesis of H0 : μ = 10. What sample size should I take according to
Cox’s rule if I want a reasonable chance of rejecting H0 when μ = 13? What if I want a reasonable
chance of rejecting H0 when μ = 12? What sample size should I take if I want a 95% confidence
interval that is no more than 2 units long? What if I want a 99% confidence interval that is no more
than 2 units long?
E XERCISE 3.10.5. The turtle shell data of Jolicoeur and Mosimann (1960) given in Exercise 2.7.4
has a standard deviation of about 21.25. If we were to collect a new sample, how large should the
sample size be in order to have a 95% confidence interval with a length of (about) four units?
According to Cox’s rule, what sample size should I take if I want a reasonable chance of rejecting
H0 : μ = 130 when μ = 140?
E XERCISE 3.10.6. With reference to Exercise 2.8.3, give the approximate number of observa-
tions necessary to estimate the mean of BX to within 0.01 units with 99% confidence. How large a
sample is needed to get a reasonable test of H0 : μ = 10 when μ = 11 using Cox’s rule?
E XERCISE 3.10.7. With reference to Exercise 2.8.3, give the approximate number of observa-
tions necessary to get a 99% confidence for the mean of K that has a length of 60. How large a
sample is needed to get a reasonable test of H0 : μ = 1200 when μ = 1190 using Cox’s rule? What
is the number when μ = 1150?
86 3. GENERAL STATISTICAL INFERENCE
E XERCISE 3.10.8. With reference to Exercise 2.8.3, give the approximate number of observa-
tions necessary to estimate the mean of FORM to within 0.5 units with 95% confidence. How large
a sample is needed to get a reasonable test of H0 : μ = 20 when μ = 20.2 using Cox’s rule?
E XERCISE 3.10.9. With reference to Exercise 2.8.2, give the approximate number of observa-
tions necessary to estimate the mean rat weight to within 1 unit with 95% confidence. How large a
sample is needed to get a reasonable test of H0 : μ = 55 when μ = 54 using Cox’s rule?
E XERCISE 3.10.10. Suppose we have three random variables y, y1 , and y2 and let α be a number
between 0 and 1. Show that if y = α y1 + (1 − α )y2 and if E(y) = E(y2 ) = θ then E(y1 ) = θ .
E XERCISE 3.10.11. Given that y1 , . . . , yn are independent with E(yi ) = μi and σ 2 = Var(yi ) =
E[yi − μi ]2 , give intuitive justifications for why both σ̂ 2 ≡ ∑ni=1 (yi − ŷi )2 /n and MSE ≡ ∑ni=1 (yi −
ŷi )2 /dfE are reasonable estimates of σ 2 . Recall that ŷi is an estimate of μi .
Chapter 4
Two Samples
In this chapter we consider several situations where it is of interest to compare two samples. First
we consider two samples of correlated data. These are data that consist of pairs of observations
measuring comparable quantities. Next we consider two independent samples from populations with
the same variance. This data form is generalized to several independent samples with a common
variance in Chapter 12, a problem that is known as analysis of variance or more commonly as
ANOVA. We then examine two independent samples from populations with different variances.
Finally we consider the problem of testing whether the variances of two populations are equal.
E XAMPLE 4.1.1. Shewhart (1931, p. 324) presents data on the hardness of an item produced by
welding two parts together. Table 4.1 gives the hardness measurements for each of the two parts.
The hardness of part 1 is denoted y1 and the hardness of part 2 is denoted y2 . For i = 1, 2, the data
for part i are denoted yi j , j = 1, . . . , 27. These data are actually a subset of the data presented by
Shewhart.
We are interested in the difference between μ1 , the population mean for part one, and μ2 , the
population mean for part two. In other words, the parameter of interest is Par = μ1 − μ2 . Note that
if there is no difference between the population means, μ1 − μ2 = 0. The natural estimate of this
parameter is the difference between the sample means, i.e., Est = ȳ1· − ȳ2· . Here we use the bar
87
88 4. TWO SAMPLES
.
. . . .. ..:. .:: ... .:. . . . .
-----+---------+---------+---------+---------+---------+-d
7.0 10.5 14.0 17.5 21.0 24.5
and the dot (·) in place of the second subscript to indicate averaging over the second subscript, i.e.,
ȳi· = (yi1 + · · · + yi27)/27.
To perform parametric statistical inferences, we need the standard error of the estimate, i.e.,
SE (ȳ1· − ȳ2· ). As indicated earlier, finding an appropriate standard error is often the most difficult
aspect of statistical inference. In problems such as this, where the data are paired, finding the stan-
dard error is complicated by the fact that the two observations in each pair are not independent. In
data such as these, different pairs are often independent but observations within a pair are not.
In paired comparisons, we use a trick to reduce the problem to consideration of only one sample.
It is a simple algebraic fact that the difference of the sample means, ȳ1· − ȳ2· is the same as the
sample mean of the differences d j = y1 j − y2 j , i.e., d¯ = ȳ1· − ȳ2· . Thus d¯ is an estimate of the
parameter of interest μ1 − μ2 . The differences are given in Table 4.1 along with the data. Summary
statistics are listed below for each variable and the differences. Note that for the hardness data,
d¯ = 12.663 = 47.552 − 34.889 = ȳ1· − ȳ2· . In particular, if the positive value for d¯ means anything
(other than random variation), it indicates that part one is harder than part two.
Sample statistics
Variable Ni Mean Variance Std. dev.
y1 27 47.552 6.79028 2.606
y2 27 34.889 26.51641 5.149
d = y1 − y2 27 12.663 17.77165 4.216
Given that d¯ is an estimate of μ1 − μ2 , we can base the entire analysis on the differences. The
differences constitute a single sample of data, so the standard error of d¯ is simply the usual one-
sample standard error, √
SE(d)¯ = sd 27,
where sd is the sample standard deviation as computed from the 27 differences. The differences are
plotted in Figure 4.1. Note that there is one potential outlier. We leave it as an exercise to reanalyze
the data with the possible outlier removed.
We now have Par, Est, and SE(Est); it remains to find the appropriate distribution. Figure 4.2
gives a normal plot for the differences. While there is an upward curve at the top due to the possible
outlier, the curve is otherwise reasonably straight. The Wilk–Francia statistic of W ′ = 0.955 is above
the fifth percentile of the null distribution. With normal data we use the reference distribution
d¯ − (μ1 − μ2 )
√ ∼ t(27 − 1)
sd 27
12.663 ± 2.78(0.811),
which gives an interval of, roughly, (10.41, 14.92). Based on a .01 level test, the data and the model
are consistent with the population mean hardness for part 1 being between 10.41 and 14.92 units
harder than that for part 2.
4.1 TWO CORRELATED SAMPLES: PAIRED COMPARISONS 89
Normal Q−Q Plot
25
20
Hardness differences
15
10
−2 −1 0 1 2
Theoretical Quantiles
We can also get a 99% prediction interval for the difference in hardness to be observed on a new
welded piece. The prediction interval has endpoints of
12.663 ± 2.78 4.2162 + 0.8112
The trick of looking at differences between pairs is necessary because the two observations in a
pair are not independent. While different pairs of welded parts are assumed to behave independently,
it seems unreasonable to assume that two hardness measurements on a single item that has been
welded together would behave independently. This lack of independence makes it difficult to find a
standard error for comparing the sample means unless we look at the differences. In the remainder
of this chapter, we consider two-sample problems in which all of the observations are assumed to
be independent. The observations in each sample are independent of each other and independent of
all the observations in the other sample. Paired comparison problems almost fit those assumptions
but they break down at one key point. In a paired comparison, we assume that every observation is
independent of the other observations in the same sample and that each observation is independent
of all the observations in the other sample except for the observation in the other sample that it is
paired with. When analyzing two samples, if we can find any reason to identify individuals as being
part of a pair, that fact is sufficient to make us treat the data as a paired comparison.
Since paired comparisons reduce to one-sample procedures, the model-based procedures of
Chapter 2 apply.
The method of paired comparisons is also the name of a totally different statistical procedure.
Suppose one wishes to compare five brands of chocolate chip cookies: A, B, C, D, E. It would be
difficult to taste all five and order them appropriately. As an alternative, one can taste test pairs of
cookies, e.g., (A, B), (A,C), (A, D), (A, E), (B,C), (B, D), etc. and identify the better of the two. The
benefit of this procedure is that it is much easier to rate two cookies than to rate five. See David
(1988) for a survey and discussion of procedures developed to analyze such data.
E XAMPLE 4.2.1. The data in Table 4.2 are final point totals for an introductory Statistics class.
The data are divided by the sex of the student. We investigate whether the data display sex differ-
ences. The data are plotted in Figure 4.3. Figures 4.4 and 4.5 contain normal plots for the two sets
of data. Figure 4.4 is quite straight but Figure 4.5 looks curved. Our analysis is not particularly
sensitive to nonnormality and the W ′ statistic for Figure 4.5 is 0.937, which is well above the fifth
percentile, so we proceed under the assumption that both samples are normal. We also assume that
all of the observations are independent. This assumption may be questionable because some students
probably studied together; nonetheless, independence seems like a reasonable working assumption.
✷
The methods in this section rely on the assumption that the two populations are normally dis-
tributed and have the same variance. In particular, we assume two independent samples
Sample Data Distribution
1 y11 , y12 , . . . , y1N1 iid N(μ1 , σ 2 )
2 y21 , y22 , . . . , y2N2 iid N(μ2 , σ 2 )
4.2 TWO INDEPENDENT SAMPLES WITH EQUAL VARIANCES 91
. . .
. . . . : . . : : : : . . .
-------+---------+---------+---------+---------+---------females
96 112 128 144 160 176
. . . . . . . : . : . :
---+---------+---------+---------+---------+---------+---males
80 100 120 140 160 180
120
100
−2 −1 0 1 2
Theoretical Quantiles
120
100
−1 0 1
Theoretical Quantiles
and compute summary statistics from the samples. The summary statistics are just the sample mean
and the sample variance for each individual sample.
Sample statistics
Sample Size Mean Variance
1 N1 ȳ1· s21
2 N2 ȳ2· s22
Except for checking the validity of our assumptions, these summary statistics are more than suffi-
cient for the entire analysis. Algebraically, the sample mean for population i, i = 1, 2, is
Ni
1 1
ȳi· ≡
Ni ∑ yi j = Ni [yi1 + yi2 + · · · + yiNi ]
j=1
where the · in ȳi· indicates that the mean is obtained by averaging over j, the second subscript in the
yi j s. The sample means, ȳ1· and ȳ2· , are estimates of μ1 and μ2 .
The sample variance for population i, i = 1, 2, is
Ni
1
s2i =
Ni − 1 ∑ (yi j − ȳi·)2
j=1
1
2 2 2
= (yi1 − ȳi·) + (yi2 − ȳi· ) + · · · + (yiNi − ȳi· ) .
Ni − 1
The s2i s both estimate σ 2 . Combining the s2i s can yield a better estimate of σ 2 than either individual
estimate. We form a pooled estimate of the variance, say s2p , by averaging s21 and s22 . With unequal
sample sizes an efficient pooled estimate of σ 2 must be a weighted average of the s2i s. Obviously,
if we have N1 = 100, 000 observations in the first sample and only N2 = 10 observations in the
second sample, the variance estimate s21 is much better than s22 and we want to give it more weight.
The weights are the degrees of freedom associated with the estimates. The pooled estimate of the
variance is
(N1 − 1)s21 + (N2 − 1)s22
s2p ≡
(N1 − 1) + (N2 − 1)
N1 N2
1
=
N1 + N2 − 2 j=1∑ (ȳ1 j − ȳ1·) + ∑ (ȳ2 j − ȳ2·)
2 2
j=1
2 Ni
1
= ∑ ∑ (ȳi j − ȳi·)2 .
N1 + N2 − 2 i=1 j=1
The degrees of freedom for s2p are N1 + N2 − 2 = (N1 − 1) + (N2 − 1), i.e., the sum of the degrees
of freedom for the individual estimates s2i .
E XAMPLE 4.2.2. For the data on final point totals, the sample statistics follow.
4.2 TWO INDEPENDENT SAMPLES WITH EQUAL VARIANCES 93
Sample Statistics
Sample Ni ȳi· s2i si
females 22 127.954545 487.2835498 22.07
males 15 139.000000 979.2857143 31.29
We are now in a position to draw statistical inferences about the μi s. The main problem in
obtaining tests and confidence intervals is in finding appropriate standard errors. The crucial fact is
that the samples are independent so that the ȳi· s are independent.
For inferences about the difference between the two means, say, μ1 − μ2 , use the general proce-
dure of Chapter 3 with
Par = μ1 − μ2
and
Est = ȳ1· − ȳ2· .
Note that ȳ1· − ȳ2· is unbiased for estimating μ1 − μ2 because
The two means are independent, so the variance of ȳ1· − ȳ2· is the variance of ȳ1· plus the variance
of ȳ2· , i.e.,
σ2 σ2 1 1
Var(ȳ1· − ȳ2· ) = Var(ȳ1· ) + Var(ȳ2· ) = + = σ2 + .
N1 N2 N1 N2
The standard error of ȳ1· − ȳ2· is the estimated standard deviation of ȳ1· − ȳ2·,
2
1 1
SE(ȳ1· − ȳ2·) = s p + .
N1 N2
Under our assumption that the original data are normal, the reference distribution is
(ȳ1· − ȳ2· ) − (μ1 − μ2 )
∼ t(N1 + N2 − 2).
s2p N11 + N12
The degrees of freedom for the t distribution are the degrees of freedom for s2p .
Having identified the parameter, estimate, standard error, and distribution, inferences follow the
usual pattern. A 95% confidence interval for μ1 − μ2 is
2
1 1
(ȳ1· − ȳ2·) ± t(0.975, N1 + N2 − 2) s p + .
N1 N2
A test of hypothesis that the means are equal, H0 : μ1 = μ2 , can be converted into the equivalent
hypothesis involving Par = μ1 − μ2 , namely H0 : μ1 − μ2 = 0. The test is handled in the usual way.
An α = 0.01 test rejects H0 if
|(ȳ1· − ȳ2·) − 0|
> t(0.995, N1 + N2 − 2).
s2p N1 + N1
1 2
94 4. TWO SAMPLES
As discussed in Chapter 3, what we are really doing is testing the validity of the null model that
incorporates all of the assumptions, including the assumption of the null hypothesis H0 : μ1 − μ2 = 0.
Fisher (1925) quite rightly argues that the appropriate test is often a test of whether the two samples
come from the same normal population, rather than a test of whether the means are equal given that
the variances are (or are not) equal.
In our discussion of comparing differences, we have defined the parameter as μ1 − μ2 . We could
just as well have defined the parameter as μ2 − μ1 . This would have given an entirely equivalent
analysis.
Inferences about a single mean, say, μ2 ,
use the general procedures with Par = μ2 and Est = ȳ2· .
The variance of ȳ2· is σ 2 /N2 , so SE(ȳ2· ) = s2p /N2 . Note the use of s2p rather than s22 . The reference
distribution is [ȳ2· − μ2 ]/SE(ȳ2· ) ∼ t(N1 + N2 − 2). A 95% confidence interval for μ2 is
ȳ2· ± t(0.975, N1 + N2 − 2) s2p /N2 .
|ȳ − 5|
2· > t(0.995, N1 + N2 − 2).
s2p /N2
E XAMPLE 4.2.3. For comparing females and males on final point totals, the parameter of interest
is
Par = μ1 − μ2
where μ1 indicates the population mean final point total for females and μ2 indicates the population
mean final point total for males. The estimate of the parameter is
The pooled estimate of the variance is s2p = 684.08, so the standard error is
1 1 1 1
SE(ȳ1· − ȳ2· ) = s2p + = 684.08 + = 8.7578 .
N1 N2 22 15
The data have reasonably normal distributions and the variances are not too different (more on this
later), so the reference distribution is taken as
where 35 = N1 + N2 − 2. The tabled value for finding 95% confidence intervals and α = 0.05 tests
is
t(0.975, 35) = 2.030 .
A 95% confidence interval for μ1 − μ2 has endpoints
−11.05 ± (2.030)8.7578
4.2 TWO INDEPENDENT SAMPLES WITH EQUAL VARIANCES 95
which yields an interval (−28.8, 6.7). Population mean scores between, roughly, 29 points less for
females and 7 points more for females are consistent with the data and the model based on a 0.05
test.
An α = 0.05 test of H0 : μ1 − μ2 = 0 is not rejected because 0, the hypothesized value of
μ1 − μ2 , is contained in the 95% confidence interval for μ1 − μ2 . The P value for the test is based
on the observed value of the test statistic
(ȳ1· − ȳ2·) − 0 −11.05 − 0
tobs = = = −1.26 .
1
s2p 22 + 15 1 8.7578
The probability of obtaining an observation from a t(35) distribution that is as extreme or more
extreme than | − 1.26| is 0.216. There is very little evidence that the population mean final point
total for females is different (smaller) than the population mean final point total for males. The P
value is greater than 0.2, so, as we established earlier, neither an α = 0.05 nor an α = 0.01 test is
rejected. If we were silly enough to do an α = 0.25 test, we would then reject the null hypothesis.
A 95% confidence interval for μ1 , the mean of the females, has endpoints
127.95 ± (2.030) 684.08/22,
which gives the interval (116.6, 139.3). A mean final point total for females between 117 and 139 is
consistent with the data and the model. A 95% prediction interval for a new observation on a female
has endpoints
684.08
127.95 ± (2.030) 684.08 + ,
22
which gives the interval (73.7, 182.2). A new observation on a female between 74 and 182 is con-
sistent with the data and the model. This assumes that the new observation is randomly sampled
from the same population as the previous data.
A test of the assumption of equal variances is left for the final section but we will see in the next
section that the results for these data do not depend substantially on the equality of the variances.
✷
2 Ni
SSE(Full) = ∑ ∑ (yi j − ȳi· )2 ,
i=1 j=1
and
2 Ni
SSE(Red.) = ∑ ∑ (yi j − ȳ·· )2 ,
i=1 j=1
where
Ni 2 Ni
1 1
ȳi· =
Ni ∑ yi j and ȳ·· = ∑ ∑ yi j .
N1 + N2 i=1
j=1 j=1
From the last relationship, it is easy to see that the model-based F statistic is simply the square of
the parameter-based t statistic.
E XAMPLE 4.3.1. Jolicoeur and Mosimann (1960) present data on the sizes of turtle shells (cara-
paces). Table 4.3 presents data on the shell heights for 24 females and 24 males. These data are not
paired; it is simply a caprice that 24 carapaces were measured for each sex. Our interest centers on
estimating the population means for female and male heights, estimating the difference between the
heights, and testing whether the difference is zero.
4.3 TWO INDEPENDENT SAMPLES WITH UNEQUAL VARIANCES 97
N(89,4)
N(90,16)
78 81 84 87 90 93 96 99 102
Octane ratings
Following Christensen (2001) and others, we take natural logarithms of the data, i.e.,
(All logarithms in this book are natural logarithms.) The log data are plotted in Figure 4.7. The
female heights give the impression of being both larger and more spread out. Figures 4.8 and 4.9
contain normal plots for the females and males, respectively. Neither is exceptionally straight but
they do not seem too bad. Summary statistics follow; they are consistent with the visual impressions
given by Figure 4.7. The summary statistics will be used later to illustrate our statistical inferences.
3.9
3.8
3.7
−2 −1 0 1 2
Theoretical Quantiles
Figure 4.8: Normal plot for female turtle shell log heights.
3.70
3.65
3.60
3.55
−2 −1 0 1 2
Theoretical Quantiles
Figure 4.9: Normal plot for male turtle shell log heights.
4.3 TWO INDEPENDENT SAMPLES WITH UNEQUAL VARIANCES 99
Sample Size Mean Variance
1 N1 ȳ1· s21
2 N2 ȳ2· s22
Again, the sample means, ȳ1· and ȳ2· , are estimates of μ1 and μ2 , but now s21 and s22 estimate σ12
and σ22 . We have two different variances, so it is inappropriate to pool the variance estimates. Once
again, the crucial fact in obtaining a standard error is that the samples are independent.
For inferences about the difference between the two means, say, μ1 − μ2 , again use the general
procedure with
Par = μ1 − μ2
and
Est = ȳ1· − ȳ2· .
Just as before, ȳ1· − ȳ2· is unbiased for estimating μ1 − μ2 . The two sample means are independent,
so
σ2 σ2
Var(ȳ1· − ȳ2· ) = Var(ȳ1· ) + Var(ȳ2· ) = 1 + 2 .
N1 N2
The standard error of ȳ1· − ȳ2· is
s21 s2
SE(ȳ1· − ȳ2· ) = + 2.
N1 N2
Even when the original data are normal, the appropriate reference distribution is not a t distribution.
As a matter of fact, the appropriate reference distribution is not known. However, a good approxi-
mate distribution is
(ȳ1· − ȳ2· ) − (μ1 − μ2 )
∼ t(ν )
s21 /N1 + s22 /N2
where 2
s21 /N1 + s22 /N2
ν≡ 2 2
(4.3.1)
s21 /N1 /(N1 − 1) + s22 /N2 /(N2 − 1)
is an approximate number of degrees of freedom. This approximate distribution was proposed by
Satterthwaite (1946) and was discussed by Snedecor and Cochran (1980). Having identified the
parameter, estimate, standard error, and reference distribution, inferences follow the usual pattern.
Using s21 /N1 = 0.02493979/24 = 0.001039158 and s22 /N2 = 0.00677276/24 = 0.000282198 in
Equation (4.3.1), the approximate degrees of freedom are
2
(0.001039158 + 0.000282198)
ν= = 34.6.
(0.001039158)2/23 + (0.000282198)2/23
100 4. TWO SAMPLES
An α = 0.01 test is rejected if the observed value of the test statistic is farther from zero than the
.
cutoff value t(0.995, 34.6) = t(0.995, 35) = 2.72. The observed value of the test statistic is
0.2371 − 0
tobs = = 6.523,
0.03635
which is greater than the cutoff value, so the test is rejected. There is evidence at the 0.01 level
that the mean shell height for females is different from the mean shell height for males. Obviously,
since ȳ1· − ȳ2· = 0.2371 is positive, there is evidence that the females have shells of greater height.
As always, such a conclusion relies on the other assumptions being true. With these sample sizes,
the main thing that could invalidate the conclusion would be a lack of independence among the
observations. Actually, the conclusion is that the means of the log(heights) are different, but if these
are different we conclude that the mean heights are different, or more particularly that the median
heights are different.
The 95% confidence interval for the difference between mean log shell heights for females and
.
males, i.e., μ1 − μ2 , uses t(0.975, 34.6) = t(0.975, 35) = 2.03. The endpoints are
0.2371 ± 2.03 (0.03635),
and the interval is (0.163, 0.311). We took logs of the data, so for transformed data that are normal
(or any other symmetric distribution), eμi is the median height for group i even though eμi is not
the mean height for group i. Thus, eμ1 −μ2 is the ratio of the median of the female heights to the
median of the male heights. If we transform back to the original scale the interval is (e0.163 , e0.311 )
or (1.18, 1.36). The data are consistent with the population median for females being, roughly,
between one and a sixth and one and a third times the median shell heights for males. Note that a
difference between 0.163 and 0.311 on the log scale transforms into a multiplicative effect between
1.18 and 1.36 on the original scale. This idea is discussed in more detail in Example 12.1.1.
It is inappropriate to pool the variance estimates, so inferences about μ1 and μ2 are performed
just as for one sample. The 95% confidence√ interval for the mean log shell height for females, μ1 ,
uses the estimate ȳ1· , the standard error s1 / 24, and the tabled value t(0.975, 24 − 1) = 2.069. It
has endpoints √
3.9403 ± 2.069 0.1579 24
which gives the interval (3.87, 4.01). Transforming to the original scale gives the interval
(47.9, 55.1). The data are consistent with a median shell height for females between, roughly, 48
and 55 millimeters based on a 0.05 level test. Males also have 24 observations, so the interval for
μ2 also uses t(0.975, 24 − 1), has endpoints
√
3.7032 ± 2.069 0.0823 24 ,
and an interval (3.67, 3.74). Transforming the interval back to the original scale gives (39.3, 42.1).
The data are consistent with a median shell height for males between, roughly, 39 and 42 millime-
ters. The 95% prediction interval for the transformed shell height of a future male has endpoints
1
3.7032 ± 2.069 0.0823 1 + ,
24
which gives the interval (3.529, 3.877). Transforming the prediction interval back to the original
scale gives (34.1, 48.3). Transforming a prediction interval back to the original scale creates no
problems of interpretation. ✷
E XAMPLE 4.3.3. Reconsider the final point totals data of Section 4.2. Without the assumption of
equal variances, the standard error is
487.28 979.29
SE(ȳ1· − ȳ2· ) = + = 9.3507 .
22 15
4.4 TESTING EQUALITY OF THE VARIANCES 101
From Equation (4.3.1), the degrees of freedom for the approximate t distribution are 23. A 95%
confidence interval for the difference is (−30.4, 8.3) and the observed value of the statistic for
testing equal means is tobs = −1.18. This gives a P value of 0.22. These values are all quite close to
those obtained using the equal variance assumption. ✷
It is an algebraic fact that if N1 = N2 , the observed value of the test statistic for H0 : μ1 = μ2
based on unequal variances is the same as that based on equal variances. In the turtle example, the
sample sizes are both 24 and the test statistic of 6.523 is the same as the equal variances test statistic.
The algebraic equivalence occurs because with equal sample sizes, the standard errors from the two
procedures are the same. With equal sample sizes, the only practical difference between the two
procedures for examining Par = μ1 − μ2 is in the choice of degrees of freedom for the t distribu-
tion. In the turtle example above, the unequal variances procedure had approximately 35 degrees of
freedom, while the equal variance procedure has 46 degrees of freedom. The degrees of freedom are
sufficiently close that the substantive results of the turtle analysis are essentially the same, regardless
of method. The other fact that should be recalled is that the reference distribution associated with
μ1 − μ2 for the equal variance method is exactly correct for data that satisfy the model assumptions.
Even for data that satisfy the unequal variance method assumptions, the reference distribution is just
an approximation.
σ22
H0 : = 1.
σ12
E XAMPLE 4.4.1. We again consider the log turtle height data. The sample variance of log female
heights is s21 = 0.02493979 and the sample variance of log male heights is s22 = 0.00677276. The
α = 0.01 level test is rejected, i.e., we conclude that the null model with σ22 = σ12 is wrong if
0.00677276 s22
0.2716 = = > F(0.995, 23, 23) = 3.04
0.02493979 s21
or if
1 1
0.2716 < F(0.005, 23, 23) = = = 0.33.
F(0.995, 23, 23) 3.04
The second of these inequalities is true, so the null model with equal variances is rejected at the 0.01
level. We have evidence that σ22 = σ12 if the model is true and, since the statistic is less than one,
evidence that σ22 < σ12 . ✷
E XAMPLE 4.4.2. Consider again the final point total data. The sample variance for females is
s21 = 487.28 and the sample variance for males is s22 = 979.29. The test statistic is
s21 487.28
= = 0.498 .
s22 979.29
For the tests being used, it does not matter which variance estimate we put in the numerator
as long as we keep the degrees of freedom straight. The observed test statistic is not less than
1 F(0.95, 14, 21) = 1 2.197 = 0.455 nor greater than F(0.95, 21, 14) = 2.377, so the null model is
not rejected at the α = 0.10 level. ✷
In practice, tests for the equality of variances are rarely performed. As misguided as it may be,
typically, the main emphasis is on drawing conclusions about the μi s. The motivation for testing
equality of variances is frequently to justify the use of the pooled estimate of the variance. The
test assumes that the null hypothesis of equal variances is true and data that are inconsistent with
the assumptions indicate that the assumptions are false. We generally hope that this indicates that
the assumption about the null hypothesis is false, but, in fact, unusual data may be obtained if
any of the assumptions are invalid. The equal variances test assumes that the data are independent
and normal and that the variances are equal. Minor deviations from normality may cause the test
to be rejected. While procedures for comparing μi s based on the pooled estimate of the variance
are sensitive to unequal variances, they are not particularly sensitive to nonnormality. The test for
equality of variances is so sensitive to nonnormality that when rejecting this test one has little idea if
the problem is really unequal variances or if it is nonnormality. Thus one has little idea whether there
is a problem with the pooled estimate procedures or not. Since the test is not very informative, it is
rarely performed. Moreover, as discussed at the beginning of this section, if the variances of the two
groups are substantially different, inferences about the means may be irrelevant to the underlying
practical issues.
4.4 TESTING EQUALITY OF THE VARIANCES 103
Theory
The F distribution used here is related to the fact that for independent random samples of normal
data,
(Ni − 1)s2i
∼ χ 2 (Ni − 1).
σi2
Definition 4.4.3. An F distribution is the ratio of two independent chi-squared random variables
divided by their degrees of freedom. The numerator and denominator degrees of freedom for the F
distribution are the degrees of freedom for the respective chi-squareds.
In this problem, the two chi-squared random variables divided by their degrees of freedom are
i = 1, 2. They are independent because they are taken from independent samples and their ratio is
When the null hypothesis is true, i.e., σ22 /σ12 = 1, by definition, we get
s22
∼ F(N2 − 1, N1 − 1),
s21
so the test statistic has an F distribution under the null hypothesis and the normal sampling model.
Note that we could equally well have reversed the roles of the two groups and set the test up as
σ12
H0 : =1
σ22
s21 α
> F 1 − , N1 − 1, N2 − 1
s22 2
or if
s21 α
2
<F , N1 − 1, N2 − 1 .
s2 2
Using the fact that for any α between zero and one and any degrees of freedom r and s,
1
F(α , r, s) = , (4.4.1)
F(1 − α , s, r)
it is easily seen that this test is equivalent to the one we constructed. Relation (4.4.1) is a result of
the fact that with equal variances both s22 /s21 and s21 /s22 have F distributions. Clearly, the smallest,
say, 5% of values from s22 /s21 are also the largest 5% of the values of s21 /s22 .
104 4. TWO SAMPLES
4.5 Exercises
E XERCISE 4.5.1. Box (1950) gave data on the weights of rats that were given the drug Thiouracil.
The rats were measured at the start of the experiment and at the end of the experiment. The data are
given in Table 4.4. Give a 99% confidence interval for the difference in weights between the finish
and the start. Test the null hypothesis that the population mean weight gain was less than or equal
to 50 with α = 0.02.
E XERCISE 4.5.2. Box (1950) also considered data on rats given Thyroxin and a control group
of rats. The weight gains are given in Table 4.5. Give a 95% confidence interval for the difference
in weight gains between the Thyroxin group and the control group. Give the P value for a test of
whether the control group has weight gains different than the Thyroxin group.
E XERCISE 4.5.3. Conover (1971, p. 226) considered data on the physical fitness of male seniors
in a particular high school. The seniors were divided into two groups based on whether they lived
on a farm or in town. The results in Table 4.6 are from a physical fitness test administered to the
students. High scores indicate that an individual is physically fit. Give a 95% confidence interval for
the difference in mean fitness scores between the town and farm students. Test the hypothesis of no
difference at the α = 0.10 level. Give a 99% confidence interval for the mean fitness of town boys.
Give a 99% prediction interval for a future fitness score for a farm boy.
E XERCISE 4.5.4. Use the data of Exercise 4.5.3 to test whether the fitness scores for farm boys
are more or less variable than fitness scores for town boys.
E XERCISE 4.5.5. Jolicoeur and Mosimann (1960) gave data on turtle shell lengths. The data
for females and males are given in Table 4.7. Explore the need for a transformation. Test whether
there is a difference in lengths using α = 0.01. Give a 95% confidence interval for the difference in
lengths.
E XERCISE 4.5.6. Koopmans (1987) gave the data in Table 4.8 on verbal ability test scores for
8 year-olds and 10 year-olds. Test whether the two groups have the same mean with α = 0.01 and
give a 95% confidence interval for the difference in means. Give a 95% prediction interval for a new
10 year old. Check your assumptions.
E XERCISE 4.5.7. Burt (1966) and Weisberg (1985) presented data on IQ scores for identical
twins that were raised apart, one by foster parents and one by the genetic parents. Variable y1 is
the IQ score for a twin raised by foster parents, while y2 is the corresponding IQ score for the twin
raised by the genetic parents. The data are given in Table 4.9.
We are interested in the difference between μ1 , the population mean for twins raised by foster
parents, and μ2 , the population mean for twins raised by genetic parents. Analyze the data. Check
your assumptions.
E XERCISE 4.5.8. Table 4.10 presents data given by Shewhart (1939, p. 118) on various atomic
weights as reported in 1931 and again in 1936. Analyze the data. Check your assumptions.
E XERCISE 4.5.9. Reanalyze the data of Example 4.1.1 after deleting the one possible outlier.
Does the analysis change much? If so, how?
E XERCISE 4.5.10. Let y1 ∼ N(89, 4) and y2 ∼ N(90, 16). Show that Pr[y1 ≥ 87] > Pr[y2 ≥ 87],
so that the√population with the lower mean has a higher probability of exceeding 87. Recall that
(y1 − 89)/ 4 ∼ N(0, 1) with a similar result for y2 so that both probabilities can be rewritten in
terms of a N(0, 1).
E XERCISE 4.5.11. Mandel (1972) reported stress test data on elongation for a certain type of
rubber. Four pieces of rubber sent to one laboratory yielded a sample mean and variance of 56.50
and 5.66, respectively. Four different pieces of rubber sent to another laboratory yielded a sample
mean and variance of 52.50 and 6.33, respectively. Are the data two independent samples or a paired
comparison? Is the assumption of equal variances reasonable? Give a 99% confidence interval for
the difference in population means and give an approximate P value for testing that there is no
difference between population means.
E XERCISE 4.5.12. Bethea et al. (1985) reported data on the peel-strengths of adhesives. Some of
the data are presented in Table 4.11. Give an approximate P value for testing no difference between
adhesives, a 95% confidence interval for the difference between mean peel-strengths, and a 95%
prediction interval for a new observation on Adhesive A.
E XERCISE 4.5.13. Garner (1956) presented data on the tensile strength of fabrics. Here we con-
sider a subset of the data. The complete data and a more extensive discussion of the experimental
procedure are given in Exercise 11.5.2. The experiment involved testing fabric strengths on different
machines. Eight homogeneous strips of cloth were divided into samples and each machine was used
on a sample from each strip. The data are given in Table 4.12. Are the data two independent sam-
ples or a paired comparison? Give a 98% confidence interval for the difference in population means.
Give an approximate P value for testing that there is no difference between population means. What
is the result of an α = 0.05 test?
E XERCISE 4.5.14. Snedecor and Cochran (1967) presented data on the number of acres planted
in corn for two sizes of farms. Size was measured in acres. Some of the data are given in Table 4.13.
Are the data two independent samples or a paired comparison? Is the assumption of equal variances
reasonable? Test for differences between the farms of different sizes. Clearly state your α level.
Give a 98% confidence interval for the mean difference between different farms.
E XERCISE 4.5.15. Snedecor and Haber (1946) presented data on cutting dates of asparagus.
On two plots of land, asparagus was grown every year from 1929 to 1938. On the first plot the
asparagus was cut on June 1, while on the second plot the asparagus was cut on June 15. Note
that growing conditions will vary considerably from year to year. Also note that the data presented
have cutting dates confounded with the plots of land. If one plot of land is intrinsically better for
growing asparagus than the other, there will be no way of separating that effect from the effect of
cutting dates. Are the data two independent samples or a paired comparison? Give a 95% confidence
interval for the difference in population means and give an approximate P value for testing that there
is no difference between population means. Give a 95% prediction interval for the difference in a
new year. The data are given in Table 4.14.
E XERCISE 4.5.16. Snedecor (1945b) presented data on a pesticide spray. The treatments were
the number of units of active ingredient contained in the spray. Several different sources for breed-
ing mediums were used and each spray was applied on each distinct breeding medium. The data
consisted of numbers of dead adult flies found in cages that were set over the breeding medium
containers. Some of the data are presented in Table 4.15. Give a 95% confidence interval for the
difference in population means. Give an approximate P value for testing that there is no difference
between population means and an α = 0.05 test. Give a 95% prediction interval for a new obser-
vation with 8 units. Give a 95% prediction interval for a new observation with 8 units when the
corresponding 0 unit value is 300.
E XERCISE 4.5.17. Using the data of Example 4.2.1 give a 95% prediction interval for the dif-
ference in total points between a new female and a new male. This was not discussed earlier so it
requires a deeper understanding of Section 3.5.
Contingency Tables
In this chapter we consider data that consist of counts. We begin in Section 5.1 by examining a
set of data on the number of females admitted into graduate school at the University of California,
Berkeley. A key feature of these data is that only two outcomes are possible: admittance or rejec-
tion. Data with only two outcomes are referred to as binary (or dichotomous) data. Often the two
outcomes are referred to generically as success and failure. In Section 5.2, we expand our discus-
sion by comparing two sets of dichotomous data; we compare Berkeley graduate admission rates for
females and males. Section 5.3 examines polytomous data, i.e., count data in which there are more
than two possible outcomes. For example, numbers of Swedish females born in the various months
of the year involve counts for 12 possible outcomes. Section 5.4 examines comparisons between
two samples of polytomous data, e.g., comparing the numbers of females and males that are born in
the different months of the year. Section 5.5 looks at comparisons among more than two samples of
polytomous data. The last section considers a method of reducing large tables of counts that involve
several samples of polytomous data into smaller more interpretable tables.
Sections 5.1 and 5.2 involve analogues of Chapters 2 and 4 that are appropriate for dichotomous
data. The basic analyses in these sections simply involve new applications of the ideas in Chapter 3.
Sections 5.3, 5.4, and 5.5 are polytomous data analogues of Chapters 2, 4, and 12. Everitt (1977)
and Fienberg (1980) give more detailed introductions to the analysis of count data. Sophisticated
analyses of count data frequently use analogues of ANOVA and regression called logistic regression
and log-linear models. These are discussed in Chapters 20 and 21, respectively.
109
110 5. CONTINGENCY TABLES
It seems reasonable to view the 1835 females as a random sample from a population of potential
female applicants. We are interested in the probability p that a female applicant is admitted to
graduate school. A natural estimate of the parameter p is the proportion of females that were actually
admitted, thus our estimate of the parameter is
557
p̂ = = 0.30354.
1835
We have a parameter of interest, p, and an estimate of that parameter, p̂; if we can identify a standard
error and an appropriate distribution, we can use methods from Chapter 3 to perform statistical
inferences.
The key to finding a standard error is to find the variance of the estimate. As we will see later,
p(1 − p)
Var( p̂) = . (5.1.1)
n
To estimate the standard deviation of p̂, we simply use p̂ to estimate p in (5.1.1) and take the square
root. Thus the standard error is
p̂(1 − p̂) 0.30354(1 − 0.30354)
SE( p̂) = = = 0.01073.
n 1835
The final requirement for using the results of Chapter 3 is to find an appropriate reference dis-
tribution for
p̂ − p
.
SE( p̂)
We can think of each trial as scoring either a 1, if the trial is a success, or a 0, if the trial is a failure.
With this convention p̂, the proportion of successes, is really the average of the 0-1 scores √ and since
p̂ is an average we can apply the central limit theorem. (In fact, SE( p̂) is very nearly s/ n, where s
is computed from the 0-1 scores.) The central limit theorem simply states that for a large number of
trials n, the distribution of p̂ is approximately normal with a population mean that is the population
mean of p̂ and a population variance that is the population variance of p̂. We have already given the
variance of p̂ and we will see later that E( p̂) = p, thus for large n we have the approximation
p(1 − p)
p̂ ∼ N p, .
n
The variance is unknown but by the law of large numbers it is approximately equal to our estimate
of it, p̂(1 − p̂)/n. Standardizing the normal distribution (cf. Exercise 1.6.2) gives the approximation
p̂ − p
∼ N(0, 1) ≡ t(∞). (5.1.2)
SE( p̂)
This distribution requires a sample size that is large enough for both the central limit theorem ap-
proximation and the law of large numbers approximation to be reasonably valid. For values of p
that are not too close to 0 or 1, the approximation works reasonably well with sample sizes as small
as 20. However, the normal distribution is unrealistically precise, since it is based on both a normal
approximation and a law of large numbers approximation. We use the t(n − 1) distribution instead,
hoping that it provides a more realistic view of the reference distribution.
We now have Par = p, Est = p̂, SE( p̂) = p̂(1 − p̂)/n, and the distribution in (5.1.2) or t(n−1).
As in Chapter 3, a 95% confidence interval for p has limits
p̂(1 − p̂)
p̂ ± 1.96 .
n
5.1 ONE BINOMIAL SAMPLE 111
.
Here 1.96 = t(0.975, ∞) = t(0.975, 1834). Recall that a (1 − α )100% confidence interval requires
the (1 − α /2) percentile of the distribution. For the female admissions data, the limits are
0.30354 ± 1.96(0.01073),
which gives the interval (0.28, 0.32). The data are consistent with a population proportion of females
admitted to Berkeley’s graduate school between 0.28 and 0.32. (As is often the case, it is not exactly
clear what population these data relate to.) Agresti and Coull (1998) discuss alternative methods of
constructing confidence intervals, some of which have better Neyman–Pearson coverage rates.
We can also perform, say, an α = 0.01 test of the null hypothesis H0 : p = 1/3. The test rejects
H0 if
p̂ − 1/3
> 2.58
SE( p̂)
or if
p̂ − 1/3
< −2.58.
SE( p̂)
.
Here 2.58 = t(0.995, ∞) = t(0.995, 1834). An α -level test requires the (1 − α2 )100% point of the
distribution. The Berkeley data yield the test statistic
0.30354 − 0.33333
= −2.78,
0.01073
which is smaller than −2.58, so we reject the null model with p = 1/3 at α = 0.01. In other words,
we can reject, with strong assurance, the claim that one third of female applicants are admitted to
graduate school at Berkeley, provided the data really are binomial. Since the test statistic is negative,
we have evidence that the true proportion is less than one third. The test as constructed here is
equivalent to checking whether p = 1/3 is within a 99% confidence interval.
There is an alternative, slightly different, way of performing tests such as H0 : p = 1/3. The
difference involves using a different standard error. The variance of the estimate p̂ is p(1 − p)/n.
In obtaining a standard error, we estimated p with p̂ and took the square root of the estimated
variance. Recalling that tests are performed assuming that the null hypothesis is true, it makes sense
in the testing problem to use the assumption p = 1/3 in computing a standard error for p̂. Thus an
alternative standard error for p̂ in this testing problem is
1 1
1− 1835 = 0.01100.
3 3
0.481 − 0.5
= −0.20,
0.5(1 − 0.5)/27
557
p̂1 = = 0.30354.
1835
5.2 TWO INDEPENDENT BINOMIAL SAMPLES 113
Treating the males as a binomial sample, the sample size is n2 = 2691 and the probability of success
is, say, p2 . The observed proportion of male successes is
1198
p̂2 = = 0.44519.
2691
Our interest is in comparing the success rate of females and males. The appropriate parameter
is the difference in proportions,
Par = p1 − p2 .
The natural estimate of this parameter is
With independent samples, we can find the variance of the estimate and thus the standard error.
Since the females are independent of the males,
For large sample sizes n1 and n2 , both p̂1 and p̂2 have approximate normal distributions and they
are independent, so p̂1 − p̂2 has an approximate normal distribution and the appropriate reference
distribution is approximately
where the value 1.96 = t(0.975, ∞) seems reasonable given the large sample sizes involved. For
114 5. CONTINGENCY TABLES
comparing the female and male admissions, the 95% confidence interval for the population differ-
ence in proportions has endpoints
−0.14165 ± 1.96(0.01439).
The interval is (−0.17, −0.11). Proportions of women being admitted to graduate school at Berkeley
between 0.11 and 0.17 less than that for men are consistent with the data and the model at α = 0.05.
To test H0 : p1 = p2 , or equivalently H0 : p1 − p2 = 0, reject an α = 0.10 test if
( p̂1 − p̂2) − 0
> 1.645
SE( p̂1 − p̂2)
or if
( p̂1 − p̂2) − 0
< −1.645.
SE( p̂1 − p̂2)
Again, the value 1.645 is obtained from the t(∞) distribution and presumes very large samples.
With the Berkeley data, the observed value of the test statistic is
−0.14165 − 0
= −9.84.
0.01439
This is far smaller than −1.645, so the test rejects the null hypothesis of equal proportions at the
0.10 level pretty much regardless of how we determine degrees of freedom. The test statistic is
negative, so there is evidence that the proportion of women admitted to graduate school is lower
than the proportion of men.
Once again, an alternative standard error is often used in testing problems. The test assumes that
the null hypothesis is true, i.e. p1 = p2 , so in constructing a standard error for the test statistic it
makes sense to pool the data into one estimate of this common proportion. The pooled estimate is a
weighted average of the individual estimates,
n1 p̂1 + n2 p̂2
p̂∗ =
n1 + n2
1835(0.30354) + 2691(0.44519)
=
1835 + 2691
557 + 1198
=
1835 + 2691
= 0.38776 .
Using p̂∗ to estimate both p1 and p2 in Equation (5.2.1) and taking the square root gives the alter-
native standard error
p̂∗ (1 − p̂∗ ) p̂∗ (1 − p̂∗ )
SE( p̂1 − p̂2 ) = +
n1 n2
1 1
= p̂∗ (1 − p̂∗ ) +
n1 n2
1 1
= 0.38776(1 − 0.38776) +
1835 2691
= 0.01475.
Again, the two test statistics are slightly different but the difference should be minor compared to
the level of approximation involved in using the normal distribution.
A final note. Before you conclude that the data in Table 5.1 provide evidence of sex discrimi-
nation, you should realize that females tend to apply to different graduate programs than males. A
more careful examination of the complete Berkeley data shows that the difference observed here
largely results from females applying more frequently than males to highly restrictive programs,
cf. Christensen (1997, p. 114). Rejecting the test suggests that something is wrong with the null
model. In this case, the assumption of binomial sampling is wrong. Some people have different
probabilities of being admitted than other people, depending on what department they applied to.
(O − E)2
X2 = ∑ E
.
all cells
For the female Swedish births,
X 2 = 121.24.
Note that small values of X 2 indicate observed values that are similar to the expected values, so
small values of X 2 are consistent with the null model. (However, with 3 or more degrees of freedom,
values that are too small can indicate that the multinomial sampling assumption is suspect.) Large
values of X 2 occur whenever one or more observed values are far from the expected values. To
perform a test, we need some idea of how large X 2 could reasonably be when the null model is true.
It can be shown that for a problem such as this with 1) a fixed number of cells q, here q = 12, with
2) a null model consisting of known probabilities such as those given in column 4 of Table 5.2, and
with 3) large sample sizes for each cell, the null distribution of X 2 is approximately
X 2 ∼ χ 2 (q − 1).
The degrees of freedom are only q − 1 because the p̂s must add up to 1. Thus, if we know q − 1 = 11
of the proportions, we can figure out the last one. Only q − 1 of the cells are really free to vary. From
Appendix B.2, the 99.5th percentile of a χ 2 (11) distribution is χ 2 (0.995, 11) = 26.76. The observed
X 2 value of 121.24 is much larger than this, so the observed value of X 2 could not reasonably come
5.4 TWO INDEPENDENT MULTINOMIAL SAMPLES 117
Table 5.3: Swedish births: monthly observations (Oi j s) and monthly proportions by sex.
Observations Proportions
Month Female Male Total Female Male
January 3537 3743 7280 0.083 0.082
February 3407 3550 6957 0.080 0.078
March 3866 4017 7883 0.091 0.088
April 3711 4173 7884 0.087 0.091
May 3775 4117 7892 0.089 0.090
June 3665 3944 7609 0.086 0.086
July 3621 3964 7585 0.085 0.087
August 3596 3797 7393 0.084 0.083
September 3491 3712 7203 0.082 0.081
October 3391 3512 6903 0.080 0.077
November 3160 3392 6552 0.074 0.074
December 3371 3761 7132 0.079 0.082
Total 42591 45682 88273 1.000 1.000
from a χ 2 (11) distribution. Tests based on X 2 , like F tests, are commonly viewed as being rejected
only for large values of the test statistics and P values are computed correspondingly. However, X 2
values that are too small also suggest that something is awry with the null model. In any case, there
is overwhelming evidence that monthly female Swedish births are not multinomial with constant
probabilities.
In this example, our null hypothesis was that the probability of a female birth was the same
in every month. A more reasonable hypothesis might be that the probability of a female birth is
the same on every day. The months have different numbers of days so under this null model they
have different probabilities. For example, assuming a 365-day year, the probability of a female
birth in January is 31/365, which is somewhat larger than 1/12. Exercise 5.8.4 involves testing the
corresponding null model.
We can use results from Section 5.1 to help in the analysis of multinomial data. If we consider
only the month of December, we can view each trial as a success if the birth is in December and a
failure otherwise. Writing the probability of a birth in December as p12 , from Table 5.2 the estimate
of p12 is
3371
p̂12 = = 0.07915
42591
with standard error
0.07915(1 − 0.07915)
SE( p̂12 ) = = 0.00131
42591
and a 95% confidence interval has endpoints
0.07915 ± 1.96(0.00131).
The interval reduces to (0.077, 0.082). Tests for monthly proportions can be performed in a similar
fashion.
In general,
Oi· Oi· O· j
Êi j = O· j p̂0i = O· j = .
O·· O··
Again, the estimated expected values are computed assuming that the proportions of births are the
same for females and males in every month, i.e., assuming that the null model is true. The estimated
expected values under the null model are given in Table 5.4. Note that the totals for each month and
for each sex remain unchanged.
The estimated expected values are compared to the observations using Pearson residuals, just as
in Section 5.3. The Pearson residuals are
Oi j − Êi j
r̃i j ≡ .
Êi j
A more apt name for the Pearson residuals in this context may be crude standardized residuals. It
is the standardization here that is crude and not the residuals. The standardization in the Pearson
residuals ignores the fact that Ê is itself an estimate. Better, but considerably more complicated,
5.4 TWO INDEPENDENT MULTINOMIAL SAMPLES 119
Table 5.4: Estimated expected Swedish births by month (Êi j s) and pooled proportions.
Expectations Pooled
Month Female Male Total proportions
January 3512.54 3767.46 7280 0.082
February 3356.70 3600.30 6957 0.079
March 3803.48 4079.52 7883 0.089
April 3803.97 4080.03 7884 0.089
May 3807.83 4084.17 7892 0.089
June 3671.28 3937.72 7609 0.086
July 3659.70 3925.30 7585 0.086
August 3567.06 3825.94 7393 0.084
September 3475.39 3727.61 7203 0.082
October 3330.64 3572.36 6903 0.078
November 3161.29 3390.71 6552 0.074
December 3441.13 3690.87 7132 0.081
Total 42591.00 45682.00 88273 1.000
Table 5.5: Pearson residuals for Swedish birth months (r̃i j s).
standardized residuals can be defined for count data, cf. Christensen (1997, Section 6.7) and Chap-
ters 20 and 21. For the Swedish birth data, the Pearson residuals are given in Table 5.5. Note that
when compared to a N(0, 1) distribution, none of the residuals is very large; all are smaller than 1.51
in absolute value.
As in Section 5.3, the sum of the squared Pearson residuals gives Pearson’s χ 2 statistic for
testing the null model of no differences between females and males. Pearson’s test statistic is
(Oi j − Êi j )2
X2 = ∑ .
ij Êi j
For the Swedish birth data, computing the statistic from the 24 cells in Table 5.5 gives
X 2 = 14.9858.
Female
0.090
Male
0.085
Proportion
0.080
0.075
Ja Fe Mr Ap My Jn Jl Au Se Oc No De
Month
and 0.10, and the more appropriate two-sided P value would be even bigger. There is no evidence
of any differences in the monthly birth rates for males and females.
Another way to evaluate the null model is by comparing the observed monthly birth proportions
by sex. These observed proportions are given in Table 5.3. If the populations of females and males
have the same proportions of births in each month, the observed proportions of births in each month
should be similar (except for sampling variation). One can compare the numbers directly in Table 5.3
or one can make a visual display of the observed proportions as in Figure 5.1.
The methods just discussed apply equally well to the binomial data of Table 5.1. Applying the
X 2 test given here to the data of Table 5.1 gives
X 2 = 92.2.
The statistic X 2 is equivalent to the test statistic given in Section 5.2 using the pooled estimate p̂∗
to compute the alternative standard error. The test statistic in Section 5.2 is −9.60, and if we square
that we get
(−9.60)2 = 92.2 = X 2 .
The −9.60 is compared to a N(0, 1), while the 92.2 is compared to a χ 2 (1) because Table 5.1 has
2 rows and 2 columns. A χ 2 (1) distribution is obtained by squaring a N(0, 1) distribution; P values
are identical and critical values are equivalent.
R and Minitab code for fitting contingency tables is given on my website.
one sample of 1926 individuals cross-classified by religion and occupation, or four independent
samples of sizes 348, 477, 411, and 690 taken from the occupational groups, or three independent
samples of sizes 1135, 648, and 143 taken from the religious groups. We choose to view the data as
independent samples from the three religious groups. The data in Table 5.6 constitutes a 3 × 4 table
because, excluding the totals, the table has 3 rows and 4 columns.
We again test whether the populations are the same. In other words, the null hypothesis is that the
probability of falling into any occupational group is identical for members of the various religions.
Under this null hypothesis, it makes sense to pool the data from the three religions to obtain esti-
mates of the common probabilities. For example, under the null hypothesis of identical populations,
the estimate of the probability that a person is a professional is
348
p̂01 = = 0.180685.
1926
For skilled workers the estimated probability is
690
p̂04 = = 0.358255.
1926
Denote the observations as Oi j with i identifying a religious group and j indicating occupation.
We use a dot to signify summing over a subscript. Thus the total for religious group i is
Oi· = ∑ Oi j ,
j
and
O·· = ∑ Oi j
ij
is the grand total. Recall that the null hypothesis is that the probability of being in an occupation
group is the same for each of the three populations. Pooling information over religions, we have
O· j
p̂0j =
O··
as the estimate of the probability that someone in the study is in occupational group j. This estimate
is only appropriate when the null model is true.
The estimated expected count under the null model for a particular occupation and religion is
obtained by multiplying the number of people sampled in that religion by the probability of the oc-
cupation. For example, the estimated expected count under the null model for Jewish professionals
is
Ê31 = 143(0.180685) = 25.84.
122 5. CONTINGENCY TABLES
Religion A B C D Total
Protestant 205.08 281.10 242.20 406.62 1135
Roman Catholic 117.08 160.49 138.28 232.15 648
Jewish 25.84 35.42 30.52 51.23 143
Total 348.00 477.00 411.00 690.00 1926
Religion A B C D
Protestant 0.34 −0.24 0.76 −0.63
Roman Catholic −1.39 −1.62 −0.96 3.07
Jewish 2.00 4.13 −0.09 −4.78
Similarly, the estimated expected count for Roman Catholic skilled workers is
In general,
O· j Oi· O· j
Êi j = Oi· p̂0j = Oi· = .
O·· O··
Again, the estimated expected values are computed assuming that the null model is true. The ex-
pected values for all occupations and religions are given in Table 5.7.
The estimated expected values are compared to the observations using Pearson residuals. The
Pearson residuals are
Oi j − Êi j
r̃i j = .
Êi j
These crude standardized residuals are given in Table 5.8 for all occupations and religions. The
largest negative residual is −4.78 for Jewish people with occupation D. This indicates that Jew-
ish people were substantially underrepresented among skilled workers relative to the other two
religious groups. On the other hand, Roman Catholics were substantially overrepresented among
skilled workers, with a positive residual of 3.07. The other large residual in the table is 4.13 for
Jewish people in group B. Thus Jewish people were more highly represented among owners, man-
agers, and officials than the other religious groups. Only one other residual is even moderately large,
the 2.00 indicating a high level of Jewish people in the professions. The main feature of these data
seems to be that the Jewish group was different from the other two. A substantial difference appears
in every occupational group except clerical and sales.
As in Sections 5.3 and 5.4, the sum of the squared Pearson residuals gives Pearson’s χ 2 statistic
for testing the null model that the three populations are the same and the samples are independent
multinomials. Pearson’s test statistic is
(Oi j − Êi j )2
X2 = ∑ .
ij Êi j
X 2 = 60.0.
The appropriate number of degrees of freedom for the χ 2 test is the number of data rows in Ta-
ble 5.6 minus 1 times the number of data columns in Table 5.6 minus 1. Thus the appropriate
5.6 LANCASTER–IRWIN PARTITIONING 123
Protest
RomCath
Jewish
0.4
Proportion
0.3
0.2
0.1
A B C D
Occupation
reference distribution is χ 2 ((3 − 1)(4 − 1)) = χ 2 (6). The 99.5th percentile of a χ 2 (6) distribution is
χ 2 (0.995, 6) = 18.55 so the observed statistic X 2 = 60.0 could not reasonably come from a χ 2 (6)
distribution. In particular, for multinomial sampling the test clearly indicates that the proportions of
people in the different occupation groups differ with religious category.
As in the previous section, we can informally evaluate the null model by examining the observed
proportions for each religious group. The observed proportions are given in Table 5.9. Under the
null model, the observed proportions in each occupation category should be the same for all the
religions (up to sampling variability). Figure 5.2 displays the observed proportions graphically. The
Jewish group is obviously very different from the other two groups in occupations B and D and
is very similar in occupation C. The Jewish proportion seems somewhat different for occupation
A. The Protestant and Roman Catholic groups seem similar except that the Protestants are a bit
underrepresented in occupation D and therefore are overrepresented in the other three categories.
(Remember that the four proportions for each religion must add up to one, so being underrepresented
in one category forces an overrepresentation in one or more other categories.)
Jewish group has been eliminated, leaving a subset of the original table. In the collapsed table, the
two rows in the reduced table, Protestant and Roman Catholic, have been collapsed into a single
row.
In Lancaster–Irwin partitioning, we pick a group of either rows or columns, say rows. The
reduced table involves all of the columns but only the chosen subgroup of rows. The collapsed table
involves all of the columns and all of the rows not in the chosen subgroup, along with a row that
combines (collapses) all of the subgroup rows into a single row. In Table 5.10 the chosen subgroup
of rows contains the Protestants and Roman Catholics. The reduced table involves all occupational
groups but only the Protestants and Roman Catholics. In the collapsed table the occupational groups
are unaffected but the Protestants and Roman Catholics are combined into a single row. The other
rows remain the same; in this case the other rows consist only of the Jewish row. As alluded to
above, rather than picking a group of rows to form the partitioning, we could select a group of
columns.
Lancaster–Irwin partitioning is by no means a unique process. There are as many ways to parti-
tion a table as there are ways to pick a group of rows or columns. In Table 5.10 we made a particular
selection based on the residual analysis of these data from the previous section. The main feature we
discovered in the residual analysis was that the Jewish group seemed to be different from the other
two groups. Thus it seemed to be of interest to compare the Jewish group with a combination of the
others and then to investigate what differences there might be among the other religious groups. The
partitioning of Table 5.10 addresses precisely these questions. In the remainder of our discussion
we assume that multinomial sampling is valid so that we have tests of the null hypothesis and not
just the null model.
Tables 5.11 and 5.12 provide statistics for the analysis of the reduced table and collapsed table.
The reduced table simply reconfirms our previous conclusions. The X 2 value of 12.3 indicates sub-
stantial evidence of a difference between Protestants and Roman Catholics. The percentage point
χ 2 (0.995, 3) = 12.84 indicates that the one-sided P value for the test is a bit greater than 0.005. The
residuals indicate that the difference was due almost entirely to the fact that Roman Catholics have
relatively higher representation among skilled workers. (Or equivalently, that Protestants have rela-
tively lower representation among skilled workers.) Overrepresentation of Roman Catholics among
skilled workers forces their underrepresentation among other occupational groups but the level of
underrepresentation in the other groups was approximately constant as indicated by the approxi-
mately equal residuals for Roman Catholics in the other three occupation groups. We will see later
that for Roman Catholics in the other three occupation groups, their distribution among those groups
was almost the same as those for Protestants. This reinforces the interpretation that the difference
was due almost entirely to the difference in the skilled group.
The conclusions that can be reached from the collapsed table are also similar to those drawn
in the previous section. The X 2 value of 47.5 on 3 degrees of freedom indicates overwhelming
evidence that the Jewish group was different from the combined Protestant–Roman Catholic group.
The residuals can be used to isolate the sources of the differences. The two groups differed in
5.6 LANCASTER–IRWIN PARTITIONING 125
proportions of skilled workers and proportions of owners, managers, and officials. There was a
substantial difference in the proportions of professionals. There was almost no difference in the
proportion of clerical and sales workers between the Jewish group and the others.
The X 2 value computed for Table 5.6 was 60.0. The X 2 value for the collapsed table is 47.5 and
.
the X 2 value for the reduced table is 12.3. Note that 60.0 = 59.8 = 47.5 + 12.3. It is not by chance
that the sum of the X values for the collapsed and reduced tables is approximately equal to the X 2
2
value for the original table. In fact, this relationship is a primary reason for using the Lancaster–
.
Irwin partitioning method. The approximate equality 60.0 = 59.8 = 47.5 + 12.3 indicates that the
vast bulk of the differences between the three religious groups is due to the collapsed table, i.e., the
difference between the Jewish group and the other two. Roughly 80% (47.5/60) of the original X 2
value is due to the difference between the Jewish group and the others. Of course the X 2 value 12.2
for the reduced table is still large enough to strongly suggest differences between Protestants and
Roman Catholics.
.
Not all data will yield an approximation as close as 60.0 = 59.8 = 47.5 + 12.3 for the partition-
ing. The fact that we have an approximate equality rather than an exact equality is due to our choice
of the test statistic X 2 . Pearson’s statistic is simple and intuitive; it compares observed values with
expected values and standardizes by the size of the expected value. An alternative test statistic also
126 5. CONTINGENCY TABLES
The motivation behind the likelihood ratio test statistic is not as transparent as that behind Pearson’s
statistic, so we will not discuss the likelihood ratio test statistic in any detail until later. However,
one advantage of the likelihood ratio test statistic is that the sum of its values for the reduced table
and collapsed table gives exactly the likelihood ratio test statistic for the original table. Likelihood
ratio statistics will be used extensively in Chapters 20 and 21, and in Chapter 21, Lancaster-Irwin
partitioning will be revisited.
Further partitioning
We began this section with the 3 × 4 data of Table 5.6 that has 6 degrees of freedom for its X 2 test.
We partitioned the data into two 2 × 4 tables, each with 3 degrees of freedom. We can continue to
use the Lancaster–Irwin method to partition the reduced and collapsed tables given in Table 5.10.
The process of partitioning previously partitioned tables can be continued until the original table
is broken into a collection of 2 × 2 tables. Each 2 × 2 table has one degree of freedom for its chi-
squared test, so partitioning provides a way of breaking a large table into one degree of freedom
components.
What we have been calling the reduced table involves all four occupational groups along with
the two religious groups Protestant and Roman Catholic. The table was given in both Table 5.10 and
Table 5.11. We now consider this table further. It was discussed earlier that the difference between
Protestants and Roman Catholics can be ascribed almost entirely to the difference in the proportion
of skilled workers in the two groups. To explore this we choose a new partition based on a group of
columns that includes all occupations other than the skilled workers. Thus we get the ‘reduced’ table
in Table 5.13 with occupations A, B, and C and the ‘collapsed’ table in Table 5.14 with occupation
D compared to the accumulation of the other three.
Table 5.13 allows us to examine the proportions of Protestants and Catholics in the occupational
groups A, B, and C. We are not investigating whether Catholics were more or less likely than
Protestants to enter these occupational groups; we are examining their distribution within the groups.
The analysis is based only on those individuals that were in this collection of three occupational
groups. The X 2 value is exceptionally small, only 0.065. There is no evidence of any difference
between Protestants and Catholics for these three occupational groups.
5.6 LANCASTER–IRWIN PARTITIONING 127
Table 5.13 is a 2 × 3 table. We could partition it again into two 2 × 2 tables but there is little
point in doing so. We have already established that there is no evidence of differences.
Table 5.14 has the three occupational groups A, B, and C collapsed into a single group. This
table allows us to investigate whether Catholics were more or less likely than Protestants to enter
this group of three occupations. The X 2 value is a substantial 12.2 on one degree of freedom, so we
can tentatively conclude that there was a difference between Protestants and Catholics. From the
residuals, we see that among people in the four occupational groups, Catholics were more likely
than Protestants to be in the skilled group and less likely to be in the other three.
Table 5.14 is a 2 × 2 table so no further partitioning is possible. Note again that the X 2 of 12.3
from Table 5.11 is approximately equal to the sum of the 0.065 from Table 5.13 and the 12.2 from
Table 5.14.
Finally, we consider additional partitioning of the collapsed table given in Tables 5.10 and 5.12.
It was noticed earlier that the Jewish group seemed to differ from Protestants and Catholics in every
occupational group except C, clerical and sales. Thus we choose a partitioning that isolates group
C. Table 5.15 gives a collapsed table that compares C to the combination of groups A, B, and D.
Table 5.16 gives a reduced table that involves only occupational groups A, B, and D.
Table 5.15 demonstrates no difference between the Jewish group and the combined Protestant–
Catholic group. Thus the proportion of people in clerical and sales was about the same for the
128 5. CONTINGENCY TABLES
Jewish group as for the combined Protestant and Roman Catholic group. Any differences between
the Jewish and Protestant–Catholic groups must be in the proportions of people within the three
occupational groups A, B, and D.
Table 5.16 demonstrates major differences between occupations A, B, and D for the Jewish
group and the combined Protestant–Catholic group. As seen earlier and reconfirmed here, skilled
workers had much lower representation among the Jewish group, while professionals and especially
owners, managers, and officials had much higher representation among the Jewish group.
Table 5.16 can be further partitioned into Tables 5.17 and 5.18. Table 5.17 is a reduced 2 × 2
table that considers the difference between the Jewish group and others with respect to occupational
groups B and D. Table 5.18 is a 2 × 2 collapsed table that compares occupational group A with the
combination of groups B and D.
Table 5.17 shows a major difference between occupational groups B and D. Table 5.18 may or
may not show a difference between group A and the combination of groups B and D. The X 2 values
are 46.8 and 5.45, respectively. The question is whether an X 2 value of 5.45 is suggestive of a dif-
ference between religious groups when we have examined the data in order to choose the partitions
of Table 5.6. Note that the two X 2 values sum to 52.25, whereas the X 2 value for Table 5.16, from
which they were constructed, is only 47.2. The approximate equality is a very rough approximation.
Nonetheless, we see from the relative sizes of the two X 2 values that the majority of the difference
5.7 EXERCISES 129
between the Jewish group and the other religious groups was in the proportion of owners, managers,
and officials as compared to the proportion of skilled workers.
Ultimately, we have partitioned Table 5.6 into Tables 5.13, 5.14, 5.15, 5.17, and 5.18. These
are all 2 × 2 tables except for Table 5.13. We could also have partitioned Table 5.13 into two 2 ×
2 tables but we chose to leave it because it showed so little evidence of any difference between
Protestants and Roman Catholics for the three occupational groups considered. The X 2 value of
60.0 for Table 5.6 was approximately partitioned into X 2 values of 0.065, 12.2, 0.01, 46.8, and 5.45,
respectively. Except for the 0.065 from Table 5.13, each of these values is computed from a 2 × 2
table, so each has 1 degree of freedom. The 0.065 is computed from a 2 × 3 table, so it has 2 degrees
of freedom. The sum of the five X 2 values is 64.5, which is roughly equal to the 60.0 from Table 5.6.
The five X 2 values can all be used in testing but we let the data suggest the partitions. It is
inappropriate to compare these X 2 values to their usual χ 2 percentage points to obtain tests. A simple
way to adjust for both the multiple testing and the data dredging (letting the data suggest partitions)
is to compare all X 2 values to the percentage points appropriate for Table 5.6. For example, the
α = 0.05 test for Table 5.6 uses the critical value χ 2 (0.95, 6) = 12.58. By this standard, Table 5.17
with X 2 = 46.8 shows a significant difference between religious groups and Table 5.14 with X 2 =
12.2 nearly shows a significant difference between religious groups. The value of X 2 = 5.45 for
Table 5.18 gives no evidence of a difference based on this criterion even though such a value would
be highly suggestive if we could compare it to a χ 2 (1) distribution. This method is similar in spirit to
Scheffé’s method to be considered in Section 13.4 and suffers from the same extreme conservatism.
5.7 Exercises
E XERCISE 5.7.1. Reiss et al. (1975) and Fienberg (1980) reported that 29 of 52 virgin female
undergraduate university students who used a birth control clinic thought that extramarital sex is not
always wrong. Give a 99% confidence interval for the population proportion of virgin undergraduate
university females who use a birth control clinic and think that extramarital sex is not always wrong.
In addition, 67 of 90 virgin females who did not use the clinic thought that extramarital sex is
not always wrong. Give a 99% confidence interval for the difference in proportions between the two
groups and give a 0.05 level test that there is no difference.
E XERCISE 5.7.2. In France in 1827, 6929 people were accused in the courts of assize and 4236
were convicted. In 1828, 7396 people were accused and 4551 were convicted. Give a 95% confi-
dence interval for the proportion of people convicted in 1827. At the 0.01 level, test the null model
130 5. CONTINGENCY TABLES
that the conviction rate in 1827 was different than 2/3. Does the result of the test depend on the
choice of standard error? Give a 95% confidence interval for the difference in conviction rates be-
tween the two years. Test the hypothesis of no difference in conviction rates using α = 0.05 and
both standard errors.
E XERCISE 5.7.3. Pauling (1971) reports data on the incidence of colds among French skiers
who where given either ascorbic acid or a placebo. Of 139 people given ascorbic acid, 17 devel-
oped colds. Of 140 people given the placebo, 31 developed colds. Do these data suggest that the
proportion of people who get colds differs depending on whether they are given ascorbic acid?
E XERCISE 5.7.4. Use the data in Table 5.2 to test whether the probability of a birth in each month
is the number of days in the month divided by 365. Thus the null probability for January is 31/365
and the null probability for February is 28/365.
E XERCISE 5.7.5. Snedecor and Cochran (1967) report data from an unpublished report by E.
W. Lindstrom. The data concern the results of cross-breeding two types of corn (maize). In 1301
crosses of two types of plants, 773 green, 231 golden, 238 green-golden, and 59 golden-green-
striped plants were obtained. If the inheritance of these properties is particularly simple, Mendelian
genetics suggests that the probabilities for the four types of corn may be 9/16, 3/16, 3/16, and 1/16,
respectively. Test whether these probabilities are appropriate. If they are inappropriate, identify the
problem.
E XERCISE 5.7.6. Quetelet (1842) and Stigler (1986, p. 175) report data on conviction rates in
the French Courts of Assize (Law Courts) from 1825 to 1830. The data are given in Table 5.19. Test
whether the conviction rate is the same for each year. Use α = 0.05. (Hint: Table 5.19 is written in
a nonstandard form. You need to modify it before applying the methods of this chapter.) If there are
differences in conviction rates, use residuals to explore these differences.
E XERCISE 5.7.7. Table 5.20 contains additional data from Lazerwitz (1961). These consist of a
breakdown of the Protestants in Table 5.6 but with the addition of four more occupational categories.
The additional categories are E, semiskilled; F, unskilled; G, farmers; H, no occupation. Analyze
the data with an emphasis on partitioning the table.
5.7 EXERCISES 131
E XERCISE 5.7.8. Stigler (1986, p. 208) reports data from the Edinburgh Medical and Surgical
Journal (1817) on the relationship between heights and chest circumferences for Scottish militia
men. Measurements were made in inches. We concern ourselves with two groups of men, those
with 39-inch chests and those with 40-inch chests. The data are given in Table 5.21. Test whether
the distribution of heights is the same for these two groups.
Chapter 6
This chapter examines data that come as pairs of numbers, say (x, y), and the problem of fitting a
line to them. More generally, it examines the problem of predicting one variable (y) from values of
another variable (x). Consider for the moment the popular wisdom that people who read a lot tend
to have large vocabularies and poor eyes. Thus, reading causes both conditions: large vocabularies
and poor eyes. If this is true, it may be possible to predict the size of someone’s vocabulary from the
condition of their eyes. Of course this does not mean that having poor eyes causes large vocabularies.
Quite the contrary, if anything, poor eyes probably keep people from reading and thus cause small
vocabularies. Regression analysis is concerned with predictive ability, not with causation.
Section 6.1 of this chapter introduces an example along with many of the basic ideas and meth-
ods of simple linear regression (SLR). The rest of the chapter goes into the details of simple linear
regression. Section 6.7 deals with an idea closely related to simple linear regression: the correla-
tion between two variables. Section 6.9 provides an initial introduction to multiple regression, i.e.,
regression with more than one predictor variable.
6.1 An example
Data from The Coleman Report were reproduced in Mosteller and Tukey (1977). The data were col-
lected from schools in the New England and Mid-Atlantic states of the USA. For now we consider
only two variables: y—the mean verbal test score for sixth graders, and x—a composite measure of
socioeconomic status. The data are presented in Table 6.1.
Figure 6.1 contains a scatter plot of the data. Note the rough linear relationship. The higher the
composite socioeconomic status variable, the higher the mean verbal test score. However, there is
considerable error in the relationship. By no means do the points lie exactly on a straight line.
We assume a basic linear relationship between the ys and xs, something like y = β0 + β1 x. Here
β1 is the slope of the line and β0 is the intercept. Unfortunately, the observed y values do not fit
exactly on a line, so y = β0 + β1 x is only an approximation. We need to modify this equation to
allow for the variability of the observations about the line. We do this by building a random error
133
134 6. SIMPLE LINEAR REGRESSION
40
35
y
30
25
−15 −10 −5 0 5 10 15
term into the linear relationship. Write the relationship as y = β0 + β1 x + ε , where ε indicates the
random error. In this model for the behavior of the data, ε accounts for the deviations between the
y values we actually observe and the line β0 + β1 x where we expect to observe any y value that
corresponds to x. As we are interested in predicting y from known x values, we treat x as a known
(nonrandom) variable.
We assume that the relationship y = β0 + β1 x + ε applies to all of our observations. For the cur-
rent data, that means we assume the relationship holds for all of the 20 pairs of values in Table 6.1.
This assumption is stated as the simple linear regression model for these data,
yi = β0 + β1xi + εi , (6.1.1)
i = 1, . . . , 20. For this model to be useful, we need to make some assumptions about the errors, the
εi s. The standard assumptions are that the
Given data for which these assumptions are reasonable, we can estimate the unknown parameters.
Although we assume a linear relationship between the ys and xs, the model does not assume that
we know the slope β1 or the intercept β0 . Together, these unknown parameters would tell us the
exact nature of the linear relationship but both need to be estimated. We use the notation β̂1 and
β̂0 to denote estimates of β1 and β0 , respectively. To perform statistical inferences we also need
to estimate the variance of the errors, σ 2 . Note that σ 2 is also the variance of the y observations
because none of β0 , β1 , and x are random.
The simple linear regression model involves many assumptions. It assumes that the relationship
between y and x is linear, it assumes that the errors are normally distributed, it assumes that the
errors all have the same variance, it assumes that the errors are all independent, and it assumes that
the errors all have mean 0. This last assumption is redundant. It turns out that the errors all have
mean 0 if and only if the relationship between y and x is linear. As far as possible, we will want
to verify (validate) that these assumptions are reasonable before we put much faith in the estimates
and statistical inferences that can be obtained from simple linear regression. Chapters 7 and 8 deal
with checking these assumptions.
6.1 AN EXAMPLE 135
Before getting into a detailed discussion of simple linear regression, we illustrate some high-
lights using the Coleman Report data. We need to fit Model (6.1.1) to the data. A computer program
typically yields parameter estimates, standard errors for the estimates, t ratios for testing whether
the parameters are zero, P values for the tests, and an analysis of variance table. These results are
often displayed as illustrated below. We refer to them as the table of coefficients and the analysis of
variance (ANOVA) table, respectively.
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 33.3228 0.5280 63.11 0.000
x 0.56033 0.05337 10.50 0.000
Analysis of Variance
Source df SS MS F P
Regression 1 552.68 552.68 110.23 0.000
Error 18 90.25 5.01
Total 19 642.92
Much can be learned from these two tables of statistics. The estimated regression equation is
ŷ = 33.3 + 0.560x.
This equation allows us to predict a value for y when the value of x is given. In particular, for these
data an increase of one unit in socioeconomic status increases our prediction of mean verbal test
scores by about 0.56 units. This is not to say that some program to increase socioeconomic statuses
by one unit will increase mean verbal test scores by about 0.56 unit. The 0.56 describes the current
data, it does not imply a causal relationship. If we want to predict the mean verbal test score for a
school that is very similar to the ones in this study, this equation should give good predictions. If
we want to predict the mean verbal test score for a school that is very different from the ones in this
study, this equation is likely to give poor predictions. In fact, if we collect new data from schools
with very different socioeconomic statuses, the data are not similar to these, so this fitted model
would be highly questionable if applied in the new situation. Nevertheless, a simple linear regression
model with a different intercept and slope might fit the new data well. Similarly, data collected after
a successful program to raise socioeconomic statuses are unlikely to be similar to the data collected
before such a program. The relationship between socioeconomic status and mean verbal test scores
may be changed by such a program. In particular, the things causing both socioeconomic status
and mean verbal test score may be changed in unknown ways by such a program. These are crucial
points and bear repeating. The regression equation describes an observed relationship between mean
verbal test scores and socioeconomic status. It can be used to predict mean verbal test scores from
socioeconomic status in similar situations. It does not imply that changing the socioeconomic status
a fixed amount will cause the mean verbal test scores to change by a proportional amount.
The value 0.000 indicates a large amount of evidence against the null model. If we are convinced that
all the assumptions of the simple linear regression model are correct, then the only thing that could
be wrong with the null model is that β1 = 0. Note that if β1 = 0, the linear relationship becomes
y = β0 + ε , so there is no relationship between y and x, i.e., y does not depend on x. The small P
value indicates that the slope is not zero and thus the variable x helps to explain the variable y.
The table of coefficients also allows us to compute a variety of other t statistics. For example, if
we wanted to test H0 : β1 = 1
0.56033 − 1
tobs = = −8.24.
0.05337
The significance level of the test is the P value,
Alternatively, we could compute the 95% confidence interval for β1 , which has endpoints
β̂1 ± t(0.975, dfE) SE β̂1 .
From a t table, t(0.975, 18) = 2.101, so, using the tabled statistics, the endpoints are
0.56033 ± 2.101(0.05337).
The confidence interval is (0.448, 0.672), so values of the slope β1 between 0.448 and 0.672 are
consistent with the data and the model based on a 0.05 level test.
Consider the problem of estimating the value of the line at x = −16.04. This value of x is the
minimum observed value for socioeconomic status, so it is somewhat dissimilar to the other x values
in the data. Its dissimilarity causes there to be substantial variability in estimating the regression line
(expected value of y) at this point. The point on the line is β0 + β1(−16.04) and the estimator is
For constructing 95% t intervals, the percentile needed is t(0.975, 18) = 2.101. The standard error
for the estimate of the point on the line is usually available from computer programs (cf. Sec-
tion 6.10.); in this example it is SE(Line) = 1.140. The 95% confidence interval for the point on the
line β0 + β1 (−16.04) has endpoints
24.34 ± 2.101(1.140),
which gives the interval (21.9, 26.7). Values of the population mean of the schoolwise mean verbal
test scores for New England and Mid-Atlantic sixth graders with a school socioeconomic measure
of −16.04 that are consistent with the data are those between 21.9 and 26.7.
The prediction ŷ for a new observation with x = −16.04 is simply the estimated point on the
line
ŷ = β̂0 + β̂1(−16.04) = 24.34.
Prediction of a new observation is subject to more error than estimation of a point on the line. A new
observation has the same variance as all other observations, so the prediction interval must account
for this variance as well as for the variance of estimating the point on the line. The standard error
for the prediction interval is computed as
SE(Prediction) = MSE + SE(Line)2. (6.1.2)
6.1 AN EXAMPLE 137
In this example,
SE(Prediction) = 5.01 + (1.140)2 = 2.512.
The prediction interval endpoints are
24.34 ± 2.101(2.512),
and the 95% prediction interval is (19.1, 29.6). Future values of sixth graders’s mean verbal test
scores that are consistent with the data and model are those between 19.1 and 29.6. These are for a
different New England or Mid-Atlantic school with a socioeconomic measure of −16.04. Note that
the prediction interval is considerably wider than the corresponding confidence interval. Note also
that this is just another special case of the prediction theory in Section 3.7. As such, these results
are analogous to those obtained for the one-sample and two-sample data structures.
attach(coleman.slr)
coleman.slr
138 6. SIMPLE LINEAR REGRESSION
#Summary tables
cr <- lm(y ~ x)
crp=summary(cr)
crp
anova(cr)
#prediction
new = data.frame(x=c(-16.04))
predict(lm(y~x),new,se.fit=T,interval="confidence")
predict(lm(y~x),new,interval="prediction")
plot(cr$fit,rstandard(cr),xlab="Fitted",
ylab="Standardized residuals",main="Residual-Fitted plot")
plot(x,rstandard(cr),xlab="x",ylab="Standardized residuals",
main="Residual-Socio plot")
#leverage plot
Leverage=hatvalues(cr)
plot(School,Leverage,main="School-Leverage plot")
# Wilk-Francia Statistic
rankit=qnorm(ppoints(rstandard(cr),a=I(3/8)))
ys=sort(rstandard(cr))
Wprime=(cor(rankit,ys))^2
6.2 THE SIMPLE LINEAR REGRESSION MODEL 139
Wprime
The vast majority of the analyses we will run can be computed by changing the (two lines of the)
read.table command to enter the appropriate data, and changing the cr <- lm(y ~ x) com-
mand to allow for fitting an appropriate model.
Formulae for the estimates are given in Section 6.10. These estimates provide fitted (predicted)
values ŷi = β̂0 + β̂1 xi and residuals ε̂i = yi − ŷi , i = 1, . . . , n. The sum of squares error is
n
SSE = ∑ ε̂i2 .
i=1
140 6. SIMPLE LINEAR REGRESSION
The model involves two parameters for the mean values E(yi ), namely β0 and β1 , so the degrees of
freedom for error are
df E = n − 2.
Our estimate of the variance σ 2 is the mean squared error defined as
SSE
MSE = .
df E
Formulae for computing the least squares estimates are given in Section 6.10. The least squares
estimates and the mean squared error are unbiased in the sense that their expected values are the
parameters they estimate.
Proposition 6.2.1. E β̂1 = β1 , E β̂0 = β0 , and E(MSE) = σ 2 .
Proofs of the unbiasedness of the slope and intercept estimates are given in the appendix that appears
at the end of Section 6.10.
We briefly mention the standard optimality properties of least squares estimates but for a detailed
discussion see Christensen (2011, Chapter 2). Assuming that the errors have independent normal
distributions, the estimates β̂0 , β̂1 , and MSE have the smallest variance of any unbiased estimates.
The least squares estimates β̂0 and β̂1 are also maximum likelihood estimates. Maximum likelihood
estimates are those values of the parameters that are most likely to generate the data that were
actually observed. Without assuming that the errors are normally distributed, β̂0 and β̂1 have the
smallest variance of any unbiased estimates that are linear functions of the y observations. (Linear
functions allow multiplying the yi s by constants and adding terms together. Remember, the xi s are
constants, as are any functions of the xi s.) Note that with this weaker assumption, i.e., giving up
normality, we get a weaker result, minimum variance among only linear unbiased estimates instead
of all unbiased estimates. To summarize, under the standard assumptions, least squares estimates of
the regression parameters are best (minimum variance) linear unbiased estimates (BLUEs), and for
normally distributed data they are minimum variance unbiased estimates and maximum likelihood
estimates.
The coefficient of determination, R2 , measures the predictive ability of the model. When we
discuss sample correlations in Section 6.7, we will define R2 as the squared correlation between the
pairs (ŷi , yi ). Alternatively, R2 can be computed from the ANOVA table as
SSReg
R2 ≡ .
SSTot
As such, it measures the percentage of the total variability in y that is explained by the x variable. In
our example,
552.68
R2 = = 86.0%,
642.92
so 86.0% of the total variability is explained by the regression model. This is a large percentage, so
it appears that the x variable has substantial predictive power. However, a large R2 does not imply
that the model is good in absolute terms. It may be possible to show that this model does not fit
the data adequately. In other words, while this model is explaining much of the variability, we may
be able to establish that it is not explaining as much of the variability as it ought. (Example 7.2.2
involves a model with a high R2 that is demonstrably inadequate.) Conversely, a model with a
low R2 may be the perfect model but the data may simply have a great deal of variability. For
example, if you have temperature measurements obtained by having someone walk outdoors and
guess the Celsius temperature and then use the true Fahrenheit temperatures as a predictor, the exact
linear relationship between Celsius and Fahrenheit temperatures may make a line the ideal model.
Nonetheless, the obvious inaccuracy involved in people guessing Celsius temperatures may cause a
6.3 THE ANALYSIS OF VARIANCE TABLE 141
Source df SS MS F
Source df SS MS F
2
Regression(β1 ) 1 n
∑i=1 (ŷi − ȳ· ) SSReg MSReg
MSE
Error n−2 ∑ni=1 (yi − ŷi )2 SSE/(n − 2)
n 2
Total n−1 ∑i=1 (yi − ȳ· )
low R2 . Moreover, even a high R2 of 86% may provide inadequate predictions for the purposes of
the study, while in other situations an R2 of, say, 14% may be perfectly adequate. It depends on the
purpose of the study. Finally, it must be recognized that a large R2 may be an unrepeatable artifact
of a particular data set. The coefficient of determination is a useful tool but it must be used with care.
In particular, it is a much better measure of the predictive ability of a model than of the correctness
of a model.
A standard tool in regression analysis is the construction of an analysis of variance table. Tables 6.2
and 6.3 give alternative forms, both based on the sample mean, the fitted values and the data.
The best form is given in Table 6.2. In this form there is one degree of freedom for every obser-
vation, cf. the total line, and the sum of squares total is the sum of all of the squared observations.
The degrees of freedom and sums of squares for intercept, regression, and error can be added to
obtain the degrees of freedom and sums of squares total. We see that one degree of freedom is used
to estimate the intercept, one is used for the slope, and the rest are used to estimate the variance.
The more commonly used form for the analysis of variance table is given as Table 6.3. It elimi-
nates the line for the intercept and corrects the total line so that the degrees of freedom and sums of
squares still add up.
Note that
n n
∑ (yi − ȳ·) = ∑ y2i − C = SSTot − C.
2
i=1 i=1
We now repeat the testing procedures of Section 6.1 using the model-based approach of Section 3.1.
The simple linear regression model is
yi = β0 + β1xi + εi . (6.4.1)
142 6. SIMPLE LINEAR REGRESSION
This will be the full model for our tests, thus SSE(Full) = 90.25, dfE(Full) = 18, and
MSE(Full) = 5.01 are all as reported in the “Error” line of the simple linear regression ANOVA
table.
The model-based test of H0 : β1 = 0 is precisely the F test provided by the ANOVA table. To
find the reduced model, we need to incorporate β1 = 0 into Model (6.4.1). The reduced model is
yi = β0 + 0xi + εi or
yi = β0 + εi .
This is just the model for a one-sample problem, so the MSE(Red.) = s2y , the sample variance of
the yi s, dfE(Red.) = n − 1 = df Tot −C, and SSE(Red.) = (n − 1)s2y = SSTot −C from the ANOVA
table. To obtain the F statistic, compute
SSE(Red.) − SSE(Full) 642.92 − 90.25
MSTest ≡ = = 552.68
dfE(Red.) − dfE(Full) 19 − 18
and
MSTest 552.68
F= = = 110.23
MSE(Full) 5.01
as discussed earlier. The value 110.23 is compared to an F(1, 18) distribution.
To test H0 : β1 = 1, we incorporate the null hypothesis into Model (6.4.1) to obtain yi = β0 +
1xi + εi . We now move the completely known term 1xi , known as an offset, to the left side of the
equality and write the reduced model as
yi − xi = β0 + εi . (6.4.2)
Again, this is just the model for a one-sample problem, but now the dependent variable is yi − xi . It
follows that MSE(Red.) = s2y−x = 22.66, the sample variance of the numbers yi − xi , dfE(Red.) =
n − 1 = 19, and SSE(Red.) = 19s2y−x = 430.54. To obtain the F statistic, compute
yi = β1 xi + εi . (6.4.3)
This is known as a simple linear regression through the origin. Most computer programs have fitting
an intercept as the default option but one can choose to fit the model without an intercept. Fitting
this model gives a new table of coefficients and ANOVA table.
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
x 1.6295 0.7344 2.22 0.039
6.5 PARAMETRIC INFERENTIAL PROCEDURES 143
Analysis of Variance
Source df SS MS F P
Regression 1 5198 5198 4.92 0.039
Error 19 20061 1056
Total 20 25259
Note that the ANOVA table for regression through the origin is similar to Table 6.2 in that it does
not correct the total line for the intercept. As before, the table of coefficients and the ANOVA table
provide equivalent tests of H0 : β1 = 0, e.g., 4.92 = (2.22)2, but the tests are now based on the new
model, (6.4.3), so the tests are quite different from those discussed earlier.
For our purpose of testing H0 : β0 = 0, we only need dfE(Red.) = 19 and SSE(Red.) = 20061
from the Error line of this ANOVA table as well as the results from the full model. To obtain the F
statistic, compute
SSE(Red.) − SSE(Full) 20061 − 90.25
MSTest ≡ = = 19970.75
dfE(Red.) − dfE(Full) 19 − 18
and
19970.75
F= = 3983 = (63.11)2.
5.01
If you check these computations, you will notice some round-off error. MSE(Full) is closer to
5.0139. The value 3983 could be compared to an F(1, 18) distribution but that is hardly necessary
since it is huge. The value 63.11 was reported as the t statistic in Section 6.1.
β̂1 − β1
∼ t(n − 2).
SE β̂1
Using standard methods, the 99% confidence interval for β1 has endpoints
β̂1 ± t(0.995, n − 2) SE β̂1 .
|β̂1 − 0|
> t(0.975, n − 2).
SE β̂1
|β̂1 − 1|
> t(0.975, n − 2).
SE β̂1
144 6. SIMPLE LINEAR REGRESSION
For inferences about the intercept parameter β0 , formulae for the estimate β̂0 and the standard
error of β̂0 are as given in Section 6.10. The appropriate reference distribution is
β̂0 − β0
∼ t(n − 2).
SE β̂0
|β̂0 − 0|
> t(0.995, n − 2).
SE β̂0
Typically, inferences about β0 are not of substantial interest. β0 is the intercept; it is the value of
the line when x = 0. Typically, the line is only an approximation to the behavior of the (x, y) pairs
in the neighborhood of the observed data. This approximation is only valid in the neighborhood of
the observed data. If we have not collected data near x = 0, the intercept is describing behavior of
the line outside the range of valid approximation.
We can also draw inferences about a point on the line y = β0 + β1 x. For any fixed point x,
β0 + β1x has an estimate
ŷ ≡ β̂0 + β̂1 x.
To get a standard error for ŷ, we first need its variance. As shown in the appendix at the end of
Section 6.10, the variance of ŷ is
(x − x̄· )
2
2 1
Var β̂0 + β̂1 x = σ + , (6.5.1)
n (n − 1)s2x
Using standard methods, the 99% confidence interval for (β0 + β1 x) has endpoints
β̂0 + β̂1x ± t(0.995, n − 2) SE β̂0 + β̂1 x .
We typically prefer to have small standard errors. Even when σ 2 , and thus MSE, is large, using
Equation (6.5.2) we see that the standard error of ŷ will be small when the number of observations
n is large, when the xi values are well spread out, i.e., s2x is large, and when x is close to x̄· . In
other words, the line can be estimated efficiently in the neighborhood of x̄· by collecting a lot
of data. Unfortunately, if we try to estimate the line far from where we collected the data, the
standard error of the estimate gets large. The standard error gets larger as x gets farther away from
6.6 AN ALTERNATIVE MODEL 145
2
the center of the data, x̄· , because the term (x − x̄· ) gets larger. This effect is standardized by the
2 2
original observations; the term in question is (x − x̄·) (n − 1)s2x , so (x − x̄· ) must be large relative
2
to (n − 1)sx before a problem develops. In other words, the distance between x and x̄· must be several
times the standard deviation sx before a problem develops. Nonetheless, large standard errors occur
when we try to estimate the line far from where we collected the data. Moreover, the regression line
is typically just an approximation that holds in the neighborhood of where the data were collected.
This approximation is likely to break down for data points far from the original data. So, in addition
to the problem of having large standard errors, estimates far from the neighborhood of the original
data may be totally invalid.
Estimating a point on the line is distinct from prediction of a new observation for a given x
value. Ideally, the prediction would be the true point on the line for the value x. However, the true
line is an unknown quantity, so our prediction is the estimated point on the line at x. The distinction
between prediction and estimating a point on the line arises because a new observation is subject
to variability about the line. In making a prediction we must account for the variability of the new
observation even when the line is known, as well as account for the variability associated with our
need to estimate the line. The new observation is assumed to be independent of the past data, so
the variance of the prediction is σ 2 (the variance of the new observation) plus the variance of the
estimate of the line as given in (6.5.1). The standard error replaces σ 2 with MSE and takes the
square root, i.e., !
"
" 2
1 (x − x̄ )
SE(Prediction) = #MSE 1 + +
·
.
n (n − 1)s2x
Note that this is the same as the formula given in Equation (6.1.2). Prediction intervals follow in the
usual way. For example, the 99% prediction interval associated with x has endpoints
ŷ ± t(0.995, n − 2) SE(Prediction).
As discussed earlier, estimation of points on the line should be restricted to x values in the
neighborhood of the original data. For similar reasons, predictions should also be made only in the
neighborhood of the original data. While it is possible, by collecting a lot of data, to estimate the line
well even when the variance σ 2 is large, it is not always possible to get good prediction intervals.
Prediction intervals are subject to the variability of both the observations and the estimate of the
line. The variability of the observations cannot be eliminated or reduced. If this variability is too
large, we may get prediction intervals that are too large to be useful. If the simple linear regression
model is the “truth,” there is nothing to be done, i.e., no way to improve the prediction intervals. If
the simple linear regression model is only an approximation to the true process, a more sophisticated
model may give a better approximation and produce better prediction intervals.
are independent. Independence simplifies the computation of variances for regression line estimates.
We will not go further into these claims at this point but the results follow trivially from the matrix
approach to regression that will be treated in Chapter 11.
The key point about the alternative model is that it is equivalent to the original model. The
β1 parameters are the same, as are their estimates and standard errors. The models give the same
predictions, the same ANOVA table F test, and the same R2 . Even the intercept parameters are
equivalent, i.e., they are related in a precise fashion so that knowing about the intercept in either
model yields equivalent information about the intercept in the other model.
6.7 Correlation
The correlation coefficient is a measure of the linear relationship between two variables. The popula-
tion correlation coefficient, usually denoted ρ , was discussed in Chapter 1 along with the population
covariance. The sample covariance between x and y is defined as
1 n
sxy = ∑ (xi − x̄·) (yi − ȳ·)
n − 1 i=1
As with the population correlation, a sample correlation of 0 indicates no linear relationship between
the (xi , yi ) pairs. A sample correlation of 1 indicates a perfect increasing linear relationship. A
sample correlation of −1 indicates a perfect decreasing linear relationship.
The sample correlation coefficient is related to the estimated slope. From Equation (6.10.1) it
will be easily seen that
sx
r = β̂1 .
sy
1 2 3
1 2 3
y2
y3
−1
−1
−3
−3
−3 −1 0 1 2 3 4 −3 −1 0 1 2 3 4
y1 y1
ρ = − 0.5 ρ = − 0.9
1 2 3
1 2 3
y4
y5
−1
−1
−3
−3
−3 −1 0 1 2 3 4 −3 −1 0 1 2 3 4
y1 y1
Figure 6.2 Correlated data. The actual sample correlations are r = 0.889, 0.522, −0.388, −0.868, respec-
tively.
yi j = β0 + β1 xi j + εi j
or equivalently
yk = β0 + β1 xk + εk ,
then for males
μ2 ≡ E(y2 j ) = β0 + β1 x2 j = β0 + β1 × 0 = β0
and for females
μ1 ≡ E(y1 j ) = β0 + β1x1 j = β0 + β1 × 1 = β0 + β1 .
It follows that
μ1 − μ2 = β 1 .
Fitting the simple linear regression model gives
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 139.000 6.753 20.58 0.000
x −11.045 8.758 −1.26 0.216
Note that β̂0 is the sample mean for the females and the standard error is the same as that reported
in Section 4.2. The test of H0 : β̂0 = 0 is not terribly interesting.
148 6. SIMPLE LINEAR REGRESSION
School y x1 x2 x3 x4 x5
1 37.01 3.83 28.87 7.20 26.60 6.19
2 26.51 2.89 20.10 −11.71 24.40 5.17
3 36.51 2.86 69.05 12.32 25.70 7.04
4 40.70 2.92 65.40 14.28 25.70 7.10
5 37.10 3.06 29.59 6.31 25.40 6.15
6 33.90 2.07 44.82 6.16 21.60 6.41
7 41.80 2.52 77.37 12.70 24.90 6.86
8 33.40 2.45 24.67 −0.17 25.01 5.78
9 41.01 3.13 65.01 9.85 26.60 6.51
10 37.20 2.44 9.99 −0.05 28.01 5.57
11 23.30 2.09 12.20 −12.86 23.51 5.62
12 35.20 2.52 22.55 0.92 23.60 5.34
13 34.90 2.22 14.30 4.77 24.51 5.80
14 33.10 2.67 31.79 −0.96 25.80 6.19
15 22.70 2.71 11.60 −16.04 25.20 5.62
16 39.70 3.14 68.47 10.62 25.01 6.94
17 31.80 3.54 42.64 2.66 25.01 6.33
18 31.70 2.52 16.70 −10.99 24.80 6.01
19 43.10 2.68 86.27 15.03 25.51 7.51
20 41.01 2.37 76.73 12.77 24.51 6.96
The estimate of β1 is just the difference in the means between the females and the males; up
to round-off error, the standard error and the t statistic are exactly as reported in Section 4.2 for
inferences related to Par = μ1 − μ2 . Moreover, the MSE as reported in the ANOVA table is precisely
the pooled estimate of the variance with the appropriate degrees of freedom.
Analysis of Variance
Source df SS MS F P
Regression 1 1088.1 1088.1 1.59 0.216
Error 35 23943.0 684.1
Total 36 25031.1
i = 1, . . . , 20, where the εi s are unobservable independent N(0, σ 2 ) random variables and the β s are
fixed unknown parameters. Fitting Model (6.9.1) with a computer program typically yields a table
of coefficients and an analysis of variance table.
6.10 ESTIMATION FORMULAE FOR SIMPLE LINEAR REGRESSION 149
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 19.95 13.63 1.46 0.165
x1 −1.793 1.233 −1.45 0.168
x2 0.04360 0.05326 0.82 0.427
x3 0.55576 0.09296 5.98 0.000
x4 1.1102 0.4338 2.56 0.023
x5 −1.811 2.027 −0.89 0.387
Analysis of Variance
Source df SS MS F P
Regression 5 582.69 116.54 27.08 0.000
Error 14 60.24 4.30
Total 19 642.92
From just these two tables of statistics much can be learned. As will be illustrated in Chapter 9,
using the single parameter methods of Chapter 3 we can produce a variety of inferential results for
the β j coefficients from the table of coefficients. Moreover, the estimated regression equation is
which provides us with both our fitted values and our residuals. The analysis of variance table F test
is a test of the full model (6.9.1) versus the reduced model yi = β0 + εi . In Chapter 9 we will use the
Error line from the ANOVA table to construct other model tests. Similar ideas will be exploited in
the next two chapters for special cases of multiple regression.
E XAMPLE 6.10.1. For the simple linear regression on the Coleman Report data,
MSE
1
= (20 − 1)33.838125 − (0.560325468)2(20 − 1)92.64798395
20 − 2
1
= [642.924375 − 552.6756109] (6.10.2)
18
90.2487641
= = 5.01382.
18
Up to round-off error, these are the same results as tabled in Section 6.1. ✷
It is not clear that these estimates of β0 , β1 , and σ 2 are even reasonable. The estimate of the slope
β1 seems particularly unintuitive. However, from Proposition 6.2.1, the estimates are unbiased, so
they are at least estimating what we claim that they estimate.
The parameter estimates are unbiased but that alone does not ensure that they are good estimates.
These estimates are the best estimates available in several senses as discussed in Section 6.2.
To draw statistical inferences about the regression parameters, we need standard errors for the
estimates. To find the standard errors we need to know the variance of each estimate.
Proposition 6.10.1.
σ2 σ2
Var β̂1 = 2
=
n
∑i=1 (xi − x̄· ) (n − 1)s2x
and
1 2 x̄2·
x̄ 2 1
Var β̂0 = σ 2
+ ·
=σ + .
n ∑ni=1 (xi − x̄· )2 n (n − 1)s2x
The proof is given in the appendix at the end of the section. Note that, except for the unknown
6.10 ESTIMATION FORMULAE FOR SIMPLE LINEAR REGRESSION 151
Source df SS MS F
parameter σ 2 , the variances can be computed using the same six numbers we used to compute β̂0 ,
β̂1 , and MSE. Using MSE to estimate σ 2 and taking square roots, we get the standard errors,
MSE
SE β̂1 =
(n − 1)s2x
and
1 x̄2·
SE β̂0 = MSE + .
n (n − 1)s2x
E XAMPLE 6.10.2. For the Coleman Report data, using the numbers n, x̄· , and s2x ,
σ2 σ2
Var β̂1 = =
(20 − 1)92.64798395 1760.311695
and
1 3.14052
Var β̂0 = σ 2 + = σ 2 [0.055602837].
20 (20 − 1)92.64798395
The MSE is 5.014, so the standard errors are
5.014
SE β̂1 = = 0.05337
1760.311695
and
SE β̂0 = 5.014 [0.055602837] = 0.5280.
✷
We always like to have estimates with small variances. The forms of the variances show how
to achieve this. For example, the variance of β̂1 gets smaller when n or s2x gets larger. Thus, more
observations (larger n) result in a smaller slope variance and more dispersed xi values (larger s2x )
also result in a smaller slope variance. Of course all of this assumes that the simple linear regression
model is correct.
The ANOVA tables Table 6.2 and Table 6.3 can be rewritten now in terms of the parameter
estimates as Tables 6.5 and 6.6, respectively. The more commonly used form for the analysis of
variance table is given as Table 6.6. It eliminates the line for the intercept and corrects the total line
so that the degrees of freedom and sums of squares still add up.
E XAMPLE 6.10.3. Consider again simple linear regression on the Coleman Report data. The
analysis of variance table was given in Section 6.1; Table 6.7 illustrates the necessary computations.
152 6. SIMPLE LINEAR REGRESSION
Source df SS MS F
Source df SS MS F
552.68
Regression(β1 ) 1 0.5603252 (20 − 1)92.64798 552.6756109 5.014
Error 20 − 2 90.2487641 90.2487641/18
Total 20 − 1 (20 − 1)33.838125
Most of the computations were made earlier in Equation (6.10.2) during the process of obtaining
the MSE and all are based on the usual six numbers, n, x̄· , s2x , ȳ· , s2y , and ∑ xi yi . More directly, the
computations depend on n, β̂1 , s2x , and s2y . Note that the SSE is obtained as SSTot − SSReg. The
correction factor C in Tables 6.2 and 6.5 is 20(35.0825)2 but it is not used in these computations for
Table 6.7. ✷
β̂12 (n − 1)s2x 2
2 sx
R2 = = β̂ 2
1 2 =r .
(n − 1)s2y sy
Also note that ∑ni=1 (xi − x̄· ) = 0, so ∑ni=1 (xi − x̄· ) x̄· = 0. It follows that
n n n n
∑ (xi − x̄·) = ∑ (xi − x̄· ) xi − ∑ (xi − x̄· ) x̄· = ∑ (xi − x̄· ) xi .
2
n
∑i=1 (xi − x̄· ) yi
E β̂1 = E n 2
∑i=1 (xi − x̄· )
n
∑i=1 (xi − x̄· ) E(yi )
= 2
∑ni=1 (xi − x̄·)
∑ni=1 (xi − x̄· ) (β0 + β1 xi )
= n 2
∑i=1 (xi − x̄· )
6.10 ESTIMATION FORMULAE FOR SIMPLE LINEAR REGRESSION 153
∑n (xi − x̄·) ∑n (xi − x̄· ) xi
= β0 ni=1 2
+ β1 i=1 2
∑i=1 (xi − x̄· ) ∑ni=1 (xi − x̄· )
n 2
0 ∑i=1 (xi − x̄· )
= β0 n 2
+ β 1 2
∑i=1 (xi − x̄· ) ∑ni=1 (xi − x̄· )
= β1 .
1 n
= E ∑
n i=1
yi − E β̂1 x̄·
1 n
= ∑ E(yi ) − β1x̄·
n i=1
1 n
= ∑ (β0 + β1xi ) − β1x̄·
n i=1
1 n
= β0 + β1 ∑ (xi ) − β1x̄·
n i=1
= β0 + β1 x̄· − β1x̄·
= β0 .
Now consider the slope estimate. Recall that the yi s are independent.
n
∑i=1 (xi − x̄· ) yi
Var β̂1 = Var 2
∑ni=1 (xi − x̄· )
n
1
=
n 2
2 Var ∑ (xi − x̄·) yi
∑i=1 (xi − x̄· ) i=1
n
1
=
n 2
2 ∑ (xi − x̄·)2 Var(yi )
∑i=1 (xi − x̄· ) i=1
n
1
=
n 2
2 ∑ (xi − x̄·)2 σ 2
∑i=1 (xi − x̄· ) i=1
σ2
= 2
∑ni=1 (xi − x̄· )
σ2
= .
(n − 1)s2x
Rather than establishing the variance of β̂0 directly, we find Var(β̂0 + β̂1 x) for an arbitrary value
x. The variance of β̂0 is the special case with x = 0. A key result is that ȳ· and β̂1 are independent.
This was discussed in relation to the alternative regression model of Section 6.6. The independence
of these estimates is based on the errors having independent normal distributions with the same
154 6. SIMPLE LINEAR REGRESSION
variance. More generally, if the errors have the same variance and zero covariance, we still get
Cov(ȳ· , β̂1 ) = 0; see Exercise 6.11.6.
Var β̂0 + β̂1x = Var ȳ· − β̂1x̄· + β̂1 x
= Var ȳ· + β̂1(x − x̄· )
= Var(ȳ· ) + Var β̂1 (x − x̄·)2 − 2(x − x̄·)Cov ȳ· , β̂1
1 n
= ∑
n2 i=1
Var (y i ) + Var β̂1 (x − x̄· )2
1 n 2 σ 2 (x − x̄· )2
= ∑ σ + (n − 1)s2
n2 i=1 x
2
1 (x − x̄ )
σ2
·
= + .
n (n − 1)s2x
In particular, when x = 0 we get
1 x̄2·
Var β̂0 = σ 2 + .
n (n − 1)s2x
6.11 Exercises
E XERCISE 6.11.1. Draper and Smith (1966, p. 41) considered data on the relationship between
the age of truck tractors (in years) and the cost (in dollars) of maintaining them over a six-month
period. The data are given in Table 6.8. Plot cost versus age and fit a regression of cost on age. Give
95% confidence intervals for the slope and intercept. Give a 99% confidence interval for the mean
cost of maintaining tractors that are 2.5 years old. Give a 99% prediction interval for the cost of
maintaining a particular tractor that is 2.5 years old.
Reviewing the plot of the data, how much faith should be placed in these estimates for tractors
that are 2.5 years old?
E XERCISE 6.11.2. Stigler (1986, p. 6) reported data from Cassini (1740) on the angle between
the plane of the equator and the plane of the Earth’s revolution about the Sun. The data are given in
Table 6.9. The years −229 and −139 indicate 230 B.C. and 140 B.C. respectively. The angles are
listed as the minutes above 23 degrees.
Plot the data. Are there any obvious outliers (weird data)? If outliers exist, compare the fit of the
line with and without the outliers. In particular, compare the different 95% confidence intervals for
the slope and intercept.
E XERCISE 6.11.3. Mulrow et al. (1988) presented data on the calibration of a differential scan-
ning calorimeter. The melting temperatures of mercury and naphthalene are known to be 234.16
6.11 EXERCISES 155
Table 6.9: Angle between the plane of the equator and the plane of rotation about the Sun.
Chemical x y
Naphthalene 353.24 354.62
353.24 354.26
353.24 354.29
353.24 354.38
Mercury 234.16 234.45
234.16 234.06
234.16 234.61
234.16 234.48
and 353.24 Kelvin, respectively. The data are given in Table 6.10. Plot the data. Fit a simple linear
regression y = β0 + β1 x + ε to the data. Under ideal conditions, the simple linear regression should
have β0 = 0 and β1 = 1; test whether these hypotheses are true using α = 0.05. Give a 95% confi-
dence interval for the population mean of observations taken on this calorimeter for which the true
melting point is 250. Give a 95% prediction interval for a new observation taken on this calorimeter
for which the true melting point is 250.
Is there any way to check whether it is appropriate to use a line in modeling the relationship
between x and y? If so, do so.
E XERCISE 6.11.4. Exercise 6.11.3 involves the calibration of a measuring instrument. Often,
calibration curves are used in reverse, i.e., we would use the calorimeter to measure a melting point
y and use the regression equation to give a point estimate of x. If a new substance has a measured
melting point of 300 Kelvin, using the simple linear regression model, what is the estimate of the true
melting point? Use a prediction interval to determine whether the measured melting point of y = 300
is consistent with the true melting point being x = 300. Is an observed value of 300 consistent with
a true value of 310?
E XERCISE 6.11.5. Working–Hotelling confidence bands are a method for getting confidence
intervals for every point on a line with a guaranteed simultaneous coverage. The method is essen-
tially the same as Scheffé’s method for simultaneous confidence intervals discussed in Section 13.3.
For estimating the point on the line at a value x, the endpoints of the (1 − α )100% simultaneous
confidence intervals are
(β̂0 + β̂1x) ± 2F(1 − α , 2, dfE)SE(β̂0 + β̂1x).
Using the Coleman Report data of Table 6.1, find 95% simultaneous confidence intervals for the
values x = −17, −6, 0, 6, 17. Plot the estimated regression line and the Working–Hotelling confi-
dence bands. We are 95% confident that the entire line β0 + β1 x lies between the confidence bands.
Compute the regular confidence intervals for x = −17, −6, 0, 6, 17 and compare them to the results
of the Working–Hotelling procedure.
E XERCISE 6.11.6. Use part (4) of Proposition 1.2.11 to show that Cov(ȳ· , β̂1 ) = 0 whenever
Var(εi ) = σ 2 for all i and Cov(εi , ε j ) = 0 for all i = j. Hint: write out ȳ· and β̂1 in terms of the yi s.
Chapter 7
Model Checking
In this chapter we consider methods for checking model assumptions and the use of transformations
to correct problems with the assumptions. The primary method for checking model assumptions
is the use of residual plots. If the model assumptions are valid, residual plots should display no
detectable patterns. We begin in Section 7.1 by familiarizing readers with the look of plots that
display no detectable patterns. Section 7.2 deals with methods for checking the assumptions made
in simple linear regression (SLR). If the assumptions are violated, we need alternative methods of
analysis. Section 7.3 presents methods for transforming the original data so that the assumptions
become reasonable on the transformed data. Chapter 8 deals with tests for lack of fit. These are
methods of constructing more general models that may fit the data better. Chapters 7 and 8 apply
quite generally to regression, analysis of variance, and analysis of covariance models. They are not
restricted to simple linear regression.
Sample correlations
y1 y2 y3 y4 y5
y1 1.000
y2 −0.248 1.000
y3 −0.178 0.367 1.000
y4 −0.163 0.130 0.373 1.000
y5 0.071 0.279 0.293 0.054 1.000
✷
157
158 7. MODEL CHECKING
3
1
1
y1
y2
−1
−1
−3
−3
−2 −1 0 1 2 −2 −1 0 1 2
y5 y5
3
3
1
1
y3
y4
−1
−1
−3
−3
−2 −1 0 1 2 −2 −1 0 1 2
y5 y5
1 2 3
y1
y3
−1
−1
−3
−3
−2 −1 0 1 2 −2 −1 0 1 2
y2 y2
0 1 2
0 1 2
y4
y4
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
y2 y3
yh = m(xh ) + εh , h = 1, . . . , n,
εh s independent N(0, σ 2 ).
For example, the simple linear regression model posits
m(xh ) = β0 + β1xh .
The assumptions involved can all be thought of in terms of the errors. The assumptions are that
1. the εh s are independent,
2. E(εh ) = 0 for all h,
3. Var(εh ) = σ 2 for all h,
4. the εh s are normally distributed.
To have faith in our analysis, we need to validate these assumptions as far as possible. These are
assumptions and cannot be validated completely, but we can try to detect gross violations of the
assumptions.
The first assumption, that the εh s are independent, is the most difficult to validate. If the ob-
servations are taken at regular time intervals, they may lack independence and standard time series
methods may be useful in the analysis. We will not consider this further; the interested reader can
consult the time series literature, e.g., Shumway and Stoffer (2000). More general methods for
checking independence were developed by Christensen and Bedrick (1997) and are reviewed in
Christensen (2011). In general, we rely on the data analyst to think hard about whether there are
reasons for the data to lack independence.
The second assumption is that E(εh ) = 0. This is violated when we have the wrong model. The
simple linear regression model with E(εh ) = 0 specifies that
E(yh ) = β0 + β1 xh .
If we fit this model when it is incorrect, we will not have errors with E(εh ) = 0. More generally, if
we fit a mean model m(xh ), then
E(yh ) = m(xh ),
and if the model is incorrect, we will not have errors with E(εh ) = 0. Having the wrong model for
the means is called lack of fit.
The last two assumptions are that the errors all have some common variance σ 2 and that they
are normally distributed. The term homoscedasticity refers to having a constant (homogeneous)
variance. The term heteroscedasticity refers to having nonconstant (heterogeneous) variances.
In checking the error assumptions, we are hampered by the fact that the errors are not observable;
we must estimate them. The SLR model involves
yh = β0 + β1xh + εh
or equivalently,
yh − β0 − β1 xh = εh .
Given the fitted values ŷh = β̂0 + β̂1xh , we estimate εh with the residual
ε̂h = yh − ŷh .
Similarly, in Chapter 3 we defined fitted values and residuals for general models. I actually prefer
referring to predicting the error rather than estimating it. One estimates fixed unknown parameters
160 7. MODEL CHECKING
and predicts unobserved random variables. Our discussion depends only on having fitted values
and residuals; it does not depend specifically on the SLR model.
Two of the error assumptions are independence and homoscedasticity of the variances. Unfor-
tunately, even when these assumptions are true, the residuals are neither independent nor do they
have the same variance. For example, the SLR residuals all involve the random variables β̂0 and β̂1 ,
so they are not independent. Moreover, the ith residual involves β̂0 + β̂1xh , the variance of which
depends on (xh − x̄· ). Thus the variance of ε̂h depends on xh . There is little we can do about the
lack of independence except hope that it does not cause severe problems. On the other hand, we can
adjust for the differences in variances. In linear models the variance of a residual is
Var(ε̂i ) = σ 2 (1 − hi)
where hi is the leverage of the ith case. Leverages are discussed a bit later in this section and more
extensively in relation to multiple regression. (In discussing leverages I have temporarily switched
from using the meaningless subscript h to identify individual cases to using the equally meaningless
subscript i. There are two reasons. First, many people use the notation hi for leverages, to the point
of writing it as “HI.” Second, hh looks funny. My preference would be to denote the leverages mhh ,
cf. Chapter 11.)
Given the variance of a residual, we can obtain a standard error for it,
SE(ε̂i ) = MSE(1 − hi).
We can now adjust the residuals so they all have a variance of about 1; these standardized residuals
are
ε̂i
ri = .
MSE(1 − hi)
The main tool used in checking assumptions is plotting the residuals or, more commonly, the
standardized residuals. Normality is checked by performing a normal plot on the standardized resid-
uals. If the assumptions (other than normality) are correct, plots of the standardized residuals versus
any variable should look random. If the variable plotted against the ri s is continuous with no major
gaps, the plots should look similar to the plots given in the previous section. In problems where
the predictor variables are just group indicators (e.g., two-sample problems like Section 6.8 or the
analysis of variance problems of Chapter 12), we often plot the residuals against identifiers of the
groups, so the discrete nature of the number of groups keeps the plots from looking like those of the
previous section. The single most popular diagnostic plot is probably the plot of the standardized
residuals against the predicted (fitted) values ŷi , however the ri s can be plotted against any variable
that provides a value associated with each case.
Violations of the error assumptions are indicated by any systematic pattern in the residuals. This
could be, for example, a pattern of increased variability as the predicted values increase, or some
curved pattern in the residuals, or any change in the variability of the residuals.
A residual plot that displays an increasing variance looks roughly like a horn opening to the
right.
✏✏
✏✏
✏✏
PP
PP
PP
2
Standardized residuals
1
0
−1
25 30 35 40
Fitted
Figure 7.3: Plot of the standardized residuals r versus ŷ, Coleman Report.
PP
PP
P
P
✏✏
✏✏
✏✏
Plots that display curved shapes typically indicate lack of fit. One example of a curve is given below.
✬ ✩
★✥
Residual−Socio plot
2
Standardized residuals
1
0
−1
−15 −10 −5 0 5 10 15
1
0
−1
−2 −1 0 1 2
Theoretical Quantiles
444.17
R2 = = 99.2%.
447.85
The plot of residuals versus predicted values is given in Figure 7.7. A pattern is very clear; the
residuals form something like a parabola. In spite of a very large R2 and a scatter plot that looks
quite linear, the residual plot shows that a lack of fit obviously exists. After seeing the residual
plot, you can go back to the scatter plot and detect suggestions of nonlinearity. The simple linear
regression model is clearly inadequate, so we do not bother presenting a normal plot. In the next
two sections, we will examine ways to deal with this lack of fit. ✷
164 7. MODEL CHECKING
28
26
24
Pres
22
20
18
16
Temp
Residual−Fitted plot
2
Standardized residuals
1
0
−1
16 18 20 22 24 26 28
Fitted
Figure 7.7: Standardized residuals versus predicted values for Hooker data.
7.2 CHECKING ASSUMPTIONS: RESIDUAL ANALYSIS 165
7.2.2 Outliers
Outliers are bizarre data points. They are points that do not seem to fit with the other observations
in a data set. We can characterize bizarre points as having either bizarre x values or bizarre y values.
There are two valuable tools for identifying outliers.
Leverages are values between 0 and 1 that measure how bizarre an x value is relative to the other
x values in the data. A leverage near 1 is a very bizarre point. Leverages that are small are similar
to the other data. The sum of all the leverages in a simple linear regression is always 2, thus the
average leverage is 2/n. Points with leverages larger than 4/n cause concern and leverages above
6/n cause considerable concern.
For the more general models of Section 3.9 we used r to denote the number of uniquely defined
parameters in a model m(·). For general linear models the average leverage is r/n. Points with lever-
ages larger than 2r/n or 3r/n are often considered high-leverage points. The concept of leverage
will be discussed in more detail when we discuss multiple regression.
Outliers in the y values can be detected from the standardized deleted residuals. Standardized
deleted residuals are also called t residuals (in Minitab) and studentized residuals (in the R lan-
guage). Standardized deleted residuals are just standardized residuals, but the residual for the hth
case is computed from a regression that does not include the hth case. For example, in SLR the third
deleted residual is
ε̂[3] = y3 − β̂0[3] − β̂1[3] x3
where the estimates β̂0[3] and β̂1[3] are computed from a regression in which case 3 has been dropped
from the data. More generally, we could write this as
where m̂[3] (·) is the estimate of the mean model based on all data except the third observation.
The third standardized deleted residual is simply the third deleted residual divided by its standard
error. The standardized deleted residuals, denoted th , really contain the same information as the
standardized residuals rh ; the largest standardized deleted residuals are also the largest standardized
residuals. The main virtue of the standardized deleted residuals is that they can be compared to a
t(dfE − 1) distribution to test whether they could reasonably have occurred when the model is true.
The degrees of freedom in the SLR test are n − 3 because the simple linear regression model was
fitted without the ith case, so there are only n − 1 data points in the fit and (n − 1) − 2 degrees of
freedom for error. If we had reason to suspect that, say, case 3 might be an outlier, we would reject
it being consistent with the model and the other data if for h = 3
α
|th | ≥ t 1 − , dfE − 1 .
2
If one examines the largest absolute standardized deleted residual, the appropriate α -level test
rejects if
α
max |th | ≥ t 1 − , dfE − 1 .
h 2n
An unadjusted t(dfE − 1) test is no longer appropriate. The distribution of the maximum of a group
of identically distributed random variables is not the same as the original distribution. For n vari-
ables, the true P value is no greater than nP∗ where P∗ is the “P value” computed by comparing the
maximum to the original distribution. This is known as a Bonferroni adjustment and is discussed in
more detail in Chapter 13.
E XAMPLE 7.2.3. The leverages and standardized deleted residuals are given in Table 7.2 for the
Coleman Report data with one predictor. Compared to the leverage rule of thumb 4/n = 4/20 = 0.2,
only case 15 has a noticeably large leverage. None of the cases is above the 6/n rule of thumb. In
simple linear regression, one does not really need to evaluate the leverages directly because the
166 7. MODEL CHECKING
necessary information about bizarre x values is readily available from the x, y plot of the data. In
multiple regression with three or more predictor variables, leverages are vital because no one scatter
plot can give the entire information on bizarre x values. In the scatter plot of the Coleman Report
data, Figure 6.1, there are no outrageous x values, although there is a noticeable gap between the
smallest four values and the rest. From Table 6.1 we see that the cases with the smallest x values are
2, 11, 15, and 18. These cases also have the highest leverages reported in Table 7.2. The next two
highest leverages are for cases 4 and 19; these have the largest x values.
For an overall α = 0.05 level test of the deleted residuals, the tabled value needed is
0.05
t 1− , 17 = 3.54 .
2(20)
None of the standardized deleted residuals (th s) approach this, so there is no evidence of any unac-
countably bizarre y values.
A handy way to identify cases with large leverages, residuals, standardized residuals, or stan-
dardized deleted residuals is with an index plot. This is simply a plot of the value against the case
number as in Figure 7.8 for leverages. ✷
Case 1 2 3 4 5
y 1 2 3 4 −3
x 1 2 3 4 20
Leverage 0.30 0.26 0.24 0.22 0.98
The case with x = 20 is an extremely high-leverage point; it has a leverage of nearly 1. The estimated
regression line is forced to go very nearly through this high-leverage point. In fact, this plot has two
clusters of points that are very far apart, so a rough approximation to the estimated line is the line
that goes through the mean x and y values for each of the two clusters. This example has one cluster
of four cases on the left of the plot and another cluster consisting solely of the one case on the right.
The average values for the four cases on the left give the point (x̄, ȳ) = (2.5, 2.5). The one case on the
right is (20, −3). A little algebra shows the line through these two points to be ŷ = 3.286 − 0.314x.
The estimated line using least squares turns out to be ŷ = 3.128 − 0.288x, which is not too different.
The least squares line goes through the two points (2.5, 2.408) and (20, −2.632), so the least squares
line is a little lower at x = 2.5 and a little higher at x = 20.
7.2 CHECKING ASSUMPTIONS: RESIDUAL ANALYSIS 167
School−Leverage plot
0.25
0.20
Leverage
0.15
0.10
0.05
5 10 15 20
School
0
−1
−2
−3
0 5 10 15 20
7.3 Transformations
If the residuals show a problem with lack of fit, heteroscedasticity, or nonnormality, one way to deal
with the problem is to try transforming the yh s. Typically, this only works well when ymax /ymin is
reasonably large. The use of transformations is often a matter of trial and error. Various transforma-
tions are tried and the one that gives the best-fitting model is used. In this context, the best-fitting
model should have residual plots indicating that the model assumptions are reasonably valid. The
first approach to transforming the data should be to consider transformations that are suggested by
any theory associated with the data collection. Another approach to choosing a transformation is
to try a variance-stabilizing transformation. These were discussed in Section 2.5 and are repeated
below for data yh with E(yh ) = μh and Var(yh ) = σh2 .
Variance-stabilizing transformations
Mean, variance
Data Distribution relationship Transformation
√
Count Poisson μh ∝ σh2 yh
Amount Gamma μh ∝ σh h)
log(y
−1 √
Proportion Binomial/N μh (1 − μh)/N ∝ σh 2 sin yh
Whenever the data have the indicated mean-variance relationship, the corresponding variance-
stabilizing transformation is supposed to work reasonably well.
Personally, I usually start by trying the log or square root transformations and, if they do not
work, then I worry about how to find something better.
II y,λ I
✬ ✩
(1,1) x,γ
✫ ✪
III IV
0.85
0.80
0.75
Figure 7.10 indicates the kinds of transformations appropriate for some different shapes of x, y
curves. For example, if the x, y curve is similar to that in quadrant I, i.e., the y values decrease as x
increases and the curve opens to the lower left, appropriate transformations involve increasing λ or
increasing γ or both. Here we refer to increasing λ and γ relative to the no transformation values of
λ = 1 and γ = 1. In particular, Figure 7.11 gives an x, y plot for part of a cosine curve that is shaped
like the curve in quadrant I. Figure 7.12 is a plot of the numbers after x has been transformed into
x1.5 and y has been transformed into y1.5 . Note that the curve in Figure 7.12 is much straighter than
the curve in Figure 7.11. If the x, y curve increases and opens to the lower right, such as those in
quadrant II, appropriate transformations involve increasing λ or decreasing γ or both. An x, y curve
similar to that in quadrant III suggests decreasing λ or decreasing γ or both. The graph given in
Figure 7.10 is often referred to as the circle of x, y transformations.
170 7. MODEL CHECKING
1.0
0.9
y1.5
0.8
0.7
We established in the previous section that the Hooker data does not fit a straight line and that
the scatter plot in Figure 7.6 increases with a slight tendency to open to the upper left. This is the
same shaped curve as in quadrant IV of Figure 7.10. The circle of x, y transformations suggests
that to straighten the curve, we should try transformations with decreased values of λ or increased
values of γ or both. Thus we might try transforming y into y1/2 , y1/4 , log(y), or y−1 . Similarly, we
might try transforming x into x1.5 or x2 .
To get a preliminary idea of how well various transformations work, we should do a series of
plots. We might begin by examining the four plots in which y1/2 , y1/4 , log(y), and y−1 are plotted
against x. We might then plot y against both x1.5 and x2 . We should also plot all possibilities involv-
ing one of y1/2 , y1/4 , log(y), and y−1 plotted against one of x1.5 and x2 and we may need to consider
other choices of λ and γ . For the Hooker data, looking at these plots would probably only allow
us to eliminate the worst transformations. Recall that Figure 7.6 looks remarkably straight and it is
only after fitting a simple linear regression model and examining residuals that the lack of fit (the
curvature of the x, y plot) becomes apparent. Evaluating the transformations would require fitting a
simple linear regression for every pair of transformed variables that has a plot that looks reasonably
straight.
Observe that many of the power transformations considered here break down with values of x
or y that are negative. For example, it is difficult to take square roots and logs of negative numbers.
Fortunately, data are often positive or at least nonnegative. Measured amounts, counts, and propor-
tions are almost always nonnegative. When problems arise, a small constant is often added to all
cases so that they all become positive. Of course, it is unclear what constant should be added.
Obviously, the circle of transformations, just like the variance-stabilizing transformations, pro-
vides only suggestions on how to transform the data. The process of choosing a particular transfor-
mation remains one of trial and error. We begin with reasonable candidates and examine how well
these transformations agree with the simple linear regression model. When we find a transformation
that agrees well with the assumptions of simple linear regression, we proceed to analyze the data.
Obviously, an alternative to transforming the data is to change the model. In Chapter 8 we consider
a new class of models that incorporate transformations of the x variable. In the remainder of this
section, we focus on a systematic method for choosing a transformation of y.
7.3 TRANSFORMATIONS 171
7.3.2 Box–Cox transformations
We now consider a systematic method, introduced by Box and Cox (1964), for choosing a power
transformation for general models. Consider the family of power transformations, say, yλh . This
includes the square root transformation as the special case λ = 1/2 and other interesting transfor-
1
mations such as the reciprocal transformation y−h . By making a minor adjustment, we can bring log
transformations into the power family. Consider the transformations
(λ ) yλh − 1 /λ λ = 0
yh = .
log(yh ) λ =0
(λ )
For any fixed λ = 0, the transformation yh is equivalent to yλh , because the difference between the
two transformations consists of subtracting a constant and dividing by a constant. In other words,
for λ = 0, fitting the model
yλh = m(xh ) + εh
is equivalent to fitting the model
(λ )
yh = m(xh ) + εh ,
(λ )
in that fitted values ŷλh satisfy ŷh = ŷλh − 1 /λ . (This happens whenever m(x) is a linear model
that includes an intercept or terms equivalent to fitting an intercept.)
Parameters in the two models
have slightly different meanings. While the transformation yλh − 1 /λ is undefined for λ = 0, as λ
approaches 0, yλh − 1 /λ approaches log(yh ), so the log transformation fits in naturally.
(λ )
Unfortunately, the results of fitting models to yh with different values of λ are not directly
comparable. Thus it is difficult to decide which transformation in the family to use. This problem is
easily evaded (cf. Cook and Weisberg, 1982) by further modifying the family of transformations so
that the results of fitting with different λ s are comparable. Let ỹ be the geometric mean of the yh s,
i.e.,
1/n
n
1 n
ỹ = ∏ yh = exp ∑ log(yh )
i=1 n i=1
and define the family of transformations
λ −1
(λ ) yλh − 1 λ ỹ λ = 0
zh = .
ỹ log(yh ) λ =0
The results of fitting the model
(λ )
zh = m(xh ) + εh
can be summarized via SSE(λ ). These values are directly comparable for different values of λ . The
choice of λ that yields the smallest SSE(λ ) is the best-fitting model. (It maximizes the likelihood
with respect to λ .)
Box and Draper (1987, p. 290) discuss finding a confidence interval for the transformation pa-
rameter λ . An approximate (1 − α )100% confidence interval consists of all λ values that satisfy
where λ̂ is the value of λ that minimizes SSE(λ ). When ymax /ymin is not large, the interval tends to
be wide.
3.4
3.3
3.2
3.1
LPres
3.0
2.9
2.8
Temp
Table 7.3 contains SSE(λ ) values for some reasonable choices of λ . Assuming that SSE(λ ) is
a very smooth (convex) function of λ , the best λ value is probably between 0 and 1/4. If the curve
being minimized is very flat between 0 and 1/4, there is a possibility that the minimizing value is
between 1/4 and 1/3. One could pick more λ values and compute more SSE(λ )s but I have a bias
towards simple transformations. (They are easier to sell to clients.)
The log transformation of λ = 0 is simple (certainly simpler than the fourth root) and λ = 0 is
near the optimum, so we will consider it further. We now use the simple log transformation, rather
than adjusting for the geometric mean. The data are displayed in Figure 7.13. The usual summary
tables follow.
Residual−Fitted plot
1.5
1.0
Standardized residuals
0.5
0.0
−1.5 −1.0 −0.5
Fitted
Figure 7.14: Standardized residuals versus predicted values, logs of Hooker data.
0.5
0.0
−1.5 −1.0 −0.5
−2 −1 0 1 2
Theoretical Quantiles
0.99798
R2 = = 99.8%,
1.00002
although because of the transformation this number is not directly comparable to the R2 of 0.992
for the original SLR. As discussed in Section 3.9, to measure the predictive ability of this model
on the original scale, we back transform the fitted values to the original scale and compute the
squared sample correlation between the original data and these predictors on the original scale. For
the Hooker data this also gives R2 = 0.998, which is larger than the original SLR R2 of 0.992. (It
is a mere quirk that the R2 on the log scale and the back transformed R2 happen to agree to three
decimal places.) ✷
yh = β0 + β1 xh + β2wh + εh .
As illustrated in Section 6.9, multiple regression gives results similar to those for simple linear
regression; typical output includes a table of coefficients and an ANOVA table. A test of H0 : β2 = 0
from the table of coefficients gives an approximate test that no transformation is needed. The test is
performed using the standard methods of Chapter 3. Details are illustrated in the following example.
In addition, the estimate β̂2 provides, indirectly, an estimate of λ ,
λ̂ = 1 − β̂2.
Frequently, this is not a very good estimate of λ but it gives an idea of where to begin a search for
good λ s.
The t statistic is 10.65 = 0.80252/.07534 for testing that the regression coefficient of the constructed
variable is 0. The P value of 0.000 strongly indicates the need for a transformation. The estimate of
λ is
λ̂ = 1 − β̂2 = 1 − 0.80 = 0.2,
which is consistent with what we learned from Table 7.3. From Table 7.3 we suspected that the best
transformation would be between 0 and 0.25. Of course this estimate of λ is quite crude; finding
the ‘best’ transformation requires a more extensive version of Table 7.3. I limited the choices of λ
in Table 7.3 because I was unwilling to consider transformations that I did not consider simple. ✷
yh = m(xh ) + εh , (7.3.2)
yh = m(xh ) + γ wh + εh , (7.3.3)
and test H0 : γ = 0. This gives only an approximate test of whether a power transformation is needed.
The usual t distribution is not really appropriate. The problem is that the constructed variable w
involves the ys, so the ys appear on both sides of the equality in Model (7.3.3). This is enough to
invalidate the theory behind the usual test.
It turns out that this difficulty can be avoided by using the predicted values from Model (7.3.2).
We write these as ŷh(2) s, where the subscript (2) is a reminder that the predicted values come from
Model (7.3.2). We can now define a new constructed variable,
and fit
yh = m(xh ) + γ w̃h + εh . (7.3.4)
The new constructed variable w̃h simply replaces yh with ŷh(2) in the definition of wh and deletes
some terms made redundant by using the ŷh(2) s. If Model (7.3.2) is valid, the usual test of H0 : γ = 0
from Model (7.3.4) has the standard t distribution in spite of the fact that the w̃h s depend on the
yh s. By basing the constructed variable on the ŷh(2) s, we are able to get an exact t test for γ = 0 and
restrict the weird behavior of the test statistic to situations in which γ = 0.
Tukey (1949) uses neither the constructed variable wh nor w̃h but a third constructed variable
that is an approximation to w̃h . Using a method from calculus known as Taylor’s approximation
(expanding about ȳ· ) and simplifying the approximation by eliminating terms that have no effect on
the test of H0 : γ = 0, we get ŷ2h(2) as a new constructed variable. This leads to fitting the model
and testing the need for a transformation by testing H0 : γ = 0. When applied to an additive two-
way model as discussed in Chapter 14 (without replication), this is Tukey’s famous one degree of
freedom test for nonadditivity. Recall that t tests are equivalent to F tests with one degree of freedom
in the numerator, hence the reference to one degree of freedom in the name of Tukey’s test.
176 7. MODEL CHECKING
Models (7.3.3), (7.3.4), and (7.3.5) all provide rough estimates of the appropriate power transfor-
mation. From models (7.3.3) and (7.3.4), the appropriate power is estimated by λ̂ = 1 − γ̂ . In Model
(7.3.5), because of the simplification employed after the approximation, the estimate is λ̂ = 1 − 2ȳ·γ̂ .
Atkinson (1985, Section 8.1) gives an extensive discussion of various constructed variables for
testing power transformations. In particular, he suggests (on page 158) that while the tests based on
w̃h and ŷ2h(2) have the advantage of giving exact t tests and being easier to compute, the test using
wh may be more sensitive in detecting the need for a transformation, i.e., may be more powerful.
The tests used with models (7.3.4) and (7.3.5) are special cases of a general procedure intro-
duced by Rao (1965) and Milliken and Graybill (1970); see also Christensen (2011, Section 9.5). In
addition, Cook and Weisberg (1982), and Emerson (1983) contain useful discussions of constructed
variables and methods related to Tukey’s test.
7.4 Exercises
E XERCISE 7.4.1. Using the complete data of Exercise 6.11.2, test the need for a transformation
of the simple linear regression model. Repeat the test after eliminating any outliers. Compare the
results.
E XERCISE 7.4.2. Snedecor and Cochran (1967, Section 6.18) presented data obtained in 1942
from South Dakota on the relationship between the size of farms (in acres) and the number of acres
planted in corn. The data are given in Table 7.4.
Plot the data. Fit a simple linear regression to the data. Examine the residuals and discuss what
you find. Test the need for a power transformation. Is it reasonable to examine the square root or log
transformations? If so, do so.
E XERCISE 7.4.3. Repeat Exercise 7.4.2 but instead of using the number of acres of corn as the
dependent variable, use the proportion of acreage in corn as the dependent variable. Compare the
results to those given earlier.
Chapter 8
In analyzing data we often start with an initial model that is relatively complicated, that we hope fits
reasonably well, and look for simpler versions that still fit the data adequately. Lack of fit involves
an initial model that does not fit the data adequately. Most often, we start with a full model and look
at reduced models. When dealing with lack of fit, our initial model is the reduced model, and we
look for models that fit significantly better than the reduced model. In this chapter, we introduce
methods for testing lack of fit for a simple linear regression model. As with the chapter on model
checking, these ideas translate with (relatively) minor modifications to testing lack of fit for other
models. The issue of testing lack of fit will arise again in later chapters.
The full models that we create in order to test lack of fit are all models that involve fitting more
than one predictor variable. These are multiple regression models. Multiple regression was intro-
duced in Section 6.9 and special cases were applied in Section 7.3. This chapter makes extensive
use of special cases of multiple regression. The general topic, however, is considered in Chapter 9.
We illustrate lack-of-fit test testing methods by testing for lack of fit in the simple linear regres-
sion on the Hooker data of Table 7.1 and Example 7.2.2. Figure 8.1 displays the data with the fitted
line and we again provide the ANOVA table for this (reduced) model.
179
180 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
28
26
24
Pres
22
20
18
16
Temp
model
yi = β0 + β1xi + β2 x2i + εi .
We could also try a cubic model
E XAMPLE 8.1.1. Computer programs give output for polynomial regression that is very similar
to that for simple linear regression. We fit a fifth-degree (quintic) polynomial to Hooker’s data,
Actually, we tried fitting a cubic model to these data and encountered numerical instability. (Some
computer programs object to fitting it.) This may be related to the fact that the R2 is so high. To help
with the numerical instability of the procedure, before computing the powers of the x variable we
subtracted the mean x̄· = 191.787. Thus, we actually fit,
yi = β0 + β1 (xi − x̄· ) + β2(xi − x̄·)2 + β3 (xi − x̄· )3 + β4(xi − x̄· )4 + β5(xi − x̄· )5 + εi . (8.1.2)
These two models are equivalent in that they always give the same fitted values, residuals, and
degrees of freedom. Moreover, γ5 ≡ β5 but none of the other γ j s are equivalent to the corresponding
β j s. (The equivalences are obtained by the rather ugly process of actually multiplying out the powers
8.1 POLYNOMIAL REGRESSION 181
of (xi − x̄· ) in Model (8.1.2) so that the model can be rewritten in the form of Model (8.1.1).) The
fitted model, (8.1.2), is summarized by the table of coefficients and the ANOVA table.
Model
Source Comparison df Seq SS F
(x − x̄· ) SSE(3) − SSE(4) 1 444.167 16465.9
(x − x̄· )2 SSE(4) − SSE(5) 1 2.986 110.7
(x − x̄· )3 SSE(5) − SSE(6) 1 0.000 0.0
(x − x̄· )4 SSE(6) − SSE(7) 1 0.003 0.1
(x − x̄· )5 SSE(7) − SSE(8) 1 0.019 0.7
Using these and statistics reported in Example 8.1.1, the F statistic for dropping the fifth-degree
term from the polynomial is
The corresponding t statistic reported earlier for testing H0 : β5 = 0 in Model (8.1.2) was −0.84.
The data are consistent with a fourth-degree polynomial.
The F test for dropping to a third-degree polynomial from a fourth-degree polynomial is
SSE(6) − SSE(7) 0.003
Fobs = = = 0.1161.
MSE(8) 0.027
In the denominator of the test we again use the MSE from the fifth-degree polynomial. When do-
ing a series of tests on related models one generally uses the MSE from the largest model in the
denominator of all tests, cf. Subsection 3.1.1. The t statistic corresponding to this F statistic is
√ .
0.1161 = 0.341, not the value 0.90 reported earlier for the fourth-degree term in the table of coef-
ficients for the fifth-degree model, (8.1.2). The t value of 0.341 is a statistic for testing β4 = 0 in the
fourth-degree model. The value tobs = 0.341 is not quite the t statistic (0.343) you would get in the
8.1 POLYNOMIAL REGRESSION 183
28
26
24
Pres
22
20
18
16
Temp
table of coefficients for fitting the fourth-degree polynomial (8.1.7) because the table of coefficients
would use the MSE from Model (8.1.7) whereas this statistic is using the MSE from Model (8.1.8).
Nonetheless, tobs provides a test for β4 = 0 in a model that has already specified that β5 = 0 whereas
t = 0.90 from the table of coefficients for the fifth-degree model, (8.1.2), is testing β4 = 0 without
specifying that β5 = 0.
The other F statistics listed are also computed as Seq SS/MSE(8). From the list of F statistics,
we can clearly drop any of the polynomial terms down to the quadratic term.
The MSE, regression parameter estimates, and standard errors are used in the usual way. The t
statistics and P values are for the tests of whether the corresponding β parameters are 0. The t
184 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
Residual−Fitted plot
1.5
1.0
0.5
Standardized residuals
0.0
−2.0 −1.5 −1.0 −0.5
16 18 20 22 24 26 28
Fitted
statistics for β0 and β1 are of little interest. The t statistic for β2 is 10.95, which is highly significant,
so the quadratic model accounts for a significant amount of the lack of fit displayed by the simple
linear regression model. Figure 8.2 gives the data with the fitted parabola.
We will not discuss the ANOVA table in detail, but note that with two predictors, x and x2 , there
are 2 degrees of freedom for regression. In general, if we fit a polynomial of degree a, there will be
a degrees of freedom for regression, one degree of freedom for every term other than the intercept.
Correspondingly, when fitting a polynomial of degree a, there are n − a − 1 degrees of freedom for
error. The ANOVA table F statistic provides a test of whether the polynomial (in this case quadratic)
model explains the data better than the model with only an intercept.
The fitted values are obtained by substituting the xi values into
1.5
1.0
0.5
Standardized residuals
0.0
−2.0 −1.5 −1.0 −0.5
−2 −1 0 1 2
Theoretical Quantiles
The standard error (as reported by the computer program) is 0.0528 and a 95% confidence interval
is (25.84, 26.06). This compares to a point estimate of 25.95 and a 95% confidence interval of
(25.80, 26.10) obtained in Section 7.3 from regressing the log of pressure on temperature and back
transforming. The quadratic model prediction for a new observation at 205◦ F is again 25.95 with a
95% prediction interval of (25.61, 26.29). The corresponding back transformed prediction interval
from the log transformed data is (25.49, 26.42). In this example, the results of the two methods for
dealing with lack of fit are qualitatively very similar, at least at 205◦ F.
Finally, consider testing the quadratic model for lack of fit by comparing it to the quintic model
(8.1.2). The F statistic is
E XAMPLE 8.2.1. The data for the example follow. They were constructed to have most observa-
tions far from the middle.
Case 1 2 3 4 5 6 7
y 0.445 1.206 0.100 −2.198 0.536 0.329 −0.689
x 0.0 0.5 1.0 10.0 19.0 19.5 20.0
186 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
1.0
0.5
0.0
−2.0 −1.5 −1.0 −0.5
y
0 5 10 15 20
I selected the x values. The y values are a sample of size 7 from a N(0, 1) distribution. Note that
with seven distinct x values, we can fit a polynomial of degree 6.
The data are plotted in Figure 8.5. Just by chance (honest, folks), I observed a very small y value
at x = 10, so the data appear to follow a parabola that opens up. The small y value at x = 10 totally
dominates the impression given by Figure 8.5. If the y value at x = 10 had been near 3 rather than
near −2, the data would appear to be a parabola that opens down. If the y value had been between 0
and 1, the data would appear to fit a line with a slightly negative slope. When thinking about fitting
a parabola, the case with x = 10 is an extremely high-leverage point.
Depending on the y value at x = 10, the data suggest a parabola opening up, a parabola opening
down, or that we do not need a parabola to explain the data. Regardless of the y value observed at
x = 10, the fitted parabola must go nearly through the point (10, y). On the other hand, if we think
only about fitting a line to these data, the small y value at x = 10 has much less effect. In fitting
a line, the value y = −2.198 will look unusually small (it will have a very noticeable standardized
residual), but it will not force the fitted line to go nearly through the point (10, −2.198).
Table 8.1 gives the leverages for all of the polynomial models that can be fitted to these data.
Note that there are no large leverages for the simple linear regression model (the linear polynomial).
For the quadratic (parabolic) model, all of the leverages are reasonably small except the leverage
of 0.96 at x = 10 that very nearly equals 1. Thus, in the quadratic model, the value of y at x = 10
dominates the fitted polynomial. The cubic model has extremely high leverage at x = 10, but the
leverages are also beginning to get large at x = 0, 1, 19, 20. For the quartic model, the leverage at
x = 10 is 1, to two decimal places; the leverages for x = 0, 1, 19, 20 are also nearly 1. The same
pattern continues with the quintic model but the leverages at x = 0.5, 19.5 are also becoming large.
Finally, with the sixth-degree (hexic) polynomial, all of the leverages are exactly one. This indicates
that the sixth-degree polynomial has to go through every data point exactly and thus every data point
is extremely influential on the estimate of the sixth-degree polynomial. (It is fortunate that there are
only seven distinct x values. This discussion would really tank if we had to fit a seventh-degree
polynomial. [Think about it: quartic, quintic, hexic, ... tank.])
As we fit larger polynomials, we get more high-leverage cases (and more numerical instability).
Actually, as in our example, this occurs when the size of the polynomial nears one less than the num-
8.2 POLYNOMIAL REGRESSION AND LEVERAGES 187
ber of distinct x values and nearly all data points have distinct x values. The estimated polynomials
must go very nearly through all high-leverage cases. To accomplish this the estimated polynomials
may get very strange. We now give all of the fitted polynomials for these data.
4
linear
quadratic
cubic
3
2
1
y2
0
−1
−2
−3
0 5 10 15 20
Figure 8.6: Plots of linear (solid), quadratic (dashes), and cubic (dots) regression curves.
10
quartic
quintic
hexic
5
0
−5
y2
−10
−15
−20
−25
0 5 10 15 20
Figure 8.7: Plots of quartic (solid), quintic (dashes), and hexic (dots) regression curves.
perfectly. By fitting models that are too large it seems that one can often make the MSE artificially
small. For example, the quartic model has a MSE of 0.306 and an F statistic of 5.51; if it were not
for the small value of dfE, such an F value would be highly significant. If you find a large model
that has an unnaturally small MSE with a reasonable number of degrees of freedom, everything can
appear to be significant even though nothing you look at is really significant.
Just as the mean squared error often gets unnaturally small when fitting large models, R2 gets
unnaturally large. As we have seen, there can be no possible reason to use a larger model than the
8.3 OTHER BASIS FUNCTIONS 189
Quadratic model
Source df SS MS F P
Regression 2 5.185 2.593 4.78 0.09
Error 4 2.168 0.542
Total 6 7.353
Cubic model
Source df SS MS F P
Regression 3 5.735 1.912 3.55 0.16
Error 3 1.618 0.539
Total 6 7.353
Quartic model
Source df SS MS F P
Regression 4 6.741 1.685 5.51 0.16
Error 2 0.612 0.306
Total 6 7.353
Quintic model
Source df SS MS F P
Regression 5 6.856 1.371 2.76 0.43
Error 1 0.497 0.497
Total 6 7.353
Hexic model
Source df SS MS F P
Regression 6 7.353 1.2255 — —
Error 0 0.000 —
Total 6 7.353
quadratic with its R2 of 0.71 for these 7 data points, but the cubic, quartic, quintic, and hexic models
have R2 s of 0.78, 0.92, 0.93, and 1, respectively. ✷
E XAMPLE 8.3.1. We illustrate the methods on the Hooker data. With x the temperature, we
defined x̃ = (x − 180.5)/30.5. Fitting Model (8.3.1) gives
200
100
0
y
−100
−200
0 5 10 15 20
yi = β0 + β1 cos(π x̃i ) + β2 cos(π 2x̃i ) + β3 cos(π 3x̃i ) + β4 cos(π 4x̃i ) + β5 cos(π 5x̃i ) + εi .
The fit away from the data is even worse than for fifth- and sixth-order polynomials.
28
26
24
Pres
22
20
18
16
Temp
model on a series of smaller subsets of the data. Of course if the original model is correct, it should
work about the same on each subset as it does on the complete data. The statistician partitions the
data into disjoint subsets, fits the original model on each subset, and compares the overall fit of
the subsets to the fit of the original model on the entire data. The statistician is free to select the
partitions, including the number of distinct sets, but the subsets need to be chosen based on the
predictor variable(s) alone.
E XAMPLE 8.4.1. We illustrate the partitioning method by splitting the Hooker data into two
parts. Our partition sets are the data with the 16 smallest temperatures and the data with the 15
largest temperatures. We then fit a separate regression line to each partition. The two fitted lines are
given in Figure 8.9. The ANOVA table is
Analysis of Variance: Partitioned Hooker data.
Source df SS MS F P
Regression 3 446.66 148.89 3385.73 0.000
Error 27 1.19 0.04
Total 30 447.85
A test of whether this partitioning fits significantly better than SLR has statistic
(3.68 − 1.19)/(29 − 27)
Fobs = = 31.125.
0.04
Clearly the reduced model of a simple linear regression fits worse than the model with two SLRs.
Note that this is a simultaneous test of whether the slopes and intercepts are the same in each
partition. ✷
gives
28
26
24
Pres
22
20
18
16
Temp
E XAMPLE 8.4.1. We consider first the use of 15 central points with leverages below 0.05; about
half the data. We then consider a group of 6 central points; about a fifth of the data.
The ANOVA table when fitting a simple linear regression to 15 central points is
My experience is that Utt’s test tends to work better with relatively small groups of central
points. (Even though the F statistic here was smaller for the smaller group.) Minitab’s regression
198 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
28
26
24
Pres
22
20
18
16
Temp
program incorporates a version of Utt’s test that defines the central region as those points with
leverages less than 1.1p/n where p is the number of regression coefficients in the model, so for a
simple linear regression p = 2. For these data, their central region consists of the 22 observations
with temperature between 183.2 and 200.6.
8.5 Splines
When fitting a polynomial to a single predictor variable, the partitioning method is extremely similar
to the nonparametric regression method known as fitting splines. When using partitioning to test for
lack of fit, our fitting of the model on each subset was merely a device to see whether the original
fitted model gave better approximations on smaller subsets of the data than it did overall. The only
difference when fitting splines is that we take the results obtained from fitting on the partition sets
seriously as a model for the regression function m(x). As such, we typically do not want to allow
discontinuities in m(x) at the partition points (known as “knots” in spline theory), so we include
conditions that force continuity. Typically when fitting splines one uses a large number of partition
sets, so there are a large number of conditions to force continuity. We illustrate the ideas on the
Hooker data with only two partition sets. Generalizations are available for more than one predictor
variable; see Wahba (1990).
β1 + β2 191 = β3 + β4191.
β3 = β1 + β2 191 − β4191.
You can see from Figure 8.9 that the two separate fitted lines are already pretty close to matching
up at the knot.
In Subsection 8.4.1 we fitted the partitioned model as a single linear model in two ways. The
first was more transparent but the second had advantages. The same is true about the modifications
needed to generate linear spline models. To begin, we constructed a variable h that identifies the 15
high values of x. In other words, h is 1 for the 15 highest temperature values and 0 for the 16 lowest
values. We might now write
h(x) = I(191,∞) (x),
where we again use the indicator function introduced in Section 8.3. With slightly different notation
for the predictor variables, we first fitted the two separate lines model as
or
or
yi = β1 + β2 {xi [1 − h(xi)] + 191h(xi)} + β4 (xi − 191)h(xi ) + εi , (8.5.1)
where now β1 is an overall intercept for the model.
As mentioned earlier, the two-lines model was originally fitted (with different symbols for the
unknown parameters) as
yi = β1 + β2 xi + γ1 h(xi ) + γ2 xi h(xi ) + εi .
This is a model that has the low group of temperature values as a baseline and for the high group
incorporates deviations from the baseline, e.g., the slope above 191 is β2 + γ2 . For this model the
continuity condition is that
or that
0 = γ1 + γ2 191
or that
γ1 = −γ2 191.
Imposing this continuity condition, the model becomes
yi = β1 + β2 xi − γ2 191h(xi) + γ2 xi h(xi ) + εi
or
yi = β1 + β2xi + γ2 (xi − 191)h(xi ) + εi . (8.5.2)
200 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
In discussions of splines, the function (xi − 191)h(xi ) is typically written (xi − 191)+ where for any
scalar a,
x − a if x > a
(x − a)+ ≡
0 if x ≤ a.
Fitting models (8.5.1) and (8.5.2) to the Hooker data gives
In general, to fit a linear spline model, you need to decide on a group of knots at which the slope
will change. Call these x̃ j , j = 1, . . . , r. The linear spline model then becomes
r
yi = β0 + β1xi + ∑ γ j (xi − x̃ j )+ + εi .
j=1
Similar ideas work with higher-degree polynomials. The most popular polynomial to use is cubic;
see Exercise 8.7.8. The general cubic spline model is
r
yi = β0 + β1xi + β2x2i + β3x3 + ∑ γ j [(xi − x̃ j )+ ]3 + εi .
j=1
See Christensen (2001, Section 7.6) for more discussion in a similar vein.
yi s at each replicated x value. There are 2 observations at each replicated x, so the sample variance
computed at each x has 1 degree of freedom. Since there are two replicated xs each with one degree
of freedom for the variance estimate, the pure error has 1+1 = 2 degrees of freedom. To compute the
sum of squares for pure error, observe that when x = 181.9, the mean y is 15.517. The contribution
to the sum of squares pure error from this x value is (15.106 − 15.517)2 + (15.928 − 15.517)2. A
similar contribution is computed for x = 184.1 and they are added to get the sum of squares pure
error. The degrees of freedom and sum of squares for lack of fit are found by taking the values from
the original error and subtracting the values for the pure error. The F test for lack of fit examines
the mean square lack of fit divided by the mean square pure error.
Analysis of Variance.
Source df SS MS F P
Regression 1 444.17 444.17 3497.89 0.000
Error 29 3.68 0.13
(Lack of Fit) 27 3.66 0.14 10.45 0.091
(Pure Error) 2 0.03 0.01
Total 30 447.85
The F statistic for lack of fit, 10.45, seems substantially larger than 1, but because there are
only 2 degrees of freedom in the denominator, the P value is a relatively large 0.09. This method is
closely related to one-way analysis of variance as discussed in Chapter 12.
8.7 Exercises
E XERCISE 8.7.1. Dixon and Massey (1969) presented data on the relationship between IQ scores
and results on an achievement test in a general science course. Table 8.3 contains a subset of the data.
Fit the simple linear regression model of achievement on IQ and the quadratic model of achievement
on IQ and IQ squared. Evaluate both models and decide which is the best.
E XERCISE 8.7.2. In Exercise 7.4.2 we considered data on the relationship between farm sizes
and the acreage in corn. Fit the linear, quadratic, cubic, and quartic polynomial models to the logs
of the acreages in corn. Find the model that fits best. Check the assumptions for this model.
E XERCISE 8.7.3. Use two methods other than fitting polynomial models to test for lack of fit in
Exercise 8.7.1
E XERCISE 8.7.4. Based on the height and weight data given in Table 8.4, fit a simple linear
regression of weight on height for these data and check the assumptions. Give a 99% confidence
interval for the mean weight of people with a 72-inch height. Test for lack of fit of the simple linear
regression model.
202 8. LACK OF FIT AND NONPARAMETRIC REGRESSION
E XERCISE 8.7.5. Jensen (1977) and Weisberg (1985, p. 101) considered data on the outside
diameter of crank pins that were produced in an industrial process. The diameters of batches of
crank pins were measured on various days; if the industrial process is “under control” the diameters
should not depend on the day they were measured. A subset of the data is given in Table 8.5 in a
format consistent with performing a regression analysis on the data. The diameters of the crank pins
are actually .742 + yi j 10−5 inches, where the yi j s are reported in Table 8.5. Perform polynomial
regressions on the data. Give two lack-of-fit tests for the simple linear regression not based on
polynomial regression.
E XERCISE 8.7.6. Beineke and Suddarth (1979) and Devore (1991, p. 380) consider data on roof
supports involving trusses that use light-gauge metal connector plates. Their dependent variable is
an axial stiffness index (ASI) measured in kips per inch. The predictor variable is the length of the
light-gauge metal connector plates. The data are given in Table 8.6.
Fit linear, quadratic, cubic, and quartic polynomial regression models using powers of x, the
plate length, and using powers of x − x̄·, the plate length minus the average plate length. Compare
the results of the two procedures. If your computer program will not fit some of the models, report
on that in addition to comparing results for the models you could fit.
where the polynomial coefficients below the knot are the β j s and above the knot are the (β j + γ j )s.
Define the change polynomial as
C(x) ≡ γ0 + γ1 x + γ2x2 + γ3 x3 .
To turn the two polynomials into cubic splines, we require that the two cubic polynomials be equal
at the knot but also that their first and second derivatives be equal at the knot. It is not hard to see
that this is equivalent to requiring that the change polynomial have
dC(x) d 2C(x)
0 = C(191) = = ,
dx x=191 dx2 x=191
where our one knot for the Hooker data is at x = 191. Show that imposing these three conditions
leads to the model
(It is easy to show that C(x) = γ3 (x − 191)3 satisfies the three conditions. It is a little harder to show
that satisfying the three conditions implies that C(x) = γ3 (x − 191)3.)
Chapter 9
Multiple (linear) regression involves predicting values of a dependent variable from the values on a
collection of other (predictor) variables. In particular, linear combinations of the predictor variables
are used in modeling the dependent variable. For the most part, the use of categorical predictors in
multiple regression is inappropriate. To incorporate categorical predictors, they need to be replaced
by 0-1 indicators for the various categories.
25 30 35 40
25 30 35 40
y
y x1 x2
2.0 2.5 3.0 3.5
x1
x1
y x1 x2
20 40 60 80
20 40 60 80
20 40 60 80
x2
x2
x2
y x1 x2
205
206 9. MULTIPLE REGRESSION: INTRODUCTION
25 30 35 40
25 30 35 40
25 30 35 40
y
y
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
2.0 2.5 3.0 3.5
x1
x1
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
20 40 60 80
20 40 60 80
20 40 60 80
x2
x2
x2
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
15
15
5
5
x3
x3
x3
−5
−5
−5
−15
−15
−15
y x1 x2
28
28
28
26
26
26
x4
x4
x4
24
24
24
22
22
22
y x1 x2
7.5
7.5
7.5
6.5
6.5
6.5
x5
x5
x5
5.5
5.5
5.5
y x1 x2
15
15
15
5
5
x3
x3
x3
−5
−5
−5
−15
−15
−15
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
28
28
28
26
26
26
x4
x4
x4
24
24
24
22
22
22
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
7.5
7.5
7.5
6.5
6.5
6.5
x5
x5
x5
5.5
5.5
5.5
−15 −5 0 5 10 15 22 24 26 28 5.5 6.0 6.5 7.0 7.5
x3 x4 x5
Of the five variables, x3 , the one used in the simple linear regression, has the highest correlation.
Thus it explains more of the y variability than any other single variable. Variables x2 and x5 also
have reasonably high correlations with y. Low correlations exist between y and both x1 and x4 .
Interestingly, x1 and x4 turn out to be more important in explaining y than either x2 or x5 . However,
the explanatory power of x1 and x4 only manifests itself after x3 has been fitted to the data.
The model is
yi = β0 + β1xi1 + β2xi2 + β3 xi3 + β4 xi4 + β5 xi5 + εi , (9.1.1)
i = 1, . . . , 20, where the εi s are unobservable independent N(0, σ 2 ) random variables and the β s
are fixed unknown parameters. Fitting Model (9.1.1) with a computer program typically yields a
table of coefficients with parameter estimates, standard errors for the estimates, t ratios for testing
whether the parameters are zero, P values, and an analysis of variance table.
Table of Coefficients: Model (9.1.1)
Predictor β̂k SE(β̂k ) t P
Constant 19.95 13.63 1.46 0.165
x1 −1.793 1.233 −1.45 0.168
x2 0.04360 0.05326 0.82 0.427
x3 0.55576 0.09296 5.98 0.000
x4 1.1102 0.4338 2.56 0.023
x5 −1.811 2.027 −0.89 0.387
Analysis of Variance: Model (9.1.1)
Source df SS MS F P
Regression 5 582.69 116.54 27.08 0.000
Error 14 60.24 4.30
Total 19 642.92
From just these two tables of statistics much can be learned. In particular, the estimated regression
equation is
ŷ = 19.9 − 1.79x1 + 0.0436x2 + 0.556x3 + 1.11x4 − 1.81x5.
208 9. MULTIPLE REGRESSION: INTRODUCTION
Substituting the observed values xi j , j = 1, . . . , 5 gives the fitted (predicted) values ŷi and the resid-
uals ε̂i = yi − ŷi .
As discussed in simple linear regression, this equation describes the relationship between y and
the predictor variables for the current data; it does not imply a causal relationship. If we go out and
increase the percentage of sixth graders whose fathers have white-collar jobs by 1%, i.e., increase
x2 by one unit, we cannot infer that mean verbal test scores will tend to increase by 0.0436 units.
In fact, we cannot think about any of the variables in a vacuum. No variable has an effect in the
equation apart from the observed values of all the other variables. If we conclude that some variable
can be eliminated from the model, we cannot conclude that the variable has no effect on y, we can
only conclude that the variable is not necessary to explain these data. The same variable may be very
important in explaining other, rather different, data collected on the same variables. All too often,
people choose to interpret the estimated regression coefficients as if the predictor variables cause
the value of y but the estimated regression coefficients simply describe an observed relationship.
Frankly, since the coefficients do not describe a causal relationship, many people, including the
author, find regression coefficients to be remarkably uninteresting quantities. What this model is
good at is predicting values of y for new cases that are similar to those in the current data. In
particular, such new cases should have predictor variables with values similar to those in the current
data.
The t statistics for testing H0 : βk = 0 were reported in the table of coefficients. For example, the
test of H0 : β4 = 0 has
1.1102
tobs = = 2.56.
.4338
The P value is
P = Pr[|t(dfE)| ≥ 2.56] = 0.023.
The value 0.023 indicates a reasonable amount of evidence that variable x4 is needed in the model.
We can be reasonably sure that dropping x4 from the model harms the explanatory (predictive)
power of the model. In particular, with a P value of 0.023, the test of the null model with H0 : β4 = 0
is rejected at the α = 0.05 level (because 0.05 > 0.023), but the test is not rejected at the α = 0.01
level (because 0.023 > 0.01).
A 95% confidence interval for β3 has endpoints β̂3 ± t(0.975, dfE) SE(β̂3 ). From a t table,
t(0.975, 14) = 2.145 and from the table of coefficients the endpoints are
0.55576 ± 2.145(0.09296).
The confidence interval is (0.356, 0.755), so the data are consistent with β3 between 0.356 and
0.755.
The primary value of the analysis of variance table is that it gives the degrees of freedom, the
sum of squares, and the mean square for error. The mean squared error is the estimate of σ 2 , and
the sum of squares error and degrees of freedom for error are vital for comparing various regression
models. The degrees of freedom for error are n − 1 − (the number of predictor variables). The minus
1 is an adjustment for fitting the intercept β0 .
The analysis of variance table also gives the test for whether any of the x variables help to
explain y, i.e., of whether yi = β0 + εi is an adequate model. This test is rarely of interest because it
is almost always highly significant. It is a poor scholar who cannot find any predictor variables that
are related to the measurement of primary interest. (Ok, I admit to being a little judgmental here.)
The test of
H0 : β1 = · · · = β5 = 0
is based on
MSReg 116.5
Fobs = = = 27.08
MSE 4.303
and (typically) is rejected for large values of F. The numerator and denominator degrees of freedom
9.1 EXAMPLE OF INFERENTIAL PROCEDURES 209
come from the ANOVA table. As suggested, the corresponding P value in the ANOVA table is
infinitesimal, zero to three decimal places. Thus these x variables, as a group, help to explain the
variation in the y variable. In other words, it is possible to predict the mean verbal test scores for a
school’s sixth grade class from the five x variables measured. Of course, the fact that some predictive
ability exists does not mean that the predictive ability is sufficient to be useful.
The coefficient of determination, R2 , measures the predictive ability of the model. It is the
squared correlation between the (ŷi , yi ) pairs and also is the percentage of the total variability in
y that is explained by the x variables. If this number is large, it suggests a substantial predictive
ability. In this example
SSReg 582.69
R2 ≡ = = 0.906,
SSTot 642.92
so 90.6% of the total variability is explained by the regression model. This large percentage suggests
that the five x variables have substantial predictive power. However, we saw in Section 7.1 that a
large R2 does not imply that the model is good in absolute terms. It may be possible to show that
this model does not fit the data adequately. In other words, this model is explaining much of the
variability but we may be able to establish that it is not explaining as much of the variability as it
ought. Conversely, a model with a low R2 value may be the perfect model but the data may simply
have a great deal of variability. Moreover, even an R2 of 0.906 may be inadequate for the predictive
purposes of the researcher, while in some situations an R2 of 0.3 may be perfectly adequate. It
depends on the purpose of the research. Finally, a large R2 may be just an unrepeatable artifact of
a particular data set. The coefficient of determination is a useful tool but it must be used with care.
Recall from Section 7.1 that the R2 was 0.86 when using just x3 to predict y.
#Confidence intervals
confint(co, level=0.95)
#Predictions
new = data.frame(x1=2.07, x2=9.99,x3=-16.04,x4= 21.6, x5=5.17)
210 9. MULTIPLE REGRESSION: INTRODUCTION
predict(co,new,se.fit=T,interval="confidence")
predict(co,new,interval="prediction")
# Diagnostics table
infv = c(y,co$fit,hatvalues(co),rstandard(co),rstudent(co),cooks.distance(co))
inf=matrix(infv,I(cop$d \! f[1]+cop$d \! f[2]),6,dimnames = list(NULL,
c("y", "yhat", "lev","r","t","C")))
inf
#Wilk-Francia Statistic
rankit=qnorm(ppoints(rstandard(co),a=I(3/8)))
ys=sort(rstandard(co))
Wprime=(cor(rankit,ys))^2
Wprime
yi = β0 + β1xi1 + · · · + β p−1xi,p−1 + εi ,
where the subscript i = 1, . . . , n indicates different observations and the εi s are independent N(0, σ 2 )
random variables. The β j s and σ 2 are unknown constants and are the fundamental parameters of
the regression model.
Estimates of the β j s are obtained by the method of least squares. The least squares estimates are
those that minimize
n
∑ (yi − β0 − β1xi1 − β2xi2 − · · · − β p−1xi,p−1)
2
.
i=1
In this function the yi s and the xi j s are all known quantities. Least squares estimates have a number
of interesting statistical properties. If the errors are independent with mean zero, constant variance,
and are normally distributed, the least squares estimates are maximum likelihood estimates (MLEs)
and minimum variance unbiased estimates (MVUEs). If we keep the assumptions of mean zero
and constant variance but weaken the independence assumption to that of the errors being merely
uncorrelated and stop assuming normal distributions, the least squares estimates are best (minimum
variance) linear unbiased estimates (BLUEs).
In checking assumptions we often use the predictions (fitted values) ŷ corresponding to the
observed values of the predictor variables, i.e.,
and the estimate of σ 2 is the mean squared error (residual mean square)
The MSE is an unbiased estimate of σ 2 in that E(MSE) = σ 2 . Under the standard normality assump-
tions, MSE is the minimum variance unbiased estimate of σ 2 . However, the maximum likelihood
estimate of σ 2 is σ̂ 2 = MSE/n, Unless discussing SAS’s PROC GENMOD, we will never use the
MLE of σ 2 .
Details of the estimation procedures are given in Chapter 11.
z = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + β 4 x4 + β 5 x5
for some values of the predictor variables. The estimated regression surface is
There are two problems of interest. The first is estimating the value z on the regression surface for
a fixed set of predictor variables. The second is predicting the value of a new observation to be
obtained with a fixed set of predictor variables. For any set of predictor variables, the estimate of the
regression surface and the prediction are identical. What differs are the standard errors associated
with the different problems.
Consider estimation and prediction at
These are the minimum values for each of the variables, so there will be substantial variability in
estimating the regression surface at this point. The estimator (predictor) is
5
ŷ = β̂0 + ∑ β̂ j x j = 19.9 − 1.79(2.07) + 0.0436(9.99)
j=1
+ 0.556(−16.04) + 1.11(21.6) − 1.81(5.17) = 22.375.
For constructing 95% t intervals, the percentile needed is t(0.975, 14) = 2.145.
212 9. MULTIPLE REGRESSION: INTRODUCTION
The 95% confidence interval for the point β0 + ∑5j=1 β j x j on the regression surface uses the
standard error for the regression surface, which is
The standard error is obtained from the regression program and depends on the specific value of
(x1 , x2 , x3 , x4 , x5 ). The formula for the standard error is given in Section 11.4. This interval has
endpoints
22.375 ± 2.145(1.577),
which gives the interval
(18.992, 25.757).
In this example,
SE(Prediction) = 4.303 + (1.577)2 = 2.606,
and the prediction interval endpoints are
22.375 ± 2.145(2.606).
We mentioned earlier that even if the regression model is true, the variance of predictions is
large when the x j values for the prediction are far from the original data. We can use this fact to
identify situations in which the predictions are unreliable because the locations are too far away. Let
p − 1 be the number of predictor variables so that, including the intercept, there are p regression
parameters. Let n be the number of observations. A sensible rule of thumb is that we should start
worrying about the validity of the prediction whenever
SE(Sur f ace) 2p
√ ≥
MSE n
and we should be very concerned about the validity of the prediction whenever
SE(Sur f ace) 3p
√ ≥ .
MSE n
Recall that for simple linear regression we suggested that leverages greater than 4/n cause concern
and those greater than 6/n cause considerable concern. In general, leverages greater than 2p/n and
√ simple linear regression guidelines are based on having p =
3p/n cause these levels of concern. The
2. We are comparing SE(Sur f ace)/ MSE to the square roots of these guidelines. In our example,
p = 6 and n = 20, so
SE(Sur f ace) 1.577 2p
√ =√ = 0.760 < 0.775 = .
MSE 4.303 n
The location of this prediction is near the boundary of those locations for which we feel comfortable
making predictions.
9.3 COMPARING REGRESSION MODELS 213
9.3 Comparing regression models
A frequent goal in regression analysis is to find the simplest model that provides an adequate expla-
nation of the data. In examining the full model with all five x variables, there is little evidence that
any of x1 , x2 , or x5 are needed in the regression model. The t tests reported in Section 9.1 for the
corresponding regression parameters gave P values of 0.168, 0.427, and 0.387. We could drop any
one of the three variables without significantly harming the model. While this does not imply that
all three variables can be dropped without harming the model, dropping the three variables makes
an interesting point of departure.
Fitting the reduced model
yi = β0 + β3xi3 + β4 xi4 + εi
gives
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 14.583 9.175 1.59 0.130
x3 0.54156 0.05004 10.82 0.000
x4 0.7499 0.3666 2.05 0.057
Analysis of Variance
Source df SS MS F P
Regression 2 570.50 285.25 66.95 0.000
Error 17 72.43 4.26
Total 19 642.92
We can test whether this reduced model is an adequate explanation of the data as compared
to the full model. The sum of squares for error from the full model was reported in Section 9.1 as
SSE(Full) = 60.24 with degrees of freedom dfE(Full) = 14 and mean squared error MSE(Full) =
4.30. For the reduced model we have SSE(Red.) = 72.43 and dfE(Red.) = 17. The test statistic for
the adequacy of the reduced model is
[SSE(Red.) − SSE(Full)] [dfE(Red.) − dfE(Full)] [72.43 − 60.24] [17 − 14]
Fobs = = = 0.94.
MSE(Full) 4.30
F has [dfE(Red.) − dfE(Full)] and dfE(Full) degrees of freedom in the numerator and denom-
inator, respectively. Here F is about 1, so it is not significant. In particular, 0.94 is less than
F(0.95, 3, 14), so a formal α = .05 level one-sided F test does not reject the adequacy of the reduced
model. In other words, the .05 level one-sided test of the null model with H0 : β1 = β2 = β5 = 0 is
not rejected.
This test lumps the three variables x1 , x2 , and x5 together into one big test. It is possible that the
uselessness of two of these variables could hide the fact that one of them is (marginally) significant
when added to the model with x3 and x4 . To fully examine this possibility, we need to fit three
additional models. Each variable should be added, in turn, to the model with x3 and x4 . We consider
in detail only one of these three models, the model with x1 , x3 , and x4 . From fitting this model, the
t statistic for testing whether x1 is needed in the model turns out to be −1.47. This has a P value of
0.162, so there is little indication that x1 is useful. We could also construct an F statistic as illustrated
previously. The sum of squares for error in the model with x1 , x3 , and x4 is 63.84 on 16 degrees of
freedom, so
[72.43 − 63.84]/[17 − 16]
Fobs = = 2.16 .
63.84/16
Note that, up to round-off error, F = t 2 . The tests are equivalent and the P value for the F statistic is
also 0.162. F tests are only equivalent to a corresponding t test when the numerator of the F statistic
has one degree of freedom. Methods similar to these establish that neither x2 nor x5 are important
when added to the model that contains x3 and x4 .
214 9. MULTIPLE REGRESSION: INTRODUCTION
Here we are testing two models: the full model with x1 , x3 , and x4 against a reduced model with
only x3 and x4 . Both of these models are special cases of a biggest model that contains all of x1 , x2 ,
x3 , x4 , and x5 . In Subsection 3.1.1, for cases like this, we recommended an alternative F statistic,
and
yi = β0 + β1 xi1 + · · · + βq−1xi,q−1 + εi . (9.3.2)
For convenience, in this subsection we refer to equations such as (9.3.1) and (9.3.2) simply as
(1) and (2). The key fact here is that all of the variables in Model (2) are also in Model (1). In
this comparison, we dropped the last variables xi,q , . . . , xi,p−1 for notational convenience only; the
discussion applies to dropping any group of variables from Model (1). Throughout, we assume that
Model (1) gives an adequate fit to the data and then compare how well Model (2) fits the data with
how well Model (1) fits. Before applying the results of this subsection, the validity of the model (1)
assumptions should be evaluated.
We want to know if the variables xi,q , . . . , xi,p−1 are needed in the model, i.e., whether they are
useful predictors. In other words, we want to know if Model (2) is an adequate model; whether it
gives an adequate explanation of the data. The variables xq , . . . , x p−1 are extraneous if and only if
βq = · · · = β p−1 = 0. The test we develop can be considered as a test of
H0 : βq = · · · = β p−1 = 0.
Parameters are very tricky things; you never get to see the value of a parameter. I strongly
prefer the interpretation of testing one model against another model rather than the interpretation of
testing whether βq = · · · = β p−1 = 0. In practice, useful regression models are rarely correct models,
although they can be very good approximations. Typically, we do not really care whether Model (1)
is true, only whether it is useful, but dealing with parameters in an incorrect model becomes tricky.
9.3 COMPARING REGRESSION MODELS 215
In practice, we are looking for a (relatively) succinct way of summarizing the data. The smaller
the model, the more succinct the summarization. However, we do not want to eliminate useful
explanatory variables, so we test the smaller (more succinct) model against the larger model to see
if the smaller model gives up significant explanatory power. Note that the larger model always has
at least as much explanatory power as the smaller model because the larger model includes all the
variables in the smaller model plus some more.
Applying our model testing procedures to this problem yields the following test: Reject the
hypothesis
H0 : βq = · · · = β p−1 = 0
at the α level if
[SSE(Red.) − SSE(Full)] (p − q)
F≡ > F(1 − α , p − q, n − p).
MSE(Full)
For p − q ≥ 3, this one-sided test is not a significance test, cf. Chapter 3.
The notation SSE(Red.) − SSE(Full) focuses on the ideas of full and reduced models. Other
notations that focus on variables and parameters are also commonly used. One can view the model
comparison procedure as fitting Model (2) first and then seeing how much better Model (1) fits.
The notation based on this refers to the (extra) sum of squares for regressing on xq , . . . , x p−1 after
regressing on x1 , . . . , xq−1 and is written
This notation assumes that the model contains an intercept. Alternatively, one can think of fitting the
parameters βq , . . . , β p−1 after fitting the parameters β0 , . . . , βq−1 . The relevant notation refers to the
reduction in sum of squares (for error) due to fitting βq , . . . , β p−1 after β0 , . . . , βq−1 and is written
Note that it makes perfect sense to refer to SSR(xq, . . . , x p−1 |x1 , . . . , xq−1 ) as the reduction in sum of
squares for fitting xq , . . . , x p−1 after x1 , . . . , xq−1 .
It was mentioned earlier that the degrees of freedom for SSE(Red.) − SSE(Full) is p − q. Note
that p − q is the number of variables to the left of the vertical bar in SSR(xq, . . . , x p−1 |x1 , . . . , xq−1 )
and the number of parameters to the left of the vertical bar in R(βq , . . . , β p−1 |β0 , . . . , βq−1 ).
A point that is quite clear when thinking of model comparisons is that if you change either
model, (1) or (2), the test statistic and thus the test changes. This point continues to be clear when
dealing with the notations SSR(xq, . . . , x p−1 |x1 , . . . , xq−1 ) and R(βq , . . . , β p−1 |β0 , . . . , βq−1 ). If you
change any variable on either side of the vertical bar, you change SSR(xq, . . . , x p−1 |x1 , . . . , xq−1 ).
Similarly, the parametric notation R(βq , . . . , β p−1|β0 , . . . , βq−1 ) is also perfectly precise, but confu-
sion can easily arise when dealing with parameters if one is not careful. For example, when testing,
say, H0 : β1 = β3 = 0, the tests are completely different in the three models
Model (5) uses SSR(x1, x3 |x2 , x4 ) ≡ R(β1 , β3 |β0 , β2 , β4 ). In all cases we are testing β1 = β3 = 0 after
216 9. MULTIPLE REGRESSION: INTRODUCTION
fitting all the other parameters in the model. In general, we think of testing H0 : βq = · · · = β p−1 = 0
after fitting β0 , . . . , βq−1 .
If the reduced model is obtained by dropping out only one variable, e.g., if q − 1 = p − 2, the
parametric hypothesis is H0 : β p−1 = 0. We have just developed an F test for this and we have
earlier used a t test for the hypothesis. In multiple regression, just as in simple linear regression,
the F test is equivalent to the t test. It follows that the t test must be considered as a test for the
parameter after fitting all of the other parameters in the model. In particular, the t tests reported in
the table of coefficients when fitting a regression tell you only whether a variable can be dropped
relative to the model that contains all the other variables. These t tests cannot tell you whether more
than one variable can be dropped from the fitted model. If you drop any variable from a regression
model, all of the t tests change. It is only for notational convenience that we are discussing testing
β p−1 = 0; the results hold for any βk .
The SSR notation can also be used to find SSEs. Consider models (3), (4), and (5) and suppose
we know SSR(x2|x1 , x3 ), SSR(x4|x1 , x2 , x3 ), and the SSE from Model (5). We can easily find the
SSEs for models (3) and (4). By definition,
Also
yi = β0 + β1xi1 + εi ,
yi = β0 + β1xi1 + β2xi2 + εi ,
yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi ,
yi = β0 + β1xi1 + β2xi2 + β3xi3 + β4 xi4 + εi .
The sequence is determined by the order in which the variables are specified. If the identical model
is specified in the form
yi = β0 + β3xi3 + εi ,
yi = β0 + β3xi3 + β1xi1 + εi ,
yi = β0 + β3xi3 + β1xi1 + β4 xi4 + εi ,
yi = β0 + β3xi3 + β1xi1 + β4 xi4 + β2 xi2 + εi .
Frequently, programs that fit sequences of models also provide sequences of sums of squares.
Thus the first sequence of models yields
These can be used in a variety of ways. For example, as shown at the end of the previous section, to
test
yi = β0 + β1xi1 + β3 xi3 + εi
against
yi = β0 + β1xi1 + β2xi2 + β3 xi3 + β4 xi4 + εi
we need SSR(x2, x4 |x3 , x1 ). This is easily obtained from the second sequence as
to the Coleman Report data, we get the sequential sums of squares listed below.
Recall that the MSE for the five-variable model is 4.30 on 14 degrees of freedom.
From the sequential sums of squares we can test a variety of hypotheses related to the full model.
For example, we can test whether variable x5 can be dropped from the five-variable model. The F
statistic is 3.43/4.30, which is less than 1, so the effect of x5 is insignificant. This test is equivalent
to the t test for x5 given in Section 9.1 when fitting the five-variable model. We can also test whether
we can drop both x4 and x5 from the full model. The F statistic is
(25.91 + 3.43)/2
Fobs = = 3.41.
4.30
F(0.95, 2, 14) = 3.74, so this F statistic provides little evidence that the pair of variables is needed.
(The relative importance of x4 is somewhat hidden by combining it in a test with the unimportant
x5 .) Similar tests can be constructed for dropping x3 , x4 , and x5 , for dropping x2 , x3 , x4 , and x5 , and
218 9. MULTIPLE REGRESSION: INTRODUCTION
for dropping x1 , x2 , x3 , x4 , and x5 from the full model. The last of these is just the ANOVA table F
test.
We can also make a variety of tests related to ‘full’ models that do not include all five variables.
In the previous paragraph, we found little evidence that the pair x4 and x5 help explain the data in the
five-variable model. We now test whether x4 can be dropped when we have already dropped x5 . In
other words, we test whether x4 adds explanatory power to the model that contains x1 , x2 , and x3 . The
numerator has one degree of freedom and is SSR(x4|x1 , x2 , x3 ) = 25.91. The usual denominator mean
square for this test is the MSE from the model with x1 , x2 , x3 , and x4 , i.e., {14(4.303) + 3.43}/15.
(For numerical accuracy we have added another significant digit to the MSE from the five-variable
model. The SSE from the model without x5 is just the SSE from the five-variable model plus the
sequential sum of squares SSR(x5|x1 , x2 , x3 , x4 ).) Our best practice would be to construct the test
using the same numerator mean square but the MSE from the five-variable model in the denominator
of the test. Using this second denominator, the F statistic is 25.91/4.30 = 6.03. Corresponding F
percentiles are F(0.95, 1, 14) = 4.60 and F(0.99, 1, 14) = 8.86, so x4 may be contributing to the
model. If we had used the MSE from the model with x1 , x2 , x3 , and x4 , the F statistic would be
equivalent to the t statistic for dropping x4 that is obtained when fitting this four-variable model.
If we wanted to test whether x2 and x3 can be dropped from the model that contains x1 , x2 , and
x3 , the usual denominator is [14(4.303) + 25.91 + 3.43]/16 = 5.60. (The SSE for the model without
x4 or x5 is just the SSE from the five-variable model plus the sequential sum of squares for x4 and
x5 .) Again, we would alternatively use the MSE from the five-variable model in the denominator.
Using the first denominator, the test is
(343.23 + 186.34)/2
Fobs = = 47.28,
5.60
which is much larger than F(0.999, 2, 16) = 10.97, so there is overwhelming evidence that variables
x2 and x3 cannot be dropped from the x1 , x2 , x3 model.
The argument for basing tests on the MSE from the five-variable model is that it is less subject to
bias than the other MSEs. In the test given in the previous paragraph, the MSE from the usual ‘full’
model incorporates the sequential sums of squares for x4 and x5 . A reason for doing this is that we
have tested x4 and x5 and are not convinced that they are important. As a result, their sums of squares
are incorporated into the error. Even though we may not have established an overwhelming case for
the importance of either variable, there is some evidence that x4 is a useful predictor when added to
the first three variables. The sum of squares for x4 may or may not be large enough to convince us
of its importance but it is large enough to change the MSE from 4.30 in the five-variable model to
5.60 in the x1 , x2 , x3 model. In general, if you test terms and pool them with the Error whenever the
test is insignificant, you are biasing the MSE that results from this pooling. ✷
In general, when given the ANOVA table and the sequential sums of squares, we can test any
model in the sequence against any reduced model that is part of the sequence. We cannot use these
statistics to obtain a test involving a model that is not part of the sequence.
2 SSR(x1|x2 , x3 , x4 )
ry1 ·234 = ,
SSE(x2, x3 , x4 )
220 9. MULTIPLE REGRESSION: INTRODUCTION
where SSE(x2, x3 , x4 ) is the sum of squares error from a model with an intercept and the three
predictors x2 , x3 , x4 . The squared sample partial correlation coefficient between y and x2 given x1 ,
x3 , and x4 is
2 SSR(x2|x1 , x3 , x4 )
ry2 ·134 = .
SSE(x1, x3 , x4 )
Alternatively, the sample partial correlation ry2·134 is precisely the ordinary sample correlation
computed between the residuals from fitting
yi = β0 + β1 xi1 + β3 xi3 + β4xi4 + εi (9.6.1)
and the residuals from fitting
xi2 = γ0 + γ1 xi1 + γ3 xi3 + γ4 xi4 + εi . (9.6.2)
Actually, the residuals from models (9.6.1) and (9.6.2) give the basis for the perfect plot to
evaluate whether adding variable x2 will improve Model (9.6.1). Simply plot the residuals yi −
ŷi from Model (9.6.1) against the residuals xi2 − x̂i2 from Model (9.6.2). If there seems to be no
relationship between the yi − ŷi s and the xi2 − x̂i2 s, x2 will not be important in Model (9.6.3). If the
plot looks clearly linear, x2 will be important in Model (9.6.3). When a linear relationship exists in
the plot but is due to the existence of a few points, those points are the dominant cause for x2 being
important in Model (9.6.3). The reason these added variable plots work is because the least squares
estimate of β2 from Model (9.6.3) is identical to the least squares estimate of β2 from the regression
through the origin
(yi − ŷi ) = β2 (xi2 − x̂i2 ) + εi .
See Christensen (2011, Exercise 9.2).
E XAMPLE 9.6.2. For the school data, Figure 9.5 gives the added variable plot to determine
whether the variable x3 adds to the model that already contains x1 , x2 , x4 , and x5 . A clear linear
relationship exists, so x3 will improve the model. Here the entire data support the linear relation-
ship, but there are a couple of unusual cases. The second smallest x3 residual has an awfully large y
residual and the largest x3 residual has a somewhat surprisingly small y residual. ✷
9.7 COLLINEARITY 221
6
4
2
0
yresid
−2
−4
−6
−8
−10 −5 0 5 10
x3resid
Figure 9.5: Added variable plot: y residuals versus x3 residuals; Coleman Report data.
9.7 Collinearity
Collinearity exists when the predictor variables x1 , . . . , x p−1 are correlated. We have n observations
on each of these variables, so we can compute the sample correlations between them. Typically, the
x variables are assumed to be fixed and not random. For data like the Coleman Report, we have
a sample of schools so the predictor variables really are random. But for the purpose of fitting the
regression we treat them as fixed. (Probabilistically, we look at the conditional distribution of y given
the predictor variables.) In some applications, the person collecting the data actually has control
over the predictor variables so they truly are fixed. If the x variables are fixed and not random,
there is some question as to what a correlation between two x variables means. Actually, we are
concerned with whether the observed variables are orthogonal, but that turns out to be equivalent
to having sample correlations of zero between the x variables. Nonzero sample correlations indicate
nonorthogonality, thus we need not concern ourselves with the interpretation of sample correlations
between nonrandom samples.
In regression, it is almost unheard of to have x variables that display no collinearity (correlation)
[unless the variables are constructed to have no correlation]. In other words, observed x variables are
almost never orthogonal. The key ideas in dealing with collinearity were previously incorporated
into the discussion of comparing regression models. In fact, the methods discussed earlier were built
around dealing with the collinearity of the x variables. This section merely reviews a few of the main
ideas.
1. The estimate of any parameter, say β̂2 , depends on all the variables that are included in the model.
2. The sum of squares for any variable, say x2 , depends on all the other variables that are included
in the model. For example, none of SSR(x2), SSR(x2|x1 ), and SSR(x2|x3 , x4 ) would typically be
equal.
3. Suppose the model
yi = β0 + β1xi1 + β2xi2 + β3xi3 + εi
is fitted and we obtain t statistics for each parameter. If the t statistic for testing H0 : β1 = 0 is
222 9. MULTIPLE REGRESSION: INTRODUCTION
small, we are led to the model
yi = β0 + β2xi2 + β3xi3 + εi .
If the t statistic for testing H0 : β2 = 0 is small, we are led to the model
yi = β0 + β1xi1 + β3xi3 + εi .
However, if the t statistics for both tests are small, we are not led to the model
yi = β0 + β3 xi3 + εi .
To arrive at the model containing only the intercept and x3 , one must at some point use the model
containing only the intercept and x3 as a reduced model.
4. A moderate amount of collinearity has little effect on predictions and therefore little effect on
SSE, R2 , and the explanatory power of the model. Collinearity increases the variance of the
β̂k s, making the estimates of the parameters less reliable. (I told you not to rely on parameters
anyway.) Depending on circumstances, sometimes a large amount of collinearity can have an
effect on predictions. Just by chance, one may get a better fit to the data than can be justified
scientifically.
The complications associated with points 1 through 4 all vanish if the sample correlations between
the x variables are all zero.
Many computer programs will print out a matrix of correlations between the variables. One
would like to think that if all the correlations between the x variables are reasonably small, say less
than 0.3 or 0.4, then the problems of collinearity would not be serious. Unfortunately, that is simply
not true. To avoid difficulties with collinearity, not only do all the correlations need to be small but
all of the partial correlations among the x variables must be small. Thus, small correlations alone
do not ensure small collinearity.
E XAMPLE 9.7.1. The correlations among predictors for the Coleman data are given below.
x1 x2 x3 x4 x5
x1 1.000 0.181 0.230 0.503 0.197
x2 0.181 1.000 0.827 0.051 0.927
x3 0.230 0.827 1.000 0.183 0.819
x4 0.503 0.051 0.183 1.000 0.124
x5 0.197 0.927 0.819 0.124 1.000
A visual display of these relationships was provided in Figures 9.1–9.4.
Note that x3 is highly correlated with x2 and x5 . Since x3 is highly correlated with y, the fact that
x2 and x5 are also quite highly correlated with y is not surprising. Recall that the correlations with y
were given at the beginning of Section 9.1. Moreover, since x3 is highly correlated with x2 and x5 ,
it is also not surprising that x2 and x5 have little to add to a model that already contains x3 . We have
seen that it is the two variables x1 and x4 , i.e., the variables that do not have high correlations with
either x3 or y, that have the greater impact on the regression equation.
Having regressed y on x3 , the sample correlations between y and any of the other variables are no
longer important. Having done this regression, it is more germane to examine the partial correlations
between y and the other variables after adjusting for x3 . However, as we will see in our discussion
of variable selection in Chapter 10, even this has its drawbacks. ✷
As long as points 1 through 4 are kept in mind, a moderate amount of collinearity is not a big
problem. For severe collinearity, there are four common approaches: a) classical ridge regression,
b) generalized inverse regression, c) principal components regression, and d) canonical regression.
Classical ridge regression is probably the best known of these methods. The other three methods
are closely related and seem quite reasonable. Principal components regression is discussed in Sec-
tion 11.6. Another procedure, lasso regression, is becoming increasingly popular but it is consider-
ably more difficult to understand how it works, cf. Section 10.5.
9.8 MORE ON MODEL TESTING 223
E XAMPLE 9.8.1. Dixon and Massey (1983) report data from the Los Angeles Heart Study su-
pervised by J. M. Chapman. The variables are y, weight in pounds; x1 , age in years; x2 , systolic
blood pressure in millimeters of mercury; x3 , diastolic blood pressure in millimeters of mercury; x4 ,
cholesterol in milligrams per dl; x5 , height in inches. The data from 60 men are given in Table 9.1.
For now, our interest is not in analyzing the data but in illustrating modeling techniques. We
fitted the basic multiple regression model
which involves regressing y on the four variables x1 , x2 + x3, x4 , x5 . The fitted equation is
With
9.8 MORE ON MODEL TESTING 225
Analysis of Variance for Model (9.8.3)
Source df SS MS F P
Regression 4 4575.5 1143.9 2.35 0.065
Residual Error 55 26764.5 486.6
Total 59 31340.0
the test statistic for whether the reduced model fits is
(26764.5 − 24009.6)/(55 − 54)
Fobs = = 6.20.
444.6
The one-sided P value is 0.016, i.e., 6.20 = F(1 − .016, 1, 54). Clearly the reduced model fits inad-
equately. Replacing the blood pressures by their difference does not predict as well as having the
blood pressures in the model.
It would have worked equally well to have written β3 = −β2 and fitted the reduced model
Tests for proportional coefficients are similar to the previous illustrations. For example, we could
test if the coefficient for x2 (sbp) is 40 times smaller than for x3 (dbp). To test H0 : 40β2 = β3 , the
reduced model becomes
The term 0.5xi3 is a known constant for each observation i, often called an offset. Such terms are
easy to handle in linear models, just take them to the other side of the equation,
or
ŷ = −113 + 0.026x1 + 0.097x2 + 0.597x3 − 0.0201x4 + 3.27x5.
The ANOVA table for the reduced model (9.8.5) is
or
ŷ = −130 + 0.045x1 + 0.019x2 + 0.719x3 − 0.0203x4 + 3.5x5.
The ANOVA table is
Analysis of Variance for Model (9.8.6)
Source df SS MS F P
Regression 4 3583.3 895.8 2.05 0.100
Residual Error 55 24027.9 436.9
Total 59 27611.2
and testing the models in the usual way gives
(24027.9 − 24009.6)/(55 − 54)
Fobs = = 0.041
444.6
for a one-sided P value of 0.84. The reduced model (9.8.6) is consistent with the data.
Alternatively, we could test H0 : β5 = 3.5 from the original table of coefficients for Model (9.8.1)
by computing
3.248 − 3.5
tobs = = −0.203
1.241
and comparing the result to a t(54) distribution. The square of the t statistic equals the F statistic.
Finally, we illustrate a simultaneous test of the last two hypotheses, i.e., we test H0 : β2 + 0.5 =
β3 ; β5 = 3.5. The reduced model is
or
yi − 0.5xi3 − 3.5xi5 = β0 + β1xi1 + β2 (xi2 + xi3 ) + β4 xi4 + +εi . (9.8.7)
The fitted regression equation is
or
ŷ = −129 + 0.040x1 + 0.094x2 + 0.594x3 − 0.0195x4 + 3.5x5.
The ANOVA table is
9.9 ADDITIVE EFFECTS AND INTERACTION 227
Analysis of Variance for Model (9.8.7)
Source df SS MS F P
Regression 3 420.4 140.1 0.33 0.806
Residual Error 56 24058.8 429.6
Total 59 24479.2
and testing Model (9.8.7) against Model (9.8.1) in the usual way gives
(24058.8 − 24009.6)/(56 − 54)
Fobs = = 0.055
444.6
for a one-sided P value of 0.95. In this case, the high one-sided P value is probably due less to
any problems with Model (9.8.7) and due more to me looking at the table of coefficients for Model
(9.8.1) and choosing a null hypothesis that seemed consistent with the data. Typically, hypotheses
should be suggested by previous theory or data, not inspection of the current data.
m(x) = β0 + β1 x1 + β2 x2 . (9.9.1)
This model displays additive effects. The relative effect of changing x1 into, say, x̃1 is the same for
any value of x2 . Specifically,
This effect does not depend on x2 , which allows us to speak about an effect for x1 . If the effect of
x1 depends on x2 , no single effect for x1 exists and we would always need to specify the value of
x2 before discussing the effect of x1 . An exactly similar argument shows that in Model (9.9.1) the
effect of changing x2 does not depend on the value of x1 .
Generally, for any two predictors x1 and x2 , an additive effects (no-interaction) model takes the
form
m(x) = h1 (x1 ) + h2(x2 ) (9.9.2)
where x = (x1 , x2 ) and h1 (·) and h2 (·) are arbitrary functions. In this case, the relative effect of
changing x1 to x̃1 is the same for any value of x2 because
which does not depend on x2 . An exactly similar argument shows that the effect of changing x2 does
not depend on the value of x1 . In an additive model, the effect as x1 changes can be anything at all;
it can be any function h1 , and similarly for x2 . However, the combined effect must be the sum of the
two individual effects. Other than Model (9.9.1), the most common no-interaction models for two
measurement predictors are probably a polynomial in x1 plus a polynomial in x2 , say,
R S
m(x) = β0 + ∑ βr0 xr1 + ∑ β0s xs2 . (9.9.3)
r=1 s=1
An interaction model is literally any model that does not display the additive effects structure of
(9.9.2). When generalizing no-interaction polynomial models, cross-product terms are often added
to model interaction. For example, Model (9.9.1) might be expanded to
m(x) = β0 + β1x1 + β2 x2 + β3 x1 x2 .
228 9. MULTIPLE REGRESSION: INTRODUCTION
This is an interaction model because the relative effect of changing x1 to x̃1 depends on the value of
x2 . Specifically,
where the second term depends on the value of x2 . To include interaction, the no-interaction poly-
nomial model (9.9.3) might be extended to an interaction polynomial model
R S
m(x) = ∑ ∑ βrsxr1 xs2 . (9.9.4)
r=0 s=0
These devices are easily extended to more than two predictor variables, cf. Section 10.
yh = β0 + β3 xh3 + β4xh4 + εh
which was earlier examined in Section 9.3. First we fit a simple quadratic additive model
and includes 54 = 625 mean parameters βrstuv . We might want to think twice about trying to estimate
625 parameters from just 20 schools.
This is a common problem with fitting polynomial interaction models. When we have even a
moderate number of predictor variables, the number of parameters quickly becomes completely un-
wieldy. And it is not only a problem for polynomial interaction models. In Section 8.3 we discussed
replacing polynomials with other basis functions φr (x). The polynomial models happen to have
φr (x) = xr . Other choices of φr include cosines, or both cosines and sines, or indicator functions,
or wavelets. Typically, φ0 (x) ≡ 1. In the basis function approach, the additive polynomial model
(9.9.3) generalizes to
R S
m(x) = β0 + ∑ βr0 φr (x1 ) + ∑ β0s φs (x2 ) (9.10.2)
r=1 s=1
and the polynomial interaction model (9.9.4) generalizes to
R S
m(x) = ∑ ∑ βrs φr (x1 )φs (x2 ). (9.10.3)
r=0 s=0
When expanding Model (9.10.3) to include more predictors, the generalized interaction model has
exactly the same problem as the polynomial interaction model (9.10.1) in that it requires fitting too
many parameters.
Generalized additive models provide a means for circumventing the problem. They do so by
restricting the orders of the interactions. In Model (9.10.1) we have five variables, all of which can
interact with one another. Instead, suppose variables x1 and x4 can interact with one another but with
no other variables and that variables x2 , x3 , and x5 can interact with one another but with no other
variables. We can then write a generalized additive model
Using the basis function approach to model each of the two terms on the right gives
R U S T V
m(x) = ∑ ∑ βru φr (x1 )φu (x4 ) + ∑ ∑ ∑ γstu φs (x2 )φt (x3 )φv (x5 ) − γ000.
r=0 u=0 s=0 t=0 v=0
230 9. MULTIPLE REGRESSION: INTRODUCTION
We subtracted γ000 from the model because both β00 and γ000 serve as intercept terms, hence they
are redundant parameters. This section started by considering the cubic interaction model (9.10.1)
for the Coleman Report data. The model has 3 = R = S = T = U = V and involves 625 mean
parameters. Using similar cubic polynomials to model the generalized additive model (9.10.4) we
need only 24 + 34 − 1 = 96 parameters. While that is still far too many parameters to fit to the
Coleman Report data, you can see that fitting generalized additive models are much more feasible
than fitting full interaction models.
Another generalized additive model that we could propose for five variables is
In this case, not only are β00 , γ00 , and δ00 all redundant intercept parameters, but ∑Ss=0 β0s x01 xs2 and
∑s=0 γs0 xs2 x03 are redundant simple polynomials in x2 . In this case it is more convenient to write
S
Model (9.10.5) as
R S S T U V
m(x) = ∑ ∑ βrsxr1 xs2 + ∑ ∑ γst xs2 xt3 + ∑ ∑ δuv xu4 xv5 − δ00.
r=0 s=0 s=0 t=1 u=0 v=0
Of course, the catch with generalized additive models is that you need to have some idea of
what variables may interact with one another. And the only obvious way to check that assumption
is to test the assumed generalized additive model against the full interaction model. But this whole
discussion started with the fact that fitting the full interaction model is frequently infeasible.
9.12 Exercises
E XERCISE 9.12.1. Younger (1979, p. 533) presents data from a sample of 12 discount department
stores that advertise on television, radio, and in the newspapers. The variables x1 , x2 , and x3 represent
the respective amounts of money spent on these advertising activities during a certain month while y
gives the store’s revenues during that month. The data are given in Table 9.2. Complete the following
tasks using multiple regression.
(a) Give the theoretical model along with the relevant assumptions.
(b) Give the fitted model, i.e., repeat (a) substituting the estimates for the unknown parameters.
(c) Test H0 : β2 = 0 versus HA : β2 = 0 at α = 0.05.
(d) Test the hypothesis H0 : β1 = β2 = β3 = 0.
(e) Give a 99% confidence interval for β2 .
9.12 EXERCISES 231
(f) Test whether the reduced model yi = β0 + β1 xi1 + εi is an adequate explanation of the data as
compared to the full model.
(g) Test whether the reduced model yi = β0 + β1 xi1 + εi is an adequate explanation of the data as
compared to the model yi = β0 + β1xi1 + β2xi2 + εi .
(h) Write down the ANOVA table for the ‘full’ model used in (g).
(i) Construct an added variable plot for adding variable x3 to a model that already contains variables
x1 and x2 . Interpret the plot.
(j) Compute the sample partial correlation ry3·12 . What does this value tell you?
E XERCISE 9.12.2. The information below relates y, a second measurement on wood volume, to
x1 , a first measurement on wood volume, x2 , the number of trees, x3 , the average age of trees, and x4 ,
the average volume per tree. Note that x4 = x1 /x2 . Some of the information has not been reported,
so that you can figure it out on your own.
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 23.45 14.90 0.122
x1 0.93209 0.08602 0.000
x2 0.4721 1.5554 0.126
x3 −0.4982 0.1520 0.002
x4 3.486 2.274 0.132
Analysis of Variance
Source df SS MS F P
Regression 4 887994 0.000
Error
Total 54 902773
Sequential
Source df SS
x1 1 883880
x2 1 183
x3 1 3237
x4 1 694
(a) How many observations are in the data?
(b) What is R2 for this model?
(c) What is the mean squared error?
(d) Give a 95% confidence interval for β2 .
(e) Test the null hypothesis β3 = 0 with α = 0.05.
(f) Test the null hypothesis β1 = 1 with α = 0.05.
232 9. MULTIPLE REGRESSION: INTRODUCTION
E XERCISE 9.12.3. Atkinson (1985) and Hader and Grandage (1958) have presented Prater’s data
on gasoline. The variables are y, the percentage of gasoline obtained from crude oil; x1 , the crude
oil gravity o API; x2 , crude oil vapor pressure measured in lbs/in2 ; x3 , the temperature, in o F, at
which 10% of the crude oil is vaporized; and x4 , the temperature, in o F, at which all of the crude oil
is vaporized. The data are given in Table 9.3. Find a good model for predicting gasoline yield from
the other four variables.
E XERCISE 9.12.4. Analyze the Chapman data of Example 9.8.1. Find a good model for predict-
ing weight from the other variables.
E XERCISE 9.12.5. Table 9.4 contains a subset of the pollution data analyzed by McDonald and
Schwing (1973). The data are from various years in the early 1960s. They relate air pollution to mor-
tality rates for various standard metropolitan statistical areas in the United States. The dependent
variable y is the total age-adjusted mortality rate per 100,000 as computed for different metropoli-
tan areas. The predictor variables are, in order, mean annual precipitation in inches, mean January
temperature in degrees F, mean July temperature in degrees F, population per household, median
school years completed by those over 25, percent of housing units that are sound and with all facil-
ities, population per sq. mile in urbanized areas, percent non-white population in urbanized areas,
relative pollution potential of sulphur dioxide, annual average of percent relative humidity at 1 pm.
Find a good predictive model for mortality.
9.12 EXERCISES 233
Alternatively, you can obtain the complete data from the Internet statistical service STATLIB by
going to http://lib.stat.cmu.edu/datasets/ and clicking on “pollution.” The data consist
of 16 variables on 60 cases.
In this chapter we continue our discussion of multiple regression. In particular, we focus on check-
ing the assumptions of regression models by looking at diagnostic statistics. If problems with as-
sumptions become apparent, one way to deal with them is to try transformations. The discussion
of transformations in Section 7.3 continues to apply. Among the methods discussed there, only the
circle of transformations depends on having a simple linear regression model. The other methods
apply with multiple regression as well as the analysis of variance models introduced in Chapter 12
and later. In particular, the discussion of transforming x at the end of Section 7.3 takes on new impor-
tance in multiple regression because multiple regression involves several predictor variables, each
of which is a candidate for transformation. Incidentally, the modified Box–Tidwell procedure eval-
uates each predictor variable separately, so it involves adding only one predictor variable xi j log(xi j )
to the multiple regression model at a time.
This chapter also examines methods for choosing good reduced models. Variable selection meth-
ods fall into two categories: best subset selection methods and stepwise regression methods. Both
are discussed. In Section 4 we examine the interplay between influential cases and model selection
techniques. Finally, Section 5 gives a brief introduction to lasso regression. We continue to illustrate
techniques on the data from the Coleman Report given in Section 6.9 (Table 6.4) and discussed in
Chapter 9.
10.1 Diagnostics
Table 10.1 contains a variety of measures for checking the assumptions of the multiple regression
model with five predictor variables that was fitted in Section 6.9 and Chapter 9 to the Coleman
Report data. The table includes case indicators, the data y, the predicted values ŷ, the leverages,
the standardized residuals r, the standardized deleted residuals t, and Cook’s distances C. All of
these, except for Cook’s distance, were introduced in Section 7.2. Recall that leverages measure the
distance between the predictor variables of a particular case and the overall center of those data.
Cases with leverages near 1 dominate any fitted regression. As a rule of thumb, leverages greater
than 2p/n cause concern and leverages greater than 3p/n cause (at least mild) consternation. Here
n is the number of observations in the data and p is the number of regression coefficients, including
the intercept. The standardized deleted residuals t contain essentially the same information as the
standardized residuals r but t values can be compared to a t(dfE − 1) distribution to obtain a formal
test of whether a case is consistent with the other data. (A formal test based on the r values requires
a more exotic distribution than the t(dfE − 1).) Cook’s distance for case i is defined as
2
∑nh=1 ŷh − ŷh[i]
Ci = , (10.1.1)
pMSE
where ŷh is the predictor of the hth case and ŷh[i] is the predictor of the hth case when case i has been
removed from the data. Cook’s distance measures the effect of deleting case i on the prediction of
all of the original observations.
Figures 10.1 and 10.2 are plots of the standardized residuals versus normal scores and against
235
236 10. DIAGNOSTICS AND VARIABLE SELECTION
the predicted values. The largest standardized residual, that for case 18, appears to be somewhat
unusually large. To test whether the data from case 18 are consistent with the other data, we can
compare the standardized deleted residual to a t(dfE − 1) distribution. From Table 10.1, the t resid-
ual is 4.56. The corresponding P value is 0.0006. Actually, we chose to perform the test on the t
residual for case 18 only because it was the largest of the 20 t residuals. Because the test is based on
the largest of the t values, it is appropriate to multiply the P value by the number of t statistics con-
sidered. This gives 20 × 0.0006 = 0.012, which is still a very small P value. There is considerable
evidence that the data of case 18 are inconsistent, for whatever reason, with the other data. This fact
cannot be discovered from a casual inspection of the raw data.
The only point of any concern with respect to the leverages is case 10. Its leverage is 0.618,
while 2p/n = 0.6. This is only a mildly high leverage and case 10 seems well behaved in all other
respects; in particular, C10 is small, so deleting case 10 has very little effect on predictions.
We now reconsider the analysis with case 18 deleted. The regression equation is
and R2 = 0.963. Table 10.2 contains the table of coefficients. Table 10.3 contains the analysis of
variance. Table 10.4 contains diagnostics. Note that the MSE is less than half of its previous value
when case 18 was included in the analysis. It is no surprise that the MSE is smaller, since the case
being deleted is often the single largest contributor to the SSE. Correspondingly, the regression
parameter t statistics in Table 10.2 are all much more significant. The actual regression coefficient
estimates have changed a bit but not greatly. Predictions have not changed radically either, as can be
seen by comparing the predictions given in Tables 10.1 and 10.4. Although the predictions have not
changed radically, they have changed more than they would have if we deleted any observation other
than case 18. From the definition of Cook’s distance given in Equation (10.1.1), C18 is precisely the
sum of the squared differences between the predictions in Tables 10.1 and 10.4 divided by 6 times
the MSE from the full data. From Table 10.1, Cook’s distance when dropping case 18 is much larger
than Cook’s distance from dropping any other case.
Consider again Table 10.4 containing the diagnostic statistics when case 18 has been deleted.
Case 10 has moderately high leverage but seems to be no real problem. Figures 10.3 and 10.4 give
the normal plot and the standardized residual versus predicted value plot, respectively, with case
10.1 DIAGNOSTICS 237
Normal Q−Q Plot
3
2
Standardized residuals
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Residual−Fitted plot
3
2
Standardized residuals
1
0
−1
−2
25 30 35 40
Fitted
Predictor β̂ SE(β̂ ) t P
Constant 34.287 9.312 3.68 0.003
x1 −1.6173 0.7943 −2.04 0.063
x2 0.08544 0.03546 2.41 0.032
x3 0.67393 0.06516 10.34 0.000
x4 1.1098 0.2790 3.98 0.002
x5 −4.571 1.437 −3.18 0.007
238 10. DIAGNOSTICS AND VARIABLE SELECTION
18 deleted. Figure 10.4 is particularly interesting. At first glance, it appears to have a horn shape
opening to the right. But there are only three observations on the left of the plot and many on the
right, so one would expect a horn shape because of the data distribution. Looking at the right of
the plot, we see that in spite of the data distribution, much of the horn shape is due to a single
very small residual. If we mentally delete that residual, the remaining residuals contain a hint of an
upward opening parabola. The potential outlier is case 3. From Table 10.4, the standardized deleted
residual for case 3 is −5.08, which yields a raw P value of 0.0001, and if we adjust for having 19
t statistics, the P value is 0.0019, still an extremely small value. Note also that in Table 10.1, when
case 18 was included in the data, the standardized deleted residual for case 3 was somewhat large
but not nearly so extreme.
With cases 3 and 18 deleted, the regression equation becomes
The R2 for these data is 0.988. The table of coefficients is in Table 10.5, the analysis of variance is
in Table 10.6, and the diagnostics are in Table 10.7.
Deleting the outlier, case 3, again causes a drop in the MSE, from 1.78 with only case 18 deleted
to 0.61 with both cases 3 and 18 deleted. This creates a corresponding drop in the standard errors for
all regression coefficients and makes them all appear to be more significant. The actual estimates of
the regression coefficients do not change much from Table 10.2 to Table 10.5. The largest changes
seem to be in the constant and in the coefficient for x5 .
From Table 10.7, the leverages, t statistics, and Cook’s distances seem reasonable. Figures 10.5
and 10.6 contain a normal plot and a plot of standardized residuals versus predicted values. Both
plots look good. In particular, the suggestion of lack of fit in Figure 10.4 appears to be unfounded.
10.1 DIAGNOSTICS 239
Normal Q−Q Plot
1
Standardized residuals
0
−1
−2
−3
−2 −1 0 1 2
Theoretical Quantiles
Residual−Fitted plot
1
Standardized residuals
0
−1
−2
−3
25 30 35 40
Fitted
Predictor β̂ SE(β̂ ) t P
Constant 29.758 5.532 5.38 0.000
x1 −1.6985 0.4660 −3.64 0.003
x2 0.08512 0.02079 4.09 0.001
x3 0.66617 0.03824 17.42 0.000
x4 1.1840 0.1643 7.21 0.000
x5 −4.0668 0.8487 −4.79 0.000
240 10. DIAGNOSTICS AND VARIABLE SELECTION
Once again, Figure 10.6 could be misinterpreted as a horn shape but the ‘horn’ is due to the distri-
bution of the predicted values.
Ultimately, someone must decide whether or not to delete unusual cases based on subject matter
considerations. There is only moderate statistical evidence that case 18 is unusual and case 3 does
not look severely unusual unless one previously deletes case 18. Are there subject matter reasons
for these schools to be unusual? Will the data be more or less representative of the appropriate
population if these data are deleted?
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2 −1 0 1 2
Theoretical Quantiles
Residual−Fitted plot
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
25 30 35 40
Fitted
Figure 10.6: Standardized residuals versus predicted values, cases 3 and 18 deleted.
are based on defining a criterion for a best model and then finding the models that are best by this
criterion. Section 10.3 considers three methods of making sequential selections of variables. Obvi-
ously, it is better to consider all reduced models whenever feasible rather than making sequential
selections. Sequential methods are flawed but they are cheap and easy.
10.2.1 R2 statistic
The fundamental statistic in comparing all possible reduced models is the R2 statistic. This is appro-
priate but we should recall some of the weaknesses of R2 . The numerical size of R2 is more related
242 10. DIAGNOSTICS AND VARIABLE SELECTION
√ Included variables
Vars. R2 MSE x1 x2 x3 x4 x5
1 86.0 2.2392 X
1 56.8 3.9299 X
2 88.7 2.0641 X X
2 86.2 2.2866 X X
3 90.1 1.9974 X X X
3 88.9 2.1137 X X X
4 90.2 2.0514 X X X X
4 90.1 2.0603 X X X X
5 90.6 2.0743 X X X X X
to predictive ability than to model adequacy. The perfect model can have small predictive ability
and thus a small R2 , while demonstrably inadequate models can still have substantial predictive
ability and thus a high R2 . Fortunately, we are typically more interested in prediction than in finding
the perfect model, especially since our models are typically empirical approximations for which no
perfect model exists. In addition, when considering transformations of the dependent variable, the
R2 values for different models are not comparable (unless predictions are back transformed to the
original scale and correlated with the original data to obtain R2 ).
In the present context, the most serious drawback of R2 is that it typically goes up when more
predictor variables are added to a model. (It cannot go down.) Thus it is not really appropriate to
compare the R2 values of two models with different numbers of predictors. However, we can use R2
to compare models with the same number of predictor variables. In fact, for models with the same
number of predictors, we can use R2 to order them from best to worse; the largest R2 value then
corresponds to the best model. R2 is the fundamental model comparison statistic for best subset
methods in that, for comparing models with the same number of predictors, the other methods
considered give the same relative orderings for models as R2 . The essence of the other methods is to
develop a criterion for comparing models that have different numbers of predictors, i.e., the methods
incorporate penalties for adding more regression parameters.
Table 10.8 contains the two best models for the Coleman Report data based on the R2 statistic
for each number of predictor variables. The best single variable is x3 ; the second best is x2 . This
information could be obtained from the correlations between y and the predictor variables given in
Section 9.1. Note the drastic difference between the R2 for using x3 and that for x2 . The best pair
of variables for predicting y is x3 and x4 , while the second best pair is x3 and x5 . The best three-
variable model contains x1 , x3 , and x4 . Note that the largest R2 values go up very little when a forth
or fifth variable is added. Moreover, all the models in Table 10.8 that contain three or more variables
include x3 and x4 . We could conduct F tests to compare models with different numbers of predictor
variables, as long as the smaller models are contained in the larger ones.
Any models that we think are good candidates should be examined for influential and outlying
observations, consistency with assumptions, and subject matter implications. Any model that makes
particularly good sense to a subject matter specialist warrants special consideration. Models that
make particularly poor sense to subject matter specialists may be dumb luck but they may also
be the springboard for new insights into the process generating the data. We also need to concern
ourselves with the role of observations that are influential or outlying in the original (full) model. We
will examine this in more detail later. Finally, recall that when making predictions based on reduced
models, the point at which we are making the prediction generally needs to be consistent with the
original data on all variables, not just the variables in the reduced model. When we drop a variable,
we do not conclude that the variable is not important, we conclude that it is not important for this set
of data. For different data, a dropped variable may become important. We cannot presume to make
10.2 BEST SUBSET MODEL SELECTION 243
predictions from a reduced model for new cases that are substantially different from the original
data.
where s2y is the sample variance of the yi s, i.e., s2y = SSTot/(n − 1). This is a much simpler statement
than the defining relationship. For the Coleman Report example with all predictor variables, this is
4.30
0.873 = 1 − .
(642.92)/19
Note that when comparing two models, the model with the smaller MSE has the larger adjusted R2 .
R2 is always between 0 and 1, but while the adjusted R2 cannot get above 1, it can get below 0.
It is possible to find models that have MSE > s2y . In these cases, the adjusted R2 is actually less than
0.
Models with large adjusted R2 s are precisely models with small mean squared errors. At first
glance, this seems like a reasonable way to choose models, but upon closer inspection the idea seems
flawed. The problem is that when comparing some model with a reduced model, the adjusted R2 is
greater for the larger model whenever the mean squared error of the larger model is less than the
numerator mean square for testing the adequacy of the smaller model. In other words, the adjusted
R2 is greater for the larger model whenever the F statistic for comparing the models is greater than
1. Typically, we want the F statistic to be substantially larger than 1 before concluding that the extra
variables in the larger model are important.
To see that the adjusted R2 is larger for the larger model whenever F > 1, consider the simplest
example, that of comparing the full model to the model that contains just an intercept. For the
Coleman Report data, the mean squared error for the intercept model is
In general, the mean squared error for the smaller model is a weighted average of the mean
square for the variables being added and the mean squared error of the larger model. If the mean
square for the variables being added is greater than the mean squared error of the larger model, i.e.,
if F > 1, the mean squared error for the smaller model must be greater than that for the larger model.
If we add variables to a model whenever the F statistic is greater than 1, we will include a lot of
unnecessary variables.
Table 10.9 contains the six best-fitting models as judged by the adjusted R2 criterion. As ad-
vertised, the ordering of the models from best to worst √ is consistent whether one maximizes the
adjusted R2 or minimizes the MSE (or equivalently, MSE). The best model based on the adjusted
R2 is the model with variables x1 , x3 , and x4 , but a number of the best models are given. Presenting
a number of the best models reinforces the idea that selection of one or more final models should be
based on many more considerations than just the value of one model selection statistic. Moreover,
the best model as determined by the adjusted R2 often contains too many variables.
Note also that the two models in Table 10.9 with three variables are precisely the two three-
variable models with the highest R2 values from Table 10.8. The same is true about the two four-
variable models that made this list. As indicated earlier, when the number of variables is fixed,
ordering models by their R2 s is equivalent to ordering models by their adjusted R2 s. The comments
about model checking and prediction made in the previous subsection continue to apply.
The term σ 2 serves only to provide some standardization. Small values of κ indicate good reduced
models. Note that κ is not directly useful because it is unknown. It depends on the zi values and
they depend on the unknown full model regression parameters. However, if we think of the ŷiR s as
functions of the random variables yi , the comparison value κ is a function of the yi s and thus is a
random variable with an expected value. Mallows’s C p statistic is an estimate of the expected value
of κ . In particular, Mallows’s C p statistic is
SSE(Red.)
Cp = − (n − 2r).
MSE(Full)
10.2 BEST SUBSET MODEL SELECTION 245
√ Included variables
Vars Cp MSE x1 x2 x3 x4 x5
2 2.8 2.0641 X X
3 2.8 1.9974 X X X
3 4.6 2.1137 X X X
4 4.7 2.0514 X X X X
3 4.8 2.1272 X X X
4 4.8 2.0603 X X X X
For a derivation of this statistic see Christensen (2011, Section 14.1). The smaller the C p value, the
better the model (up to the variability of the estimation). If the C p statistic is computed for the full
model, the result is always p, the number of predictor variables including the intercept. For general
linear models r is the number of functionally distinct mean parameters in the reduced model.
In multiple regression, estimated regression surfaces are identical to prediction surfaces, so mod-
els with Mallows’s C p statistics that are substantially less than p can be viewed as reduced models
that are estimated to be better at prediction than the full model. Of course this comparison between
predictions from the full and reduced models is restricted to the actual combinations of predictor
variables in the observed data.
For the Coleman Report data, Table 10.10 contains the best six models based on the C p statistic.
The best model is the one with variables x3 and x4 , but the model including x1 , x3 , and x4 has
essentially the same value of C p . There is a substantial increase in C p for any of the other four
models. Clearly, we would focus attention on the two best models to see if they are adequate in terms
of outliers, influential observations, agreement with assumptions, and subject matter implications.
As always, predictions can only be made with safety from the reduced models when the new cases
are to be obtained in a similar fashion to the original data. In particular, new cases must have similar
values to those in the original data for all of the predictor variables, not just those in the reduced
model. Note that the ranking of the best models is different here than for the adjusted R2 . The full
model is not included here, while it was in the adjusted R2 table. Conversely, the model with x2 , x3 ,
and x4 is included here but was not included in the adjusted R2 table. Note also that among models
with three variables, the C p rankings agree with the R2 rankings and the same holds for four-variable
models.
It is my impression that Mallows’s C p statistic is the most popular method for selecting a best
subset of the predictor variables. It is certainly my favorite. Mallows’s C p statistic is closely related
to Akaike’s information criterion (AIC), which is a general criterion for model selection. AIC and
the relationship between C p and AIC are examined in Christensen (1997, Section 4.8).
Table 10.11 lists the three best models based on R2 for each number of predictor variables. In
addition, the adjusted R2 and C p values for each model are listed in the table. It is easy to identify
the best models based on any of the model selection criteria. The output is extensive enough to
include a few notably bad models. Rather than asking for the best 3, one might ask for the best 4,
or 5, or 6 models for each number of predictor variables but it is difficult to imagine a need for any
more extensive summary of the models when beginning a search for good reduced models.
Note that the model with x1 , x3 , and x4 is the best model as judged by adjusted R2 and is nearly
the best model as judged by the C p statistic. (The model with x3 and x4 has a slightly smaller C p
value.) The model with x2 , x3 , x4 has essentially the same C p statistic as the model with x1 , x2 , x3 ,
x4 but the latter model has a larger adjusted R2 .
246 10. DIAGNOSTICS AND VARIABLE SELECTION
or it can change depending on the step. When allowing it to change depending on the step, we could
set up the process so that it stops when all of the P values are below a fixed level.
Table 10.12 illustrates backwards elimination for the Coleman Report data. In this example, the
predetermined level for stopping the procedure is 2. If all |t| statistics are greater than 2, elimination
of variables halts. Step 1 includes all 5 predictor variables. The table gives estimated regression
coefficients, t statistics, the R2 value, and the square root of the MSE. In step 1, the smallest absolute
t statistic is 0.82, so variable x2 is eliminated from the model. The statistics in step 2 are similar to
those in step 1 but now the model includes only variables x1 , x3 , x4 , and x5 . In step 2, the smallest
absolute t statistic is | − 0.41|, so variable x5 is eliminated from the model. Step 3 is based on the
model with x1 , x3 , and x4 . The smallest absolute t statistic is the | − 1.47| for variable x1 , so x1 is
dropped. Step 4 uses the model with only x3 and x4 . At this step, the t statistics are both greater than
2, so the process halts. Note that the intercept is not considered for elimination.
The final model given in Table 10.12 happens to be the best model as determined by the C p
statistic and the model at stage 3 is the second-best model as determined by the C p statistic. This is
a fortuitous event; there is no reason that this should happen other than these data being particularly
clear about the most important variables.
indicating that when the three models yi = η0 j + η3 j xi3 + η4 j xi4 + η j xi j + εi , j = 1, 2, 5 were fitted
to the model, none of the absolute t statistics for testing η j = 0 were greater than 2.
The final model selected is the model with predictor variables x3 and x4 . This is the same model
obtained from backwards elimination and the model that has the smallest C p statistic. Again, this is
a fortuitous circumstance. There is no assurance that such agreement between methods will occur.
Rather than using t statistics, the decisions could be made using the equivalent F statistics. The
stopping value of 2 for t statistics corresponds to a stopping value of 4 for F statistics. In addition,
this same procedure can be based on sample correlations and partial correlations. The decision in
step 1 is equivalent to adding the variable that has the largest absolute sample correlation with y.
The decision in step 2 is equivalent to adding the variable that has the largest absolute sample partial
correlation with y after adjusting for x3 . Step 3 is not shown in the table, but the computations for
step 3 must be made in order to know that the procedure stops after step 2. The decision in step 3 is
equivalent to adding the variable that has the largest absolute sample partial correlation with y after
adjusting for x3 and x4 , provided this value is large enough.
The author has a hard time imagining any situation where forward selection from the intercept-
only model is a reasonable thing to do, except possibly as a screening device when there are more
predictor variables than there are observations. In such a case, the full model cannot be fitted mean-
ingfully, so best subset methods and backwards elimination do not work.
the model is correct. Systematically eliminating these large residuals makes the estimate of the vari-
ance too small. Variable selection methods tend to identify as good reduced models those with small
MSEs. The most extreme case is that of using the adjusted R2 criterion, which identifies as the best
model the one with the smallest MSE. Confidence and prediction intervals based on models that
are arrived at after variable selection or outlier deletion should be viewed as the smallest reasonable
intervals available, with the understanding that more appropriate intervals would probably be wider.
Tests performed after variable selection or outlier deletion should be viewed as giving the greatest
reasonable evidence against the null hypothesis, with the understanding that more appropriate tests
would probably display a lower level of significance.
Recall that in Section 10.1, case 18 was identified as an influential point in the Coleman Report
data and then case 3 was identified as highly influential. Table 10.14 gives the results of a best subset
selection when case 18 has been eliminated. The full model is the best model as measured by either
the C p statistic or the adjusted R2 value. This is a far cry from the full data analysis in which the
models with x3 , x4 and with x1 , x3 , x4 had the smallest C p statistics. These two models are only
the seventh and fifth best models in Table 10.14. The two closest competitors to the full model in
Table 10.14 involve dropping one of variables x1 and x2 . The fourth and fifth best models involve
dropping x2 and one of variables x1 and x5 . In this case, the adjusted R2 ordering of the five best
models agrees with the C p ordering.
Table 10.15 gives the best subset summary when cases 3 and 18 have both been eliminated.
Once again, the best model as judged by either C p or adjusted R2 is the full model. The second
best model drops x1 and the third best model drops x2 . However, the subsequent ordering changes
substantially.
Now consider backwards elimination and forward selection with influential observations
deleted. In both cases, we continue to use the |t| value 2 as the cutoff to stop addition and removal
of variables.
Table 10.16 gives the results of a backwards elimination when case 18 is deleted and when cases
3 and 18 are deleted. In both situations, all five of the variables remain in the model. The regression
coefficients are similar in the two models with the largest difference being in the coefficients for
x5 . Recall that when all of the cases were included, the backwards elimination model included only
variables x3 and x4 , so we see a substantial difference due to the deletion of one or two cases.
The results of forward selection are given in Table 10.17. With case 18 deleted, the process stops
with a model that includes x3 and x4 . With case 3 also deleted, the model includes x1 , x3 , and x4 .
While these happen to agree quite well with the results from the complete data, they agree poorly
with the results from best subset selection and from backwards elimination, both of which indicate
that all variables are important. Forward selection gets hung up after a few variables and cannot deal
250 10. DIAGNOSTICS AND VARIABLE SELECTION
with the fact that adding several variables (rather than one at a time) improves the fit of the model
substantially.
An alternative to least squares estimation that has become quite popular is lasso regression, which
was proposed by Tibshirani (1996). “Lasso” stands for least absolute shrinkage and selection op-
Table 10.18: Lasso and least squares estimates: The Coleman Report data.
Lasso λ Reduced Model
Predictor 1 0.6 0.56348 0.5 0 Least Squares
Constant 19.95 18.79306 20.39486 26.51564 35.0825 12.1195 14.58327
x1 −1.793 −0.33591 0.00000 0.00000 0.0000 −1.7358 0.00000
x2 0.04360 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000
x3 0.55576 0.51872 0.51045 0.47768 0.0000 0.5532 0.54156
x4 1.1102 0.62140 0.52194 0.28189 0.0000 1.0358 0.74989
x5 −1.811 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000
erator. The interesting thing about lasso, and the reason for its inclusion in this chapter, is that it
automatically performs variable selection while it is estimating the regression parameters.
As discussed in Subsection 9.1.2, the least squares estimates β̂ j satisfy
n 2
∑ yi − β̂0 − β̂1xi1 − β̂2xi2 − · · · − β̂ p−1xi,p−1 =
i=1
n
min
β0 ,...,β p−1
∑ (yi − β0 − β1xi1 − β2xi2 − · · · − β p−1xi,p−1)2 .
i=1
There are various ways that one can present the lasso criterion for estimation. One of them is to
minimize the least squares criterion
n
∑ (yi − β0 − β1xi1 − β2xi2 − · · · − β p−1xi,p−1)2
i=1
subject to an upper bound on the sum of the absolute values of the regression coefficients. We define
the upper bound in terms of the least squares estimates so that the lasso estimates must satisfy
p −1 p −1
∑ |β j | ≤ λ ∑ |β̂ j | (10.5.1)
j=1 j=1
for some λ with 0 ≤ λ ≤ 1. The lasso estimates depend on the choice of λ . The least squares
estimates obviously satisfy the inequality when λ = 1, so λ = 1 gives least squares estimates. When
λ = 0, all the regression coefficients in the inequality must be zero, but notice that the intercept
is not subject to the upper bound in (10.5.1). Thus, λ = 0 gives the least squares estimates for the
intercept-only model, i.e., it zeros out all the regression coefficients except the intercept, which it
estimates with ȳ· .
E XAMPLE 10.5.1. We examine the effect of lasso regression on The Coleman Report data. Ta-
ble 10.18 contains results for five values of λ and least squares estimates for two reduced models.
For λ = 1, the estimates are identical to the least squares estimates for the full model.
R’s lasso2 package has a default value of λ = 0.5, which zeros out the coefficients for x1 , x2 ,
and x5 . The reduced model that only includes x3 and x4 is the model that we liked in Section 9.3.
The lasso estimates of β3 and β4 are noticeably smaller than the least squares estimates from the
reduced model given in the last column of Table 10.18. I also found the largest value of λ that zeros
out the coefficients for x1 , x2 , and x5 . That value is λ = 0.56348. With this larger value of λ , the
lasso estimates are closer to the reduced model least squares estimates but still noticeably different.
For λ ≥ 0.56349, lasso produces a nonzero coefficient for x1 . From Section 9.3, if we were
going to add another variable to the model containing only x3 and x4 , the best choice is to add x1 .
Table 10.18 includes results for λ = 0.6 and least squares on the three-variable model. λ = 0.6 still
has the coefficients for x2 and x5 zeroed out. Again, the nonzero lasso estimates for β1 , β3 , and β4
are all closer to zero than the least squares estimates from the model with just x1 , x3 , and x4 . ✷
252 10. DIAGNOSTICS AND VARIABLE SELECTION
Lasso seems to do a good job of identifying the important variables and it does it pretty automat-
ically. That can be both a blessing and a curse. It is far less obvious how well lasso is estimating the
regression coefficients. The least squares estimates seem more stable across reduced models than
do the lasso estimates. And there is the whole issue of choosing λ .
Notice that the inequality (10.5.1) uses the same weight λ on all of the regression coefficients.
That is not an obviously reasonable thing to do when the predictor variables are measured in differ-
ent units, so lasso is often applied to standardized predictor variables, i.e., variables that have their
sample mean subtracted and are then divided by their standard deviation. (This is the default in R’s
lasso2 package.) The regression estimates can then be transformed back to their original scales to be
comparable to the least squares estimates. Section 11.6 illustrates this standardization procedure for
another regression technique, principal components regression. Lasso applied to the unstandardized
Coleman Report data gives very different, and less appealing, results.
10.6 Exercises
E XERCISE 10.6.1. Reconsider the advertising data of Exercise 9.12.1.
(a) Are there any high-leverage points? Why or why not?
(b) Test whether each case is an outlier using an overall significance level no greater than α = 0.05.
Completely state the appropriate reference distribution.
(c) Discuss the importance of Cook’s distances in regard to these data.
(d) Using only analysis of variance tables, compute R2 , the adjusted R2 , and the C p statistic for
yi = β0 + β1xi1 + β2xi2 + εi . Show your work.
(e) In the three-variable model, which if any variable would be deleted by a backwards elimination
method? Why?
E XERCISE 10.6.2. Consider the information given in Table 10.19 on diagnostic statistics for the
wood data of Exercise 9.12.2.
(a) Are there any outliers in the predictor variables? Why are these considered outliers?
(b) Are there any outliers in the dependent variable? If so, why are these considered outliers?
(c) What are the most influential observations in terms of the predictive ability of the model?
E XERCISE 10.6.3. Consider the information in Table 10.20 on best subset regression for the
wood data of Exercise 9.12.2.
(a) In order, what are the three best models as measured by the C p criterion?
(b) What is the mean squared error for the model with variables x1 , x3 , and x4 ?
(c) In order, what are the three best models as measured by the adjusted R2 criterion? (Yes, it is
possible to distinguish between the best four!)
(d) What do you think are the best models and what would you do next?
E XERCISE 10.6.4. Consider the information in Table 10.21 on stepwise regression for the wood
data of Exercise 9.12.2.
(a) What is being given in the rows labeled x1 , x2 , x3 , and x4 ? What is being given in the rows labeled
t?
(b) Is this table for forward selection, backwards elimination, stepwise regression, or some other
procedure?
(c) Describe the results of the procedure.
10.6 EXERCISES 253
E XERCISE 10.6.6. Reanalyze the Chapman data of Exercise 9.12.4. Examine residuals and in-
fluential observations. Explore the use of the various model selection methods.
E XERCISE 10.6.7. Reanalyze the pollution data of Exercise 9.12.5. Examine residuals and influ-
ential observations. Explore the use of various model selection methods.
E XERCISE 10.6.8. Repeat Exercise 9.12.6 on the body fat data with special emphasis on diag-
nostics and model selection.
Chapter 11
In this chapter we use matrices to write regression models. Properties of matrices are reviewed in
Appendix A. The economy of notation achieved through using matrices allows us to arrive at some
interesting new insights and to derive several of the important properties of regression analysis.
The expected value of the random vector is just the vector of expected values of the random vari-
ables. For the random variables write E(yi ) = μi , then
⎡ ⎤ ⎡ ⎤
E(y1 ) μ1
E(Y ) ≡ ⎣ E(y2 ) ⎦ = ⎣ μ2 ⎦ ≡ μ .
E(y3 ) μ3
In other words, expectation of a random vector is performed elementwise. In fact, the expected
value of any random matrix (a matrix consisting of random variables) is the matrix made up of the
expected values of the elements in the random matrix. Thus if wi j , i = 1, 2, 3, j = 1, 2 is a collection
of random variables and we write ⎡ ⎤
w11 w12
W = ⎣ w21 w22 ⎦ ,
w31 w33
then ⎡ ⎤
E(w11 ) E(w12 )
E(W ) ≡ ⎣ E(w21 ) E(w22 ) ⎦ .
E(w31 ) E(w33 )
We also need a concept for random vectors that is analogous to the variance of a random variable.
This is the covariance matrix, sometimes called the dispersion matrix, the variance matrix, or the
variance-covariance matrix. The covariance matrix is simply a matrix consisting of all the variances
and covariances associated with the vector Y . Write
and
Cov(yi , y j ) = E[(yi − μi )(y j − μ j )] ≡ σi j .
255
256 11. MULTIPLE REGRESSION: MATRIX FORMULATION
Two subscripts are used on σii to indicate that it is the variance of yi rather than writing Var(yi ) = σi2 .
The covariance matrix of our 3 × 1 vector Y is
⎡ ⎤
σ11 σ12 σ13
Cov(Y ) = ⎣ σ21 σ22 σ23 ⎦ .
σ31 σ32 σ33
yi = β0 + β1xi + εi i = 1, . . . , n , (11.2.1)
E(εi ) = 0, Var(εi ) = σ 2 , and Cov(εi , ε j ) = 0 for i = j. In matrix terms this can be written as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 1 x1 ε1
⎢ y2 ⎥ ⎢1 x2 ⎥ β0 ⎢ ε2 ⎥
⎢ . ⎥ = ⎢. .. ⎥ + ⎢ . ⎥.
⎣ . ⎦
.
⎣ .. . ⎦ β1 ⎣ .. ⎦
yn 1 xn εn
Yn×1 = Xn×2 β 2 ×1 + en × 1
Multiplying and adding the matrices on the right-hand side gives
⎡ ⎤ ⎡β + β x + ε ⎤
y1 0 1 1 1
⎢ y2 ⎥ ⎢ β0 + β1 x2 + ε2 ⎥
⎢ . ⎥=⎢ .. ⎥.
⎣ . ⎦ ⎣ ⎦
. .
yn β0 + β1 xn + εn
These two vectors are equal if and only if the corresponding elements are equal, which occurs if and
only if Model (11.2.1) holds. The conditions on the εi s translate into matrix terms as
E(e) = 0
Cov(e) = σ 2 I
where I is the n × n identity matrix. By definition, the covariance matrix Cov(e) has the variances
of the εi s down the diagonal. The variance of each individual εi is σ 2 , so all the diagonal elements
of Cov(e) are σ 2 , just as in σ 2 I. The covariance matrix Cov(e) has the covariances of distinct εi s as
its off-diagonal elements. The covariances of distinct εi s are all 0, so all the off-diagonal elements
of Cov(e) are zero, just as in σ 2 I.
11.2 MATRIX FORMULATION OF REGRESSION MODELS 257
E XAMPLE 11.2.1. Height and weight data are given in Table 11.1 for 12 individuals. In matrix
terms, the SLR model for regressing weights (y) on heights (x) is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 1 65 ε1
⎢ y2 ⎥ ⎢ 1 65 ⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 ⎥ ⎢ ε3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 65 ⎥ ⎢ ε4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y5 ⎥ ⎢ 1 66 ⎥ ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 1 66 ⎥ β0 ⎢ε ⎥
⎢ ⎥=⎢ ⎥ +⎢ 6 ⎥.
⎢ y7 ⎥ ⎢ 1 63 ⎥ β1 ⎢ ε7 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y8 ⎥ ⎢ 1 63 ⎥ ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 ⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 1 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣ 1 72 ⎦ ⎣ ⎦
y11 ε11
y12 1 72 ε12
The observed data for this example are
⎡ ⎤ ⎡ ⎤
y1 120
⎢ y2 ⎥ ⎢ 140 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 130 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 135 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y5 ⎥ ⎢ 150 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 135 ⎥
⎢ ⎥=⎢ ⎥.
⎢ y7 ⎥ ⎢ 110 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y8 ⎥ ⎢ 135 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 120 ⎥
⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 170 ⎥
⎣ ⎦ ⎣ ⎦
y11 185
y12 160
We could equally well rearrange the order of the observations to write
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y7 1 63 ε7
⎢ y8 ⎥ ⎢ 1 63 ⎥ ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 ⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y1 ⎥ ⎢ 1 65 ⎥ ⎢ ε1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y2 ⎥ ⎢ 1 65 ⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 ⎥ β0 ⎢ε ⎥
⎢ ⎥=⎢ ⎥ +⎢ 3 ⎥
⎢ y4 ⎥ ⎢ 1 65 ⎥ β1 ⎢ ε4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 66 ⎥ ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 1 66 ⎥ ⎢ ε6 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 1 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
y11 1 72 ε11
y12 1 72 ε12
258 11. MULTIPLE REGRESSION: MATRIX FORMULATION
in which the xi values are ordered from smallest to largest. ✷
Y = X β + e, E(e) = 0, Cov(e) = σ 2 I.
where
E(εi ) = 0, Var(εi ) = σ 2 , Cov(εi , ε j ) = 0, i = j.
In matrix terms this can be written as
⎡ ⎤
⎡ ⎤ ⎡ ⎤ β0 ⎡ ⎤
y1 1 x11 x12 ··· x1,p−1 ε1
⎢ 1 x21 x22 ⎢ β1 ⎥
⎢ y2 ⎥ ··· x2,p−1 ⎥ ⎢ ⎥ ⎢ ε2 ⎥
⎢ . ⎥ = ⎢ ⎢. .. .. .. ⎥
⎥ ⎢ β2 ⎥ + ⎢ . ⎥.
⎣ . ⎦ ⎣ .. .. ⎢ ⎥ ⎣ .. ⎦
. . . . . ⎦ ⎣ .. ⎦
.
yn 1 xn1 xn2 ··· xn,p−1 εn
β p −1
Yn×1 = Xn× p β p ×1 + en × 1
Multiplying and adding the right-hand side gives
⎡ ⎤ ⎡β + β x + β x + ···+ β x ⎤
y1 0 1 11 2 12 p−1 1,p−1 + ε1
⎢ y2 ⎥ ⎢ β0 + β1x21 + β2 x22 + · · · + β p−1x2,p−1 + ε2 ⎥
⎢ . ⎥=⎢ ⎥
⎥,
⎣ . ⎦ ⎢ ⎣
..
⎦
. .
yn β0 + β1xn1 + β2 xn2 + · · · + β p−1xn,p−1 + εn
which holds if and only if (11.2.2) holds. The conditions on the εi s translate into
E(e) = 0,
Cov(e) = σ 2 I,
E XAMPLE 11.2.3. In Example 11.2.1 we illustrated the matrix form of a SLR using the data on
heights and weights. We now illustrate some of the models from Chapter 8 applied to these data.
The cubic model
yi = β0 + β1 xi + β2x2i + β3x3i + εi (11.2.3)
11.2 MATRIX FORMULATION OF REGRESSION MODELS 259
is
⎡ ⎤ ⎡1 ⎤ ⎡ ⎤
y1 65 652 653 ε1
⎢ y2 ⎥ ⎢ 1 65 652 3
65 ⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 652 653 ⎥ ⎢ ε3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 65 652 65 ⎥ ⎡ ⎤ ⎢ ε4 ⎥
3
⎢ ⎥ ⎢ ⎥ β0 ⎢ ⎥
⎢ y5 ⎥ ⎢ 1 66 662 3
66 ⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥ ⎢ β1 ⎥ ⎢ 5 ⎥
⎢ y6 ⎥ ⎢ 1 66 662 66 ⎥ ⎢ ⎥ ⎢ ε6 ⎥
3
⎢ ⎥=⎢ ⎥⎢β ⎥ + ⎢ ⎥.
⎢ y7 ⎥ ⎢ 1 63 632 633 ⎥ ⎣ 2 ⎦ ⎢ ε7 ⎥
⎢ ⎥ ⎢ ⎥ β ⎢ ⎥
⎢ y8 ⎥ ⎢ 1 63 632 63 ⎥ 3
3
⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ β4 ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 632 3
63 ⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ 3⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 1 72 722 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
y11 1 72 722 72 3 ε11
y12 1 72 722 72 3 ε12
Some of the numbers in X are getting quite large, i.e., 653 = 274, 625. The model has better
numerical properties if we compute x̄· = 69.41666̄ and replace Model (11.2.3) with the equivalent
model
This third-degree polynomial is the largest polynomial that we can fit to these data. Two points
determine a line, three points determine a quadratic, and with only four district x values in the data,
we cannot fit a model greater than a cubic.
Define x̃ = (x − 63)/9 so that
(x1 , . . . , x12 ) = (65, 65, 65, 65, 66, 66, 63, 63, 63, 72, 72, 72)
transforms to
becomes
⎡ ⎤ ⎡1 ⎤ ⎡ ⎤
y1 65 1 0 ε1
⎢ y2 ⎥ ⎢ 1 65 1 0⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 1 0⎥ ⎢ ε3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 65 1 0⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 4 ⎥
⎢ y5 ⎥ ⎢ 1 66 1 0 ⎥ β0 ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 1 66 1 0 ⎥ ⎢ β1 ⎥ ⎢ ε6 ⎥
⎢ ⎥=⎢ ⎥⎣ ⎦+ ⎢ ⎥.
⎢ y7 ⎥ ⎢ 1 63 1 0 ⎥ β2 ⎢ ε7 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y8 ⎥ ⎢ 1 63 1 0 ⎥ β3 ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 1 0⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 1 72 0 1⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣1 ⎦ ⎣ ⎦
y11 72 0 1 ε11
y12 1 72 0 1 ε12
Notice that the last two columns of the X matrix add up to a column of 1s, like the first column.
This causes the rank of the 12 × 4 model matrix X to be only 3, so the model is not a regression
model. Dropping either of the last two columns (or the first column) does not change the model in
any meaningful way but makes the model a regression.
If we partition the SLR model into points below 65.5 and above 65.5, the matrix model becomes
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 1 65 0 0 ε1
⎢ y2 ⎥ ⎢ 1 65 0 0⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 0 0⎥ ⎢ ε3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 65 0 0⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 4 ⎥
⎢ y5 ⎥ ⎢ 0 0 1 66 ⎥ β1 ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 0 0 1 66 ⎥ ⎢ β2 ⎥ ⎢ ε6 ⎥
⎢ ⎥=⎢ ⎥⎣ ⎦ + ⎢ ⎥.
⎢ y7 ⎥ ⎢ 1 63 0 0 ⎥ β3 ⎢ ε7 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y8 ⎥ ⎢ 1 63 0 0 ⎥ β4 ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 0 0⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 0 0 1 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
y11 0 0 1 72 ε11
y12 0 0 1 72 ε12
11.2 MATRIX FORMULATION OF REGRESSION MODELS 261
Alternatively, we could rewrite the model as
⎡ ⎤ ⎡1 ⎤ ⎡ ⎤
y7 63 0 0 ε7
⎢ y8 ⎥ ⎢ 1 63 0 0⎥ ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 0 0⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y1 ⎥ ⎢ 1 65 0 0⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 1 ⎥
⎢ y2 ⎥ ⎢ 1 65 0 0 ⎥ β1 ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 0 0 ⎥ ⎢ β2 ⎥ ⎢ ε3 ⎥
⎢ ⎥=⎢ ⎥⎣ ⎦ + ⎢ ⎥.
⎢ y4 ⎥ ⎢ 1 65 0 0 ⎥ β3 ⎢ ε4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 0 0 1 66 ⎥ β4 ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 0 0 1 66 ⎥ ⎢ ε6 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 0 0 1 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣0 ⎦ ⎣ ⎦
y11 0 1 72 ε11
y12 0 0 1 72 ε12
This makes it a bit clearer that we are fitting a SLR to the points with small x values and a separate
SLR to cases with large x values. The pattern of 0s in the X matrix ensure that the small x values
only involve the intercept and slope parameters β1 and β2 for the line on the first partition set and
that the large x values only involve the intercept and slope parameters β3 and β4 for the line on the
second partition set.
⎡ ⎤ ⎡1 ⎤ ⎡ ⎤
y7 63 0 0 ε7
⎢ y8 ⎥ ⎢ 1 63 0 0⎥ ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 1 63 0 0⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y1 ⎥ ⎢ 1 65 0 0⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 1 ⎥
⎢ y2 ⎥ ⎢ 1 65 0 0 ⎥ β0 ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 65 0 0 ⎥ ⎢ β1 ⎥ ⎢ ε3 ⎥
⎢ ⎥=⎢ ⎥⎣ ⎦ + ⎢ ⎥.
⎢ y4 ⎥ ⎢ 1 65 0 0 ⎥ γ0 ⎢ ε4 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 66 1 66 ⎥ γ1 ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 1 66 1 66 ⎥ ⎢ ε6 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 1 72 1 72 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣1 ⎦ ⎣ ⎦
y11 72 1 72 ε11
y12 1 72 1 72 ε12
Here we have changed the first two columns to make them agree with the SLR of Example 11.2.1.
However, notice that if we subtract the third column from the first column we get the first column
of the previous version. Similarly, if we subtract the fourth column from the second column we get
the second column of the previous version. This model has intercept and slope parameters β0 and
β1 for the first partition and intercept and slope parameters (β0 + γ0 ) and (β1 + γ1 ) for the second
partition.
Because of the particular structure of these data with 12 observations but only four distinct
values of x, except for the Haar wavelet model, all of these models are equivalent to one another and
262 11. MULTIPLE REGRESSION: MATRIX FORMULATION
all of them are equivalent to a model with the matrix formulation
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
y1 1 0 0 0 ε1
⎢ y2 ⎥ ⎢ 1 0 0 0 ⎥ ⎢ ε2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y3 ⎥ ⎢ 1 0 0 0 ⎥ ⎢ ε3 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y4 ⎥ ⎢ 1 0 0 0 ⎥ ⎢ε ⎥
⎢ ⎥ ⎢ ⎥⎡ ⎤ ⎢ 4 ⎥
⎢ y5 ⎥ ⎢ 0 1 0 0 ⎥ β 0 ⎢ ε5 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y6 ⎥ ⎢ 0 1 0 0 ⎥ ⎢ β1 ⎥ ⎢ ε6 ⎥
⎢ ⎥=⎢ ⎥⎣ ⎦+⎢ ⎥.
⎢ y7 ⎥ ⎢ 0 0 1 0 ⎥ β 2 ⎢ ε7 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y8 ⎥ ⎢ 0 0 1 0 ⎥ β 3 ⎢ ε8 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y9 ⎥ ⎢ 0 0 1 0 ⎥ ⎢ ε9 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ y10 ⎥ ⎢ 0 0 0 1 ⎥ ⎢ ε10 ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
y11 0 0 0 1 ε11
y12 0 0 0 1 ε12
The models are equivalent in that they all give the same fitted values, residuals, and degrees of
freedom for error. We will see in the next chapter that this last matrix model has the form of a
one-way analysis of variance model. ✷
Other models to be discussed later such as analysis of variance and analysis of covariance mod-
els can also be written as general linear models. However, they are frequently not regression models
in that they frequently have the rank of X less than the number of columns p.
−1
Proposition 11.3.1. If r(X) = p, then β̂ = (X ′ X ) X ′Y is the least squares estimate of β .
but
−1 −1
X ′ I − X (X ′ X ) X ′ = X ′ − (X ′ X) (X ′ X) X ′ = X ′ − X ′ = 0.
Thus ′
X β̂ − X β Y − X β̂ = 0
and similarly ′
Y − X β̂ X β̂ − X β = 0.
Eliminating the two cross-product terms in (11.3.3) gives
′ ′
(Y − X β ) (Y − X β ) = Y − X β̂ Y − X β̂ + X β̂ − X β X β̂ − X β .
′
This form is easily minimized. The first of the terms on the right-hand side does not depend
on β , so the β that minimizes (Y − X β ) (Y − X β ) is the β that minimizes the second term
′
′
X β̂ − X β X β̂ − X β . The second term is non-negative because it is the sum of squares of
the elements in the vector X β̂ − X β and it is minimized by making it zero. This is accomplished by
choosing β = β̂ . ✷
Y = Z β∗ + e
Proposition 11.3.3. Let A be a fixed r × n matrix, let c be a fixed r × 1 vector, and let Y be an
n × 1 random vector, then
1. E (AY + c) = A E(Y ) + c
Applying these results allows us to find the expected value and covariance matrix for Y in a linear
model. The linear model has Y = X β + e where X β is a fixed vector (even though β is unknown),
E(e) = 0, and Cov(e) = σ 2 I. Applying the proposition gives
E XAMPLE 11.3.2 C ONTINUED . For simple linear regression the covariance matrix becomes
−1
Cov β̂ = σ 2 (X ′ X )
266 11. MULTIPLE REGRESSION: MATRIX FORMULATION
n 2
1 ∑i=1 xi − ∑ni=1 xi
= σ 2
n
n ∑ni=1 (xi − x̄· ) − ∑i=1 xi
2 n
n 2
1 ∑i=1 xi − nx̄2· + nx̄2· −nx̄·
= σ2 n
n ∑i=1 (xi − x̄· )2 −nx̄· n
n 2
= σ 2 1 ∑i=1 (xi − x̄· ) + nx̄2· −nx̄·
2 −nx̄·
n ∑n (xi − x̄· ) n
⎡ i=1 2 ⎤
1 x̄ −x̄·
+ n ·
2⎣ n ∑ (xi −x̄· )2 ∑n (x −x̄ )2
= σ i=1
−x̄
i=1
1
i · ⎦,
·
∑ni=1 (xi −x̄· )2 ∑ni=1 (xi −x̄· )2
which agrees with results given earlier for simple linear regression.
SSReg
R2 = .
Y ′Y − C
This is the ratio of the variability explained by the predictor variables to the total variability of the
data. Note that (Y ′Y − C)/(n − 1) = s2y , the sample variance of the ys without adjusting for any
structure except the existence of a possibly nonzero mean.
i=1 i=1
i=1
The same result can be obtained from β̂ ′ X ′ X β̂ − C but the algebra is more tedious. ✷
To obtain tests and confidence regions we need to make additional distributional assumptions.
In particular, we assume that the yi s have independent normal distributions. Equivalently, we take
ε1 , . . . , εn indep. N(0, σ 2 ).
This shows that β̂k is an unbiased estimate of βk . Before obtaining the standard error of β̂k , it is
−1
necessary to identify its variance. The covariance matrix of β̂ is σ 2 (X ′ X ) , so the variance of
1
β̂k is the (k + 1)st diagonal element of σ 2 (X ′ X ) . The (k + 1)st diagonal element is appropriate
−
because the first diagonal element is the variance of β̂0 not β̂1 . If we let ak be the (k + 1)st diagonal
element of (X ′ X )−1 and estimate σ 2 with MSE, we get a standard error for β̂k of
√ √
SE β̂k = MSE ak .
β̂k − βk
∼ t(n − p).
SE(β̂k )
268 11. MULTIPLE REGRESSION: MATRIX FORMULATION
Standard techniques now provide tests and confidence intervals. For example, a 95% confidence
interval for βk has endpoints
β̂k ± t(.975, n − p) SE(β̂k )
where t(.975, n − p) is the 97.5th percentile of a t distribution with n − p degrees of freedom.
A (1 − α )100% simultaneous confidence region for β0 , β1 , . . . , β p−1 consists of all the β vectors
that satisfy
′
β̂ − β X ′ X β̂ − β p
≤ F(1 − α , p, n − p).
MSE
This region also determines joint (1 − α )100% confidence intervals for the individual βk s with limits
β̂k ± pF(1 − α , p, n − p)SE(β̂k ).
These intervals are an application of Scheffé’s method of multiple comparisons, cf. Section 13.3.
We can also use the Bonferroni method to obtain joint (1 − α )100% confidence intervals with
limits
α
β̂k ± t 1 − , n − p SE(β̂k ).
2p
Finally, we consider estimation of the point on the surface that corresponds to a given set of
predictor variables and the prediction of a new observation with a given set of predictor variables.
Let the predictor variables be x1 , x2 , . . . , x p−1 . Combine these into the row vector
x′ = (1, x1 , x2 , . . . , x p−1 ) .
−1
The point on the surface that we are trying to estimate is the parameter x′ β = β0 + ∑ pj=1 β j x j . The
least squares estimate is x′ β̂ , which can be thought of as a 1 × 1 matrix. The variance of the estimate
is
−1
Var x′ β̂ = Cov x′ β̂ = x′ Cov β̂ x = σ 2 x′ (X ′ X ) x,
This is the standard error of the estimated regression surface. The appropriate reference distribution
is
x′ β̂ − x′ β
∼ t(n − p)
SE x′ β̂
ê = Y − Ŷ
= Y − X β̂
= Y − X(X ′X)−1 X ′Y
= I − X(X ′ X )−1 X ′ Y
= (I − M)Y
where
M ≡ X(X ′ X )−1 X ′ .
M is called the perpendicular projection operator (matrix) onto C(X ), the column space of X. M
is the key item in the analysis of the general linear model, cf. Christensen (2011). Note that M is
symmetric, i.e., M = M ′ , and idempotent, i.e., MM = M, so it is a perpendicular projection operator
as discussed in Appendix A. Using these facts, observe that
n
SSE = ∑ ε̂i2
i=1
= ê′ ê
= [(I − M)Y ]′ [(I − M)Y ]
= Y ′ (I − M ′ − M + M ′ M)Y
= Y ′ (I − M)Y.
The last equality follows from M = M ′ and MM = M. Typically, the covariance matrix is not diag-
onal, so the residuals are not uncorrelated.
The variance of a particular residual ε̂i is σ 2 times the ith diagonal element of (I − M). The ith
diagonal element of (I − M) is the ith diagonal element of I, 1, minus the ith diagonal element of
M, say, mii . Thus
Var(ε̂i ) = σ 2 (1 − mii )
and the standard error of ε̂i is
SE(ε̂i ) = MSE(1 − mii).
The ith standardized residual is defined as
ε̂i
ri ≡ .
MSE(1 − mii)
The leverage of the ith case is defined to be mii , the ith diagonal element of M. Some people
like to think of M as the ‘hat’ matrix because it transforms Y into Ŷ , i.e., Ŷ = X β̂ = MY . More
common than the name ‘hat matrix’ is the consequent use of the notation hi for the ith leverage.
This notation was used in Chapter 7 but the reader should realize that hi ≡ mii . In any case, the
leverage can be interpreted as a measure of how unusual x′i is relative to the other rows of the X
matrix, cf. Christensen (2011, Section 13.1).
Christensen (2011, Chapter 13) discusses the computation of standardized deleted residuals and
Cook’s distance.
Xd = 0. When this happens the rank of X is not p, so we cannot find (X ′ X)−1 and we cannot find
the estimates of β in Proposition 11.3.1. Near redundancies occur when we can find a vector d
.
that is not too small, say with d ′ d = 1, having Xd = 0. Principal components (PC) regression is
a method designed to identify near redundancies among the predictor variables. Having identified
near redundancies, they can be eliminated if we so choose. In Section 10.7 we mentioned that having
small collinearity requires more than having small correlations among all the predictor variables, it
requires all partial correlations among the predictor variables to be small as well. For this reason,
eliminating near redundancies cannot always be accomplished by simply dropping well-chosen pre-
dictor variables from the model.
The basic idea of principal components is to find new variables that are linear combinations
of the x j s and that are best able to (linearly) predict the entire set of x j s; see Christensen (2001,
Chapter 3). Thus the first principal component variable is the one linear combination of the x j s
that is best able to predict all of the x j s. The second principal component variable is the linear
combination of the x j s that is best able to predict all the x j s among those linear combinations having
a sample correlation of 0 with the first principal component variable. The third principal component
variable is the best predictor that has sample correlations of 0 with the first two principal component
variables. The remaining principal components are defined similarly. With p − 1 predictor variables,
there are p − 1 principal component variables. The full collection of principal component variables
always predicts the full collection of x j s perfectly. The last few principal component variables are
least able to predict the original x j variables, so they are the least useful. They are also the aspects
of the predictor variables that are most redundant; see Christensen (2011, Section 15.1). The best
(linear) predictors used in defining principal components can be based on either the covariances
between the x j s or the correlations between the x j s. Unless the x j s are measured on the same scale
(with similarly sized measurements), it is generally best to use principal components defined using
the correlations.
For The Coleman Report data, a matrix of sample correlations between the x j s was given in
Example 9.7.1. Principal components are derived from the eigenvalues and eigenvectors of this
matrix, cf. Section A.8. (Alternatively, one could use eigenvalues and eigenvectors of the matrix
of sample covariances.) An eigenvector corresponding to the largest eigenvalue determines the first
principal component variable.
The eigenvalues are given in Table 11.2 along with proportions and cumulative proportions. The
proportions in Table 11.2 are simply the eigenvalues divided by the sum of the eigenvalues. The
cumulative proportions are the sum of the first group of eigenvalues divided by the sum of all the
eigenvalues. In this example, the sum of the eigenvalues is
5 = 2.8368 + 1.3951 + 0.4966 + 0.2025 + 0.0689.
The sum of the eigenvalues must equal the sum of the diagonal elements of the original matrix.
The sum of the diagonal elements of a correlation matrix is the number of variables in the matrix.
The third eigenvalue in Table 11.2 is .4966. The proportion is .4966/5 = .099. The cumulative
proportion is (2.8368 + 1.3951 + 0.4966)/5 = 0.946. With an eigenvalue proportion of 9.9%, the
third principal component variable accounts for 9.9% of the variance associated with predicting the
x j s. Taken together, the first three principal components account for 94.6% of the variance associated
with predicting the x j s because the third cumulative eigenvalue proportion is 0.946.
For the school data, the principal component (PC) variables are determined by the coefficients
in Table 11.3. The first principal component variable is
272 11. MULTIPLE REGRESSION: MATRIX FORMULATION
for i = 1, . . . , 20 where s1 is the sample standard deviation of the xi1 s, etc. The columns of coeffi-
cients given in Table 11.3 are actually eigenvectors for the correlation matrix of the x j s. The PC1
coefficients are an eigenvector corresponding to the largest eigenvalue, the PC2 coefficients are an
eigenvector corresponding to the second largest eigenvalue, etc.
We can now perform a regression on the new principal component variables. The table of coef-
ficients is given in Table 11.4. The analysis of variance is given in Table 11.5. The value of R2 is
0.906. The analysis of variance table and R2 are identical to those for the original predictor variables
given in Section 9.1. The plot of standardized residuals versus predicted values from the principal
component regression is given in Figure 11.1. This is identical to the plot given in Figure 10.2 for
the original variables. All of the predicted values and all of the standardized residuals are identical.
Since Table 11.5 and Figure 11.1 are unchanged, any usefulness associated with principal
component regression must come from Table 11.4. The principal component variables display no
collinearity. Thus, contrary to the warnings given earlier about the effects of collinearity, we can
make final conclusions about the importance of variables directly from Table 11.4. We do not have
to worry about fitting one model after another or about which variables are included in which mod-
els. From examining Table 11.4, it is clear that the important variables are PC1, PC3, and PC4. We
can construct a reduced model with these three; the estimated regression surface is simply
where we merely used the estimated regression coefficients from Table 11.4. Refitting the reduced
model is unnecessary because there is no collinearity.
3
2
Standardized residuals
1
0
−1
−2
25 30 35 40
Fitted
Figure 11.1: Standardized residuals versus predicted values for principal component regression.
To get predictions for a new set of x j s, just compute the corresponding PC1, PC3, and PC4
variables using formulae similar to those in Equation (11.6.1) and make the predictions using the
fitted model in Equation (11.6.2). When using equations like (11.6.1) to obtain new values of the
principal component variables, continue to use the x̄· j s and s j s computed from only the original
observations.
As an alternative to this prediction procedure, we could use the definitions of the principal
component variables, e.g., Equation (11.6.1), and substitute for PC1, PC3, and PC4 in Equation
(11.6.2) to obtain estimated coefficients on the original x j variables.
⎡ ⎤
PC1
ŷ = 35.0825 + [−2.9419, −2.0457, 4.380] ⎣ PC3 ⎦
PC4
= 35.0825 + [−2.9419, −2.0457, 4.380] ×
⎡ ⎤
⎡ ⎤ (x1 − x̄·1 )/s1
−0.229 −0.555 −0.545 −0.170 −0.559 ⎢ (x2 − x̄·2 )/s2 ⎥
⎢ ⎥
⎣ 0.723 0.051 −0.106 −0.680 −0.037 ⎦ ⎢ (x3 − x̄·3 )/s3 ⎥
⎣ ⎦
0.018 −0.334 0.823 −0.110 −0.445 (x4 − x̄·4 )/s4
(x5 − x̄·5 )/s5
= 35.0825 + [−0.72651, 0.06550, 5.42492, 1.40940, −0.22889] ×
⎡ ⎤
(x1 − 2.731)/0.454
⎢ (x2 − 40.91)/25.90 ⎥
⎢ ⎥
⎢ (x3 − 3.14)/9.63 ⎥ .
⎣ ⎦
(x4 − 25.069)/1.314
(x5 − 6.255)/0.654
Obviously this can be simplified into a form ŷ = β̃0 + β̃1 x1 + β̃2 x2 + β̃3 x3 + β̃4 x4 + β̃5 x5 , which
in turn simplifies the process of making predictions and provides new estimated regression coeffi-
cients for the x j s that correspond to the fitted principal component model. In this case they become
ŷ = 12.866 − 1.598x1 + 0.002588x2 + 0.5639x3 + 1.0724x4 − 0.3484x5. These PC regression esti-
mates of the original β j s can be compared to the least squares estimates. Many computer programs
274 11. MULTIPLE REGRESSION: MATRIX FORMULATION
for performing PC regression report these estimates of the β j s and their corresponding standard
errors. A similar method is used to obtain lasso estimates when the lasso procedure is performed on
standardized predictor variables, cf. Section 10.5.
It was mentioned earlier that collinearity tends to increase the variance of regression coeffi-
cients. The fact that the later principal component variables are more nearly redundant is reflected
in Table 11.4 by the fact that the standard errors for their estimated regression coefficients increase
(excluding the intercept).
One rationale for using PC regression is that you just don’t believe in using nearly redundant
variables. The exact nature of such variables can be changed radically by small errors in the x j s. For
this reason, one might choose to ignore PC5 because of its small eigenvalue proportion, regardless
of any importance it may display in Table 11.4. If the t statistic for PC5 appeared to be significant,
it could be written off as a chance occurrence or, perhaps more to the point, as something that is un-
likely to be reproducible. If you don’t believe redundant variables, i.e., if you don’t believe that they
are themselves reproducible, any predictive ability due to such variables will not be reproducible
either.
When considering PC5, the case is pretty clear. PC5 accounts for only about 1.5% of the vari-
ability involved in predicting the x j s. It is a very poorly defined aspect of the predictor variables
x j and, anyway, it is not a significant predictor of y. The case is less clear when considering PC4.
This variable has a significant effect for explaining y, but it accounts for only 4% of the variability
in predicting the x j s, so PC4 is reasonably redundant within the x j s. If this variable is measuring
some reproducible aspect of the original x j data, it should be included in the regression. If it is not
reproducible, it should not be included. From examining the PC4 coefficients in Table 11.3, we see
that PC4 is roughly the average of the percent white-collar fathers x2 and the mothers’ education
x5 contrasted with the socio- economic variable x3 . (Actually, this comparison is between the vari-
ables after they have been adjusted for their means and standard deviation as in Equation (11.6.1).)
If PC4 strikes the investigator as a meaningful, reproducible variable, it should be included in the
regression.
In our discussion, we have used PC regression both to eliminate questionable aspects of the
predictor variables and as a method for selecting a reduced model. We dropped PC5 primarily
because it was poorly defined. We dropped PC2 solely because it was not a significant predictor.
Some people might argue against this second use of PC regression and choose to take a model based
on PC1, PC2, PC3, and possibly PC4.
On occasion, PC regression is based on the sample covariance matrix of the x j s rather than the
sample correlation matrix. Again, eigenvalues and eigenvectors are used, but in using relationships
like Equation (11.6.1), the s j s are deleted. The eigenvalues and eigenvectors for the covariance ma-
trix typically differ from those for the correlation matrix. The relationship between estimated prin-
cipal component regression coefficients and original least squares regression coefficient estimates
is somewhat simpler when using the covariance matrix.
It should be noted that PC regression is just as sensitive to violations of the assumptions as reg-
ular multiple regression. Outliers and high-leverage points can be very influential in determining
the results of the procedure. Tests and confidence intervals rely on the independence, homoscedas-
ticity, and normality assumptions. Recall that in the full principal components regression model,
the residuals and predicted values are identical to those from the regression on the original predic-
tor variables. Moreover, highly influential points in the original predictor variables typically have a
large influence on the coefficients in the principal component variables.
11.7 Exercises
E XERCISE 11.7.1. Show that the form (11.3.2) simplifies to the form (11.3.1) for simple linear
regression.
E XERCISE 11.7.5. Do a principal components regression on the Younger data from Exer-
cise 9.12.1.
E XERCISE 11.7.6. Do a principal components regression on the Prater data from Exercise 9.12.3.
E XERCISE 11.7.8. Do a principal components regression on the pollution data of Exercise 9.12.5.
E XERCISE 11.7.9. Do a principal components regression on the body fat data of Exercise 9.12.6.
Chapter 12
One-Way ANOVA
Analysis of variance (ANOVA) involves comparing random samples from several populations
(groups). Often the samples arise from observing experimental units with different treatments ap-
plied to them and we refer to the populations as treatment groups. The sample sizes for the groups
are possibly different, say, Ni and we assume that the samples are all independent. Moreover, we
assume that each population has the same variance and is normally distributed. Assuming different
means for each group we have a model
yi j = μi + εi j , εi j s independent N(0, σ 2 )
or, equivalently,
yi j s independent N(μi , σ 2 ),
where with a groups, i = 1, . . . , a, and with Ni observations in the ith group, j = 1, . . . , Ni . There is
one mean parameter μi for each group and it is estimated by the sample mean of the group, say, ȳi· .
Relating this model to the general models of Section 3.9, we have replaced the single subscript h that
identifies all observations with a double subscript i j in which i identifies a group and j identifies an
observation within the group. The group identifier i is our (categorical) predictor variable. The fitted
values are ŷh ≡ ŷi j = ȳi· , i.e., the point prediction we make for any observation is just the sample
mean from the observation’s group. The residuals are ε̂h ≡ ε̂i j = yi j − ȳi· . The total sample size is
n = N1 + · · · + Na . The model involves estimating a mean values, one for each group, so dfE = n − a.
The SSE is
n a Ni
SSE = ∑ ε̂h2 = ∑ ∑ ε̂i2j ,
h=1 1=1 j=1
12.1 Example
E XAMPLE 12.1.1. Table 12.1 gives data from Koopmans (1987, p. 409) on the ages at which
suicides were committed in Albuquerque during 1978. Ages are listed by ethnic group. The data
are plotted in Figure 12.1. The assumption is that the observations in each group are a random
sample from some population. While it is not clear what these populations would be, we proceed to
examine the data. Note that there are fewer Native Americans in the study than either Hispanics or
non-Hispanic Caucasians (Anglos); moreover the ages for Native Americans seem to be both lower
and less variable than for the other groups. The ages for Hispanics seem to be a bit lower than for
non-Hispanic Caucasians. Summary statistics follow for the three groups.
Sample statistics: Suicide ages
Group Ni ȳi· s2i si
Caucasians 44 41.66 282.9 16.82
Hispanics 34 35.06 268.3 16.38
Native Am. 15 25.07 74.4 8.51
277
278 12. ONE-WAY ANOVA
.
. . : . :
.:.:::. : : . :: ::..: ... . . . . .
---+---------+---------+---------+---------+---------+---Caucasians
. .
....:::: :: . : . .. :... .. . . .
---+---------+---------+---------+---------+---------+---Hispanics
:
:...:.: .. .
---+---------+---------+---------+---------+---------+---Nat. Am.
15 30 45 60 75 90
The sample standard deviation for the Native Americans is about half the size of the others.
To evaluate the combined normality of the data, we did a normal plot of the standardized residu-
als. One normal plot for all of the yi j s would not be appropriate because they have different means,
μi . The residuals adjust for the different means. Of course with the reasonably large samples avail-
able here for each group, it would be permissible to do three separate normal plots, but in other
situations with small samples for each group, individual normal plots would not contain enough
observations to be of much value. The normal plot for the standardized residuals is given as Fig-
ure 12.2. The plot is based on n = 44 + 34 + 15 = 93 observations. This is quite a large number, so
if the data are normal the plot should be quite straight. In fact, the plot seems reasonably curved.
In order to improve the quality of the assumptions of equal variances and normality, we consider
transformations of the data. In particular, consider taking the log of each observation. Figure 12.3
contains the plot of the transformed data. The variability in the groups seems more nearly the same.
This is confirmed by the following sample statistics.
Sample statistics: Log of suicide ages
Group Ni ȳi· s2i si
Caucasians 44 3.6521 0.1590 0.3987
Hispanics 34 3.4538 0.2127 0.4612
Native Am. 15 3.1770 0.0879 0.2965
The largest sample standard deviation is only about 1.5 times the smallest. The normal plot of stan-
dardized residuals for the transformed data is given in Figure 12.4; it seems considerably straighter
than the normal plot for the untransformed data.
12.1 EXAMPLE 279
Normal Q−Q Plot
3
2
Standardized residuals
1
0
−1
−2 −1 0 1 2
Theoretical Quantiles
. . . : . :
. :. :..:. :: .. . :...:.:. : . . .. .
-----+---------+---------+---------+---------+---------+-Caucasians
.
. . . .. .:: : :.: . .. . . . ::.. .. . .
-----+---------+---------+---------+---------+---------+-Hispanics
. . . ..: :... : .
-----+---------+---------+---------+---------+---------+-Nat. Am.
2.80 3.15 3.50 3.85 4.20 4.55
All in all, the logs of the original data seem to satisfy the assumptions reasonably well and
considerably better than the untransformed data. The square roots of the data were also examined
as a possible transformation. While the square roots seem to be an improvement over the original
scale, they do not seem to satisfy the assumptions nearly as well as the log transformed data.
A basic assumption in analysis of variance is that the variance is the same for all groups. Al-
though we can find the MSE as the sum of the squared residuals divided by the degrees of freedom
for error, equivalently, as we did for two independent samples with the same variance, we can also
compute it as a pooled estimate of the variance. This is a weighted average of the variance estimates
from the individual groups with weights that are the individual degrees of freedom. For the logs of
the suicide age data, the mean squared error is
The degrees of freedom for this estimate are the sum of the degrees of freedom for the individual
variance estimates, s2i , so the degrees of freedom for error are
2
1
Standardized residuals
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Residual−Fitted plot
2
1
Standardized residuals
0
−1
−2
Fitted
This is also the total number of observations, n = 93, minus the number of mean parameters we
have to estimate, a = 3. The data have an approximate normal distribution, so we can use t(90) as
the reference distribution for statistical inferences on a single parameter. The sum of squares error
is SSE ≡ dfE × MSE.
For completeness, we also include the residual-fitted value plot as Figure 12.5.
We can now perform statistical inferences for a variety of parameters using our standard pro-
cedure involving a Par, an Est, a SE(Est), and a t(dfE) distribution for [Est − Par]/SE(Est). In
this example, perhaps the most useful things to look at are whether there is evidence of any age
differences in the three groups. Let μC ≡ μ1 , μH ≡ μ2 , and μN ≡ μ3 denote the population means
12.1 EXAMPLE 281
for the log ages of the non-Hispanic Caucasian (Anglo), Hispanic, and Native American groups,
respectively. First, we briefly consider inferences for one of the group means. Our most lengthy dis-
cussion is for differences between group means. We then discuss more complicated linear functions
of the group means. Finally, we discuss testing μC = μH = μN .
The estimates and variances are obtained exactly as in Section 4.2. The standard errors of the esti-
mates are obtained by substituting MSE for σ 2 in the variance formula and taking the square root.
Below are given the estimates, standard errors, the tobs values for testing H0 : Par = 0, the P values,
and the 99% confidence intervals for Par. Computing the confidence intervals requires the value
t(0.995, 90) = 2.632.
282 12. ONE-WAY ANOVA
Table of Coefficients
Par Est SE(Est) tobs P 99% CI
μC − μH 0.1983 0.0936 2.12 0.037 (−0.04796, 0.44456)
μC − μN 0.4751 0.1225 3.88 0.000 (0.15280, 0.79740)
μH − μN 0.2768 0.1270 2.18 0.032 (−0.05734, 0.61094)
While the estimated difference between Hispanics and Native Americans is half again as large as
the difference between non-Hispanic Caucasians and Hispanics, the tobs values, and thus the sig-
nificance levels of the differences, are almost identical. This occurs because the standard errors are
substantially different. The standard error for the estimate of μC − μH involves only the reasonably
large samples for non-Hispanic Caucasians and Hispanics; the standard error for the estimate of
μH − μN involves the comparatively small sample of Native Americans, which is why this standard
error is larger. On the other hand, the standard errors for the estimates of μC − μN and μH − μN are
very similar. The difference in the standard error between having a sample of 34 or 44 is minor by
comparison to the effect on the standard error of having a sample size of only 15.
The hypothesis H0 : μC − μN = 0, or equivalently H0 : μC = μN , is the only one rejected at the
0.01 level. Summarizing the results of the tests at the 0.01 level, we have no strong evidence of a
difference between the ages at which non-Hispanic Caucasians and Hispanics commit suicide, we
have no strong evidence of a difference between the ages at which Hispanics and Native Americans
commit suicide, but we do have strong evidence that there is a difference in the ages at which non-
Hispanic Caucasians and Native Americans commit suicide. Of course, all of these statements about
null hypotheses presume that the underlying model is correct.
Establishing a difference between non-Hispanic Caucasians and Native Americans does little to
explain why that difference exists. The reason that Native Americans committed suicide at younger
ages could be some complicated function of socio-economic factors or it could be simply that there
were many more young Native Americans than old ones in Albuquerque at the time. The test only
indicates that the two groups were different; it says nothing about why the groups were different.
The confidence interval for the difference between non-Hispanic Caucasians and Native Amer-
icans was constructed on the log scale. Back transforming the interval gives (e0.1528 , e0.7974 ) or
(1.2, 2.2). We are 99% confident that the median age of suicides is between 1.2 and 2.2 times higher
for non-Hispanic Caucasians than for Native Americans. Note that examining differences in log
ages transforms to the original scale as a multiplicative factor between groups. The parameters μC
and μN are both means and medians for the logs of the suicide ages. When we transform the interval
(0.1528, 0.7974) for μC − μN into the interval (e0.1528 , e0.7974 ), we obtain a confidence interval for
eμC −μN or equivalently for eμC /eμN . The values eμC and eμN are median values for the age distri-
butions of the non-Hispanic Caucasians and Native Americans although they are not the expected
values (population means) of the distributions. Obviously, eμC = (eμC /eμN ) eμN , so eμC /eμN is the
number of times greater the median suicide age is for non-Hispanic Caucasians. That is the basis
for the interpretation of the interval (e0.1528 , e0.7974 ).
With these data, the tests for differences in means do not depend crucially on the log trans-
formation but interpretations of the confidence intervals do. For the untransformed data, the mean
squared error is MSEu = 245 and the observed value of the test statistic for comparing non-Hispanic
Caucasians and Native Americans is
41.66 − 25.07
tu = 3.54 = 1 ,
1
245 44 + 15
which is not far from the transformed value 3.88. However, the untransformed 99% confidence in-
terval is (4.3, 28.9), indicating a 4-to-29-year-higher age for the mean non-Hispanic Caucasian sui-
cide, rather than the transformed interval (1.2, 2.2), indicating that typical non-Hispanic Caucasian
suicide ages are 1.2 to 2.2 times greater than those for Native Americans.
12.1 EXAMPLE 283
12.1.3 Inference on linear functions of means
The data do not strongly suggest that the means for Hispanics and Native Americans are different,
so we might wish to compare the mean of the non-Hispanic Caucasians with the average of these
groups. Typically, averaging means will only be of interest if we feel comfortable treating those
means as the same. The parameter of interest is Par = μC − (μH + μN )/2 or
1 1
Par = μC − μH − μN
2 2
with
1 1 1 1
Est = ȳC − ȳH − ȳN = 3.6521 − 3.4538 − 3.1770 = 0.3367.
2 2 2 2
It is not really appropriate to use our standard methods to test this contrast between the means
because the contrast was suggested by the data. Nonetheless, we will illustrate the standard methods.
From the independence of the data in the three groups and Proposition 1.2.11, the variance of the
estimate is
2 2
1 1 −1 −1
Var ȳC − ȳH − ȳN = Var(ȳC ) + Var(ȳH ) + Var(ȳN )
2 2 2 2
2 2 2 2
σ2 −1 σ −1 σ
= + +
44 2 34 2 15
2 2
1 −1 1 −1 1
= σ 2
+ + .
44 2 34 2 15
Substituting the MSE for σ 2 and taking the square root, the standard error is
!
"
" 2 2
# 1 −1 1 −1 1
0.0886 = 0.168 + + .
44 2 34 2 15
Note that the standard error happens to be smaller than any of those we have considered when
comparing pairs of means. To test the null hypothesis that the mean for non-Hispanic Caucasians
equals the average of the other groups, i.e., H0 : μC − 12 μH − 12 μN = 0, the test statistic is
0.3367 − 0
tobs = = 3.80,
0.0886
so the null hypothesis is easily rejected. This is an appropriate test statistic for evaluating H0 , but
when letting the data suggest the parameter, the t(90) distribution is no longer appropriate for quan-
tifying the level of significance. Similarly, we could construct the 99% confidence interval with
endpoints
0.3367 ± 2.631(0.0886)
but again, the confidence coefficient 99% is not really appropriate for a parameter suggested by the
data.
While the parameter μC − 21 μH − 21 μN was suggested by the data, the theory of inference in
Chapter 3 assumes that the parameter of interest does not depend on the data. In particular, the
reference distributions we have used are invalid when the parameters depend on the data. Moreover,
performing numerous inferential procedures complicates the analysis. Our standard tests are set up
to check on one particular hypothesis. In the course of analyzing these data we have performed
several tests. Thus we have had multiple opportunities to commit errors. In fact, the reason we have
been discussing 0.01 level tests rather than 0.05 level tests is to help limit the number of errors made
when all of the null hypotheses are true. In Chapter 13, we discuss methods of dealing with the
problems that arise from making multiple comparisons among the means.
284 12. ONE-WAY ANOVA
12.1.4 Testing μ1 = μ2 = μ3
To test H0 : μ1 = μ2 = μ3 we test the one-way ANOVA model against the reduced model that fits
only the grand mean (intercept), yi j = μ + εi j . The results are summarized in Table 12.2. Subject to
round-off error, the information for the Error line is as given previously for the one-way ANOVA
model, i.e., dfE = 90, MSE = 0.168, and SSE = 90(0.168) = 15.088. The information in the Total
line is the dfE and SSE for the grand-mean model. For the grand-mean model, dfE = n − 1 =
92, MSE = s2y = 0.193, i.e., the sample variance of all n = 93 observation, and the SSE is found
by multiplying the two, SSE = 92(0.193) = 17.743. The dfE and SSE for Groups are found by
subtracting the entries in the Error line from the Total line, so the df and SS are precisely what
we need to compute the numerator of the F statistic, df Grps = 92 − 90 = 2, SSGrps = 17.743 −
15.008 = 2.655. The reported F statistic
1.328 MSGrps [SSTot − SSE]/[df Tot − dfE]
7.92 = = =
0.168 MSE MSE
is the statistic for testing our reduced (null) model.
The extremely small P value for the analysis of variance F test, as reported in Table 12.2, es-
tablishes clear differences among the mean log suicide ages. More detailed comparisons are needed
to identify which particular groups are different. We established earlier that at the 0.01 level, only
non-Hispanic Caucasians and Native Americans display a pairwise difference.
12.2 Theory
In analysis of variance, we assume that we have independent observations on, say, a different normal
populations with the same variance. In particular, we assume the following data structure.
Sample Data Distribution
1 y11 , y12 , . . . , y1N1 iid N(μ1 , σ 2 )
2 y21 , y22 , . . . , y2N2 iid N(μ2 , σ 2 )
.. .. .. ..
. . . .
a ya1 , ya2 , . . . , yaNa iid N(μa , σ 2 )
Here each sample is independent of the other samples. These assumptions are written more suc-
cinctly as the one-way analysis of variance model
yh = m(xh ) + εh , h = 1, . . . , n, (12.2.3)
where n ≡ N1 + · · · + Na . In this case the predictor variable xh takes on one of a distinct values to
identify the group for each observation. Suppose xh takes on the values 1, 2, . . . , a, then we identify
μ1 ≡ m(1), . . . , μa ≡ m(a).
The model involves a distinct mean parameters, so dfE = n − a. Switching from the h subscripts to
the i j subscripts gives Model (12.2.1) with xh = i.
To analyze the data, we compute summary statistics from each sample. These are the sample
means and sample variances. For the ith group of observations, the sample mean is
Ni
1
ȳi· ≡
Ni ∑ yi j
j=1
With independent normal errors having the same variance, all of the summary statistics are indepen-
dent of one another. Except for checking the validity of our assumptions, these summary statistics
are more than sufficient for the entire analysis. Typically, we present the summary statistics in tab-
ular form.
Sample statistics
Group Size Mean Variance
1 N1 ȳ1· s21
2 N2 ȳ2· s22
.. .. .. ..
. . . .
a Na ȳa· s2a
The sample means, the ȳi· s, are estimates of the corresponding μi s and the s2i s all estimate the
common population variance σ 2 . With unequal sample sizes an efficient pooled estimate of σ 2 must
be a weighted average of the s2i s. The weights are the degrees of freedom associated with the various
estimates. The pooled estimate of σ 2 is the mean squared error (MSE),
The degrees of freedom for the MSE are the degrees of freedom for error,
a
dfE ≡ n − a = ∑ (Ni − 1).
i=1
This is the sum of the degrees of freedom for the individual variance estimates. Note that the MSE
depends only on the sample variances, so, with independent normal errors having the same variance,
MSE is independent of the ȳi· s.
A simple average of the sample variances s2i is not reasonable. If we had N1 = 1, 000, 000 ob-
servations in the first sample and only N2 = 5 observations in the second sample, obviously the
286 12. ONE-WAY ANOVA
variance estimate from the first sample is much better than that from the second and we want to give
it more weight.
In Model (12.2.3) the fitted values for group i are
We need to check the validity of our assumptions. The errors in models (12.2.1) and (12.2.2)
are assumed to be independent normals with mean 0 and variance σ 2 , so we would like to use them
to evaluate the distributional assumptions, e.g., equal variances and normality. Unfortunately, the
errors are unobservable; we only see the yi j s and we do not know the μi s, so we cannot compute
the εi j s. However, since εi j = yi j − μi and we can estimate μi , we can estimate the errors with the
residuals, ε̂i j = yi j − ȳi· . The residuals can be plotted against fitted values ȳi· to check whether the
variance depends in some way on the means μi . They can also be plotted against rankits (normal
scores) to check the normality assumption. More often we use the standardized residuals,
ε̂i j
ri j = ,
MSE 1 − N1i
In general, we are concerned with parameters that are linear combinations of the μi s. For known
coefficients λ1 , . . . , λa , interesting parameters are defined by
a
Par = λ1 μ1 + · · · + λa μa = ∑ λi μi .
i=1
For example, μ2 has λ2 = 1 and all other λi s equal to 0. The difference μ2 − μ1 has λ1 = −1, λ2 = 1,
and all other λi s equal to 0. The parameter μ1 − 12 μ2 − 12 μ3 has λ1 = 1, λ2 = −1/2, λ3 = −1/2, and
all other λi s equal to 0.
The natural estimate of Par = ∑ai=1 λi μi substitutes the sample means for the population means,
i.e., the natural estimate is
a
Est = λ1 ȳ1· + · · · + λaȳa· = ∑ λi ȳi· .
i=1
a a a
E ∑ λi ȳi· = ∑ λi E(ȳi· ) = ∑ λi μi ,
i=1 i=1 i=1
a a
Var ∑ λiȳi· = ∑ λi2 Var (ȳi· )
i=1 i=1
a
σ2
= ∑ λi2 Ni
i=1
a
λi2
= σ2 ∑ .
i=1 Ni
288 12. ONE-WAY ANOVA
The standard error is
a
λ12 λa2 a
λ2
SE ∑ λiȳi· = MSE
N1
+ ···+
Na
= MSE ∑ i
i=1 i=1 Ni
see Exercise 12.8.14. If the independence and equal variance assumptions hold, then the central
limit theorem and law of large numbers can be used to justify a N(0, 1) reference distribution even
when the data are not normal as long as all the Ni s are large, although I would continue to use the t
distribution since the normal is clearly too optimistic.
In analysis of variance, we are most interested in contrasts (comparisons) among the μi s. These
are characterized by having ∑ai=1 λi = 0. The difference μ2 − μ1 is a contrast as is the parameter
μ1 − 12 μ2 − 21 μ3 . If we use Model (12.2.2) rather than Model (12.2.1) we get
a a a a a
∑ λi μi = ∑ λi (μ + αi) = μ ∑ λi + ∑ λi αi = ∑ λiαi ,
i=1 i=1 i=1 i=1 i=1
thus contrasts in Model (12.2.2) involve only the group effects. This is of some importance later
when dealing with more complicated models.
Having identified a parameter, an estimate, a standard error, and an appropriate reference distri-
bution, inferences follow the usual pattern. A 95% confidence interval for ∑ai=1 λi μi has endpoints
a a
∑ λi ȳi· ± t(0.975, dfE) MSE ∑ λi2 /Ni .
i=1 i=1
An equivalent procedure to the test in (12.2.4) is often useful. If we square both sides of (12.2.4),
the test rejects if
⎛ ⎞2
|
⎝ i=1 ∑
a
λ i ȳ i· − 0| ⎠ > [t(0.975, dfE)]2 .
MSE ∑i=1 λi /Ni
a 2
The square of the test statistic leads to another common statistic, the sum of squares for the param-
eter. Rewrite the test statistic as
⎛ ⎞2
2
⎝ | ∑
a
i=1 λ i ȳ i· − 0| ⎠ = (∑i=1 λi ȳi· − 0)
a
SS (∑ai=1 λi μi )
> [t(0.975, dfE)]2 .
MSE
It is a mathematical fact that for any α between 0 and 1 and any dfE,
α 2
t 1 − , dfE = F(1 − α , 1, dfE).
2
Thus the test based on the sum of squares is an F test with 1 degree of freedom in the numerator.
Any parameter of this type has 1 degree of freedom associated with it.
In Section 12.1 we transformed the suicide age data so that they better satisfy the assumptions of
equal variances and normal distributions. In fact, analysis of variance tests and confidence intervals
are frequently useful even when these assumptions are violated. Scheffé (1959, p. 345) concludes
that (a) nonnormality is not a serious problem for inferences about means but it is a serious problem
for inferences about variances, (b) unequal variances are not a serious problem for inferences about
means from samples of the same size but are a serious problem for inferences about means from
samples of unequal sizes, and (c) lack of independence can be a serious problem. Of course, any
such rules depend on just how bad the nonnormality is, how unequal the variances are, and how bad
the lack of independence is. My own interpretation of these rules is that if you check the assumptions
and they do not look too bad, you can probably proceed with a fair amount of assurance.
in which each group has the same mean. Recall that the variance estimate for this model is the
sample variance, i.e., MSE(Red.) = s2y , with dfE(Red.) = n − 1.
The computations are typically summarized in an analysis of variance table. The commonly
used form for the analysis of variance table is given below.
Analysis of Variance
Source df SS MS F
a 2 MSGrps
Groups a−1 ∑i=1 Ni (ȳi· − ȳ·· ) SSGr ps/(a − 1) MSE
a Ni 2
Error n−a ∑i=1 ∑ j=1 yi j − ȳi· SSE/(n − a)
a Ni 2
Total n−1 ∑i=1 ∑ j=1 yi j − ȳ··
The entries in the Error line are just dfE, SSE, and MSE for Model (12.2.1). The entries for the
Total line are dfE and SSE for Model (12.2.6). These are often referred to as dfTot and SSTot and
sometimes as dfTot −C and SSTot −C. The Groups line is obtained by subtracting the Error df and
SS from the Total df and SS, respectively, so that MSGroups ≡ SSGrps/df Grps gives precisely the
numerator of the F statistic for testing our hypothesis. It is some work to show that the algebraic
formula given for SSGrps is correct.
The total line is corrected for the grand mean. An obvious meaning for the phrase “sum of
squares total” would be the sum of the squares of all the observations, ∑i j y2i j . The reported sum of
N
squares total is SSTot = ∑ai=1 ∑ j=1
i
y2i j − C, which is the sum of the squares of all the observations
minus the correction factor for fitting the grand mean, C ≡ nȳ2·· . Similarly, an obvious meaning
290 12. ONE-WAY ANOVA
for the phrase “degrees of freedom total” would be n, the number of observations: one degree of
freedom for each observation. The reported df Tot is n − 1, which is corrected for fitting the grand
mean μ in Model (12.2.6).
E XAMPLE 12.2.1. We now illustrate direct computation of SSGrps, the only part of the analysis
of variance table computations that we have not illustrated for the logs of the suicide data. The
sample statistics are repeated below.
where
44(3.6521) + 34(3.4538) + 15(3.1770)
3.5030 = ȳ·· = .
44 + 34 + 15
The ANOVA table was presented as Table 12.2. ✷
MSGrps = Ns2ȳ ,
1 a
s2ȳ ≡ ∑ (ȳi· − ȳ··)2 .
a − 1 i=1
This idea can be used as the basis for analyzing virtually any balanced multifactor ANOVA. Re-
call from Section 3.9 that a multifactor ANOVA is simply a model that involves more than one
categorical predictor variable. Christensen (1996) examined this idea in detail.
It does not matter that we are using Greek μ s for the regression coefficients rather than β s. Model
(12.3.1) is precisely the same model as
yi j = μi + εi j , i = 1, 2, 3, j = 1, 2, . . . , Ni ,
i.e., Model (12.2.1). They give the same fitted values, residuals, and dfE.
Model (12.3.1) is fitted without an intercept (constant). Fitting the regression model to the log
suicide age data gives a Table of Coefficients and an ANOVA table. The tables are adjusted for
the fact that the model was fitted without an intercept. Obviously, the Table of Coefficients cannot
contain a constant term, since we did not fit one.
Table of Coefficients: Model (12.3.1)
Predictor μ̂i SE(μ̂i ) t P
Anglo 3.65213 0.06173 59.17 0.000
Hisp. 3.45377 0.07022 49.19 0.000
N.A. 3.1770 0.1057 30.05 0.000
The estimated regression coefficients are just the group sample means as displayed in Section 12.1.
The reported standard errors are the standard errors appropriate for performing confidence intervals
tests on a single population mean as discussed in Subsection 12.1.1, i.e., μ̂i = ȳi· and SE(μ̂i ) =
and
MSE/Ni . The table also provides test statistics and P values for H0 : μi = 0 but these are not
typically of much interest. The 95% confidence interval for, say, the Hispanic mean μ2 has endpoints
3.45377 ± 2.631(0.0.07022)
for an interval of (3.269, 3.639), just as in Subsection 12.1.1. Prediction intervals are easily obtained
from most software by providing the corresponding 0-1 input for x1 , x2 , and x3 , e.g., to predict a
Native American log suicide age, (x1 , x2 , x3 ) = (0, 0, 1).
In the ANOVA table
292 12. ONE-WAY ANOVA
Analysis of Variance: Model (12.3.1)
Source df SS MS F P
Regression 3 1143.84 381.28 2274.33 0.000
Error 90 15.09 0.17
Total 93 1158.93
The Error line is the same as that given in Section 12.1, up to round-off error. Without fitting an
intercept (grand mean) in the model, most programs report the Total line in the ANOVA table with-
out correcting for the grand mean. Here the Total line has n = 93 degrees of freedom, rather than
the usual n − 1. Also, the Sum of Squares Total is the sum of the squares of all 93 observations,
rather than the usual corrected number (n − 1)s2y . Finally, the F test reported in the ANOVA table is
for testing the regression model against the relatively uninteresting model yh = 0 + εh . It provides a
simultaneous test of 0 = μC = μH = μN rather than the usual test of μC = μH = μN .
similar to Model (12.2.2). Remember, the Greek letters we choose to use as regression coefficients
make no difference to the substance of the model. Model (12.3.2) is no longer a regression model
because the parameters are redundant. The data have three groups, so we need no more than three
model parameters to explain them. Model (12.3.2) contains four parameters. To make it into a
regression model, we need to drop one of the predictor variables. In most important ways, which
predictor variable we drop makes no difference. The fitted values, the residuals, the dfE, SSE, and
MSE all remain the same. However, the meaning of the parameters changes depending on which
variable we drop.
At the beginning of this section, we dropped the constant term from Model (12.3.2) to get
Model (12.3.1) and discussed the parameter estimates. Now we leave in the intercept but drop
one of the other variables. Let’s drop x3 , the indicator variable for Native Americans. This makes
the Native Americans into a baseline group with the other two groups getting compared to it. As
mentioned earlier, we will obtain two of the three comparisons from Subsection 12.1.2, specifically
the comparisons between Anglo and N.A., μC − μN , and Hisp. and N.A., μH − μN . Fitting the
regression model with an intercept but without x3 , i.e.,
or
yh = μ1 (xh1 + xh3) + μ2 xh2 + εh .
The Greek letters change their meaning in this process, so we could just as well write the model as
This reduced model still only involves indicator variables: x2 is the indicator variable for group 2
(Hisp.) but x1 + x3 is now an indicator variable that is 1 if an individual is either Anglo or N.A. and
0 otherwise. We have reduced our three-group model with Anglos, Hispanics, and Native Amer-
icans to a two-group model that lumps Anglos and Native Americans together but distinguishes
Hispanics. The question is whether this reduced model fits adequately relative to our full model that
distinguishes all three groups. Fitting the model gives a Table of Coefficients and an ANOVA table.
Table of Coefficients: Model (12.3.4)
Predictor γ̂k SE(γ̂k ) t P
x1 + x3 3.53132 0.05728 61.65 0.000
Hisp. 3.45377 0.07545 45.78 0.000
294 12. ONE-WAY ANOVA
The Table of Coefficients is not very interesting. It gives the same mean for Hisp. as Model (12.3.1)
but provides a standard error based on a MSE from Model (12.3.4) that does not distinguish be-
tween Anglos and N.A.s. The other estimate in the table is the average of all the Anglos and N.A.s.
Similarly, the ANOVA table is not terribly interesting except for its Error line.
Analysis of Variance: Model (12.3.4)
Source df SS MS F P
Regression 2 1141.31 570.66 2948.27 0.000
Error 91 17.61 0.19
Total 93 1158.93
From this Error line and the Error for Model (12.3.1), the model testing statistic for the hypoth-
esis μ1 − μ3 ≡ μC − μN = 0 is
The last (almost) equality between 15.03 and (3.88)2 demonstrates that this F statistic is the square
of the t statistic reported in Subsection 12.1.2 for testing μ1 − μ3 ≡ μC − μN = 0. Rejecting an
F(1, 90) test for large values of Fobs is equivalent to rejecting a t(90) test for tobs values far from
zero. The lack of equality between 15.03 and (3.88)2 is entirely due to round-off error. To reduce
round-off error, in computing Fobs we used MSE(Full) = 15.09/90 as the full model mean squared
error, rather than the reported value from Model (12.3.1) of MSE(Full) = 0.17. To further reduce
round-off error, we could use even more accurate numbers reported earlier for Model (12.3.3),
Two final points. First, 17.61 − 15.0881 = SS(μ1 − μ3 ), the sum of squares for the contrast as defined
in (12.2.5). Second, to test μ1 − μ3 = 0, rather than manipulating the indicator variables, the next
section discusses how to get the same results by manipulating the group subscript.
Similar to testing μ1 − μ3 = 0, we could test μ1 − 21 μ2 − 21 μ3 = 0. Rewrite the hypothesis as
μ1 = 12 μ2 + 21 μ3 and obtain the reduced model by substituting this relationship into Model (12.3.1)
to get
1 1
yh = μ2 + μ3 xh1 + μ2 xh2 + μ3 xh3 + εh
2 2
or
1 1
yh = μ2 xh1 + xh2 + μ3 xh1 + xh3 + εh .
2 2
This is just a no-intercept regression model with two predictor variables x̃1 = 21 x1 + x2 and x̃2 =
1
2 x1 + x3 , say,
yi = γ1 x̃i1 + γ2 x̃i2 + εi .
Fitting the reduced model gives the usual tables. Our interest is in the ANOVA table Error line.
Table of Coefficients
Predictor γ̂k SE(γ̂k ) t P
x̃1 3.55971 0.06907 51.54 0.000
x̃2 3.41710 0.09086 37.61 0.000
Analysis of Variance
Source df SS MS F P
Regression 2 1141.41 570.71 2965.31 0.000
Error 91 17.51 0.19
Total 93 1158.93
12.3 REGRESSION ANALYSIS OF ANOVA DATA 295
From this Error line and the Error for Model (12.3.1), the model testing statistic for the hypoth-
esis μ1 − 12 μ2 − 21 μ3 = 0 is
Again, 3.80 is the tobs that was calculated in Subsection 12.1.3 for testing this hypothesis.
Finally, suppose we wanted to test μ1 − μ3 ≡ μC − μN = 1.5. Using results from the Table of
Coefficients for fitting Model (12.3.3), the t statistic is
0.4752 − 1.5
tobs = = −8.3725.
0.1224
The corresponding model-based test reduces the full model (12.3.1) by incorporating μ1 = μ3 + 1.5
to give the reduced model
or
yh = 1.5xh1 + μ2 xh2 + μ3 (xh1 + xh3) + εh .
The term 1.5xi1 is completely known (not multiplied by an unknown parameter) and is called an
offset. To analyze the model, we take the offset to the left-hand side of the model, rewriting it as
This regression model has a different dependent variable than Model (12.3.1), but because the offset
is a known multiple of a variable that is in Model (12.3.1), the offset model can be compared to
Model (12.3.1) in the usual way, cf. Christensen (2011, Subsection 3.2.1). The predictor variables
in the reduced model (12.3.5) are exactly the same as the predictor variables used in Model (12.3.4)
to test μ1 − μ3 ≡ μC − μN = 0.
Fitting Model (12.3.5) gives the usual tables. Our interest is in the Error line.
Table of Coefficients: Model (12.3.5)
Predictor γ̂k SE(γ̂k ) t P
Hisp. 3.45377 0.09313 37.08 0.000
x1 + x3 2.41268 0.07070 34.13 0.000
Analysis of Variance: Model (12.3.5)
Source df SS MS F P
Regression 2 749.01 374.51 1269.87 0.000
Error 91 26.84 0.29
Total 93 775.85
From this Error line and the Error for Model (12.3.1), the F statistic for the hypothesis μ1 − μ3 ≡
μC − μN = 1.5 is
Again, the lack of equality is entirely due to round-off error and rejecting the F(1, 90) test for large
values of Fobs is equivalent to rejecting the t(90) test for tobs values far from zero.
Instead of writing μ1 = μ3 + 1.5 and substituting for μ1 in Model (12.3.1), we could just as well
have written μ1 − 1.5 = μ3 and substituted for μ3 in Model (12.3.1). It is not obvious, but this leads
to exactly the same test. Try it!
296 12. ONE-WAY ANOVA
12.3.3 Another choice
Another variation on fitting regression models involves subtracting out the last predictor. In some
programs, notably Minitab, the overparameterized model
Up to round-off error, the ANOVA table is the same as Table 12.2. Interpreting the Table of Coeffi-
cients is a bit more complicated.
Table of Coefficients: Model (12.3.6)
Predictor γ̂k SE(γ̂k ) t P
Constant (γ0 ) 3.42762 0.04704 72.86 0.000
Group
C (γ1 ) 0.22450 0.05902 3.80 0.000
H (γ2 ) 0.02615 0.06210 0.42 0.675
The relationship between this regression model (12.3.6) and the easiest model
is
γ0 + γ1 = μ1 , γ0 + γ2 = μ2 , γ0 − (γ1 + γ2 ) = μ3 ,
so
γ̂0 + γ̂1 = 3.42762 + 0.22450 = 3.6521 = ȳ1· = μ̂1 ,
Alas, this table of coefficients is not really very useful. We can see that γ0 = (μ1 + μ2 + μ3 )/3.
The table of coefficients provides the information needed to perform inferences on this not very
interesting parameter. Moreover, it allows inference on
μ1 + μ2 + μ3 2 −1 −1
γ1 = μ1 − γ0 = μ1 − = μ1 + μ2 + μ3 ,
3 3 3 3
another, not tremendously interesting, parameter. The interpretation of γ2 is similar to γ1 .
The relationship between Model (12.3.6) and the overparameterized model (12.3.2) is
γ0 + γ1 = μ + α1 , γ0 + γ2 = μ + α2 , γ0 − (γ1 + γ2 ) = μ + α3 ,
which leads to
γ0 = μ , γ1 = α1 , γ2 = α2 , −(γ1 + γ2 ) = α3 ,
provided that the side conditions α1 + α2 + α3 = 0 hold. Under this side condition, the relationship
between the γ s and the parameters of Model (12.3.2) is very simple. That is probably the motivation
for considering Model (12.3.6). But the most meaningful parameters are clearly the μi s, and there
is no simple relationship between the γ s and them.
12.4 MODELING CONTRASTS 297
E XAMPLE 12.4.1. Mandel (1972) and Christensen (1996, Chapter 6) presented data on the stress
at 600% elongation for natural rubber with a 40-minute cure at 140o C. Stress was measured in 7
laboratories and each lab measured it four times. The dependent variable was measured in kilograms
per centimeter squared (kg/cm2 ). Following Christensen (1996) the analysis is conducted on the
logs of the stress values. The data are presented in the first column of Table 12.4 with column 2
indicating the seven laboratories. The other seven columns will be discussed later.
The model for the one-way ANOVA is
yi j = μi + ei j (12.4.1)
= μ + αi + ei j i = 1, 2, 3, 4, 5, 6, 7, j = 1, 2, 3, 4.
The ANOVA table is given as Table 12.5. Clearly, there are some differences among the laboratories.
298 12. ONE-WAY ANOVA
If these seven laboratories did not have any natural structure to them, about the only thing of
interest would be to compare all of the pairs of labs to see which ones are different. This involves
looking at 7(6)/2 = 21 pairs of means, a process discussed more in Chapter 13.
As in Christensen (1996), suppose that the first two laboratories are in San Francisco, the second
two are in Seattle, the fifth is in New York, and the sixth and seventh are in Boston. This structure
suggests that there are six interesting questions to ask. On the West Coast, is there any difference
between the San Francisco labs and is there any difference between the Seattle labs? If there are
no such differences, it makes sense to discuss the San Francisco and Seattle population means, in
which case, is there any difference between the San Francisco labs and the Seattle labs? On the East
Coast, is there any difference between the Boston labs and if not, do they differ from the New York
lab? Finally, if the West Coast labs have the same mean, and the East Coast labs have the same
mean, is there a difference between labs on the West Coast and labs on the East Coast?
C2
C3 C4 C7
C5
C6 C8
C9
GM
While this hierarchy of models was designed in response to the structure of our specific treat-
ments, the hierarchical approach is pretty general. Suppose our groups were five diets: Control, Beef
A, Beef B, Pork, and Beans. With five diets, we might be interested in four comparisons suggested
by the structure of the diets. First, we might compare the two beef diets. Second, compare the beef
diets with the pork diet. (If the beef diets are the same, are they different from pork?) Third, compare
the meat diets with the Bean diet. (If the meat diets are the same, are they different from beans?)
Fourth, is the control different from the rest? These four comparisons suggest a hierarchy of models,
cf. Exercise 12.7.15. Other nonhierarchical options would be to compare the control to each of the
other diets or to compare each diet with every other diet.
Now suppose our five diets are: Control, Beef, Pork, Lima Beans, and Soy Beans. Again, we
could compare the control to each of the other diets or compare all pairs of diets, but the structure
of the treatments suggests the four comparisons 1) beef with pork, 2) lima beans with soy beans,
3) meat with beans, and 4) control with the others, which suggests a hierarchy of models, cf. Exer-
cise 12.7.16.
Analysis of Variance: H0 : μ1 = μ2 , C3
Source df SS MS F P
Groups, 1=2 5 2.26572 0.45314 76.61 0.000
Error 22 0.13012 0.00591
Total 27 2.39584
Comparing the reduced model C3 to the full model C2 [equivalently, Model (12.4.1)] whose
ANOVA table is given as Table 12.5, we get the F statistic
There is no evidence of differences between the San Francisco labs. Note that the numerator sum of
squares is 0.13012 − 0.12663 = 0.00349 = SS(μ1 − μ2 ) as defined in (12.2.5).
When fitting an intermediate ANOVA model from our hierarchy, like C3, our primary interest
is in using the Error line of the fitted model to construct the F statistic that is our primary interest.
But the ANOVA table for model C3 also provides a test of the intermediate model against the
grand-mean model. In the case of fitting model C3: μ1 = μ2 , the F statistic of 76.61 reported in
the ANOVA table with 5 numerator degrees of freedom suggests that, even when the first two labs
are treated as the same, differences exist somewhere among this pair and the other five labs for
which we have made no assumptions. We probably would not want to perform this 5 df test if
we got a significant result in the test of model C3 versus model C2 because the 5 df test would
then be based on an assumption that is demonstrably false. (Ok, “demonstrably false” is a little
strong.) Also, since we are looking at a variety of models, all of which are special cases of model
C2, our best practice uses MSE(C2) in the denominator of all tests, including this 5 df test. In
particular, it would be better to replace F = 0.45314/0.00591 = 76.61 from the C3 ANOVA table
with F = 0.45314/0.00603 = 75.15 having 5 and 21 degrees of freedom.
Table 12.6 contains ANOVA tables for the models that impose restrictions on the West Coast
labs. The reduced models have all incorporated some additional conditions on the μi s relative to
model C2 and these are given at the top of each ANOVA table. The first model is the one we have
just examined, the model in which no distinction is made between labs 1 and 2.
The second ANOVA table in Table 12.6 is based on the model in which no distinction is made
12.4 MODELING CONTRASTS 301
between labs 3 and 4, the two in Seattle. (This is model C4.) A formal test for equality takes the
form
[SSE(C4) − SSE(C2)] [1.09390 − 0.12663]/[22 − 21]
MSE(C2) = = 155.26
[dfE(C4) − dfE(C2)] 0.00603
There is great evidence for differences between the two Seattle labs. At this point, it does not make
much sense to look further at any models that incorporate μ3 = μ4 , but we plunge forward just to
demonstrate the complete process.
The third ANOVA table in Table 12.6 is based on the model in which no distinction is made
between the two labs within San Francisco and also no distinction is made between the two labs
within Seattle. (This model uses index column 5.) The difference in SSE between this ANOVA
model and the SSE for model C2 is 1.09740 − 0.12663 = 0.97077. The degrees of freedom are
23 − 21 = 2. A formal test of H0 : μ1 = μ2 ; μ3 = μ4 takes the form
level of thrust and yi j is the depth of the hole on the jth trial with xi pounds of downward thrust. Not
only can we fit a simple linear regression, but we can fit polynomials to the data. In this example, we
could fit no polynomial above second-degree (quadratic), because three points determine a parabola
and we only have three distinct x values. If we ran the experiment with 20, 25, 30, 35, and 40 pounds
of thrust, we could fit at most a fourth-degree (quartic) polynomial because five points determine a
fourth-degree polynomial and we would only have five x values.
In general, some number a of distinct x values allows fitting of an a − 1 degree polynomial.
Moreover, fitting the a − 1 degree polynomial is equivalent to fitting the one-way ANOVA with
groups defined by the a different x values. However, as discussed in Section 8.2, fitting high-degree
polynomials is often a very questionable procedure. The problem is not with how the model fits
the observed data but with the suggestions that a high-degree polynomial makes about the behavior
of the process for x values other than those observed. In the example with 20, 25, 30, 35, and 40
pounds of thrust, the quartic polynomial will fit as well as the one-way ANOVA model but the quar-
tic polynomial may have to do some very weird things in the areas between the observed x values.
Of course, the ANOVA model gives no indications of behavior for x values other than those that
were observed. When performing regression, we usually like to have some smooth-fitting model
giving predictions that, in some sense, interpolate between the observed data points. High-degree
polynomials often fail to achieve this goal.
Much of the discussion that follows, other than observing the equivalence of fitting a one-way
ANOVA and an a − 1 degree polynomial, is simply a discussion of fitting a polynomial. It is very
similar to the discussion in Section 8.1 but with fewer possible values for the predictor variable x.
However, we will find the concept of replacing a categorical variable with a polynomial to be a very
useful one in higher-order ANOVA and in modeling count data.
E XAMPLE 12.5.1. Beineke and Suddarth (1979) and Devore (1991, p. 380) consider data on roof
supports involving trusses that use light-gauge metal connector plates. Their dependent variable is
an axial stiffness index (ASI) measured in kips per inch. The predictor variable is the length of the
light-gauge metal connector plates. The data are given in Table 12.9.
Viewed as regression data, we might think of fitting a simple linear regression model
yh = β0 + β1 xh + εh ,
h = 1, . . . , 35. Note that while h varies from 1 to 35, there are only five distinct values of xh that
306 12. ONE-WAY ANOVA
450
400
y
350
4 6 8 10 12
occur in the data. The data could also be considered as an analysis of variance with plate lengths
being different groups and with seven observations on each treatment. Table 12.10 gives the usual
summary statistics for a one-way ANOVA. As an analysis of variance, we usually use two subscripts
to identify an observation: one to identify the group and one to identify the observation within the
group. The ANOVA model is often written
yi j = μi + εi j , (12.5.1)
where i = 1, 2, 3, 4, 5 and j = 1, . . . , 7. We can also rewrite the regression model using the two
subscripts i and j in place of h,
yi j = β0 + β1xi + εi j ,
where i = 1, 2, 3, 4, 5 and j = 1, . . . , 7. Note that all of these models account for exactly 35 observa-
tions.
Figure 12.5 contains a scatter plot of the data. With multiple observations at each x value, the
regression is really only fitted to the mean of the y values at each x value. The means of the ys
are plotted against the x values in Figure 12.6. The overall trend of the data is easier to evaluate in
this plot than in the full scatter plot. We see an overall increasing trend that is very nearly linear
except for a slight anomaly with 6-inch plates. We need to establish if these visual effects are real
or just random variation, i.e., we would also like to establish whether the simple regression model
is appropriate for the trend that exists.
A more general model for trend is a polynomial. With only five distinct x values, we can fit at
most a quartic (fourth-degree) polynomial, say,
We would prefer a simpler model, something smaller than a quartic, i.e., a cubic, quadratic, or a
linear polynomial.
Table 12.11 contains ANOVA tables for fitting the linear, quadratic, cubic, and quartic polyno-
mial regressions and for fitting the one-way ANOVA model. From our earlier discussion, the F test
in the simple linear regression ANOVA table strongly suggests that there is an overall trend in the
data. From Figure 12.6 we see that this trend must be increasing, i.e., as lengths go up, by and large
12.5 POLYNOMIAL REGRESSION AND ONE-WAY ANOVA 307
440
420
400
cr$coef
380
360
340
4 6 8 10 12
PL
the ASI readings go up. ANOVA tables for higher-degree polynomial models have been discussed in
Chapter 8 but for now the key point to recognize is that the ANOVA table for the quartic polynomial
is identical to the ANOVA table for the one-way analysis of variance. This occurs because models
(12.5.1) and (12.5.2) are equivalent, i.e. they have the same fitted values, residuals, SSE, dfE, and
MSE. Note however that the ANOVA model provides predictions (fitted values) only for the five
plate lengths in the data whereas the polynomial models provide predictions for any plate length,
although predictions that have dubious value when fitting high-order polynomials at lengths other
than those in the data.
The first question of interest is whether a quartic polynomial is needed or whether a cubic model
would be adequate. This is easily evaluated from the Table of Coefficients for the quartic fit that
follows. For computational reasons, the results reported are for a polynomial involving powers of
x − x̄· rather than powers of x, cf. Section 8.1. This has no effect on our subsequent discussion.
Table of Coefficients
Predictor β̂k SE(β̂k ) t P
Constant 375.13 12.24 30.64 0.000
(x − x̄·) 8.768 5.816 1.51 0.142
(x − x̄·)2 3.983 4.795 0.83 0.413
(x − x̄·)3 0.2641 0.4033 0.65 0.517
(x − x̄·)4 −0.2096 0.2667 −0.79 0.438
There is little evidence (P = 0.438) that β4 = 0, so a cubic polynomial seems to be an adequate
explanation of the data.
This Table of Coefficients is inappropriate for evaluating β3 in the cubic model (even the cubic
model based on x − x̄· ). To evaluate β3 , we need to fit the cubic model. If we then decide that
a parabola is an adequate model, evaluating β2 in the parabola requires one to fit the quadratic
model. In general, regression estimates are only valid for the model fitted. A new model requires
new estimates and standard errors.
If we fit the sequence of polynomial models: intercept-only, linear, quadratic, cubic, quartic, we
could look at testing whether the coefficient of the highest-order term is zero in each model’s Table
of Coefficients or we could compare the models by comparing SSEs. The latter is often more con-
308 12. ONE-WAY ANOVA
venient. The degrees of freedom and sums of squares for error for the four polynomial regression
models and the model with only an intercept β0 (grand mean) follow. The differences in sums of
squares error for adjacent models are also given as sequential sums of squares (Seq. SS), cf. Sec-
tion 9.4; the differences in degrees of freedom error are just 1.
Model comparisons
(Difference)
Model dfE SSE Seq. SS
Intercept 34 75468 —-
Linear 33 32687 42780
Quadratic 32 32573 114
Cubic 31 32123 450
Quartic 30 31475 648
Note that the dfE and SSE for the intercept model are those from the corrected Total lines in the
ANOVAs of Table 12.11. The dfEs and SSEs for the other models also come from Table 12.11. One
virtue of using this method is that many regression programs will report the Seq. SS when fitting
Model (12.5.2), so we need not fit all four polynomial models, cf. Subsection 8.1.1.
To test the quartic model against the cubic model we take
648/1
Fobs = = 0.62 = (−0.79)2 .
31475/30
This is just the square of the t statistic for testing β4 = 0 in the quartic model. The reference distri-
bution for the F statistic is F(1, 30) and the P value is 0.438, as it was for the t test.
12.5 POLYNOMIAL REGRESSION AND ONE-WAY ANOVA 309
If we decide that we do not need the quartic term, we can test whether we need the cubic term.
We can test the quadratic model against the cubic model with
450/1
Fobs = = 0.434.
32123/31
The reference distribution is F(1, 31). This test is equivalent to the t test of β3 = 0 in the cubic
model. The t test of β3 = 0 in the quartic model is inappropriate.
Our best practice is an alternative to this F test. The denominator of this test is 32123/31, the
mean squared error from the cubic model. If we accepted the cubic model only after testing the
quartic model, the result of the quartic test is open to question and thus using the MSE from the
cubic model to estimate of σ 2 is open to question. It might be better just to use the estimate of σ 2
from the quartic model, which is the mean squared error from the one-way ANOVA. If we do this,
the test statistic for the cubic term becomes
450/1
Fobs = = 0.429.
31475/30
The reference distribution for the alternative test is F(1, 30). In this example the two F tests give
essentially the same answers. This should, by definition, almost always be the case. If, for example,
one test were significant at 0.05 and the other were not, they are both likely to have P values near
0.05 and the fact that one is a bit larger than 0.05 and the other is a bit smaller than 0.05 should not
be a cause for concern. The only time these tests would be very different is if we performed them
when there was considerable evidence that β4 = 0, something that would be silly to do.
As originally discussed in Subsection 3.1.1, when making a series of tests related to Model
(12.5.2), we recommend using its mean squared error, say MSE(2), as the denominator of all the
F statistics. We consider this preferable to actually fitting all four polynomial models and looking
at the Table of Coefficients t statistics for the highest-order term, because the tables of coefficients
from the four models will not all use MSE(2).
If we decide that neither the quartic nor the cubic terms are important, we can test whether we
need the quadratic term. Testing the quadratic model against the simple linear model gives
114/1
Fobs = = 0.112,
32573/32
which is compared to an F(1, 32) distribution. This test is equivalent to the t test of β2 = 0 in the
quadratic model. Again, we prefer the quadratic term test statistic
114/1
Fobs = = 0.109
31475/30
42780/1
Fobs = = 43.190.
32687/33
This is just the test for zero slope in the simple linear regression model. Again, the preferred test for
the linear term has
42780/1
Fobs = = 40.775.
31475/30
Table 12.12 contains an expanded analysis of variance table for the one-way ANOVA that in-
corporates the information from this sequence of comparisons.
310 12. ONE-WAY ANOVA
0
−1
−2
4 6 8 10 12
Plate Length
From Table 12.12, the P value of 0.000 indicates strong evidence that the five groups are differ-
ent, i.e., there is strong evidence for the quartic polynomial helping to explain the data. The results
from the sequential terms are so clear that we did not bother to report P values for them. There is a
huge effect for the linear term. The other three F statistics are all much less than 1, so there is no
evidence of the need for a quartic, cubic, or quadratic polynomial. As far as we can tell, a line fits the
data just fine. For completeness, some residual plots are presented as Figures 12.7 through 12.12.
Note that the normal plot for the simple linear regression in Figure 12.8 is less than admirable, while
the normal plot for the one-way ANOVA in Figure 12.12 is only slightly better. It appears that one
should not put great faith in the normality assumption. ✷
2
1
Standardized residuals
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
1
0
−1
−2
4 6 8 10 12
Plate Length
freedom for error in the one-way ANOVA are often referred to as the mean square, sum of squares,
and degrees of freedom for pure error. This lack-of-fit test can be performed whenever the data
contain multiple observations at any x values. Often the appropriate unbalanced one-way ANOVA
includes groups with only one observation on them. These groups do not provide an estimate of σ 2 ,
so they simply play no role in obtaining the mean square for pure error. In Section 8.6 the Hooker
data had only two x values with multiple observations and both groups only had two observations
in them. Thus, the n = 31 cases are divided into a = 29 groups but only four cases were involved in
finding the pure error.
For testing lack of fit in the simple linear regression model with the ASI data, the numerator
312 12. ONE-WAY ANOVA
Residual−Fitted plot
2
Standardized residuals
1
0
−1
−2
Fitted
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
sum of squares can be obtained by differencing the sums of squares for error in the simple linear
regression model and the one-way ANOVA model. Here the sum of squares for lack of fit is 32687 −
31475 = 1212 and the degrees of freedom for lack of fit are 33 − 30 = 3. The mean square for lack
of fit is 1212/3 = 404. The pure error comes from the one-way ANOVA table. The lack-of-fit test
F statistic for the simple linear regression model is
404
Fobs = = 0.39
1049
12.5 POLYNOMIAL REGRESSION AND ONE-WAY ANOVA 313
which is less than 1, so there is no evidence of a lack of fit in the simple linear regression model. If
an α = 0.05 test were desired, the test statistic would be compared to F(0.95, 3, 30).
A similar lack-of-fit test is available for any of the reduced polynomial models relative to the
one-way ANOVA model.
12.5.2 More on R2
Consider a balanced one-way ANOVA,
yi j = μi + εi j , E(εi j ) = 0, Var(εi j ) = σ 2 ,
yi j = β0 + β1 xi + εi j (12.5.3)
The two regression models give the same least squares estimates and predictions.
Exercise 12.7.17 is to show that R2 is always at least as large for Model (12.5.4) as for Model
(12.5.3) by showing that for Model (12.5.3),
2
[N ∑ai=1 (xi − x̄· )(ȳi· − ȳ··)]
R2 = (12.5.5)
[N ∑ai=1 (xi − x̄· )2 ]N ∑ai=1 (ȳi· − ȳ··)2 + ∑i j (yi j − ȳi· )2
The two regressions are equivalent in terms of finding a good model but it is easier to predict
averages than individual observations because averages have less variability. R2 for a model depends
both on how good the model is relative to a perfect prediction model and how much variability there
is in y when using a perfect model. Remember, perfect models can have low R2 when there is much
variability and demonstrably wrong models can have high R2 when the variability of a perfect model
is small, but the wrong model captures the more important features of a perfect model in relating x
to y.
Discussing R2 in the context of one-way ANOVA is useful because the one-way ANOVA pro-
vides a perfect model for predicting y based on x, whereas the simple linear regression model may
or may not be a perfect model. For a given value of SSTot = ∑i j (yi j − ȳ·· )2 , the size of R2 for the
one-way ANOVA depends only on the within-group variability, that is, the variability of y in the
perfect model. The size of R2 for the simple linear regression depends on both the variability of y in
the perfect model as well as how well the simple linear regression model approximates the perfect
model. The term ∑i j (yi j − ȳi· )2 in the denominator of (12.5.5) is the sum of squares error from the
one-way ANOVA, so it is large precisely when the variability in the perfect model is large. (The
one-way ANOVA’s MSE is an estimate of the variance from a perfect prediction model.) Averaging
observations within a group makes the variability of a perfect model smaller, i.e., the variance is
smaller in
ȳi· = μi + ε̄i· , E(ε̄i· ) = 0, Var(ε̄i· ) = σ 2 /N,
so the R2 of Model (12.5.4) is larger because, while the component of R2 due to approximating the
perfect model remains the same, the component due to variability in the perfect model is reduced.
314 12. ONE-WAY ANOVA
12.6 Weighted least squares
In general, weighted regression is a method for dealing with observations that have nonconstant
variances and nonzero correlations. In this section, we deal with the simplest form of weighted
regression in which we assume zero correlations between observations. Weighted regression has
some interesting connections to fitting polynomials to one-way ANOVA data that we will examine
here, and it has connections to analyzing the count data considered later.
Our standard regression model from Chapter 11 has
Y = X β + e, E(e) = 0, Cov(e) = σ 2 I.
We now consider a model for data that do not all have the same variance. In this model, we assume
that the relative sizes of the variances are known but that the variances themselves are unknown.
In this simplest form of weighted regression, we have a covariance structure that changes from
Cov(e) = σ 2 I to Cov(e) = σ 2 D(w)−1 . Here D(w) is a diagonal matrix with known weights w =
(w1 , . . . , wn )′ along the diagonal. The covariance matrix involves D(w)−1 , which is just a diagonal
matrix having diagonal entries that are 1/w1 , . . . , 1/wn . The variance of an observation yi is σ 2 /wi .
If wi is large relative to the other weights, the relative variance of yi is small, so it contains more
information than other observations and we should place more weight on it. Conversely, if wi is
relatively small, the variance of yi is large, so it contains little information and we should place little
weight on it. For all cases, wi is a measure of how much relative weight should be placed on case i.
Note that the weights are relative, so we could multiply or divide them all by a constant and obtain
essentially the same analysis. Obviously, in standard regression the weights are all taken to be 1.
In matrix form, our new model is
In this model all the observations are uncorrelated because the covariance matrix is diagonal. We
do not know the variance of any observation because σ 2 is unknown. However, we do know the
relative sizes of the variances because we know the weights wi . It should be noted that when Model
(12.6.1) is used to make predictions, it is necessary to specify weights for any future observations.
Before giving a general discussion of weighted regression models, we examine some examples
of their application. A natural application of weighted regression is to data for a one-way analysis
of variance with groups that are quantitative levels of some factor. With a quantitative factor, we can
perform either a one-way ANOVA or a regression on the data. However, if for some reason the full
data are not available, we can still obtain an appropriate simple linear regression by performing a
weighted regression analysis on the treatment means. The next examples explore the relationships
between regression on the full data and weighted regression on the treatment means.
In the weighted regression, the weights turn out to be the treatment group sample sizes from
the ANOVA. In a standard unbalanced ANOVA yi j = μi + εi j , i = 1, . . . , a, j = 1, . . . , Ni , the sample
means have Var(ȳi· ) = σ 2 /Ni . Thus, if we perform a regression on the means, the observations
have different variances. In particular, from our earlier discussion of variances and weights, it is
appropriate to take the sample sizes as the weights, i.e., wi = Ni .
E XAMPLE 12.6.1. In Section 12.5 we considered the axial stiffness data of Table 12.9. A simple
linear regression on the full data gives the following:
Table of Coefficients: SLR
Predictor β̂k SE(β̂k ) t P
Constant 285.30 15.96 17.88 0.000
x (plates) 12.361 1.881 6.57 0.000
The analysis of variance table for the simple linear regression is given below. The usual error
line would have 33 degrees of freedom but, as per Subsection 12.5.1, we have broken this into two
components, one for lack of fit and one for pure error.
12.6 WEIGHTED LEAST SQUARES 315
Analysis of Variance: SLR
Source df SS MS
Regression 1 42780 42780
Lack of fit 3 1212 404
Pure error 30 31475 1049
Total 34 75468
In Section 12.5 we presented group summary statistics, ȳi· and s2i , for the four plate lengths. The
mean squared pure error is just the pooled estimate of the variance and the sample sizes and sample
means are given below.
Plate 4 6 8 10 12
N 7 7 7 7 7
ȳi· 333.2143 368.0571 375.1286 407.3571 437.1714
As mentioned in Section 12.5, one can get the same estimated line by just fitting a simple linear
regression to the means. For an unbalanced ANOVA, getting the correct regression line from the
means requires a weighted regression. In this balanced case, if we use a weighted regression we get
not only the same fitted line but also some interesting relationships in the ANOVA tables. Below
are given the Table of Coefficients and the ANOVA table for the weighted regression on the means.
The weights are the sample sizes for each mean.
Table of Coefficients: Weighted SLR
Predictor β̂k SE(β̂k ) t P
Constant 285.30 10.19 27.99 0.000
x (plates) 12.361 1.201 10.29 0.002
Analysis of Variance: Weighted SLR
Source df SS MS F P
Regression 1 42780 42780 105.88 0.002
Error 3 1212 404
Total 4 43993
The estimated regression coefficients are identical to those given in Section 12.5. The standard errors
and thus the other entries in the table of coefficients differ. In the ANOVA tables, the regression
lines agree while the error line from the weighted regression is identical to the lack-of-fit line in
the ANOVA table for the full data. In the weighted regression, all standard errors use the lack of fit
as an estimate of the variance. In the regression on the full data, the standard errors use a variance
estimate obtained from pooling the lack of fit and the pure error. The ultimate point is that by using
weighted regression on the summarized data, we can still get most relevant summary statistics for
simple linear regression. Of course, this assumes that the simple linear regression model is correct,
and unfortunately the weighted regression does not allow us to test for lack of fit.
If we had taken all the weights to be one, i.e., if we had performed a standard regression on the
means, the parameter estimate table would be the same but the ANOVA table would not display the
identities discussed above. The sums of squares would all have been off by a factor of 7. ✷
Unbalanced weights
We now examine an unbalanced one-way ANOVA and again compare a simple linear regression
including identification of pure error and lack of fit to a weighted regression on sample means.
E XAMPLE 12.6.2. Consider the data of Exercise 6.11.1 and Table 6.8. These involve ages of
truck tractors and the costs of maintaining the tractors. A simple linear regression on the full data
yields the tables given below.
316 12. ONE-WAY ANOVA
Table of Coefficients: SLR
Predictor β̂k SE(β̂k ) t P
Constant 323.6 146.9 2.20 0.044
Age 131.72 35.61 3.70 0.002
Analysis of Variance: SLR
Source df SS MS
Regression 1 1099635 1099635
Lack of fit 5 520655 104131
Pure error 10 684752 68475
Total 16 2305042
The weighted regression analysis is based on the sample means and sample sizes given below.
The means serve as the y variable, the ages are the x variable, and the sample sizes are the weights.
Age 0.5 1.0 4.0 4.5 5.0 5.5 6.0
Ni 2 3 3 3 3 1 2
ȳi· 172.5 664.3 633.0 900.3 1202.0 987.0 1068.5
The Table of Coefficients and ANOVA table for the weighted regression are
Table of Coefficients: Weighted SLR
Predictor β̂k SE(β̂k ) t P
Constant 323.6 167.3 1.93 0.111
Age 131.72 40.53 3.25 0.023
and
Analysis of Variance: Weighted SLR
Source df SS MS F P
Regression 1 1099635 1099635 10.56 0.023
Error 5 520655 104131
Total 6 1620290
Note that, as in the previous example, the regression estimates agree with those from the full data,
that the regression sum of squares from the ANOVA table agrees with the full data, and that the lack
of fit line from the full data ANOVA agrees with the error line from the weighted regression. For an
unbalanced ANOVA, you cannot obtain a correct simple linear regression analysis from the group
means without using weighted regression. ✷
12.6.1 Theory
The analysis of the weighted regression model (12.6.1) is based on changing it into a standard
√
regression model. The trick is to create a new
√ diagonal matrix that has entries wi . In a minor
abuse of notation, we write this matrix as D( w). We now multiply Model (12.6.1) by this matrix
to obtain √ √ √
D( w)Y = D( w)X β + D( w)e. (12.6.2)
It is not difficult to see that
√ √ √
E D( w)e = D( w)E(e) = D( w)0 = 0
and √ √ √ √ √
Cov D( w)e = D( w)Cov(e)D( w)′ = D( w) σ 2 D(w)−1 D( w) = σ 2 I.
Thus Equation (12.6.2) defines a standard regression model. For example, by Proposition 11.3.1,
the least squares regression estimates from Model (12.6.2) are
√ √ −1 √ √
β̂ = [D( w)X ]′ [D( w)X ] [D( w)X ]′ [D( w)Y ]
= (X ′ D(w)X )−1 X ′ D(w)Y.
12.7 EXERCISES 317
The estimate of β given above is referred to as a weighted least squares estimate because rather than
minimizing [Y − X β ] [Y − X β ], the estimates are obtained by minimizing
′
√ √ ′ √ √
D( w)Y − D( w)X β D( w)Y − D( w)X β = [Y − X β ] D(w) [Y − X β ].
′
Thus the original minimization problem has been changed into a similar minimization problem that
incorporates the weights. The sum of squares for error from Model (12.6.2) is
√ √ ′ √ √ ′
SSE = D( w)Y − D( w)X β̂ D( w)Y − D( w)X β̂ = Y − X β̂ D(w) Y − X β̂ .
The dfE are unchanged from a standard model and MSE is simply SSE divided by dfE. Standard
errors are found in much the same manner as usual except now
Cov β̂ = σ 2 (X ′ D(w)X )−1 .
Because the D(w) matrix is diagonal, it is very simple to modify a computer program for stan-
dard regression to allow the analysis of models like (12.6.1). Of course, to make a prediction, a
weight must now be specified for the new observation. Essentially the same idea of rewriting Model
(12.6.1) as the standard regression model (12.6.2) works even when D(w) is not a diagonal matrix,
cf. Christensen (2011, Sections 2.7 and 2.8).
12.7 Exercises
E XERCISE 12.7.1. In addition to the data in Table 12.4, Mandel (1972) reported stress test data
from five additional laboratories. Summary statistics are given in Table 12.13. Based on just these
five additional labs, compute the analysis of variance table and test for differences in means between
all pairs of labs. Use α = .01. Is there any reason to worry about the assumptions of the analysis of
variance model?
E XERCISE 12.7.2. Snedecor and Cochran (1967, Section 6.18) presented data obtained in 1942
from South Dakota on the relationship between the size of farms (in acres) and the number of acres
planted in corn. Summary statistics are presented in Table 12.14. Note that the sample standard
deviations rather than the sample variances are given. In addition, the pooled standard deviation is
0.4526.
(a) Give the one-way analysis of variance model with all of its assumptions. Can any problems with
the assumptions be identified?
(b) Give the analysis of variance table for these data. Test whether there are any differences in corn
acreages due to the different sized farms. Use α = 0.01.
(c) Test for differences between all pairs of farm sizes using α = 0.01 tests.
(d) Find the sum of squares for the contrast defined by the following coefficients:
318 12. ONE-WAY ANOVA
Table 12.15: Weights (in pounds) for various heights (in inches).
Sample Sample Sample
Height size mean variance
63 3 121.66̄ 158.333̄
65 4 131.25 72.913̄
66 2 142.50 112.500
72 3 171.66̄ 158.333̄
E XERCISE 12.7.3. Table 12.15 gives summary statistics on heights and weights of people. Give
the analysis of variance table and test for differences among the four groups. Give a 99% confidence
interval for the mean weight of people in the 72-inch height group.
E XERCISE 12.7.4. In addition to the data discussed earlier, Mandel (1972) reported data from
one laboratory on four different types of rubber. Four observations were taken on each type of
rubber. The means are given below.
Material A B C D
Mean 26.4425 26.0225 23.5325 29.9600
The sample variance of the 16 observations is 14.730793. Compute the analysis of variance table,
the overall F test, and test for differences between each pair of rubber types. Use α = .05.
E XERCISE 12.7.5. In Exercise 12.7.4 on the stress of four types of rubber, the observations on
material B were 22.96, 22.93, 22.49, and 35.71. Redo the analysis, eliminating the outlier. The
sample variance of the 15 remaining observations is 9.3052838.
E XERCISE 12.7.6. Bethea et al. (1985) reported data on an experiment to determine the effec-
tiveness of four adhesive systems for bonding insulation to a chamber. The data are a measure of the
peel-strength of the adhesives and are presented in Table 12.16. A disturbing aspect of these data is
that the values for adhesive system 3 are reported with an extra digit.
(a) Compute the sample means and variances for each group. Give the one-way analysis of variance
model with all of its assumptions. Are there problems with the assumptions? If so, does an
analysis on the square roots or logs of the data reduce these problems?
(b) Give the analysis of variance table for these (possibly transformed) data. Test whether there are
any differences in adhesive systems. Use α = 0.01.
12.7 EXERCISES 319
(c) Test for differences between all pairs of adhesive systems using α = 0.01 tests.
(d) Find the sums of squares i) for comparing system 1 with system 4 and ii) for comparing system
2 with system 3.
(e) Assuming that systems 1 and 4 have the same means and that systems 2 and 3 have the same
means, perform a 0.01 level F test for whether the peel-strength of systems 1 and 4 differs from
the peel-strength of systems 2 and 3.
(f) Give a 99% confidence interval for the mean of every adhesive system.
(g) Give a 99% prediction interval for every adhesive system.
(h) Give a 95% confidence interval for the difference between systems 1 and 2.
E XERCISE 12.7.7. Table 12.17 contains weight gains of rats from Box (1950). The rats were
given either Thyroxin or Thiouracil or were in a control group. Do a complete analysis of variance
on the data. Give the model, check assumptions, make residual plots, give the ANOVA table, and
examine appropriate relationships among the means.
E XERCISE 12.7.8. Aitchison and Dunsmore (1975) presented data on Cushing’s syndrome.
Cushing’s syndrome is a condition in which the adrenal cortex overproduces cortisol. Patients are
divided into one of three groups based on the cause of the syndrome: a—adenoma, b— bilateral
hyperplasia, and c—carcinoma. The data are amounts of tetrahydrocortisone in the urine of the
patients. The data are given in Table 12.18. Give a complete analysis.
E XERCISE 12.7.9. Draper and Smith (1966, p. 41) considered data on the relationship between
the age of truck tractors (in years) and the cost (in dollars) of maintaining them over a six-month
period. The data are given in Table 12.19.
Note that there is only one observation at 5.5 years of age. This group does not yield an estimate
of the variance and can be ignored for the purpose of computing the mean squared error. In the
weighted average of variance estimates, the variance of this group is undefined but the variance gets
0 weight, so there is no problem.
Give the analysis of variance table for these data. Does cost differ with age? Is there a significant
difference between the cost at 0.5 years as opposed to 1.0 year? Determine whether there are any
differences between costs at 4, 4.5, 5, 5.5, and 6 years. Are there differences between the first two
ages and the last five? How well do polynomials fit the data?
E XERCISE 12.7.10. Lehmann (1975), citing Heyl (1930) and Brownlee (1960), considered data
on determining the gravitational constant of three elements: gold, platinum, and glass. The data
Lehmann gives are the third and fourth decimal places in five determinations of the gravitational
constant. Analyze the following data.
Gold Platinum Glass
83 61 78
81 61 71
76 67 75
79 67 72
76 64 74
E XERCISE 12.7.11. Shewhart (1939, p. 69) also presented the gravitational constant data of Heyl
(1930) that was considered in the previous problem, but Shewhart reports six observations for gold
instead of five. Shewhart’s data are given below. Analyze these data and compare your results to
those of the previous exercise.
Gold Platinum Glass
83 61 78
81 61 71
76 67 75
79 67 72
78 64 74
72
E XERCISE 12.7.12. Recall that if Z ∼ N(0, 1) and W ∼ χ 2 (r) with Z and W independent, then
by Definition 2.1.3, Z W /r has a t(r) distribution. Also recall that in a one-way ANOVA with
independent normal errors, a contrast has
a a a
λ2
∑ λi ȳi· ∼ N ∑ λi μi , σ 2 ∑ Nii ,
i=1 i=1 i=1
12.7 EXERCISES 321
SSE
∼ χ 2 (dfE),
σ2
and MSE independent of all the ȳi· s. Show that
E XERCISE 12.7.13. Suppose a one-way ANOVA involves four diet treatments: Control, Beef A,
Beef B, Pork, and Beans. As in Subsection 12.4.1, construct a reasonable hierarchy of models to
examine that involves five rows and no semicolons.
E XERCISE 12.7.14. Suppose a one-way ANOVA involves four diet treatments: Control, Beef,
Pork, Lima Beans, and Soy Beans. As in Subsection 12.4.1, construct a reasonable hierarchy of
models that involves four rows, one of which involves a semicolon.
E XERCISE 12.7.15. Conover (1971, p. 326) presented data on the amount of iron found in the
livers of white rats. Fifty rats were randomly divided into five groups of ten and each group was
given a different diet. We analyze the logs of the original data. The total sample variance of the 50
observations is 0.521767 and the means for each diet are given below.
Diet A B C D E
Mean 1.6517 0.87413 0.89390 0.40557 0.025882
Compute the analysis of variance table and test whether there are differences due to diet.
If diets A and B emphasize beef and pork, respectively, diet C emphasizes poultry, and diets D
and E are based on dried beans and oats, the following contrasts may be of interest.
Diet
Contrast A B C D E
Beef vs. pork 1 −1 0 0 0
Mammals vs. poultry 1 1 −2 0 0
Beans vs. oats 0 0 0 1 −1
Animal vs. vegetable 2 2 2 −3 −3
Compute sums of squares for each contrast. Construct a hierarchy of models based on the diet labels
and figure out how to test them using weighted least squares and the mean squared error for pure
error that you found to construct the ANOVA table. What conclusions can you draw about the data?
As illustrated in Chapter 12, the most useful information from a one-way ANOVA is obtained
through examining contrasts. That can be done either by estimating contrasts and performing tests
and confidence intervals or by incorporating contrasts directly into reduced models. The first tech-
nique is convenient for one-way ANOVA and also for balanced multifactor ANOVA but it is difficult
to apply to unbalanced multifactor ANOVA or to models for count data. In the latter cases, modeling
contrasts is easier. In either case, the trick is in picking interesting contrasts to consider. Interesting
contrasts are determined by the structure of the groups or are suggested by the data.
The structure of the groups often suggests contrasts that are of interest. We introduced this idea
in Section 12.4. For example, if one of the groups is a standard group or a control, it is of interest to
compare all of the other groups to the standard. With a groups, this leads to a − 1 contrasts. Later
we will consider factorial group structures. These include situations such as four fertilizer groups,
say,
n 0 p0 n0 p1 n1 p0 n1 p1
where n0 p0 is no fertilizer, n0 p1 consists of no nitrogen fertilizer but application of a phosphorous
fertilizer, n1 p0 consists of a nitrogen fertilizer but no phosphorous fertilizer, and n1 p1 indicates both
types of fertilizer. Again the group structure suggests contrasts to examine. One interesting contrast
compares the two groups having nitrogen fertilizer against the two without nitrogen fertilizer, an-
other compares the two groups having phosphorous fertilizer against the two without phosphorous
fertilizer, and a third contrast compares the effect of nitrogen fertilizer when phosphorous is not
applied with the effect of nitrogen fertilizer when phosphorous is applied. Again, we have a groups
and a − 1 contrasts. Even when there is an apparent lack of structure in the groups, the very lack
of structure suggests a set of contrasts. If there is no apparent structure, the obvious thing to do is
compare all of the groups with all of the other groups. With three groups, there are three distinct
pairs of groups to compare. With four groups, there are six distinct pairs of groups to compare. With
five groups, there are ten pairs. With seven groups, there are 21 pairs. With 13 groups, there are 78
pairs.
One problem is that, with a moderate number of groups, there are many contrasts to examine.
When we do tests or confidence intervals, there is a built-in chance for error. The more statistical
inferences we perform, the more likely we are to commit an error. The purpose of the multiple
comparison methods examined in this chapter is to control the probability of making a specific
type of error. When testing many contrasts, we have many null hypotheses. This chapter considers
multiple comparison methods that control (i.e., limit) the probability of making an error in any
of the tests, when all of the null hypotheses are correct. Limiting this probability is referred to as
weak control of the experimentwise error rate. It is referred to as weak control because the control
only applies under the very stringent assumption that all null hypotheses are correct. Some authors
consider a different approach and define strong control of the experimentwise error rate as control
of the probability of falsely rejecting any null hypothesis. Thus strong control limits the probability
of false rejections even when some of the null hypotheses are false. Not everybody distinguishes
between weak and strong control, so the definition of experimentwise error rate depends on whose
work you are reading. One argument against weak control of the experimentwise error rate is that in
323
324 13. MULTIPLE COMPARISON METHODS
designed experiments, you choose groups that you expect to have different effects. In such cases, it
makes little sense to concentrate on controlling the error under the assumption that all groups have
the same effect. On the other hand, strong control is more difficult to establish.
Our discussion of multiple comparisons focuses on testing whether contrasts are equal to 0. In
all but one of the methods considered in this chapter, the experimentwise error rate is (weakly)
controlled by first doing a test of the hypothesis μ1 = μ2 = · · · = μa . If this test is not rejected, we
do not claim that any individual contrast is different from 0. In particular, if μ1 = μ2 = · · · = μa , any
contrast among the means must equal 0, so all of the null hypotheses are correct. Since the error rate
for the test of μ1 = μ2 = · · · = μa is controlled, the weak experimentwise error rate for the contrasts
is also controlled.
Many multiple testing procedures can be adjusted to provide multiple confidence intervals that
have a guaranteed simultaneous coverage. Several such methods will be presented in this chapter.
Besides the group structure suggesting contrasts, the other source of interesting contrasts is
having the data suggest them. If the data suggest a contrast, then the ‘parameter’ in our standard
theory for statistical inferences is a function of the data and not a parameter in the usual sense of
the word. When the data suggest the parameter, the standard theory for inferences does not apply.
To handle such situations we can often include the contrasts suggested by the data in a broader
class of contrasts and develop a procedure that applies to all contrasts in the class. In such cases
we can ignore the fact that the data suggested particular contrasts of interest because these are still
contrasts in the class and the method applies for all contrasts in the class. Of the methods considered
in the current chapter, only Scheffé’s method (discussed in Section 13.3) is generally considered
appropriate for this kind of data dredging.
A number of books have been published on multiple comparison methods, e.g., Hsu (1996),
Hochberg and Tamhane (1987). A classic discussion is Miller (1981), who also focuses on weak
control of the experimentwise error rate, cf. Miller’s Section 1.2.
We present multiple comparison methods in the context of the one-way ANOVA model (12.2.1)
but the methods extend to many other situations. We will use Mandel’s (1972) data from Sec-
tion 12.4 to illustrate the methods.
E XAMPLE 13.1.1. For Mandel’s laboratory data, Subsection 12.4.2 discussed six F tests to go
along with our six degrees of freedom for groups. To test H0 : μ1 = μ2 we compared model C3
to model C2. To test H0 : μ3 = μ4 we compared models C4 and C2. To test H0 : μ6 = μ7 we
compared models C7 and C2. To test H0 : μ1 = μ2 = μ3 = μ4 , we assumed μ1 = μ2 and μ3 = μ4 and
compared model C6 to model C5. Normally, to test H0 : μ5 = μ6 = μ7 , we would assume μ6 = μ7
and test model C8 against C7. Finally, to test H0 : μ1 = μ2 = μ3 = μ4 = μ5 = μ6 = μ7 we assumed
μ1 = μ2 = μ3 = μ4 and μ5 = μ6 = μ7 and compared model C9 to the grand-mean model.
Under the least significant difference method with α = 0.05, first check that the P value in
Table 12.5 is no greater than 0.05, and, if it is, perform the six tests in the usual way at the 0.05
level. In Subsection 12.4.2 we did not test model C8 against C7, we tested model C8 against C2,
13.1 “FISHER’S” LEAST SIGNIFICANT DIFFERENCE METHOD 325
and we also performed a test of model C4 against C2. These changes in what were tested cause no
change in procedure. However, if the P value in Table 12.5 is greater than 0.05, you simply do not
perform any of the other tests. ✷
The name “least significant difference” comes from comparing pairs of means in a balanced
ANOVA. With N observations in each group, there is a number, the least significant difference
(LSD), such that the difference between two means must be greater than the LSD for the corre-
sponding groups to be considered significantly different. Generally, we have a significant difference
between μi and μ j if
|ȳi· − ȳ j· | α
> t 1 − , dfE .
MSE 1 + 1 2
N N
E XAMPLE 13.1.1 C ONTINUED . For Mandel’s laboratory data, the analysis of variance F test is
highly significant, so we can proceed to make individual comparisons among pairs of means. With
α = 0.05,
1 1
LSD = t(0.975, 39) 0.00421 + = 2.023(0.0459) = 0.093.
4 4
Means that are greater than 0.093 apart are significantly different. Means that are less than 0.093
apart are not significantly different. We display the results visually. Order the sample means from
smallest to largest and indicate groups of means that are not significantly different by underlining
the group. Such a display follows for comparing laboratories 1 through 7.
Lab. 4 7 5 6 3 2 1
Mean 4.0964 4.2871 4.6906 4.7175 4.7919 4.8612 4.9031
Laboratories 4 and 7 are distinct from all other laboratories. All the other consecutive pairs of labs
are insignificantly different. Thus labs 5 and 6 cannot be distinguished. Similarly, labs 6 and 3
cannot be distinguished, 3 and 2 cannot be distinguished, and labs 2 and 1 cannot be distinguished.
However, lab 5 is significantly different from labs 3, 2, and 1. Lab 6 is significantly different from
labs 2 and 1. Also, lab 3 is different from lab 1.
An alternative display is often used by computer programs.
326 13. MULTIPLE COMPARISON METHODS
Lab. Mean
4 4.0964 A
7 4.2871 B
5 4.6906 C
6 4.7175 C D
3 4.7919 E D
2 4.8612 E F
1 4.9031 F
Displays such as these may not be possible when dealing with unbalanced data. What makes
them possible is that, with balanced data, the standard error is the same for comparing every pair of
means. To illustrate their impossibility, we modify the log suicide sample means while leaving their
standard errors alone. Suppose that the means are
Sample statistics: log of suicide ages
modified data
Group Ni ȳi·
Caucasians 44 3.6521
Hispanics 34 3.3521
Native Am. 15 3.3321
(The fact that all three sample means have the same last two digits are a clue that the data are made
up.) Now if we test whether all pairs of differences are zero, at α = 0.01 the critical value is 2.632.
Table of Coefficients
Par Est SE(Est) tobs
μC − μH 0.3000 0.0936 3.21
μC − μN 0.3200 0.1225 2.61
μH − μN 0.0200 0.1270 0.16
The Anglo mean is farther from the Native American mean than it is from the Hispanic mean, but
the Anglos and Hispanics are significantly different whereas the Anglos and the Native Americans
are not. ✷
Apparently some people have taken to calling this method the Fisher significant difference
(FSD) method. One suspects that this is a reaction to another meaning commonly associated with
the letters LSD. I, for one, would never suggest that only people who are hallucinating would believe
all differences declared by LSD are real.
The least significant difference method has traditionally been ascribed to R. A. Fisher and is
often called “Fisher’s least significant difference method.” However, from my own reading of Fisher,
I am unconvinced that he either suggested the method or would have approved of it.
E XAMPLE 13.2.1. For Mandel’s laboratory data, using the structure exploited in Section 12.4,
we had six F tests to go along with our six degrees of freedom for groups. To test H0 : μ1 = μ2 we
compared model C3 to model C2. To test H0 : μ3 = μ4 we compared models C4 and C2. To test
H0 : μ6 = μ7 we compared models C7 and C2. To test H0 : μ1 = μ2 = μ3 = μ4 , we assumed μ1 = μ2
and μ3 = μ4 and compared model C6 to model C5. Normally, to test H0 : μ5 = μ6 = μ7 , we would
assume μ6 = μ7 and test model C8 against C7. Finally, to test H0 : μ1 = μ2 = μ3 = μ4 = μ5 =
μ6 = μ7 we assumed μ1 = μ2 = μ3 = μ4 and μ5 = μ6 = μ7 and compared model C9 to the grand-
mean model. Under the Bonferroni method with α = 0.05 and six tests to perform, you simply
perform each one at the α /6 = 0.05/6 = 0.0083 level. Personally, with six tests, I would instead
pick α = 0.06 so that α /6 = 0.06/6 = 0.01. Rather than these six tests, in Subsection 12.4.2 we
actually performed seven tests, so for an α = 0.05 Bonferroni procedure we need to perform each
one at the α /7 = 0.05/7 = 0.0071 level. Again, I would personally just raise the Bonferroni level
to 0.07 and do all the tests at the 0.01 level. If I had nine tests, I would not raise the Bonferroni level
all the way to 0.09, but I might lower it to 0.045 so that I could do the individual tests at the 0.005
level. ✷
To compare pairs of means in a balanced ANOVA, as with the least significant difference
method, there is a single number to which we can compare the differences in means. For a fixed
α , this number is called the Bonferroni significant difference and takes on the value
α 1 1
BSD ≡ t 1 − , dfE MSE + .
2s N N
Recall, for comparison, that with the least significant difference method, the necessary tabled value
is t(1 − α /2, dfE), which is always smaller than the tabled value used in the BSD. Thus the BSD is
always larger than the LSD and the BSD tends to display fewer differences among the means than
the LSD.
Bonferroni adjustments can also be used to obtain confidence intervals that have a simultaneous
confidence of (1 − α )100% for covering all of the contrasts. The endpoints of these intervals are
a α a
∑ λ i ȳ i· ± t 1 −
2s
, dfE SE ∑ λi ȳi· .
i=1 i=1
Only the t distribution value distinguishes this interval from a standard confidence interval for
∑i=1 λi μi . In the special case of comparing pairs of means in a balanced ANOVA, the Bonferroni
a
(ȳi· − ȳ j· ) ± BSD.
For these intervals, we are (1 − α )100% confident that the collection of all such intervals simulta-
neously contain all of the corresponding differences between pairs of population means.
7
E XAMPLE 13.2.1. In comparing Mandel’s 7 laboratories, we have 2 = 21 pairs of laboratories
328 13. MULTIPLE COMPARISON METHODS
to contrast. The Bonferroni significant difference for α = 0.05 is
0.025 1 1
BSD = t 1 − , 39 0.00421 +
21 4 4
= t(0.99881, 39)0.04588 = 3.2499(0.04588) = 0.149 .
Means that are greater than 0.149 apart are significantly different. Means that are less than 0.149
apart are not significantly different. Once again, we display the results visually. We order the sample
means from smallest to largest and indicate groups of means that are not significantly different by
underlining the group.
Lab. 4 7 5 6 3 2 1
Mean 4.0964 4.2871 4.6906 4.7175 4.7919 4.8612 4.9031
Laboratories 4 and 7 are distinct from all other laboratories. Labs 5, 6, and 3 cannot be distinguished.
Similarly, labs 6, 3, and 2 cannot be distinguished; however, lab 5 is significantly different from lab
2 and also lab 1. Labs 3, 2, and 1 cannot be distinguished, but lab 1 is significantly different from
lab 6. Remember there is no assurance that such a display can be constructed for unbalanced data.
The Bonferroni simultaneous 95% confidence interval for, say, μ2 − μ5 has endpoints
which gives the interval (0.021,0.320). Transforming back to the original scale from the logarithmic
scale, we are 95% confident that the median for lab 2 is being between e0.021 = 1.02 and e0.320 =
1.38 times greater than the median for lab 5. Similar conclusions are drawn for the other twenty
comparisons between pairs of means. ✷
with reference distribution F(dfE(Red.) − dfE(Full), dfE(Big.)). Scheffé’s method requires a fur-
ther modification of the test statistic.
If the smallest model is true, then all of the other models are also true. The experimentwise error
rate is the probability of rejecting any reduced model Red. (relative to a full model Full) when model
Sml. is true. Scheffé’s method allows us to compare any and all full and reduced models, those we
13.3 SCHEFFÉ’S METHOD 329
even pick by looking at the data, and controls the experimentwise error rate at α by rejecting the
reduced model only when
To justify this procedure, note that the test of the smallest model versus the biggest model rejects
when
[SSE(Sml.) − SSE(Big.)]/[dfE(Sml.) − dfE(Big.)]
F=
MSE(Big.)
> F(1 − α , dfE(Sml.) − dfE(Big.), dfE(Big.)
and when the smallest model is true, this has only an α chance of occurring. Because
we have
[SSE(Sml.) − SSE(Big.)] ≥ [SSE(Red.) − SSE(Full)]
and
[SSE(Sml.) − SSE(Big.)]/[dfE(Sml.) − dfE(Big.)]
MSE(Big.)
[SSE(Red.) − SSE(Full)]/[dfE(Sml.) − dfE(Big.)]
≥ .
MSE(Big.)
It follows that you cannot reject Red. relative to Full unless you have already rejected Big. relative
to Sml., and rejecting Big. relative to Sml. occurs only with probability α when Sml. is true. In
other words, there is no more than an α chance of rejecting any of the reduced models when they
are true.
Scheffé’s method is valid for examining any and all contrasts simultaneously. This method is
primarily used with contrasts that were suggested by the data. Scheffé’s method should not be used
for comparing pairs of means in a balanced ANOVA because the HSD method presented in the next
section has properties comparable to Scheffé’s but is better for comparing pairs of means.
In one-way ANOVA, the analysis of variance F test is rejected when
SSGrps/(a − 1)
> F(1 − α , a − 1, dfE). (13.3.1)
MSE
It turns out that for any contrast ∑i λi μi ,
SS ∑ λ i μi ≤ SSGrps. (13.3.2)
i
SS (∑i λi μi ) /(a − 1)
> F(1 − α , a − 1, dfE).
MSE
330 13. MULTIPLE COMPARISON METHODS
From (13.3.1) and (13.3.2), Scheffé’s test cannot possibly be rejected unless the ANOVA test is
rejected. This controls the experimentwise error rate for multiple tests. Moreover, there always
exists a contrast that contains all of the SSGrps, i.e., there is always a contrast that achieves equality
in relation (13.3.2), so if the ANOVA test is rejected, there is always some contrast that can be
rejected using Scheffé’s method. This contrast may not be interesting but it exists.
Scheffé’s method can be adapted to provide simultaneous (1 − α )100% confidence intervals for
contrasts. These have the endpoints
a a
∑ λi ȳi· ± (a − 1) F(1 − α , a − 1, dfE) SE ∑ λi ȳi· .
i=1 i=1
If the observed value of this studentized range statistic Q is consistent with its coming from a
Q(a, dfE) distribution, then the data are consistent with the null hypothesis of equal means μi . If the
μi s are not all equal, the studentized range Q tends to be larger than if the means were all equal; the
difference between the largest and smallest observations will involve not only random variation but
also the differences in the μi s. Thus, for an α = 0.05 level test, if the observed value of Q is larger
than Q(0.95, a, dfE), we reject the claim that the means are all equal.
In applying these methods to a higher-order ANOVA, the key ideas are to compare a set of
sample means using the MSE appropriate to the model and taking N as the number of observations
that go into each mean.
The studentized range multiple comparison methods discussed in this section begin with this
studentized range test.
13.4 STUDENTIZED RANGE METHODS 331
13.4.1 Tukey’s honest significant difference
John Tukey’s honest significant difference method is to reject the equality of a pair of means, say,
μi and μ j at the α = 0.05 level, if
|ȳi· − ȳ j·|
> Q(0.95, a, dfE).
MSE/N
Obviously, this test cannot be rejected for any pair of means unless the test based on the maximum
and minimum sample means is also rejected. For an equivalent way of performing the test, reject
equality of μi and μ j if
|ȳi· − ȳ j·| > Q(0.95, a, dfE) MSE/N.
With a fixed α , the honest significant difference is
HSD ≡ Q(1 − α , a, dfE) MSE N.
For any pair of sample means with an absolute difference greater than the HSD, we conclude that the
corresponding population means are significantly different. The HSD is the number that an observed
difference must be greater than in order for the population means to have an ‘honestly’ significant
difference. The use of the word ‘honest’ is a reflection of the view that the LSD method allows ‘too
many’ rejections.
Tukey’s method can be extended to provide simultaneous (1 − α )100% confidence intervals for
all differences between pairs of means. The interval for the difference μi − μ j has end points
ȳi· − ȳ j· ± HSD
where HSD depends on α . For α = 0.05, we are 95% confident that the collection of all such
intervals simultaneously contains all of the corresponding differences between pairs of population
means.
E XAMPLE 13.4.1. For comparing the 7 laboratories in Mandel’s data with α = 0.05, the honest
significant difference is approximately
HSD = Q(0.95, 7, 40) MSE/4 = 4.39 0.00421/4 = 0.142.
Here we have used Q(0.95, 7, 40) rather than the correct value Q(0.95, 7, 39) because the correct
value was not available in the table used. Group means that are more than 0.142 apart are signif-
icantly different. Means that are less than 0.142 apart are not significantly different. Note that the
HSD value is similar in size to the corresponding BSD value of 0.149; this frequently occurs. Once
again, we display the results visually.
Lab. 4 7 5 6 3 2 1
Mean 4.0964 4.2871 4.6906 4.7175 4.7919 4.8612 4.9031
These results are nearly the same as for the BSD except that labs 6 and 2 are significantly different
by the HSD criterion. Many Statistics packages will either perform Tukey’s procedure or allow you
to find Q(1 − α , a, dfE).
The HSD simultaneous 95% confidence interval for, say, μ2 − μ5 has endpoints
(4.8612 − 4.6906) ± 0.142,
which gives the interval (0.029, 0.313). Transforming back to the original scale from the logarithmic
scale, we are 95% confident that the median for lab 2 is between e0.029 = 1.03 and e0.313 = 1.37 times
greater than the median for lab 5. Again, there are 20 more intervals to examine. ✷
The Newman–Keuls multiple range method involves repeated use of the honest significant dif-
ference method with some minor adjustments; see Christensen (1996) for an example.
332 13. MULTIPLE COMPARISON METHODS
13.6 Exercises
E XERCISE 13.6.1. Exercise 12.7.1 involved measurements from different laboratories on the
stress at 600% elongation for a certain type of rubber. The summary statistics are repeated in Ta-
ble 13.1. Ignoring any reservations you may have about the appropriateness of the analysis of vari-
ance model for these data, compare all pairs of laboratories using α = 0.10 for the LSD, Bonferroni,
Tukey, and Newman–Keuls methods. Give joint 95% confidence intervals using Tukey’s method for
all differences between pairs of labs.
E XERCISE 13.6.2. Use Scheffé’s method with α = 0.01 to test whether the contrast in Exer-
cise 12.7.2d is zero.
E XERCISE 13.6.3. Use Bonferroni’s method with an α near 0.01 to give simultaneous confidence
intervals for the mean weight in each height group for Exercise 12.7.3.
E XERCISE 13.6.4. Exercise 12.7.4 contained data on stress measurements for four different types
of rubber. Four observations were taken on each type of rubber; the means are repeated below,
13.6 EXERCISES 333
Material A B C D
Mean 26.4425 26.0225 23.5325 29.9600
and the sample variance of the 16 observations is 14.730793. Test for differences between all pairs
of materials using α = 0.05 for the LSD, Bonferroni, and Tukey methods. Give 95% confidence
intervals for the differences between all pairs of materials using the BSD method.
E XERCISE 13.6.5. In Exercise 12.7.5 on the stress of four types of rubber, an outlier was noted
in material B. Redo the multiple comparisons of the previous problem eliminating the outlier and
using only the methods that are still applicable.
E XERCISE 13.6.6. In Exercise 12.7.6 on the peel-strength of different adhesive systems, parts
(b) and (c) amount to doing LSD multiple comparisons for all pairs of systems. Compare the LSD
results with the results obtained using Tukey’s methods with α = .01.
E XERCISE 13.6.7. For the weight gain data of Exercise 12.7.7, use the LSD, Bonferroni, and
Scheffé methods to test whether the following contrasts are zero: 1) the contrast that compares the
two drugs and 2) the contrast that compares the control with the average of the two drugs. Pick an
α level but clearly state the level chosen.
E XERCISE 13.6.8. For the Cushing’s syndrome data of Exercise 12.7.8, use all appropriate meth-
ods to compare all pairwise differences among the three groups. Pick an α level but clearly state the
level chosen.
E XERCISE 13.6.9. Use Scheffé’s method with α = 0.05 and the data of Exercise 12.7.9 to test
the significance of the contrast
E XERCISE 13.6.10. Restate the least significant difference method in terms of testing Biggest,
Full, Reduced, and Smallest models.
Chapter 14
Two-Way ANOVA
This chapter involves many model comparisons, so, for simplicity within a given section, say 14.2,
equation numbers such as (14.2.1) that redundantly specify the section number are referred to in the
text without the section number, hence simply as (1). When referring to an equation number outside
the current section, the full equation number is given.
Bailey (1953), Scheffé (1959), and Christensen (2011) examined data on infant female rats that were
given to foster mothers for nursing. The variable of interest was the weight of the rat at 28 days.
Weights were measured in grams. Rats are classified into four genotypes: A, F, I, and J. Specifically,
rats from litters of each genotype were given to a foster mother of each genotype. The data are
presented in Table 14.1.
335
336 14. TWO-WAY ANOVA
14.1.1 Initial analysis
One way to view these data is as a one-way ANOVA with 4 × 4 = 16 groups. Specifically,
yhk = μh + εhk ,
with h = 1, . . . , 16, k = 1, . . . , Nh . It is convenient to replace the subscript h with the pair of subscripts
(i, j) and write
yi jk = μi j + εi jk , (14.1.1)
εi jk s independent N(0, σ 2 ),
where i = 1, . . . , 4 indicates the litter genotype and j = 1, . . . , 4 indicates the foster mother geno-
type so that, together, i and j identify the 16 groups. The index k = 1, . . . , Ni j indicates the various
observations in each group.
Equivalently, we can write an overparameterized version of Model (1) called the interaction
model,
yi jk = μ + αi + η j + γi j + εi jk . (14.1.2)
The idea is that μ is an overall effect (grand mean) to which we add αi , an effect for the ith litter
genotype, plus η j , an effect for the j foster mother genotype, plus an effect γi j for each combination
of a litter genotype and foster mother genotype. Comparing the interaction model (2) with the one-
way ANOVA model (1), we see that the γi j s in (2) play the same role as the μi j s in (1), making
all of the μ , αi and η j parameters completely redundant. There are 16 groups so we only need 16
parameters to explain the group means and there are 16 γi j s. In particular, all of the μ , αi and η j
parameters could be 0 and the interaction model would explain the data exactly as well as Model
(1). In fact, we could set these parameters to be any numbers at all and still have a free γi j parameter
to explain each group mean. It is equally true that any data features that the μ , αi and η j parameters
could explain could already be explained by the γi j s.
So why bother with the interaction model? Simply because dropping the γi j s out of the model
gives us a much simpler, more interpretable no-interaction model
in which we have structured the effects of the litter and foster mother genotypes so that each adds
some fixed amount to our observations. Model (3) is actually a special case of the general additive-
effects model (9.9.2), which did not specify whether predictors were categorical or measurement
variables. In Model (3), the population mean difference between litter genotypes A and F must be
the same, regardless of the foster mother genotype, i.e.,
(μ + α1 + η j ) − (μ + α2 + η j ) = α1 − α2 .
Similarly, the difference between foster mother genotypes F and J must be the same regardless of
the litter genotype, i.e.,
(μ + αi + η2 ) − (μ + αi + η4 ) = η2 − η4 .
Model (3) has additive effects for the two factors: litter genotype and foster mother genotype. The
effect for either factor is consistent across the other factor. This property is also referred to as the
absence of interaction or as the absence of effect modification. Model (3) requires that the effect
of any foster mother genotype be the same for every litter genotype, and also that the effect of any
litter genotype be the same for every foster mother genotype. Without this property, one could not
meaningfully speak about the effect of a litter genotype, because it would change from foster mother
genotype to foster mother genotype. Similarly, foster mother genotype effects would depend on the
litter genotypes.
Model (2) imposes no such restrictions on the factor effects. Model (2) would happily allow the
14.1 UNBALANCED TWO-WAY ANALYSIS OF VARIANCE 337
Table 14.2: Statistics from fitting models to the data of Table 14.1.
Model Model SSE df Cp
(14.1.2): G + L + M + LM [LM] 2440.82 45 16.0
(14.1.3): G + L + M [L][M] 3264.89 54 13.2
(14.1.4): G + L [L] 4039.97 57 21.5
(14.1.5): G + M [M] 3328.52 57 8.4
(14.1.6): G [G] 4100.13 60 16.6
foster mother genotype that has the highest weight gains for litter type A to be also the foster mother
genotype that corresponds to the smallest weight gains for Litter J, a dramatic interaction. Model
(2) does not require that the effect of a foster mother genotype be consistent for every litter type or
that the effect of a litter genotype be consistent for every foster mother genotype. If the effect of a
litter genotype can change depending on the foster mother genotype, the model is said to display
effect modification or interaction.
The γi j s in Model (2) are somewhat erroneously called interaction effects. Although they can
explain much more than interaction, eliminating the γi j s in Model (2) eliminates any interaction.
(Whereas eliminating the equivalent μi j effects in Model (1) eliminates far more than just interac-
tion; it leads to a model in which every group has mean 0.)
The test for whether interaction exists is simply the test of the full, interaction, model (2) against
the reduced, no-interaction, model (3). Remember that Model (2) is equivalent to the one-way
ANOVA model (1), so models (1) and (2) have the same fitted values ŷi jk and residuals ε̂i jk and
dfE(1) = dfE(2). The analysis for models like (1) was given in Chapter 12. While it may not be
obvious that Model (3) is a reduced model relative to Model (1), Model (3) is obviously a reduced
model relative to the interaction model (2). Computationally, the fitting of Model (3) is much more
complicated than fitting a one-way ANOVA.
If Model (3) does not fit the data, there is often little one can do except go back to analyzing
Model (1) using the one-way ANOVA techniques of Chapters 12 and 13. In later chapters, depending
on the nature of the factors, we will explore ways to model interaction by looking at models that are
intermediate between (2) and (3).
Table 14.2 contains results for fitting models (2) and (3) along with results for fitting other
models to be discussed anon. In our example, a test of whether Model (3) is an adequate substitute
for Model (2) rejects Model (3) if
[SSE(3) − SSE(2)] [dfE(3) − dfE(2)]
F=
SSE(2) dfE(2)
is too large. The F statistic is compared to an F(dfE(3) − dfE(2), dfE(2)) distribution. Specifically,
we get
[3264.89 − 2440.82]/[54 − 45] 91.56
Fobs = = = 1.69,
2440.82/45 54.24
with a one-sided P value of 0.129, i.e., 1.69 is the 0.871 percentile of an F(9, 45) distribution
denoted 1.69 = F(.871, 9, 45).
If Model (3) fits the data adequately, we can explore further to see if even simpler models
adequately explain the data. Using Model (3) as a working model, we might be interested in whether
there are really any effects due to Litters, or any effects due to Mothers. Remember that in the
interaction model (2), it makes little sense even to talk about a Litter effect or a Mother effect
without specifying a particular level for the other factor, so this discussion requires that Model (3)
be reasonable.
The effect of Mothers can be measured in two ways. First, by comparing the no-interaction
model (3) with a model that eliminates the effect for Mothers
yi jk = μ + αi + εi jk . (14.1.4)
338 14. TWO-WAY ANOVA
This model comparison assumes that there is an effect for Litters because the αi s are included in
both models. Using Table 14.2, the corresponding F statistic is
with a one-sided P value of 0.009, i.e., 4.27 = F(.991, 3, 54). There is substantial evidence for
differences in Mothers after accounting for any differences due to Litters. We constructed this
F statistic in the usual way for comparing the reduced model (4) to the full model (3) but when
examining a number of models that are all smaller than a largest model, in this case Model (2), our
preferred practice is to use the MSE from the largest model in the denominator of all the F statistics,
thus we compute
yi jk = μ + η j + εi jk (14.1.5)
yi jk = μ + εi jk . (14.1.6)
so there is substantial evidence for differences in Mothers when ignoring any differences due to
Litters. The two F statistics for Mothers, 4.74 and 4.76, are very similar in this example, but the
difference is real; it is not round-off error. Special cases exist where the two F statistics will be
identical, cf. Christensen (2011, Chapter 7).
Similarly, the effect of Litters can be measured by comparing the no-interaction model (3) with
Model (5) that eliminates the effect for Litters. Here Mothers are included in both the full and
reduced models, because the η j s are included in both models. Additionally, we could assume that
there are no Mother effects and base our evaluation of Litter effects on comparing Model (4) with
Model (6). Using Table 14.2, both of the corresponding F statistics turn out very small, below 0.4,
so there is no evidence of a Mother effect whether accounting for or ignoring effects due to Litters.
In summary, both of the tests for Mothers show Mother effects and neither test for Litters shows
Litter effects, so the one-way ANOVA model (5), the model with Mother effects but no Litter effects,
seems to be the best-fitting model. Of course the analysis is not finished by identifying Model
(5). Having identified that the Mother effects are the interesting ones, we should explore how the
four foster mother groups behave. Which genotype gives the largest weight gains? Which gives
the smallest? Which genotypes are significantly different? If you accept Model (5) as a working
model, all of these issues can be addressed as in any other one-way ANOVA. However, it would be
good practice to use MSE(2) when constructing any standard errors, in which case the t(dfE(2))
distribution must be used. Moreover, we have done nothing yet to check our assumptions. We should
have checked assumptions on Model (2) before doing any tests. Diagnostics will be considered in
Subsection 14.1.4.
All of the models considered have their SSE, dfE, and C p statistic (cf. Subsection 10.2.3) re-
ported in Table 14.2. Tests of various models constitute the traditional form of analysis. These tests
14.1 UNBALANCED TWO-WAY ANALYSIS OF VARIANCE 339
are further summarized in the next subsection. But all of this testing seems like a lot of work to
identify a model that the C p statistic immediately identifies as the best model. Table 14.2 also in-
corporates some shorthand notations for the models. First, we replace the Greek letters with Roman
letters that remind us of the effects being fitted, i.e., G for the grand mean, L for Litter effects, M
for Mother effects, and LM for interaction effects. Model (2) is thus rewritten as
yi jk = G + Li + M j + (LM)i j + εi jk .
A second form of specifying models eliminates any group of parameters that is completely redun-
dant and assumes that distinct terms in square brackets are added together. Thus, Model (2) is [LM]
because it requires only the (LM)i j terms and Model (3) is written [L][M] because in Model (3) the
G (μ ) term is redundant and the L (α ) and M (η ) terms are added together. Model (3) is the most
difficult to fit of the models in Table 14.2. Model (6) is a one-sample model, and models (1)=(2),
(4), and (5) are all one-way ANOVA models. When dealing with Model (3), you have to be able
to coax a computer program into giving you all the results that you want and need. With the other
models, you could easily get what you need from a hand calculator.
[LM]
[L][M]
[L] [M]
[G]
(1) = (2)
(3)
(4) (5)
(6).
Models (4) and (5) are not directly comparable, but both are reductions of (3) and both contain (6)
as a special case. Any model in a row of this hierarchy can be tested as a full model relative to
any (reduced) model in a lower row or tested as a reduced model relative to any (full) model in a
higher row. However, we typically modify our testing procedure so that in the denominator of the F
statistic we always use MSE(2), the MSE from the model at the top of the hierarchy, i.e., the MSE
from the largest model being considered. In other words,
[SSE(Full) − SSE(Red.)] [dfE(Full) − dfE(Red.)]
F=
SSE(2) dfE(2)
Source df Seq SS MS F P
Litters 3 60.16 20.05 0.37 0.776
Mothers 3 775.08 258.36 4.76 0.006
Litters∗Mothers 9 824.07 91.56 1.69 0.120
Error 45 2440.82 54.24
Total 60 4100.13
Litters), and interaction. In the first ANOVA table, Mothers are fitted to the data before Litters. In
the second table, Litters are fitted before Mothers.
Although models are fitted from smallest to largest and, in ANOVA tables, results are reported
from smallest model to largest, a sequence of models is evaluated from largest model to smallest.
Thus, we begin the analysis of Table 14.3 at the bottom, looking for interaction. The rows for
Mother∗Litter interaction are identical in both tables. The sum of squares and degrees of freedom
for Mother∗Litter interaction in the table is obtained by differencing the error sums of squares and
degrees of freedom for models (3) and (2). If the interaction is significant, there is little point in
looking at the rest of the ANOVA table. One can either analyze the data as a one-way ANOVA or, as
will be discussed in later chapters, try to model the interaction by developing models intermediate
between models (2) and (3), cf. Subsection 15.3.2.
Our interaction F statistic is quite small, so there is little evidence of interaction and we proceed
with an analysis of Model (3). In particular, we now examine the main effects. Table 14.3 shows
clear effects for both Mothers ignoring Litters (F = 4.74) and Mothers after fitting Litters (F = 4.76)
with little evidence for Litters fitted after Mothers (F = 0.39) or Litters ignoring Mothers (F = 0.37).
The difference in the error sums of squares for models (4) [L] and (3) [L][M] is the sum of
squares reported for Mothers in the second of the two ANOVA tables in Table 14.3. The difference
in the error sums of squares for models (6) [G] and (5) [M] is the sum of squares reported for
Mothers in the first of the two ANOVA tables in Table 14.3. The difference in the error sums of
squares for models (5) [M] and (3) [L][M] is the sum of squares reported for Litters in the first of
the two ANOVA tables in Table 14.3. The difference in the error sums of squares for models (6)
[G] and (4) [L] is the sum of squares reported for Litters in the second of the ANOVA tables in
Table 14.3.
Balanced two-way ANOVA is the special case where Ni j = N for all i and j. For balanced
ANOVA the two ANOVA tables (cf. Table 14.3) would be identical.
yi jk = μ + αi + η j + γi j + εi jk
where μ is an overall effect (grand mean), the αi s are effects for litter genotype, the η j s are effects
for foster mother genotype, and the γi j s are effects for each combination of a litter genotype and
foster mother genotype. Just like regression programs, general linear models programs typically fit
a sequence of models where the sequence is determined by the order in which the terms are specified.
14.1 UNBALANCED TWO-WAY ANALYSIS OF VARIANCE 341
Thus, specifying Model (2) causes the sequence (6), (4), (3), (2) to be fitted and the second ANOVA
table in Table 14.3 to be produced. Specifying the equivalent but reordered model
yi jk = μ + η j + αi + γi j + εi jk
causes the sequence (6), (5), (3), (2) to be fitted and the first ANOVA table in Table 14.3 to be
produced.
When obtaining an analysis of Model (2), many computer programs give ANOVA tables with
either the sequential sums of squares or “adjusted” sums of squares. Adjusted sums of squares are
for adding a term to the model last. Thus, in Model (2) the adjusted sums of squares for Litters is
the sum of squares for dropping Litters out of the model
yi jk = μ + η j + γi j + αi + εi jk .
This is idiotic! As we have mentioned, the γi j terms can explain anything the αi or η j terms can
explain, so the model without Litter main effects
yi jk = μ + η j + γi j + εi jk
is equivalent to Model (2).
What do these adjusted sums of squares really mean? Unfortunately, you have to enter the bow-
els of the computer program to find out. Most computer programs build in side conditions that allow
them to give some form of parameter estimates. Only Model (1) really allows all the parameters to
be estimated. In any of the other models, parameters cannot be estimated without imposing some
arbitrary side conditions. In the interaction model (2) the adjusted sums of squares for main effects
depend on these side conditions, so programs that use different side conditions (and programs DO
use different side conditions) give different adjusted sums of squares for main effects after interac-
tion. These values are worthless! Unfortunately, many programs, by default, produce mean squares,
F statistics, and P values using these adjusted sums of squares. The interaction sum of squares and
F test are not affected by this issue.
To be fair, if you are dealing with Model (3) instead of Model (2), the adjusted sums of squares
are perfectly reasonable. In Model (3),
yi jk = μ + αi + η j + εi jk ,
the adjusted sum of squares for Litters just compares Model (3) to Model (5) and the adjusted sum of
squares for Mothers compares Model (3) to Model (4). Adjusted sums of squares are only worthless
when you fit main effects after having already fit an interaction that involves the main effect.
yi jk = μ + αi + εi jk .
However, it is possible that in this model there may be no apparent effect for Mothers (when Litters
are ignored), so dropping Mothers is suggested and we get the model
yi jk = μ + εi jk .
This model contradicts our first conclusion that there is an effect for Mothers, albeit one that only
shows up after adjusting for Litters. These issues are discussed more extensively in Christensen
(2011, Section 7.5).
14.1.5 Diagnostics
It is necessary to consider the validity of our assumptions. Table 14.4 contains many of the standard
diagnostic statistics used in regression analysis. They are computed from the interaction model
(2). Model (2) is equivalent to a one-way ANOVA model, so the leverage associated with yi jk in
Table 14.3 is just 1/Ni j .
Figures 14.1 and 14.2 contain diagnostic plots. Figure 14.1 contains a normal plot of the stan-
dardized residuals, a plot of the standardized residuals versus the fitted values, and boxplots of the
residuals versus Litters and Mothers, respectively. Figure 14.2 plots the leverages, the t residuals,
and Cook’s distances against case numbers. The plots identify one potential outlier. From Table 14.4
this is easily identified as the observed value of 68.0 for Litter I and Foster Mother A. This case has
by far the largest standardized residual r, standardized deleted residual t, and Cook’s distance C. We
can test whether this case is consistent with the other data. The t residual of 4.02 has an unadjusted
P value of 0.000225. If we use a Bonferroni adjustment for having made n = 61 tests, the P value is
.
61 × 0.000225 = 0.014. There is substantial evidence that this case does not belong with the other
data.
so
1263.48/9 140.39
Fobs = = = 3.46,
40.58 40.58
with a one-sided P value of .003. The interaction in significant, so we could reasonably go back to
treating the data as a one-way ANOVA with 16 groups. Typically, we would print out the 16 group
means and try to figure out what is going on. But in this case, most of the story is determined by the
plot of the standardized residuals versus the fitted values for the deleted data, Figure 14.3.
Case 12 was dropped from the Litter I, Mother A group that contained three observations. After
dropping case 12, that group has two observations and, as can been seen from Figure 14.3, that
group has a far lower sample mean and has far less variability than any other group. In this example,
deleting the one observation that does not seem consistent with the other data makes the entire group
inconsistent with the rest of the data.
14.1 UNBALANCED TWO-WAY ANALYSIS OF VARIANCE 343
3
Standardized residuals
Standardized residuals
2
2
1
1
0
0
−2 −1
−2 −1
−2 −1 0 1 2 50 55 60 65
3
Standardized residuals
Standardized residuals
2
2
1
1
0
0
−2 −1
−2 −1
A F I J A F I J
Litters Mothers
4
3
0.40
rstudent(cr)
2
Leverage
1
0.30
−2 −1 0
0.20
0 10 20 30 40 50 60 0 10 20 30 40 50 60
i i
0.2
0.1
0.0
0 10 20 30 40 50 60
The small mean value for the Litter I, Mother A group after deleting case 12 is causing the
interaction. If we delete the entire group, the interaction test becomes
578.74/8
Fobs = = 1.74, (14.1.7)
1785.36/43
which gives a P value of 0.117. Note that by dropping the Litter I, Mother A group we go from
our original 61 observations to 58 observations, but we also go from 16 groups to 15 groups, so
dfE(2) = 58−15 = 43. On the other hand, the number of free parameters in Model (3) is unchanged,
so dfE(3) = 58 − 7 = 51, which leaves us 8 degrees of freedom in the numerator of the test.
14.1 UNBALANCED TWO-WAY ANALYSIS OF VARIANCE 345
Residual−Fitted plot
2
1
Standardized residuals
0
−1
−2
40 45 50 55 60 65
Fitted
The Litter I, Mother A group is just weird. It contains three cases, the two smallest along with
the third largest case out of 61 total cases. It is weird if we leave case 12 in the data and it is weird
if we take case 12 out of the data. With all the data, the best-fitting model is (5). Deleting the Litter
I, Mother A group, the best-fitting model again turns out to be (5).
For the full data and Model (5), LSD at the 5% level can be summarized as
Mother Mean
F 58.700 A
A 55.400 A
I 53.362 A B
J 48.680 B
The residual plots and diagnostics look reasonably good for this model. The plots and diagnostics
are different from those given earlier for Model (2).
For Model (5) with the Litter I, Mother A group removed, LSD at the 5% level can be summa-
rized as
Mother Mean
F 58.700 A
A 57.315 A B
I 53.362 C B
J 48.680 C
(These are unbalanced one-way ANOVAs, so there is no guarantee that such displays can be con-
structed.) Again, the residual plots and diagnostics look reasonably good but are different from those
for the full data models (2) and (5).
The main difference between these two analyses is that one has Mothers F and I significantly
different and the other does not. Given that the change to the analysis consisted of deleting ob-
servations from Mother A leaving groups F and I alone, that is somewhat strange. The confidence
intervals for μI − μF are (−10.938, 0.263) for the full data and (−10.248, −0.427) for the deleted
data, so one is just barely insignificant for testing H0 : μI − μF = 0 and the other is just barely signif-
icant. The discrepancy comes from using different MSEs. Both are based on Model (5) but they are
based on different data. It would be preferable to base the LSDs on MSEs from Model (2), but the
346 14. TWO-WAY ANOVA
results would still be different for the different data. For all of the weirdness of the Litter I, Mother
A group, in the end, the results are remarkably consistent whether we delete the group or not.
Finally, we have enough information to test whether the three observations in the Litter I, Mother
A group are collectively a group of outliers. We do that by testing a full model that is Model (2)
defined for the deleted data against a reduced model that is Model (2) for the full data. This may
seem like a backwards definition of a full and reduced model, but the deleted data version of Model
(2) can be obtained from the full data Model (2) by adding a separate parameter for each of the three
points we want to delete. Using information from Equation (7) and either of Table 14.2 or 14.3,
which is highly significant: statistical evidence that Litter I, Mother A is a weird group. The numer-
ator degrees of freedom is 2. Model (2) for the full data already has one parameter for the Litter I,
Mother A group, so we need add only two more free parameters to have a separate parameter for
every observation in the group.
The interesting part of any analysis is figuring out how the groups really differ. To do that, you need
to look at contrasts. We examined contrasts for one-way ANOVA models in Chapter 12, and all the
models we have looked at in this chapter, except the additive-effects model, have been essentially
one-way ANOVA models. In particular, our final conclusions about Mothers in the previous section
came from the one-way ANOVA that ignored Litters.
But what if we could not ignore Litters? What if we needed to see how Mothers differed in the
additive-effects model rather than a one-way ANOVA? As mentioned earlier, when dealing with the
additive-effects model you cannot just compute what you want on a hand calculator. You have to be
able to coax whatever information you need out of a computer program. These issues are addressed
in this section and the next. You can generally get everything you need by fitting equivalent models
in a regression program as discussed in the next section. Here we focus on extracting information
from an ANOVA program, i.e., we focus on manipulating the subscripts that are fed into an ANOVA
program.
When the treatments have no structure to exploit, one way to start is by looking for evidence of
differences between all pairs of means.
Bonferroni
Parameter Est SE(Est) t P
ηF − ηA 3.516 2.862 1.229 1.0000
ηI − ηA −1.832 2.767 −0.662 1.0000
ηJ − ηA −6.755 2.810 −2.404 0.1182
ηI − ηF −5.35 2.863 −1.868 0.4029
ηJ − ηF −10.27 2.945 −3.488 0.0059
ηJ − ηI −4.923 2.835 −1.736 0.5293
If there are b levels of the second factor, as there are 4 levels of Mother, there are b(b − 1)/2 =
4(3)/2 = 6 pairwise comparisons to make. Of these, we will see in the next section that we can get
b − 1 = 4 − 1 = 3 of them by fitting a regression model. Some programs, like Minitab, will provide
all of these pairwise comparisons.
It is tempting to just summarize these results and be done with them. For an LSD procedure with
α = 0.05 (actually specified in Minitab as a Bonferonni procedure with α = 0.3), these results can
be summarized by
14.2 MODELING CONTRASTS 347
Mother Mean
F 58.8 A
A 55.2 A
I 53.4 A B
J 48.5 B
It is by no means clear what the “Mean” values are. (They are explained in the next section.) But
what is important, and is reported correctly, are the relative differences among the “Mean” values.
From the display, we see no differences among Mothers F, A, and I and no difference between
Mothers I and J. We do see differences between F and J and between A and J.
Unfortunately, as discussed in Chapter 13, it is possible that no such display can be generated be-
cause it is possible to have, say, a significant difference between F and A but no significant difference
between F and I. This is possible, for example, if SE(η̂F − η̂A ) is much smaller than SE(η̂F − η̂I ).
Based on the pairwise testing results, one could perform a backwards elimination. The pair of
means with the least evidence for a difference from 0 is ηI − ηA with tobs = −0.662. We could
incorporate the assumption ηI = ηA into the model and look for differences between the remaining
three groups: Mothers F, Mothers J, and the combined group Mothers A or I and continue the process
of finding groups that could be combined. If we followed this procedure, at the next step we would
combine Mothers A, F and I and then finally conclude that J was different from the other three.
Another plausible model might be to combine J with A and I and leave F separate. These additive
models with ηA = ηI , ηA = ηI = ηF , and ηA = ηI = ηJ have respective C p values of 11.7, 13.2, and
16.0. Only ηA = ηI is a minor improvement over the full two-factor additive-effects model, which
has C p = 13.2 as reported in Table 14.2.
The other methods of Section 12.4 extend easily to two-factor models but the results depend on
the specific model in which we incorporate the hypotheses.
rather than the 1.68 we got from the additive model without assuming that ηI = ηJ .
In this example, the test statistics are noticeably, but not substantially, different. With other data,
the differences can be much more substantial.
In a balanced ANOVA, the numerators for these three tests would all be identical and the only
differences in the tests would be due to alternative choices of a MSE for the denominator.
yi jk = μ + α1 xi j1 + α2 xi j2 + α3 xi j3 + α4 xi j4 + η1 xi j5 + η2 xi j6 + η3 xi j7 + η4 xi j8 + εi jk .
xi j1 + xi j2 + xi j3 + xi j4 = 1 = xi j5 + xi j6 + xi j7 + xi j8.
Also, associated with the grand mean μ is a predictor variable that always takes the value 1, say, x0 ≡
1. To make a regression model out of the additive-effects model we need to drop one variable from
two of the three sets of variables {x0 }, {x1 , x2 , x3 , x4 }, {x5 , x6 , x7 , x8 }. We illustrate the procedures
by dropping two of the three variables, x0 , x2 (the indicator for Litter F), and x8 (the indicator for
Mother J).
If we drop x2 and x8 the model becomes
yi jk = δ + γ1 xi j1 + γ3 xi j3 + γ4 xi j4 + β1 xi j5 + β2xi j6 + β3 xi j7 + εi jk . (14.3.1)
In this model, the Litter F, Mother J group becomes a baseline group and
δ = μ + α2 + η4 , γ1 = α1 − α2 , γ3 = α3 − α2 , γ4 = α4 − α2 ,
β1 = η1 − η4 , β2 = η2 − η4 , β3 = η3 − η4 .
After fitting Model (1), the Table of Coefficients gives immediate results for testing whether differ-
ences exist between Mother J and each of Mothers A, F, and I. It also gives immediate results for
testing no difference between Litter F and each of Litters A, I, and J.
350 14. TWO-WAY ANOVA
If we drop x0 , x8 the model becomes
yi jk = γ1 xi j1 + γ2 xi j2 + γ3 xi j3 + γ4 xi j4 + β1xi j5 + β2xi j6 + β3 xi j7 + εi jk
but now
γ1 = μ + α1 + η4 , γ2 = μ + α2 + η4 , γ3 = μ + α3 + η4 , γ4 = μ + α4 + η4 ,
β1 = η1 − η4 , β2 = η2 − η4 , β3 = η3 − η4 ,
so the Table of Coefficients still gives immediate results for testing whether differences exist be-
tween Mother J and Mothers A, F, and I.
If we drop x0 , x2 the model becomes
yi jk = γ1 xi j1 + γ3 xi j3 + γ4 xi j4 + β1xi j5 + β2 xi j6 + β3xi j7 + β4 xi j8 + εi jk .
Now
β1 = μ + η1 + α2 , β2 = μ + η2 + α2 , β3 = μ + η3 + α2 , β4 = μ + η4 + α2 ,
γ1 = α1 − α2 , γ3 = α3 − α2 , γ4 = α4 − α2 .
The Table of Coefficients still gives immediate results for testing whether differences exist between
Litter F and Litters A, I, and J.
To illustrate these claims, we fit Model (1) to the rat data to obtain the following Table of
Coefficients.
Table of Coefficients
Predictor Est SE(Est) t P
Constant (δ ) 48.129 2.867 16.79 0.000
x1 :L-A (γ1 ) 2.025 2.795 0.72 0.472
x3 :L-I (γ3 ) −0.628 2.912 −0.22 0.830
x4 :L-J (γ4 ) 0.004 2.886 0.00 0.999
x5 :M-A (β1 ) 6.755 2.810 2.40 0.020
x6 :M-F (β2 ) 10.271 2.945 3.49 0.001
x7 :M-I (β3 ) 4.923 2.835 1.74 0.088
If, for example, you ask Minitab’s GLM procedure to test all pairs of Mother effects using a
Bonferroni adjustment, you get the table reported earlier,
Bonferroni
Parameter Est SE(Est) t P
ηF − ηA 3.516 2.862 1.229 1.0000
ηI − ηA −1.832 2.767 −0.662 1.0000
ηJ − ηA −6.755 2.810 −2.404 0.1182
ηI − ηF −5.35 2.863 −1.868 0.4029
ηJ − ηF −10.27 2.945 −3.488 0.0059
ηJ − ηI −4.923 2.835 −1.736 0.5293
Note that the estimate, say, β̂2 = η̂2 − η̂4 = η̂F − η̂J = 10.271, is the negative of the estimate of
ηJ − ηF , that they have the same standard error, that the t statistics are the negatives of each other,
and that the Bonferroni P values are 6 times larger than the Table of Coefficients P values. Similar
results hold for β1 = η1 − η4 = ηA − ηJ and β3 = η3 − η4 = ηI − ηJ .
A display of results given earlier was
Mother Mean
F 58.8 A
A 55.2 A
I 53.4 A B
J 48.5 B
14.4 HOMOLOGOUS FACTORS 351
The problems with the display are that the column of “Mean” values has little meaning and that
no meaningful display may be possible because standard errors depend on the difference being
estimated. As for the Mean values, the relative differences among the Mother effects are portrayed
correctly, but the actual numbers are arbitrary. The relative sizes of Mother effects must be the
same for any Litter, but there is nothing one could call an overall Mother effect. You could add any
constant to each of these four Mean values and they would be just as meaningful.
To obtain these “Mean” values as given, fit the model
h = 1, . . . , 61. This model is overparameterized. If we just run the model, most good regression
programs are smart enough to throw out redundant parameters (predictor variables). Performing
this operation ourselves, we fit the model
that eliminates LJ and MJ . Remember, Model (4) is equivalent to (1) and (3). Fitting Model (4), we
have little interest in the Table of Coefficients but the ANOVA table follows.
Analysis of Variance: Model (14.4.4)
Source df SS MS F P
Regression 6 835.24 139.21 2.30 0.047
Error 54 3264.89 60.46
Total 60 4100.13
As advertised, the Error line agrees with the results given for the no-interaction model (14.1.2) in
Section 1.
To fit the symmetric-additive-effects model (2), we incorporate the assumption α1 =
η1 , . . . , α4 = η4 into Model (3) getting
or
yh = μ + α1 (LhA + MhA ) + α2 (LhF + MhF ) + α3 (LhI + MhI ) + α4 (LhJ + MhJ ) + εh .
Fitting this model requires us to construct new regression variables, say,
A = LA + MA
F = LF + MF
I = LI + MI
J = LJ + MJ .
14.4 HOMOLOGOUS FACTORS 353
The symmetric-additive-effects model (2) is then written as
yh = μ + α1 Ah + α2 Fh + α3 Ih + α4 Jh + εh ,
or, emphasizing that the parameters mean different things,
yh = γ0 + γ1 Ah + γ2 Fh + γ3 Ih + γ4 Jh + εh .
This model is also overparameterized, so we actually fit
yh = γ0 + γ1 Ah + γ2 Fh + γ3 Ih + εh , (14.4.5)
giving
Table of Coefficients: Model (14.4.5)
Predictor γ̂k SE(μ̂k ) t P
Constant 48.338 2.595 18.63 0.000
A 4.159 1.970 2.11 0.039
F 5.049 1.912 2.64 0.011
I 1.998 1.927 1.04 0.304
and
Analysis of Variance: Model (14.4.5)
Source df SS MS F P
Regression 3 513.64 171.21 2.72 0.053
Error 57 3586.49 62.92
Total 60 4100.13
We need the ANOVA table Error line to test whether the symmetric-additive-effects model (2)
fits the data adequately relative to the additive-effects model (1). The test statistic is
[3586.49 − 3264.89]/[57 − 54]
Fobs = = 1.773
60.46
with P = 0.164, so the model seems to fit. As discussed in Section 14.1, it would be reasonable to
use the interaction model MSE in the denominator of the F statistic, which makes
[3586.49 − 3264.89]/[57 − 54]
Fobs = = 1.976,
54.24
but the P value remains a relatively high 0.131.
Presuming that the symmetric-additive-effects model (4) fits, we can interpret the Table of Co-
efficients. We dropped variable J in the model, so the constant term γ̂0 = 48.338 estimates the effect
of having genotype J for both Litters and Mothers, i.e., γ0 = μ + 2α4 . The estimated regression co-
efficient for A, γ̂1 = 4.159, is the estimated effect for the difference between the genotype A effect
and the genotype J effect. The P value of 0.039 indicates weak evidence for a difference between
genotypes A and J. Similarly, there is pretty strong evidence for a difference between genotypes F
and J (P = 0.011) but little evidence for a difference between genotypes I and J (P = 0.304). From
the table of coefficients, the estimated effect for having, say, litter genotype A and mother genotype I
is 48.338 + 4.159 + 1.998 = 54.495, which is that same as for litter genotype I and mother genotype
A.
is actually equivalent to the no-interaction model (1). Thus, our test for whether the symmetric
additive model fits can also be thought of as a test for whether skew symmetric additive effects
exist after fitting symmetric additive effects and our test for whether the skew symmetric additive
model fits can also be thought of as a test for whether symmetric additive effects exist after fitting
skew symmetric additive effects. Neither the symmetric additive model (2) nor the skew symmetric
additive model (6) is comparable to either of the single-effects-only models (14.1.4) and (14.1.5).
14.4 HOMOLOGOUS FACTORS 355
14.4.3 Symmetry
The assumption of symmetry is that the two factors are interchangeable. Think again about our
fathers’ and mothers’ education. Under symmetry, there is no difference between having a college
graduate father and postgraduate mother as opposed to having a postgraduate father and college
graduate mother. Symmetric additive models display this symmetry but impose the structure that
there is some consistent effect for, say, being a college graduate and for being a high school graduate.
But symmetry can exist even when no overall effects for educational levels exist. For overall effects
to exist, the effects must be additive.
Table 14.5 gives examples of symmetric additive and symmetric nonadditive effects. The sym-
metric additive effects have a “grand mean” of 10, an effect of 0 for being less than a HS Grad, an
effect of 2 for a HS Grad, an effect of 3 for some college, and an effect of 5 for both college grad
and postgrad. The nonadditive effects were obtained by modifying the symmetric additive effects.
In the nonadditive effects, any pair where both parents have any college is up 3 units and any pair
where both parents are without any college is down 2 units.
In Subsection 14.4.1 we looked carefully at the symmetric-additive-effects model, which is a
special case of the additive-effects (no-interaction) model. Now we impose symmetry on the inter-
action model.
Rather than the interaction model (14.1.2) we focus on the equivalent one-way ANOVA model
(14.1.1), i.e.,
yi jk = μi j + εi jk . (14.4.8)
For rat genotypes, i = 1, . . . , 4 and j = 1, . . . , 4 are used together to indicate the 16 groups. Alterna-
tively, we can replace the pair of subscripts i j with a single subscript r = 1, . . . , 16,
The top half of Table 14.6 shows how the subscripts r identify the 16 groups. The error for this
model should agree with the error for Model (14.1.2), which was given in Section 1. You can see
from the ANOVA table that it does.
Analysis of Variance: Model (14.4.9)
Source df SS MS F P
Rat groups 15 1659.3 110.6 2.04 0.033
Error 45 2440.8 54.2
Total 60 4100.1
356 14. TWO-WAY ANOVA
μi j = μ ji .
This places no restrictions on the four groups with i = j. Translating the symmetry restriction into
Model (9) with the identifications of Table 14.6, symmetry becomes
Imposing these restrictions on the one-way ANOVA model (9) amounts to constructing a new one-
way ANOVA model with only 10 groups. Symmetry forces the 6 pairs of groups for which i = j and
(i, j) = ( j, i) to act like 6 single groups and the 4 groups with i = j are unaffected by symmetry. The
bottom half of Table 14.6 provides subscripts s for the one-way ANOVA model that incorporates
symmetry
ysm = μs + εsm . (14.4.10)
Note that in the nonadditive symmetry model (10), the second subscript for identifying observations
within a group also has to change. There are still 61 observations, so if we use fewer groups, we
must have more observations in some groups.
Fitting the nonadditive symmetry model gives
Analysis of Variance: Model (14.4.10)
Source df SS MS F P
Symmetric groups 9 1159.4 128.8 2.23 0.034
Error 51 2940.8 57.7
Total 60 4100.1
Testing this against the interaction model (6) provides
If we used all 16 of the predictor variables in the symmetry and skew symmetry model, we would
have a model equivalent to the interaction model (8), so the test for the adequacy of the symmetry
model is also a test for whether skew symmetry adds anything after fitting symmetry.
It is hard to imagine good applications of skew symmetry. Perhaps less so if we added an inter-
cept to the skew symmetry variables, since that would make it a generalization of the skew symmet-
ric additive model. In fact, to analyze homologous factors without replications, i.e., when Model (8)
has 0 degrees of freedom for error, McCullagh (2000) suggests using the MSE from the symmetry
model as a reasonable estimate of error, i.e., take as error the mean square for skew symmetry after
fitting symmetry.
(14.1.2) = (14.4.9)
(14.4.10) (14.1.3)
(14.4.2)
(14.1.6).
If the additive two-factor model does not fit, we might try the symmetry model. The symmetry
model is a reduced model relative to the interaction model but is not comparable to the additive two-
factor model. As in Subsection 14.1.2, there are two sequences of models that go from the smallest
model to the largest and we could use an ANOVA table similar to Table 14.3.
Rather than doing all this testing, we could look at C p statistics as given in Table 14.7. The
model with only Mother effects is still looking good.
14.5 Exercises
E XERCISE 14.5.1. Cochran and Cox (1957) presented data from Pauline Paul on the effect of
cold storage on roast beef tenderness. Treatments are labeled A through F and consist of 0, 1, 2, 4, 9,
and 18 days of storage, respectively. The data are tenderness scores and are presented in Table 14.8.
358 14. TWO-WAY ANOVA
Table 14.7: Cp statistics for fitting models to the data of Table 14.1.
Model Model SSE df Cp
(14.1.2) [LM] 2440.82. 45 16.0
(14.1.3) [L][M] 3264.89 54 13.2
(14.1.4) [L] 4039.97 57 21.5
(14.1.5) [M] 3328.52 57 8.4
(14.1.6) [G] 4100.13 60 16.6
(14.4.2) Symmetric Additive 3586.49 57 13.1
(14.4.6) Skew Symmetric Additive 3802.39 57 17.1
(14.4.10) Symmetric Nonadditive 2940.8 51 13.2
Analyze the data using a additive two-way ANOVA model involving blocks and treatments. Focus
your analysis on identifying differences between treatments.
E XERCISE 14.5.2. Inman et al. (1992) report data on the percentages of Manganese (Mn) in
various samples as determined by a spectrometer. Ten samples were used and the percentage of
Mn in each sample was determined by each of 4 operators. The data are given in Table 14.9. The
operators actually made two readings; the data presented are the averages of the two readings for
each sample–operator combination.
Using an additive two-way ANOVA model, analyze the data. Include in your analysis an eval-
uation of whether any operators are significantly different. Identify a potential outlier, delete that
outlier, reanalyze the data, and compare the results of the two analyses.
E XERCISE 14.5.3. Nelson (1993) presents data on the average access times for various disk
drives. The disk drives are five brands of half-height fixed drives. The performance of disk drives
depends on the computer in which they are installed. The computers could only hold four disk
drives. The data are given in Table 14.10. Analyze the data using an additive two-factor model.
Focus your analysis on identifying differences among brands.
E XERCISE 14.5.4. Garner (1956) presented data on the tensile strength of fabrics. Here we con-
sider a subset of the data. The complete data and a more extensive discussion of the experimental
procedure are given in Exercise 18.7.2. The experiment involved testing fabric strengths on four
different machines. Eight homogeneous strips of cloth were divided into four samples. Each sample
was tested on one of four machines. The data are given in Table 14.11.
(a) Analyze the data using an additive two-way model focusing on machine differences. Give an
appropriate analysis of variance table. Examine appropriate contrasts using Bonferroni’s method
with α = 0.05.
(b) Check the assumptions of the model and adjust the analysis appropriately.
E XERCISE 14.5.5. Repeat the analyses of this chapter after eliminating Litter I and Mother I
from the data in Table 14.1.
E XERCISE 14.5.6. Repeat the analyses of this chapter after eliminating Litter I and Mother J
from the data in Table 14.1.
E XERCISE 14.5.7. Repeat the analyses of this chapter after eliminating Litter I from the data
in Table 14.1. This requires extending the ideas on homologous factors to situations with unequal
numbers of factor levels.
E XERCISE 14.5.8. Explain how dropping the last term out of Model (14.4.7) gives results that
are different from dropping the indicator for the last factor level out of a one-way ANOVA model
with four groups. Focus on the intercept.
Chapter 15
Analysis of covariance (ACOVA) incorporates one or more regression variables into an analysis of
variance. As such, we can think of it as analogous to the two-way ANOVA of Chapter 14 except that
instead of having two different factor variables as predictors, we have one factor variable and one
continuous variable. The regression variables are referred to as covariates (relative to the dependent
variable), hence the name analysis of covariance. Covariates are also known as supplementary or
concomitant observations. Cox (1958, Chapter 4) gives a particularly nice discussion of the experi-
mental design ideas behind analysis of covariance and illustrates various useful plotting techniques;
also see Figure 15.4 below. In 1957 and 1982, Biometrics devoted entire issues to the analysis of
covariance. We begin our discussion with an example that involves a two-group one-way analysis
of variance and one covariate. Section 15.2 looks at an example where the covariate can also be
viewed as a factor variable. Section 15.3 uses ACOVA to look at lack-of-fit testing.
yi j = μi + εi j (15.1.1)
= μ + αi + εi j ,
where the yi j s are the heart weights, i = 1, 2, and j = 1, . . . , 24. This model yields the analysis of
Table 15.1: Body weights (kg) and heart weights (g) of domestic cats.
Females Males
Body Heart Body Heart Body Heart Body Heart
2.3 9.6 2.0 7.4 2.8 10.0 2.9 9.4
3.0 10.6 2.3 7.3 3.1 12.1 2.4 9.3
2.9 9.9 2.2 7.1 3.0 13.8 2.2 7.2
2.4 8.7 2.3 9.0 2.7 12.0 2.9 11.3
2.3 10.1 2.1 7.6 2.8 12.0 2.5 8.8
2.0 7.0 2.0 9.5 2.1 10.1 3.1 9.9
2.2 11.0 2.9 10.1 3.3 11.5 3.0 13.3
2.1 8.2 2.7 10.2 3.4 12.2 2.5 12.7
2.3 9.0 2.6 10.1 2.8 13.5 3.4 14.4
2.1 7.3 2.3 9.5 2.7 10.4 3.0 10.0
2.1 8.5 2.6 8.7 3.2 11.6 2.6 10.5
2.2 9.7 2.1 7.2 3.0 10.6 2.5 8.6
361
362 15. ACOVA AND INTERACTIONS
variance given in Table 15.2. Note the overwhelming effect due to sexes. We now develop a model
for both sex and weight that is analogous to the additive model (14.1.3).
yi j = μi + γ zi j + εi j (15.1.2)
= μ + αi + γ zi j + εi j .
Here the zs are the body weights and γ is a slope parameter associated with body weights. For this
example the mean model is
μ1 + γ z, if sex = female
m(sex, z) =
μ2 + γ z if sex = male.
Model (15.1.2) is a special case of the general additive-effects model (9.9.2). It is an extension
of the simple linear regression between the ys and the zs in which we allow a different intercept μi
for each sex but the same slope. In many ways, it is analogous to the two-way additive-effects model
(14.1.3). In Model (15.1.2) the effect of sex on heart weight is always the same for any fixed body
weight, i.e.,
(μ1 + γ z) − (μ2 + γ z) = μ1 − μ2 .
Thus we can talk about μ1 − μ2 being the sex effect regardless of body weight. The means for
females and males are parallel lines with common slope γ and |μ1 − μ2 | the distance between the
lines.
An analysis of variance table for Model (15.1.2) is given as Table 15.3. The interpretation of
this table is different from the ANOVA tables examined earlier. For example, the sums of squares
for body weights, sex, and error do not add up to the sum of squares total. The sums of squares in
Table 15.3 are referred to as adjusted sums of squares (Adj. SS) because the body weight sum of
squares is adjusted for sexes and the sex sum of squares is adjusted for body weights.
The error line in Table 15.3 is simply the error from fitting Model (15.1.2). The body weights
line comes from comparing Model (15.1.2) with the reduced model (15.1.1). Note that the only
difference between models (15.1.1) and (15.1.2) is that (15.1.1) does not involve the regression on
body weights, so by testing the two models we are testing whether there is a significant effect due
15.1 ONE COVARIATE EXAMPLE 363
to the regression on body weights. The standard way of comparing a full and a reduced model is
by comparing their error terms. Model (15.1.2) has one more parameter, γ , than Model (15.1.1), so
there is one more degree of freedom for error in Model (15.1.1) than in Model (15.1.2), hence one
degree of freedom for body weights. The adjusted sum of squares for body weights is the difference
between the sum of squares error in Model (15.1.1) and the sum of squares error in Model (15.1.2).
Given the sum of squares and the mean square, the F statistic for body weights is constructed in
the usual way and is reported in Table 15.3. We see a major effect due to the regression on body
weights.
The Sex line in Table 15.3 provides a test of whether there are differences in sexes after adjusting
for the regression on body weights. This comes from comparing Model (15.1.2) to a similar model in
which sex differences have been eliminated. In Model (15.1.2), the sex differences are incorporated
as μ1 and μ2 in the first version and as α1 and α2 in the second version. To eliminate sex differences
in Model (15.1.2), we simply eliminate the distinctions between the μ s (the α s). Such a model can
be written as
yi j = μ + γ zi j + εi j . (15.1.3)
The analysis of covariance model without treatment effects is just a simple linear regression of heart
weight on body weight. We have reduced the two sex parameters to one overall parameter, so the
difference in degrees of freedom between Model (15.1.3) and Model (15.1.2) is 1. The difference in
the sums of squares error between Model (15.1.3) and Model (15.1.2) is the adjusted sum of squares
for sex reported in Table 15.3. We see that the evidence for a sex effect over and above the effect
due to the regression on body weights is not great.
While ANOVA table Error terms are always the same for equivalent models, the table of coef-
ficients depends on the particular parameterization of a model. I prefer the ACOVA model parame-
terization
yi j = μi + γ zi j + εi j .
Some computer programs insist on using the equivalent model
yi j = μ + αi + γ zi j + εi j , (15.1.4)
which is overparameterized. To get estimates of the parameters in Model (15.1.4), one must impose
side conditions on them. My choice would be to make μ = 0 and get a model equivalent to the first
one. Other common choices of side conditions are: (a) α1 = 0, (b) α2 = 0, and (c) α1 + α2 = 0. Some
programs are flexible enough to let you specify the side conditions yourself. Minitab, for example,
uses the side conditions (c) and reports
Covariate γ̂ SE(γ̂ ) t P
Constant 2.755 1.498 1.84 0.072
Sex
1 −0.3884 0.2320 −1.67 0.101
Body Wt 2.7948 0.5759 4.85 0.000
Relative to Model (15.1.4) the parameter estimates are μ̂ = 2.755, α̂1 = −0.3884, α̂2 = 0.3884,
γ̂ = 2.7948, so the estimated regression line for females is
E(y) = [2.755 + (−0.3884)] + 2.7948z = 2.3666 + 2.7948z
and for males
E(y) = [2.755 − (−0.3884)] + 2.7948z = 3.1434 + 2.7948z,
e.g., the predicted values for females are
ŷ1 j = [2.755 + (−0.3884)] + 2.7948z1 j = 2.3666 + 2.7948z1 j
and for males are
ŷ2 j = [2.755 − (−0.3884)] + 2.7948z2 j = 3.1434 + 2.7948z2 j.
364 15. ACOVA AND INTERACTIONS
Note that the t statistic for sex 1 is the square root of the F statistic for sex in Table 15.3. The
P values are identical. Similarly, the tests for body weights are equivalent. Again, we find clear
evidence for the effect of body weights after fitting sexes.
A 95% confidence interval for γ has end points
2.7948 ± 2.014(0.5759),
which yields the interval (1.6, 4.0). We are 95% confident that, for data comparable to the data in
this study, an increase in body weight of one kilogram corresponds to a mean increase in heart
weight of between 1.6 g and 4.0 g. (An increase in body weight corresponds to an increase in heart
weight. Philosophically, we have no reason to believe that increasing body weights by one kg will
cause an increase in heart weight.)
In Model (15.1.2), comparing treatments by comparing the treatment means ȳi· is inappropriate
because of the complicating effect of the covariate. Adjusted means are often used to compare
treatments. The formula and the actual values for the adjusted means are given below along with the
raw means for body weights and heart rates.
Adjusted means ≡ ȳi· − γ̂ (z̄i· − z̄·· )
Sex N Body Heart Adj. Heart
Female 24 2.333 8.887 9.580
Male 24 2.829 11.050 10.357
Combined 48 2.581 9.969
Note that the difference in adjusted means is
9.580 − 10.357 = α̂1 − α̂2 = 2(−0.3884).
We have seen previously that there is little evidence of a differential effect on heart weights due to
sexes after adjusting for body weights. Nonetheless, from the adjusted means what evidence exists
suggests that, even after adjusting for body weights, a typical heart weight for males, 10.357, is
larger than a typical heart weight for females, 9.580.
Figures 15.1 through 15.3 contain residual plots for Model (15.1.2). The plot of residuals versus
predicted values looks exceptionally good. The plot of residuals versus sexes shows slightly less
variability for females than for males. The difference is probably not enough to worry about. The
normal plot of the residuals is alright with W ′ above the appropriate percentile.
The models that we have fitted form a hierarchy similar to that discussed in Chapter 14. The
ACOVA model is larger than both the one-way and simple linear regression models, which are not
comparable, but both are larger than the intercept-only model.
ACOVA
One-Way ANOVA Simple Linear Regression
Intercept-Only
In terms of numbered models the hierarchy is
(15.1.2)
(15.1.1) (15.1.3)
(14.1.6).
Such a hierarchy leads to two sequential ANOVA tables that are displayed in Table 15.4. All of the
results in Table 15.3 appear in Table 15.4.
2
1
Standardized residuals
0
−1
8 9 10 11 12
Fitted
Residual−Socio plot
2
1
Standardized residuals
0
−1
for these regressions but uses the same slope γ . We should test the assumption of a common slope
by fitting the more general model that allows different slopes for females and males, i.e.,
yi j = μi + γi zi j + εi j (15.1.5)
= μ + αi + γi zi j + εi j .
In Model (15.1.5) the γ s depend on i and thus the slopes are allowed to differ between the sexes.
While Model (15.1.5) may look complicated, it consists of nothing more than fitting a simple linear
regression to each group: one to the female data and a separate simple linear regression to the male
366 15. ACOVA AND INTERACTIONS
Normal Q−Q Plot
2
1
Standardized residuals
0
−1
−2 −1 0 1 2
Theoretical Quantiles
Source df Seq SS MS F P
Sex 1 56.117 56.117 39.94 0.000
Body Weights 1 37.828 37.828 23.55 0.000
Error 45 72.279 1.606
Total 47 166.223
Figure 15.4 contains some examples of how Model (15.1.2) and Model (15.1.5) might look when
plotted. In Model (15.1.2) the lines are always parallel. In Model (15.1.5) they can have several
appearences.
The sum of squares error for Model (15.1.5) can be found directly but it also comes from adding
the error sums of squares for the separate female and male simple linear regressions. It is easily seen
that for females the simple linear regression has an error sum of squares of 22.459 on 22 degrees of
freedom and the males have an error sum of squares of 49.614 also on 22 degrees of freedom. Thus
Model (15.1.5) has an error sum of squares of 22.459 + 49.614 = 72.073 on 22 + 22 = 44 degrees
of freedom. The mean squared error for Model (15.1.5) is
72.073
MSE(5) = = 1.638
44
15.1 ONE COVARIATE EXAMPLE 367
No Interaction Interaction
Mean
10 12 14 16 18 20 10 12 14 16 18 20
x1 x1
Interaction Interaction
0.00 0.05 0.10 0.15 0.20
Mean
10 12 14 16 18 20 10 12 14 16 18 20
x1 x1
Figure 15.4 Patterns of interaction (effect modification) between a continuous predictor x1 and a binary pre-
dictor x2 .
and, using results from Table 15.3, the test of Model (15.1.5) against the reduced model (15.1.2) has
(15.1.5)
(15.1.2)
(15.1.1) (15.1.3)
(14.1.6).
The hierarchy leads to the two ANOVA tables given in Table 15.5. We could also report C p statistics
for all five models relative to the interaction model (15.1.5).
368 15. ACOVA AND INTERACTIONS
Source df Seq SS MS F P
Sex 1 56.117 56.117 34.26 0.000
Body Weights 1 37.828 37.828 23.09 0.000
Sex*Body Wt 1 0.206 0.206 0.13 0.725
Error 44 72.073 1.638
Total 47 166.223
The table of coefficients depends on the particular parameterization of a model. I prefer the
interaction model parameterization
yi j = μi + γi zi j + εi j ,
in which all of the parameters are uniquely defined. Some computer programs insist on using the
equivalent model
yi j = μ + αi + β zi j + γi zi j + εi j (15.1.6)
which is overparameterized. To get estimates of the parameters, one must impose side conditions
on them. My choice would be to make μ = 0 = β and get a model equivalent to the first one.
Other common choices of side conditions are: (a) α1 = 0 = γ1 , (b) α2 = 0 = γ2 , and (c) α1 + α2 =
0 = γ1 + γ2 . Some programs are flexible enough to let you specify the model yourself. Minitab, for
example, uses the side conditions (c) and reports
Covariate γ̂ SE(γ̂ ) t P
Constant 2.789 1.516 1.84 0.072
Sex
1 0.142 1.516 0.09 0.926
Body Wt 2.7613 0.5892 4.69 0.000
Body Wt*Sex
1 −0.2089 0.5892 −0.35 0.725
Relative to Model (15.1.6) the parameter estimates are μ̂ = 2.789, α̂1 = 0.142, α̂2 = −0.142, β̂ =
2.7613, γ̂1 = −0.2089, γ̂2 = 0.2089, so the estimated regression line for females is
yi j = μi + γ1 xi j1 + γ2 xi j2 + γ3 xi j3 + εi j .
We could even apply this idea to the cat example by considering a polynomial model. Incorporating
into Model (15.1.2) a cubic polynomial for one predictor z gives
yi j = μi + γ1 zi j + γ2 z2i j + γ3 z3i j + εi j .
The key point is that ACOVA models are additive-effects models because none of the γ parameters
depend on sex (i). If we have three covariates x1 , x2 , x3 , an ACOVA model has
yi j = μi + h(xi j1 , xi j2 , xi j3 ) + εi j ,
for some function h(·). In this case μ1 − μ2 is the differential effect for the two groups regardless of
the covariate values.
One possible interaction model allows completely different regressions functions for each group,
Here we allow the slope parameters to depend on i. For the cat example we might consider separate
cubic polynomials for each sex, i.e.,
m(x1 , x2 , x3 ) = μ 1 x1 + μ 2 x2 + μ 3 x3 (15.2.1)
0 μ , female
1
= μ2 , male
μ3 , herm
and the second form for the means is the overparameterized model
m(x1 , x2 , x3 ) = μ + α x + α2 x2 + α3 x3 (15.2.2)
⎧ 1 1
⎨ (μ + α1 ), female
= (μ + α2 ), male
⎩
(μ + α3 ), herm.
The first form for the means of Model (15.1.2) is the parallel lines regression model
m(x1 , x2 , x3 , z) = μ 1 x1 + μ 2 x2 + μ 3 x3 + γ z (15.2.3)
0 μ + γ z, female
1
= μ2 + γ z, male
μ3 + γ z, herm
370 15. ACOVA AND INTERACTIONS
and the second form is the overparameterized parallel lines model
m(x1 , x2 , x3 , z) = μ + α x + α2 x2 + α3 x3 + γ z (15.2.4)
⎧ 1 1
⎨ (μ + α1 ) + γ z, female
= (μ + α2 ) + γ z, male
⎩
(μ + α3 ) + γ z, herm.
Similarly, we could have “parallel” polynomials. For quadratics that would be
m(x1 , x2 , x3 , z) = μ x + μ2x2 + μ3x3 + γ1 z + γ2 z2
⎧1 1
⎨ μ1 + γ1 z + γ2 z2 , female
= μ + γ1 z + γ2 z2 , male
⎩ 2
μ3 + γ1 z + γ2 z2 , herm
wherein only the intercepts are different.
The interaction model (15.1.5) gives separate lines for each group and can be written as
m(x1 , x2 , x3 , z) = μ1 x1 + μ2 x2 + μ3 x3 + γ1 zx1 + γ2 zx2 + γ3 zx3
0 μ + γ z, female
1 1
= μ2 + γ2 z, male
μ3 + γ3 z, herm
and the second form is the overparameterized model
m(x1 , x2 , x3 , z) = μ + α x + α2 x2 + α3 x3 + β z + γ1zx1 + γ2 zx2 + γ3 zx3
⎧ 1 1
⎨ (μ + α1 ) + (β + γ1 )z, female
= (μ + α2 ) + (β + γ2 )z, male
⎩
(μ + α3 ) + (β + γ3 )z, herm.
Every sex category has a completely separate line with different slopes and intercepts. Interaction
parabolas would be completely separate parabolas for each group
m(x1 , x2 , x3 , z) = μ1 x1 + μ2 x2 + μ3 x3 + γ11 zx1 + γ21z2 x1
+γ12 zx2 + γ22 z2 x2 + γ13 zx3 + γ23 z2 x3
⎧
⎨ μ1 + γ11 z + γ21 z2 , female
= μ + γ12 z + γ22 z2 , male
⎩ 2
μ3 + γ13 z + γ23 z2 , herm.
If we thought that males and herms had the same mean, since neither male nor herm is the baseline,
we could replace x2 and x3 with a new variable x̃ = x2 + x3 that indicates membership in either group
to get
μ, female
m(x1 , x2 , x3 ) = μ + α x̃ =
μ + α , male or herm.
We could equally well fit the model
μ1 , female
m(x1 , x2 , x3 ) = μ1 x1 + μ3 x̃ =
μ3 , male or herm.
In these cases, the analysis of covariance (15.2.4) behaves similarly. For example, without both x1
and x2 Model (15.2.4) becomes
m(x1 , x2 , x3 , z) = μ + α3 x3 + γ z (15.2.7)
μ + γ z, female or male
=
(μ + α3 ) + γ z, herm
and involves only two parallel lines, one that applies to both females and males, and another one for
herms.
Dropping both x1 and x2 from Model (15.2.2) gives very different results than dropping the
intercept and x2 from Model (15.2.2). That statement may seem obvious, but if you think about
the fact that dropping x1 alone does not actually affect how the model fits the data, it might be
tempting to think that further dropping x2 could have the same effect after dropping x1 as dropping
x2 has in Model (15.2.1). We have already examined dropping both x1 and x2 from Model (15.2.2),
now consider dropping both the intercept and x2 from Model (15.2.2), i.e., dropping x2 from Model
(15.2.1). The model becomes
0 μ , female
1
m(x) = μ1 x1 + μ3 x3 = 0, male
μ3 , herm.
This occurs because all of the predictor variables in the model take the value 0 for male. If we
incorporate the covariate age into this model we get
0 μ + γ z, female
1
m(x) = μ1 x1 + μ3 x3 + γ z = 0 + γ z, male
μ3 + γ z, herm,
which are three parallel lines but male has an intercept of 0.
whether the measurement system for the weight of railroad hopper cars was under control. A stan-
dard hopper car weighing about 266,000 pounds was used to obtain the first 3 weighings of the day
on each of 20 days. The process was to move the car onto the scales, weigh the car, move the car
off, move the car on, weigh the car, move it off, move it on, and weigh it a third time. The tabled
values are the weight of the car minus 260,000.
As we did with the cat data, the first thing we might do is treat the three repeat observations as
replications and do a one-way ANOVA on the days,
yi j = μi + εi j , i = 1, . . . , 20, j = 1, 2, 3.
Summary statistics are given in Table 15.7 and the ANOVA table follows.
Analysis of Variance
Source df SS MS F P
Day 19 296380 15599 18.25 0.000
Error 40 34196 855
Total 59 330576
Obviously, there are differences in days.
yi j = μi + γ1 zi j + γ2 z2i j + εi j , i = 1, . . . , 20, j = 1, 2, 3.
15.3 ACOVA AND TWO-WAY ANOVA 373
The software I used actually fits
yi j = μ + αi + γ1 zi j + γ2 z2i j + εi j , i = 1, . . . , 20, j = 1, 2, 3
with the additional constraint that α1 + · · · + α20 = 0, so that α̂20 = −(α̂1 + · · · + α̂19 ). The output
then only presents α̂1 , . . . , α̂19
Table of Coefficients
Predictor Est SE(Est) t P
Constant 6066.10 28.35 213.98 0.000
z −49.50 32.19 −1.54 0.132
z2 11.850 7.965 1.49 0.145
Day
1 −55.73 16.37 −3.41 0.002
2 −123.07 16.37 5.13 0.000
4 −105.07 16.37 −6.42 0.000
5 −9.07 16.37 −0.55 0.583
6 32.27 16.37 1.97 0.056
7 102.93 16.37 6.29 0.000
8 −53.40 16.37 −3.26 0.002
9 −36.73 16.37 −2.24 0.031
10 19.93 16.37 1.22 0.231
11 −72.40 16.37 −4.42 0.000
12 35.60 16.37 2.18 0.036
13 77.27 16.37 4.72 0.000
14 43.27 16.37 2.64 0.012
15 −37.40 16.37 −2.29 0.028
16 −10.07 16.37 −0.62 0.542
17 125.27 16.37 7.65 0.000
18 8.93 16.37 0.55 0.588
19 −97.40 16.37 −5.95 0.000
The table of coefficients is ugly, especially because there are so many days, but the main point is
that the z2 term in not significant (P = 0.145).
The corresponding ANOVA table is a little strange. The only really important thing is that it
gives the Error line. There is also some interest in the fact that the F statistic reported for z2 is the
square of the t statistic, having identical P values.
Analysis of Variance
Source df SS MS F P
z 1 176 176 2.36 0.132
Day 19 296380 15599 18.44 0.000
z2 1 1872 1872 2.21 0.145
Error 38 32147 846
Total 59 330576
Similar to Section 12.5, instead of fitting a maximal polynomial (we only have three times so can
fit at most a quadratic in time), we could alternatively treat z as a factor variable and do a two-way
ANOVA as in Chapter 14, i.e., fit
yi j = μ + αi + η j + εi j , i = 1, . . . , 20, j = 1, 2, 3.
The quadratic ACOVA model is equivalent to this two-way ANOVA model, so the two-way ANOVA
model should have an equivalent ANOVA table.
374 15. ACOVA AND INTERACTIONS
Analysis of Variance
Source df SS MS F P
Day 19 296380 15599 18.44 0.000
Time 2 2049 1024 1.21 0.309
Error 38 32147 846
Total 59 330576
This has the same Error line as the quadratic ACOVA model.
With a nonsignificant z2 term in the quadratic model, it makes sense to check whether we need
the linear term in z. The model is
yi j = μi + γ1 zi j + εi j , i = 1, . . . , 20, j = 1, 2, 3
or
yi j = μ + αi + γ1 zi j + εi j , i = 1, . . . , 20, j = 1, 2, 3
subject to the constraint that α1 + · · · + α20 = 0. The table of coefficients is
Table of Coefficients
Predictor Est SE(Est) t P
Constant 6026.60 10.09 597.40 0.000
Time −2.100 4.670 −0.45 0.655
Day
1 −55.73 16.62 −3.35 0.002
2 −123.07 16.62 −7.40 0.000
3 83.93 16.62 5.05 0.000
4 −105.07 16.62 −6.32 0.000
5 −9.07 16.62 −0.55 0.588
6 32.27 16.62 1.94 0.059
7 102.93 16.62 6.19 0.000
8 −53.40 16.62 −3.21 0.003
9 −36.73 16.62 −2.21 0.033
10 19.93 16.62 1.20 0.238
11 −72.40 16.62 −4.36 0.000
12 35.60 16.62 2.14 0.038
13 77.27 16.62 4.65 0.000
14 43.27 16.62 2.60 0.013
15 −37.40 16.62 −2.25 0.030
16 −10.07 16.62 −0.61 0.548
17 125.27 16.62 7.54 0.000
18 8.93 16.62 0.54 0.594
19 −97.40 16.62 −5.86 0.000
and we find no evidence that we need the linear term (P = 0.655). For completeness, an ANOVA
table is
Analysis of Variance
Source df SS MS F P
z 1 176 176 0.20 0.655
Day 19 296380 15599 17.88 0.000
Error 39 34020 872
Total 59 330576
It might be tempting to worry about interaction in this model. Resist the temptation! First, there
are not enough observations for us to fit a full interaction model and still estimate σ 2 . If we fit
separate quadratics for each day, we would have 60 mean parameters and 60 observations, so zero
15.4 NEAR REPLICATE LACK-OF-FIT TESTS 375
degrees of freedom for error. Exactly the same thing would happen if we fit a standard interaction
model from Chapter 14. But more importantly, it just makes sense to think of interaction as error for
these data. What does it mean for there to be a time trend in these data? Surely we have no interest
in time trends that go up one day and down another day without any rhyme or reason. For a time
trend to be meaningful, it needs to be something that we can spot on a consistent basis. It has to be
something that is strong enough that we can see it over and above the natural day-to-day variation
of the weighing process. Well, the natural day-to-day variation of the weighing process is precisely
the Day-by-Time interaction, so the interaction is precisely what we want to be using as our error
term. In the model
yi j = μ + αi + γ1 zi j + γ2 z2i j + εi j ,
changes that are inconsistent across days and times, terms that depend on both i and j, are what we
want to use as error. (An exception to this claim is if, say, we noticed that time trends go up one
day, down the next, then up again, etc. That is a form of interaction that we could be interested in,
but its existence requires additional structure for the Days because it involves modeling effects for
alternate days.)
This is sometimes called the artificial means model because it is a regression on the near replicate
cluster means x̄ j· but the clusters are artificially constructed. The second model is a one-way analysis
of variance model with groups defined by the near replicate clusters,
y jk = μ j + ε jk . (15.4.2)
As a regression model, define the predictor variables δh j for h = 1, . . . , 19, that are equal to 1 if h = j
and 0 otherwise. Then the model can be rewritten as a multiple regression model through the origin
y jk = μ1 δ1 j + μ2 δ2 j + · · · + μ19 δ19, j + ε jk .
The last model is called an analysis of covariance model because it incorporates the original predic-
tor (covariate) x jk into the analysis of variance model (15.4.2). The model is
y jk = μ j + β1 x jk + ε jk , (15.4.3)
y jk = μ1 δ1 j + μ2δ2 j + · · · + μ19δ19, j + β1 x jk + ε jk .
This can be compared to an F(17, 11) distribution, that yields a P value of 0.001. This procedure is
known as Shillington’s test, cf. Christensen (2011).
15.5 EXERCISES 377
Table 15.9: Compressive strength of hoop pine trees (y) with moisture contents (z).
Temperature
−20◦ C 0◦ C 20◦ C 40◦ C 60◦ C
Tree z y z y z y z y z y
1 42.1 13.14 41.1 12.46 43.1 9.43 41.4 7.63 39.1 6.34
2 41.0 15.90 39.4 14.11 40.3 11.30 38.6 9.56 36.7 7.27
3 41.1 13.39 40.2 12.32 40.6 9.65 41.7 7.90 39.7 6.41
4 41.0 15.51 39.8 13.68 40.4 10.33 39.8 8.27 39.3 7.06
5 41.0 15.53 41.2 13.16 39.7 10.29 39.0 8.67 39.0 6.68
6 42.0 15.26 40.0 13.64 40.3 10.35 40.9 8.67 41.2 6.62
7 40.4 15.06 39.0 13.25 34.9 10.56 40.1 8.10 41.4 6.15
8 39.3 15.21 38.8 13.54 37.5 10.46 40.6 8.30 41.8 6.09
9 39.2 16.90 38.5 15.23 38.5 11.94 39.4 9.34 41.7 6.26
10 37.7 15.45 35.7 14.06 36.7 10.74 38.9 7.75 38.2 6.29
15.5 Exercises
E XERCISE 15.5.1. Table 15.9 contains data from Sulzberger (1953) and Williams (1959) on y,
the maximum compressive strength parallel to the grain of wood from ten hoop pine trees. The data
also include the temperature of the evaluation and a covariate z, the moisture content of the wood.
Analyze the data. Examine polynomials in the temperatures.
E XERCISE 15.5.2. Smith, Gnanadesikan, and Hughes (1962) gave data on urine characteristics
of young men. The men were divided into four categories based on obesity. The data contain a
covariate z that measures specific gravity. The dependent variable is y1 ; it measures pigment crea-
tinine. These variables are included in Table 15.10. Perform an analysis of covariance on y1 . How
do the conclusions about obesity effects change between the ACOVA and the results of the ANOVA
that ignores the covariate?
E XERCISE 15.5.3. Smith, Gnanadesikan, and Hughes (1962) also give data on the variable y2
that measures chloride in the urine of young men. These data are also reported in Table 15.10. As
in the previous problem, the men were divided into four categories based on obesity. Perform an
analysis of covariance on y2 again using the specific gravity as the covariate z. Compare the results
of the ACOVA to the results of the ANOVA that ignores the covariate.
E XERCISE 15.5.4. The data of Exercise 14.5.1 and Table 14.8 involved two factors, one of which
had unequally spaced quantitative levels. Find the smallest polynomial that gives an adequate fit in
place of treating “days in storage” as a factor variable.
E XERCISE 15.5.5. Apply Shillington’s test to the data of Exercise 9.12.3 and Table 9.3. The
challenge is to come up with some method of identifying near replicates. (A clustering algorithm is
a good idea but beyond the scope of this book.)
E XERCISE 15.5.6. Referring back to Subsection 7.3.3, test the need for a power transformation
in each of the following problems from the previous chapter. Use all three constructed variables on
each data set and compare results.
(a) Exercise 14.5.1.
(b) Exercise 14.5.2.
(c) Exercise 14.5.3.
(d) Exercise 14.5.4.
378 15. ACOVA AND INTERACTIONS
E XERCISE 15.5.7. Write models (15.1.1), (15.1.2), and (15.1.3) in matrix form. For each model
use a regression program on the heart weight data of Table 15.1 to find 95% and 99% prediction
intervals for a male and a female each with body weight of 3.0. Hint: Use models without intercepts
whenever possible.
E XERCISE 15.5.8. Consider the analysis of covariance for a one-way ANOVA with one covari-
ate. Find the form for a 99% prediction interval for an observation, say, from the first treatment
group with a given covariate value z.
where SSEzz is the sum of squares error from doing a one-way ANOVA on z.
Chapter 16
Multifactor Structures
In this chapter we introduce analysis of variance models that involve more than two factors and
examine interactions between two factors.
Table 16.1 is derived from Scheffé (1959) and gives the moisture content (in grams) for samples
of a food product made with three kinds of salt (A), three amounts of salt (B), and two additives (C).
The amounts of salt, as measured in moles, are equally spaced. The two numbers listed for some
treatment combinations are replications. We wish to analyze these data.
We will consider these data as a three-factor ANOVA. From the structure of the replications the
ANOVA has unequal numbers. The general model for a three-factor ANOVA with replications is
with only the [AB] and [AC] interactions. In [AB][AC], the grand mean and all of the main effects
are redundant; it does not matter whether these terms are included in the model. Similarly, [AB][C]
indicates the model
yi jkm = G + Ai + B j + Ck + [AB]i j + ei jkm
with the [AB] interaction and the C main effect. In [AB][C], the grand mean and the A and B main
effects are redundant. Readers familiar with methods for fitting log-linear models (cf. Christensen,
1997 or Fienberg, 1980) will notice a correspondence between Table 16.2 and similar displays used
379
380 16. MULTIFACTOR STRUCTURES
Table 16.2: Statistics for fitting models to the data of Table 16.1.
Model SSE dfE F* Cp
[ABC] 32.50 14 18.0
[AB][AC][BC] 39.40 18 .743 13.0
[AB][AC] 45.18 20 .910 11.5
[AB][BC] 40.46 20 .572 9.4
[AC][BC] 333.2 22 16.19 131.5
[AB][C] 45.75 22 .713 7.7
[AC][B] 346.8 24 13.54 133.4
[BC][A] 339.8 24 13.24 130.4
[A][B][C] 351.1 26 11.44 131.2
The F statistics are for testing each model against the model
with a three-factor interaction, i.e., [ABC]. The denominator
of each F statistic is MSE([ABC]) = 32.50/14 = 2.3214.
in fitting three-dimensional contingency tables. The analogies between selecting log-linear models
and selecting models for unbalanced ANOVA are pervasive.
All of the models have been compared to the full model using F statistics in Table 16.2. It takes
neither a genius nor an F table to see that the only models that fit the data are the models that include
the [AB] interaction. The C p statistics tell the same story.
In addition to testing models against the three-factor interaction model, there are a number of
other comparisons that can be made among models that include [AB]. These are [AB][AC][BC] versus
[AB][AC], [AB][AC][BC] versus [AB][BC], [AB][AC][BC] versus [AB][C], [AB][AC] versus [AB][C], and
[AB][BC] versus [AB][C]. None of the comparisons show any lack of fit. The last two comparisons
are illustrated below.
We need to examine the [AB] interaction. Since the levels of B are quantitative, a model that is equiv-
alent to [AB][C] is a model that includes the main effects for C, but, instead of fitting an interaction
in A and B, fits a separate regression equation in the levels of B for each level of A. Let x j , j = 1, 2, 3
denote the levels of B. There are three levels of B, so the most general polynomial we can fit is a
second-degree polynomial in x j . Since the amounts of salt were equally spaced, it does not matter
16.1 UNBALANCED THREE-FACTOR ANALYSIS OF VARIANCE 381
much what we use for the x j s. The computations were performed using x1 = 1, x2 = 2, x3 = 3. In
particular, the model [AB][C] was reparameterized as
The nature of this model is that for a fixed additive, the three curves for the three salts can take any
shapes at all. However, if you change to the other additive all three of the curves will shift, either
up or down, exactly the same amount due to the change in additive. The shapes of the curves do not
change.
With a notation similar to that used in Table 16.2, the SSE and the dfE are reported in Table 16.3
for Model (16.1.1) and three reduced models. Note that the SSE and dfE reported in Table 16.3 for
[A0 ][A1 ][A2 ][C] are identical to the values reported in Table 16.2 for [AB][C]. This, of course, must be
true if the models are merely reparameterizations of one another. First we want to establish whether
the quadratic effects are necessary in the regressions. To do this we drop the Ai2 terms from Model
(16.1.1) and test
[A0 ][A1 ][A2 ][C] versus [A0 ][A1 ][C]
R(A2 |A1 , A0 ,C) = 59.98 − 45.75 = 14.23
Fobs = (14.23/3)/2.3214 = 2.04.
Since F(.95, 3, 14) = 3.34, there is no evidence of any nonlinear effects.
At this point it might be of interest to test whether there are any linear effects. This is done by
testing [A0 ][A1 ][C] against [A0 ][C]. The statistics needed for this test are in Table 16.3. Instead of
actually doing the test, recall that no models in Table 16.2 fit the data unless they included the [AB]
interaction. If we eliminated the linear effects we would have a model that involved none of the
[AB] interaction. (The model [A0 ][C] is identical to the ANOVA model [A][C].) We already know
that such models do not fit.
Finally, we have never explored the possibility that there is no main effect for C. This can be
done by testing
[A0 ][A1 ][C] versus [A0 ][A1 ]
R(C|A1 , A0 ) = 262.0 − 59.98 = 202
Fobs = (202/1)/2.3214 = 87.
Obviously, there is a substantial main effect for C, the type of food additive.
Our conclusion is that the model [A0 ][A1 ][C] is the smallest model that has been considered that
adequately fits the data. This model indicates that there is an effect for the type of additive and a
linear relationship between amount of salt and moisture content. The slope and intercept of the line
may depend on the type of salt. (The intercept of the line also depends on the type of additive.)
Table 16.4 contains parameter estimates and standard errors for the model. All estimates in the
example use the side condition C1 = 0.
Note that, in lieu of the F test given earlier, the test for the main effect C could be performed
from Table 16.4 by looking at t = −5.067/.5522 = −9.176. Moreover, we should have t 2 = F. The
t statistic squared is 84, while the F statistic reported earlier is 87. The difference is due to the fact
382 16. MULTIFACTOR STRUCTURES
Table of Coefficients
Parameter Estimate SE
A10 3.35 1.375
A11 5.85 .5909
A20 −3.789 1.237
A21 13.24 .5909
A30 −4.967 1.231
A31 14.25 .5476
C1 0. none
C2 −5.067 .5522
Table 16.5: yi jkm = Ai0 + Ai1 x j +Ck + ei jkm , A21 = A31 , A20 = A30 .
Table of Coefficients
Parameter Estimate SE
A10 3.395 1.398
A11 5.845 .6008
A20 −4.466 .9030
A21 13.81 .4078
C1 0. none
C2 −5.130 .5602
that the SE reported in Table 16.4 uses the MSE for the model being fitted, while in performing the
F test we used MSE([ABC]).
Are we done yet? No. The parameter estimates suggest some additional questions. Are the slopes
for salts 2 and 3 the same, i.e., is A21 = A31 ? In fact, are the entire lines for salts 2 and 3 the same,
i.e., are A21 = A31 , A20 = A30 ? We can fit models that incorporate these assumptions.
Model SSE dfE
[A0 ][A1 ][C] 59.98 25
[A0 ][A1 ][C], A21 = A31 63.73 26
[A0 ][A1 ][C], A21 = A31 , A20 = A30 66.97 27
It is a small matter to check that there is no lack of fit displayed by any of these models. The
smallest model that fits the data is now [A0 ][A1 ][C], A21 = A31 , A20 = A30 . Thus there seems to be
no difference between salts 2 and 3, but salt 1 has a different regression than the other two salts.
(We did not actually test whether salt 1 is different, but if salt 1 had the same slope as the other two
then there would be no [AB] interaction and we know that interaction exists.) There is also an effect
for the food additives. The parameter estimates and standard errors for the final model are given in
Table 16.5.
Figure 16.1 shows the fitted values as functions of the amount of salt for each combination of
a salt (with salts 2 and 3 treated as the same) and the additive. The fact that the slope for salt 1
is different from the slope for salts 2 and 3 constitutes an AB interaction. The vertical distances
between the two lines for each salt are the same due to the simple main effect for C (additive).
The two lines are shockingly close at B = x1 = 1, which makes one wonder if perhaps B = 1 is an
indication of no salt being used.
If the level B = 1 really consists of not adding salt, then, when B = 1, the means should be
identical for the three salts. The additives can still affect the moisture contents and positive salt
amounts can affect the moisture contents. To incorporate these ideas, we subtract one from the salt
amounts and eliminate the intercepts from the lines in the amount of salt. That makes the effects for
the additive the de facto intercepts, and they are no longer overparameterized,
40
(A,C)
(1,1)
(1,2)
(2,1)
(2,2)
30
Fitted
20
10
0
Figure 16.1: Fitted values for moisture content data treating salts 2 and 3 as the same.
Table of Coefficients
Parameter Estimate SE tobs
C1 9.3162 0.5182 17.978
C2 4.1815 0.4995 8.371
A11 5.8007 0.4311 13.456
A21 13.8282 0.3660 37.786
This model has dfE = 28 and SSE = 67.0 so it fits the data almost as well as the previous model
but with one less parameter. The estimated coefficients are given in Table 16.6 and the results are
plotted in Figure 16.2. The figure is almost identical to Figure 16.1. Note that the vertical distances
between the two lines with “the same” salt in Figure 16.2 are 5.1347 = 9.3162 − 4.1815, almost
identical to the 5.130 in Figure 16.1.
Are we done yet? Probably not. We have not even considered the validity of the assumptions.
Are the errors normally distributed? Are the variances the same for every treatment combination?
Technically, we need to ask whether C1 = C2 in this new model. A quick look at the estimates and
standard errors answers the question in the negative.
Exercise 16.4.7 examines the process of fitting the more unusual models found in this section.
16.1.1 Computing
Because it is the easiest program I know, most of the analyses in this book were done in Minitab. We
now present and contrast R and SAS code for fitting [AB][C] and discuss the fitting of other models
from this section. Table 16.7 illustrates the variables needed for a full analysis. The online data file
contains only the y values and indices for the three groups. Creating X and X2 is generally easy.
Creating the variable A2 that does not distinguish between salts 2 and 3 can be trickier. If we had a
huge number of observations, we would want to write a program to modify A into A2. With the data
we have, in Minitab it is easy to make a copy of A and modify it appropriately in the spreadsheet.
384 16. MULTIFACTOR STRUCTURES
40
(A,C)
(1,1)
(1,2)
(2,1)
(2,2)
30
Fitted
20
10
0
B−1
Figure 16.2: Fitted values for moisture content data treating salts 2 and 3 as the same and B = 1 as 0 salt.
Similarly, it is easy to create A2 in R using A2=A followed by A2[(A2 == 3)] <- 2. For SAS, I
would probably modify the data file so that I could read A2 with the rest of the data.
An R script for fitting [AB][C] follows. R needs to locate the data file, which in this case is
located at E:\Books\ANREG2\DATA2\tab16-1.dat.
scheffe <- read.table("E:\\Books\\ANREG2\\DATA2\\tab16-1.dat",
sep="",col.names=c("y","a","b","c"))
attach(scheffe)
scheffe
summary(scheffe)
#Summary tables
A=factor(a)
16.1 UNBALANCED THREE-FACTOR ANALYSIS OF VARIANCE 385
B=factor(b)
C=factor(c)
X=b
X2=X*X
sabc <- lm(y ~ A:B + C)
coef=summary(sabc)
coef
anova(sabc)
SAS code for fitting [AB][C] follows. The code assumes that the data file is the same directory
(folder) as the SAS file.
To fit the other models, one needs to modify the part of the code that specifies the model. In R
this involves changes to “sabc <- lm(y ∼ A:B + C)” and in SAS it involves changes to “model
y = A*B C;”. Alternative model specifications follow.
We start by creating 0-1 indicator variables for the factor variables A, B, and C. Call these, A1 ,
A2 , A3 , B1 , B2 , B3 , C1 , C2 , respectively. The values used to identify groups in factor variable B
are measured quantities, so create a measurement variable x ≡ B and another x2 . We can construct
all of the models from these 10 predictor variables by multiplying them together judiciously. Of
course there are many equivalent ways of specifying these models; we present only one. None of
the models contain an intercept.
386 16. MULTIFACTOR STRUCTURES
Model Variables
[ABC] A1 B1C1 , A1 B1C2 , A1 B2C1 , A1 B2C2 , A1 B3C1 , . . . , A3 B3C1 , A3 B3C2
[AB][AC][BC] A1 B1 , A1 B2 , . . . , A3 B3 , A1C2 , A2C2 , A3C2 , B2C2 , B3C2
[AB][BC] A1 B1 , A1 B2 , . . . , A3 B3 , B1C2 , B2C2 , B3C2
[AB][C] A1 B1 , A1 B2 , . . . , A3 B3 ,C2
[A][B][C] A1 , A2 , A3 , B2 , B3 ,C2
[A0 ][A1 ][A2 ][C] A1 , A2 , A3 , A1 x, A2 x, A3 x, A1 x2 , A2 x2 , A3 x2 ,C2
[A0 ][A1 ][C] A1 , A2 , A3 , A1 x, A2 x, A3 x,C2
[A0 ][A1 ] A1 , A2 , A3 , A1 x, A2 x, A3 x
[A0 ][C] A1 , A2 , A3 ,C2
Constructing the models in which salts 2 and 3 are treated alike requires some additional algebra.
Model Variables
[A0 ][A1 ][C], A21 = A31 A1 , A2 , A3 , A1 x, (A2 + A3 )x,C2
[A0 ][A1 ][C], A21 = A31 , A20 = A30 A1 , (A2 + A3 ), A1 x, (A2 + A3 )x,C2
E XAMPLE 16.2.1. Box (1950) considers data on the abrasion resistance of a fabric. The data
are weight loss of a fabric that occurs during the first 1000 revolutions of a machine designed to
test abrasion resistance. A piece of fabric is weighed, put on the machine for 1000 revolutions, and
weighed again. The measurement is the change in weight. Fabrics of several different types are
compared. They differ by whether a surface treatment was applied, the type of filler used, and the
proportion of filler used. Two pieces of fabric of each type are examined, giving two replications
in the analysis of variance. The data, given in Table 16.8, are balanced because they have the same
number of observations for each group.
The three factors are referred to as “surf ,” “ f ill,” and “prop,” respectively. The factors have
2, 2, and 3 levels, so there are 2 × 2 × 3 = 12 groups. This can also be viewed as just a one-way
ANOVA with 12 groups. Using the three subscripts i jk to indicate a treatment by indicating the
levels of surf , f ill, and prop, respectively, the one-way ANOVA model is
Here the S, F, and P effects indicate main effects for surf , f ill, and prop, respectively. (We hope no
confusion occurs between the factor F and the use of F statistics or between the factor P and the use
of P values!) The (SF)s are effects that allow for the two-factor interaction between surf and f ill;
16.2 BALANCED THREE-FACTORS 387
Residual−Fitted plot
3
2
Standardized residuals
1
0
−1
−2
−3
Fitted
Figure 16.3: Plot of residuals versus predicted values, 1000-rotation Box data.
(SP) and (FP) are defined similarly. The (SFP)s are effects that allow for three-factor interaction.
A three-factor interaction can be thought of as a two-factor interaction that changes depending on
the level of the third factor. The main effects, two-factor interactions, and three-factor interaction
simply provide a structure that allows us to proceed in a systematic fashion.
We begin by considering the one-way analysis of variance.
Analysis of Variance
Source df SS MS F
Treatments 11 48183 4380 16.30
Error 12 3225 269
Total 23 51408
The F statistic is very large. If the standard one-way ANOVA assumptions are reasonably valid,
there is clear evidence that not all of the 12 treatments have the same effect.
Now consider the standard residual checks for a one-way ANOVA. Figure 16.3 contains the
residuals plotted against the predicted values. The symmetry of the plot about a horizontal line at 0
is due to the model fitting, which forces the two residuals in each group to add to 0. Except for one
pair of observations, the variability seems to decrease as the predicted values increase. The residual
pattern is not one that clearly suggests heteroscedastic variances. We simply note the pattern and
would bring it to the attention of the experimenter to see if it suggests something to her. In the
absence of additional information, we proceed with the analysis. Figure 16.4 contains a normal plot
of the residuals. It does not look too bad. Note that with 24 residuals and only 12 dfE, we may want
to use dfE as the sample size should we choose to perform a W ′ test.
Table 16.9 results from fitting a variety of models to the data. It is constructed just like Ta-
ble 16.2. From the C p statistics and the tests of each model against the three-factor interaction model,
the obvious candidate models are [SF][SP][FP] and its reduced model [SF][FP]. Using MSE([SFP])
in the denominator, testing them gives
3
2
Standardized residuals
1
0
−1
−2
−3
−2 −1 0 1 2
Theoretical Quantiles
Table 16.9: Statistics for fitting models to the 1000-rotation abrasion resistance data of Table 16.8.
Model SSE dfE F* Cp
[SFP] 3225.0 12 — 12.0
[SF][SP][FP] 3703.6 14 0.89 9.8
[SF][SP] 7232.7 16 3.73 18.9
[SF][FP] 4889.7 16 1.55 10.2
[SP][FP] 7656.3 15 5.50 22.5
[SF][P] 8418.7 18 3.22 19.3
[SP][F] 11185.3 17 5.92 31.6
[FP][S] 8842.3 17 4.18 22.9
[S][F][P] 12371.4 19 4.86 32.0
which has a P value of 0.153. This is a test for whether we need the [SP] interaction in a model that
already includes [SF][FP]. We will tentatively go with the smaller model,
The test for adding the [SP] interaction to this model was the one test we really needed to per-
form, but there are several tests available for [SP] interaction. In addition to the test we performed,
one could test [SF][SP] versus [SF][P] as well as [SP][FP] versus [S][FP]. Normally, these would
be three distinct tests but with balanced data like the 1000-rotation data, the tests are all identical.
Because of this and similar simplifications due to balanced data, we can present a unique ANOVA
table, in lieu of Table 16.9, that provides a comprehensive summary of all ANOVA model tests. This
is given as Table 16.10. Note that the F statistic and P value for testing surf ∗ prop in Table 16.10
agree with our values from the previous paragraph. For a two-factor model, we presented ANOVA
tables like Table 14.3 that depended on fitting both of the two reasonable sequences of models. In an
unbalanced three-factor ANOVA, there are too many possible model sequences to present them all,
16.2 BALANCED THREE-FACTORS 389
so we use tables like 16.2 and 16.9, except in the balanced case where everything can be summarized
as in Table 16.10.
In the previous section, our best model for the moisture data had only one two-factor term. For
the abrasion data our working model has two two-factor terms: [SF] and [FP]. Both two-factor terms
involve F, so if we fix a level of f ill, we will have an additive model in surf and prop. In other
words, for each level of f ill there will be some effect for surf that is added to some effect for the
proportions. The interaction comes about because the surf effect can change depending on the f ill,
and the prop effects can also change depending on the f ill. Moreover, prop is a quantitative factor
with three levels, so an equivalent model will be to fit separately, for each level of fill, the surface
effects as well as a parabola in proportions. Let pk , k = 1, 2, 3 denote the levels of prop. Since the
proportions were equally spaced, it does not matter much what we use for the pk s. We take p1 = 1,
p2 = 2, p3 = 3, although another obvious set of values would be 25, 50, 75. The model, equivalent
to [SF][FP], is
yi jkm = SFi j + F j1 pk + F j2 p2k + ei jkm .
Denote this model [SF][F1 ][F2 ]. An ANOVA table is given as Table 16.11. Note that the Error line
agrees, up to round-off error, with the Error information on [SF][FP] in Table 16.9.
300
(F,S)
(A,Yes)
(A,No)
(B,Yes)
(B,No)
250
Fitted
200
150
100
Prop
The table of coefficients for [SF][F1 ][F2 ] is given as Table 16.12. It provides our fitted model
⎧
⎪
⎪ 180.50 + 18.50p + 3.75p2 Surf = Yes, Fill = A
⎨
140.00 + 18.50p + 3.75p2 Surf = No, Fill = A
m̂(i, j, p) =
⎪
⎪ 256.67 − 41.38p + 11.38p2 Surf = Yes, Fill = B
⎩
164.83 − 41.38p + 11.38p2 Surf = No, Fill = B,
which is graphed in Figure 16.5. The two parabolas for Fill = A are parallel and remarkably straight.
The two parabolas for Fill = B are also parallel and not heavily curved. That the curves are parallel
for a fixed Fill is indicative of there being no [SP] or [SFP] interactions in the model. The fact
that the shapes of the Fill = A parabolas are different from the shapes of the Fill = B parabolas
is indicative of the [FP] interaction. The fact that the distance between the two parallel Fill = A
parabolas is different from the distance between the two parallel Fill = B parabolas is indicative of
the [SF] interaction.
Both quadratic terms have large P values in Table 16.12. We might consider fitting a reduced
model that eliminates the curvatures, i.e., fits straight lines. The reduced model is
denoted [SF][F1 ]. Table 16.13 gives the ANOVA table which, when compared to Table 16.11, allows
us to test simultaneously whether we need the two quadratic terms. With
This is graphed in Figure 16.6. The difference in the slopes for Fills A and B indicate the [FP]
interaction. The fact that the distance between the two parallel lines for Fill A is different from
the distance between the two parallel lines for Fill B indicates the presence of [SF] interaction. The
nature of this model is that for a fixed Fill the proportion curves will be parallel but when you change
fills both the shape of the curves and the distance between the curves can change.
The slope for Fill B looks to be nearly 0. The P value in Table 16.14 is 0.504. We could incor-
porate F21 = 0 into a model yi jkm = m(i, j, pk ) + εi jkm so that
⎧
⎪ [SF]11 + F11 p Surf = Yes, Fill = A
⎨
[SF]21 + F11 p Surf = No, Fill = A
m(i, j, p) =
⎪
⎩ [SF] 12 Surf = Yes, Fill = B
[SF]22 Surf = No, Fill = B.
300
(F,S)
(A,Yes)
(A,No)
(B,Yes)
(B,No)
250
Fitted
200
150
100
Prop
Denote this [SF][F11 ]. The ANOVA table and the Table of Coefficients are given as Tables 16.15
and 16.16. The fitted model is
⎧
⎪ 168.000 + 33.500p Surf = Yes, Fill = A
⎨
127.500 + 33.500p Surf = No, Fill = A
m̂(i, j, p) =
⎪
⎩ 227.000 Surf = Yes, Fill = B
135.167 Surf = No, Fill = B,
(F,S)
(A,Yes)
(A,No)
(B,Yes)
(B,No)
250
Fitted
200
150
100
Prop
with ⎧
⎪ [SF]11 + F11 p Surf = Yes, Fill = A
⎨
[SF]21 + F11 p Surf = No, Fill = A
m(i, j, p) =
⎪
⎩ [SF]12 Surf = Yes, Fill = B
[SF]21 Surf = No, Fill = B
fits well but is rather dubious. Extrapolating to 0% fill, the estimated weight losses would be the
same for no surface treatment and both fills. But as the proportion increases, the weight loss remains
flat for Fill B but increases for Fill A. With a surface treatment, the extrapolated weight losses at 0%
fill are different, but for Fill B it remains flat while for Fill A it increases. The ANOVA table and
Table of Coefficients are given as Tables 16.17 and 16.18.
16.4 Exercises
E XERCISE 16.4.1. Baten (1956) presented data on lengths of steel bars. An excessive number
of bars had recently failed to meet specifications and the experiment was conducted to identify the
causes of this problem. The bars were made with one of two heat treatments (W, L) and cut on one
of four screw machines (A, B, C, D) at one of three times of day (8 am, 11 am, 3 pm). The three
times were used to investigate the possibility of worker fatigue during the course of the day. The
bars were intended to be between 4.380 and 4.390 inches long. The data presented in Table 16.19
are thousandths of an inch in excess of 4.380. Treating the data as a 2 × 3 × 4 ANOVA, give an
analysis of the data.
E XERCISE 16.4.2. Bethea et al. (1985) reported data on an experiment to determine the effec-
tiveness of four adhesive systems for bonding insulation to a chamber. The adhesives were applied
both with and without a primer. Tests of peel-strength were conducted on two different thicknesses
394 16. MULTIFACTOR STRUCTURES
of rubber. Using two thicknesses of rubber was not part of the original experimental design. The
existence of this factor was only discovered by inquiring about a curious pattern of numbers in the
laboratory report. The data are presented in Table 16.20. Another disturbing aspect of these data is
that the values for adhesive system 3 are reported with an extra digit. Presumably, a large number of
rubber pieces were available and the treatments were randomly assigned to these pieces, but, given
the other disturbing elements in these data, I wouldn’t bet the house on it. A subset of these data
was examined earlier in Exercise 12.7.6.
(a) Give an appropriate model. List all the assumptions made in the model.
(b) Check the assumptions of the model and adjust the analysis appropriately.
(c) Analyze the data. Give an appropriate analysis of variance table. Examine appropriate contrasts.
E XERCISE 16.4.3. The data of Table 16.21 were presented in Finney (1964) and Bliss (1947).
The observations are serum calcium values of dogs after they have been injected with a dose of
parathyroid extract. The doses are the treatments and they have factorial structure. One factor in-
volves using either the standard preparation (S) or a test preparation (T). The other factor is the
amount of a dose; it is either low (L) or high (H). Low doses are 0.125 cc and high doses are 0.205 cc.
Each dog is subjected to three injections at about 10 day intervals. Serum calcium is measured on
the day after an injection. Analyze the data using a three-factor model with dogs, preparations, and
amounts but do not include any interactions involving dogs. Should day effects be incorporated?
Can this be done conveniently? If so, do so.
16.4 EXERCISES 395
E XERCISE 16.4.4. Using the notation of Section 16.1, write the models [A0 ][A1 ][C],
[A0 ][A1 ][C] A21 = A31 , and [A0 ][A1 ][C] A21 = A31 , A20 = A30 in matrix form. (Hint: To obtain
[A0 ][A1 ][C] A21 = A31 from [A0 ][A1 ][C], replace the two columns of X corresponding to A21 and A31
with one column consisting of their sum.) Use a regression program to fit these three models. (Hint:
Eliminate the intercept, and to impose the side condition C1 = 0, drop the column corresponding to
C1 .)
Chapter 17
In this chapter we examine basic experimental designs: completely randomized designs (CRDs),
randomized complete block (RCB) designs, Latin square (LS) designs, balanced incomplete block
(BIB) designs, and more. The focus of this chapter is on ideas of experimental design and how they
determine the analysis of data. We have already examined in the text and in the exercises data from
many of these experimental designs.
397
398 17. BASIC EXPERIMENTAL DESIGNS
chance that one of the treatment groups will randomly have an inordinate number of good units for
some variable, and hence show an effect that is really due to chance. In other words, if we measure
enough variables, just by chance, some of them will display a relationship to the treatment groups,
regardless of how the treatment groups were chosen.
A particularly disturbing problem is that the experimental treatments are often not what we think
they are. An experimental treatment is everything we do differently to a group of experimental units.
If we give a drug to a bunch of rats and then stick them into an asbestos filled attic, the fact that
those rats have unusually high cancer rates does not mean that the drug caused it. The treatment
caused it, but just because we call the treatment by the name of the drug does not make the drug the
treatment.
Alternatively, suppose we want to test whether artificial sweeteners made with a new chemical
cause cancer. We get some rats, randomly divide them into a treatment group and a control. We
inject the treatment rats with a solution of the sweetener combined with another (supposedly benign)
chemical. We leave the control rats alone. For simplicity we keep the treatment rats in one cage and
the control rats in another cage. Eventually, we find an increased risk of cancer among the treatment
rats as compared to the control rats. We can reasonably conclude that the treatments caused the
increased cancer rate. Unfortunately, we do not really know whether the sweetener or the supposedly
benign chemical or the combination of the two caused the cancer. In fact, we do not really know
that it was the chemicals that caused the cancer. Perhaps the process of injecting the rats caused the
cancer or perhaps something about the environment in the treatment rats’ cage caused the cancer.
A treatment consists of all the ways in which a group is treated differently from other groups. It
is crucially important to treat all experimental units as similarly as possible so that (as nearly as
possible) the only differences between the units are the agents that were meant to be investigated.
Random assignment of treatments is fundamental to conducting an experiment but it does not
mean haphazard assignment of treatments to experimental units. Haphazard assignment is subject to
the (unconscious) biases of the person making the assignments. Random assignment uses a reliable
table of random numbers or a reliable computer program to generate random numbers. It then uses
these numbers to assign treatments. For example, suppose we have four experimental units labeled
u1 , u2 , u3 , and u4 and four treatments labeled A, B, C, and D. Given a program or table that provides
random numbers between 0 and 1 (i.e., random samples from a Uniform(0,1) distribution), we
associate numbers between 0 and .25 with treatment A, numbers between .25 and .50 with treatment
B, numbers between .50 and .75 with treatment C, and numbers between .75 and 1 with treatment
D. The first random number selected determines the treatment for u1 . If the first number is .6321,
treatment C is assigned to u1 because .50 < .6321 < .75. If the second random number is .4279,
u2 gets treatment B because .25 < .4279 < .50. If the third random number is .2714, u3 would get
treatment B, but we have already assigned treatment B to u2 , so we throw out the third number. If the
fourth number is .9153, u3 is assigned treatment D. Only one unit and one treatment are left, so u4
gets treatment A. Any reasonable rule (decided ahead of time) can be used to make the assignment
if a random number hits a boundary, e.g., if a random number comes up, say, .2500.
By definition, treatments must be amenable to change. As discussed earlier, things like sex and
race are not capable of change, but in addition many viable treatments cannot be randomly assigned
for social reasons. If we want to know if smoking causes cancer in humans, running an experiment
is difficult. In our society we cannot force some people to smoke a specific amount for a long period
of time and force others not to smoke at all. Nonetheless, we are very interested in whether smoking
causes cancer. What are we to do?
When experiments cannot be run, the other common method for inferring causation is the “What
else could it be?” approach. For smoking, the idea is that we measure everything else that could pos-
sibly be causing cancer and appropriately adjust for those measurements. If, after fitting all of those
variables, smoking still has a significant effect on predicting cancer, then smoking must be caus-
ing the cancer. The catch is that this is extremely difficult to do. How do we even identify, much
less measure, everything else that could be causing cancer? And even if we do measure everything,
how do we know that we have adjusted for those variables appropriately? The key to this argument
17.2 TECHNICAL DESIGN CONSIDERATIONS 399
is independent replication of the studies! If there are many such observational studies with many
different ideas of what other variables could be causing the effect (cancer) and many ways of ad-
justing for those variables, and if the studies consistently show that smoking remains an important
predictor, at some point it would seem foolish to ignore the possibility that smoking causes cancer.
I have long contended that one cannot infer causation from data analysis. Certainly data analysis
speaks to the relative validity of competing causal models but that is a far cry from actually deter-
mining causation. I believe that causation must be determined by some external argument. I find
randomization to be the most compelling external argument. In “What else can it be?” the external
argument is that all other variables of importance have been measured and appropriately considered.
My contention that data analysis cannot lead to causation may be wrong. I have not devoted
my life to studying causal models. And I know that people study causation by the consideration of
counterfactuals. But for now, I stand by my contention.
Although predictive ability does not imply causation, for many (perhaps most) purposes, pre-
dictive ability is more important. Do we really care why the lights go on when we flip a switch? Or
do we care that our prediction comes true? We probably only care about causation when the lights
stop working. How many people really understand the workings of an automobile? How many can
successfully predict how automobiles will behave?
As a technical matter, the first object in designing an experiment is to construct one that allows for
a valid estimate of σ 2 , the variance of the observations. Without a valid estimate of error, we cannot
know whether the treatment groups are exhibiting any real differences. Obtaining a valid estimate of
error requires appropriate replication of the experiment. Having one observation on each treatment
is not sufficient. All of the basic designs considered in this chapter allow for a valid estimate of the
variance. (In my experience, failure to replicate is the most common sin committed on the television
show Mythbusters.)
The simplest experimental design is the completely randomized design (CRD). With four drug
treatments and observations on eight animals, a valid estimate of the error can be obtained by ran-
domly assigning each of the drugs to two animals. If the treatments are assigned completely at
random to the experimental units (animals), the design is a completely randomized design. The fact
that there are more animals than treatments provides our replication.
It is not crucial that the design be balanced, i.e., it is not crucial that we have the same number
of replications on each treatment. But it is useful to have more than one observation on each unit to
help check our assumption of equal variances.
A second important consideration is to construct a design that yields a small variance. A smaller
variance leads to sharper statistical inferences, i.e., narrower confidence intervals and more power-
ful tests. The basic idea is to examine the treatments on homogeneous experimental material. The
people of Bergen, Norway are probably more homogenous than the people of New York City. It will
be easier to find treatment effects when looking at people from Bergen. Of course the downside is
that we end up with results that apply to the people of Bergen. The results may or may not apply to
the people of New York City.
A fundamental tool for reducing variability is blocking. The people of New York City may be
more variable than the people of Bergen but we might be able to divide New Yorkers into subgroups
that are just as homogeneous as the people of Bergen. With our drugs and animals illustration, a
smaller variance for treatment comparisons is generally obtained when the eight animals consist
of two litters of four siblings and each treatment is applied to one randomly selected animal from
each litter. With each treatment applied in every litter, all comparisons among treatments can be
performed within each litter. Having at least two litters is necessary to get a valid estimate of the
variance of the comparisons. Randomized complete block designs (RCBs) : 1) identify blocks of
homogeneous experimental material (units) and 2) randomly assign each treatment to an experi-
400 17. BASIC EXPERIMENTAL DESIGNS
mental unit within each block. The blocks are complete in the sense that each block contains all of
the treatments.
The key point in blocking on litters is that, if we randomly assigned treatments to experimental
units without consideration of the litters, our measurements on the treatments would be subject
to all of the litter-to-litter variability. By blocking on litters, we can eliminate the litter-to-litter
variability so that our comparisons of treatments are subject only to the variability within litters
(which, presumably, is smaller). Blocking has completely changed the nature of the variability in
our observations.
The focus of block designs is in isolating groups of experimental units that are homogeneous:
litters, identical twins, plots of ground that are close to one another. If we have three treatments
and four animals to a litter, we can simply not use one animal. If we have five treatments and four
animals to a litter, a randomized complete block experiment becomes impossible.
A balanced incomplete block (BIB) design is one in which every pair of treatments occur to-
gether in a block the same number of times. For example, if our experimental material consists of
identical twins and we have the drugs A, B, and C, we might give the first set of twins drugs A and
B, the second set B and C, and the third set C and A. Here every pair of treatments occurs together
in one of the three blocks.
BIBs do not provide balanced data in our usual sense of the word “balanced” but they do have a
relatively simple analysis. RCBs are balanced in the usual sense. Unfortunately, losing any observa-
tions from either design destroys the balance that they display. Our focus is in analyzing unbalanced
data, so we use techniques for analyzing block designs that do not depend on any form of balance.
The important ideas here are replication and blocking. RCBs and BIBs make very efficient
designs but keeping their balance is not crucial. In olden days, before good computing, the simplicity
of their analyses was important. But simplicity of analysis was never more than a side effect of the
good experimental designs.
Latin squares use two forms of blocking at once. For example, if we suspect that birth order
within the litter might also have an important effect on our results, we continue to take observations
on each treatment within every litter, but we also want to have each treatment observed in every
birth order. This is accomplished by having four litters with treatments arranged in a Latin square
design. Here we are simultaneously blocking on litter and birth order.
Another method for reducing variability is incorporating covariates into the analysis. This topic
is discussed in Section 17.8.
Ideas of blocking can also be useful in observational studies. While one cannot really create
blocks in observational studies, one can adjust for important groupings.
E XAMPLE 17.2.1. If we wish to run an experiment on whether cocaine users are more paranoid
than other people, we may decide that it is important to block on socioeconomic status. This is
appropriate if the underlying level of paranoia in the population differs by socioeconomic status.
Conducting an experiment in this setting is difficult. Given groups of people of various socioeco-
nomic statuses, it is a rare researcher who has the luxury of deciding which subjects will ingest
cocaine and which will not. ✷
The seminal work on experimental design was written by Fisher (1935). It is still well worth
reading. My favorite source on the ideas of experimentation is Cox (1958). The books by Cochran
and Cox (1957) and Kempthorne (1952) are classics. Cochran and Cox is more applied. Kempthorne
is more theoretical. Kempthorne has been supplanted by Hinkelmann and Kempthorne (2008, 2005).
There is a huge literature in both journal articles and books on the general subject of designing ex-
periments. The article by Coleman and Montgomery (1993) is interesting in that it tries to formalize
many aspects of planning experiments that are often poorly specified. Two other useful books are
Cox and Reid (2000) and Casella (2008).
17.3 COMPLETELY RANDOMIZED DESIGNS 401
17.3 Completely randomized designs
In a completely randomized design, a group of experimental units are available and the experimenter
randomly assigns treatments to the experimental units. The data consist of a group of observations
on each treatment. Typically, these groups of observations are subjected to a one-way analysis of
variance.
E XAMPLE 17.3.1. In Example 12.4.1, we considered data from Mandel (1972) on the elasticity
measurements of natural rubber made by 7 laboratories. While Mandel did not discuss how the data
were obtained, it could well have been the result of a completely randomized design. For a CRD,
we would need 28 pieces of the type of rubber involved. These should be randomly divided into
7 groups (using a table of random numbers or random numbers generated by a reliable computer
program). The first group of samples is then sent to the first lab, the second group to the second lab,
etc. For a CRD, it is important that a sample is not sent to a lab because the sample somehow seems
appropriate for that particular lab.
Personally, I would also be inclined to send the four samples to a given lab at different times. If
the four samples are sent at the same time, they might be analyzed by the same person, on the same
machines, at the same time. Samples sent at different times might be treated differently. If samples
are treated differently at different times, this additional source of variation should be included in
any predictive conclusions we wish to make about the labs.
When samples sent at different times are treated differently, sending a batch of four samples at
the same time constitutes subsampling. There are two sources of variation to deal with: variation
from time to time and variation within a given time. The values from four samples at a given time
collectively help reduce the effect on treatment comparisons due to variability at a given time, but
samples analyzed at different times are still required if we are to obtain a valid estimate of the
error. In fact, with subsampling, a perfectly valid analysis can be based on the means of the four
subsamples. In our example, such an analysis gives only one ‘observation’ at each time, so the need
for sending samples at more than one time is obvious. If the four samples were sent at the same
time, there would be no replication, hence no estimate of error. Subsection 19.4.1 and Christensen
(2011, Section 9.4) discuss subsampling in more detail. ✷
E XAMPLE 17.3.2. In Chapter 12, we considered suicide age data. A designed experiment would
require that we take a group of people who we know will commit suicide and randomly assign one
of the ethnic groups to the people. Obviously a difficult task. ✷
The typical analysis of a randomized complete block design is a two-way ANOVA without
replication or interaction. Except for the experimental design considerations, the analysis is like
that of the Hopper Data from Example 15.3.1. A similar analysis is illustrated below. As with the
Hopper data, block-by- treatment interaction is properly considered to be error. If the treatment
effects are not large enough to be detected above any interaction, then they are not large enough to
be interesting.
E XAMPLE 17.4.1. Inman, Ledolter, Lenth, and Niemi (1992) studied the performance of an
optical emission spectrometer. Table 17.1 gives some of their data on the percentage of manganese
(Mn) in a sample. The data were collected using a sharp counterelectrode tip with the sample to
be analyzed partially covered by a boron nitride disk. Data were collected under three temperature
conditions. Upon fixing a temperature, the sample percentage of Mn was measured using 1) a new
boron nitride disk with light passing through a clean window (new-clean), 2) a new boron nitride
disk with light passing through a soiled window (new-soiled), 3) a used boron nitride disk with light
passing through a clean window (used-clean), and 4) a used boron nitride disk with light passing
through a soiled window (used-soiled). The four conditions, new-clean, new-soiled, used-clean, and
used-soiled are the treatments. The temperature was then changed and data were again collected for
each of the four treatments. A block is always made up of experimental units that are homogeneous.
The temperature conditions were held constant while observations were taken on the four treatments
so the temperature levels identify blocks. Presumably, the treatments were considered in random
order. Christensen (1996) analyzed these data including the data point for Block 3 and used-soiled.
We have dropped that point to illustrate an analysis for unbalanced data.
The two-factor additive-effects model for these data is
yi j = μ + βi + η j + εi j ,
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2.0
Fitted
Bonferroni
Parameter Est SE(Est) t P
η2 − η1 −0.00453 0.005794 −0.78 1.0000
η3 − η1 −0.08253 0.005794 −14.24 0.0002
η4 − η1 −0.07906 0.006691 −11.82 0.0005
η3 − η2 −0.07800 0.005794 −13.46 0.0002
η4 − η2 −0.07452 0.006691 −11.14 0.0006
η4 − η3 0.003478 0.006691 0.5198 1.000
The one missing observation is from treatment 4 so the standard errors that involve treatment 4 are
larger. Although we have different standard errors, the results can be summarized as follows.
New-clean New-soiled Used-soiled Used-clean
η̂1 η̂2 η̂4 η̂3
0 −0.00453 −0.07906 −0.08253
The new disk treatments are significantly different from the used disk treatments but the new disk
treatments are not significantly different from each other nor are the used disk treatments signifi-
cantly different from each other. The structure of the treatments suggests an approach to analyzing
the data that will be exploited in the next chapter. Here we used a side condition of η1 = 0 because
it made the estimates readily agree with the table of pairwise comparisons.
Table 17.2 contains an F test for blocks. In a true blocking experiment, there is not much interest
in testing whether block means are different. After all, one chooses the blocks so that they have
different means. Nonetheless, the F statistic MSBlks/MSE is of some interest because it indicates
how effective the blocking was, i.e., it indicates how much the variability was reduced by blocking.
For this example, MSBlks is 63 times larger than MSE, indicating that blocking was definitely
worthwhile. In our model for block designs, there is no reason not to test for blocks, but some
models used for block designs do not allow a test for blocks.
Residual plots for the data are given in Figures 17.1 through 17.4. Figure 17.1 is a plot of
the residuals versus the predicted values. Figure 17.2 plots the residuals versus indicators of the
treatments. While the plot looks something like a bow tie, I am not overly concerned. Figure 17.3
contains a plot of the residuals versus indicators of blocks. The residuals look pretty good. From
404 17. BASIC EXPERIMENTAL DESIGNS
Residual−Treatment plot
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2.0
Treatments
Residual−Block plot
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2.0
Blocks
Figure 17.4, the residuals look reasonably normal. In the normal plot there are 11 residuals but the
analysis has only 5 degrees of freedom for error. If we want to do a W ′ test for normality, we might
use a sample size of 11 and compare the value W ′ = 0.966 to W ′ (α , 11), but it may be appropriate
to use the dfE as the sample size for the test and use W ′ (α , 5).
The leverages (not shown) are all reasonable. The largest t residual is −3.39 for Block 2, Treat-
ment 1, which gives a Bonferonni adjusted P value of 0.088. ✷
17.4 RANDOMIZED COMPLETE BLOCK DESIGNS 405
Normal Q−Q Plot
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2.0
Theoretical Quantiles
An interesting special case of complete block data is paired comparison data as discussed in Sec-
tion 4.1. In paired comparison data, there are two treatments to contrast and each pair constitutes a
complete block.
This is exactly the same t statistic as obtained in Section 4.1. The reference distribution is t(26),
again exactly the same. The analysis of variance F statistic is just the square of the tobs and gives
equivalent results for two-sided tests. Confidence intervals for the difference in means are also
exactly the same in the blocking analysis and the paired comparison analysis. The one real difference
between this analysis and the analysis of Section 4.1 is that this analysis provides an indication of
whether the effort used to account for pairing was worthwhile. In this case, with a P value of 0.006,
it was worthwhile to account for pairing. ✷
406 17. BASIC EXPERIMENTAL DESIGNS
E XAMPLE 17.5.1. Mercer and Hall (1911) and Fisher (1925, Section 49) consider data on the
weights of mangold roots. They used a Latin square design with 5 rows, columns, and treatments.
The rectangular field on which the experiment was run was divided into five rows and five columns.
This created 25 plots, arranged in a square, on which to apply the treatments A, B, C, D, and E.
Each row of the square was viewed as a block, so every treatment was applied in every row. The
unique feature of Latin square designs is that there is a second set of blocks. Every column was
also considered a block, so every treatment was also applied in every column. The data are given in
Table 17.4, arranged by rows and columns with the treatment given in the appropriate place and the
observed root weight given in parentheses.
Table 17.5 contains the analysis of variance table including the analysis of variance F test for the
null hypothesis that the effects are the same for every treatment. The F statistic MSTrts/MSE is very
small, 0.56, so there is no evidence that the treatments behave differently. Blocking on columns was
not very effective as evidenced by the F statistic of 1.20, but blocking on rows was very effective,
F = 7.25.
Many experimenters are less than thrilled when told that there is no evidence for their treatments
having any differential effects. Inspection of the table of coefficients (not given) leads to an obvious
conclusion that most of the treatment differences are due to the fact that treatment D has a much
larger effect than the others, so we look at this a bit more.
We created a new factor variable called “Contrast” that has the same code for all of treatments
A, B, C, E but a different code for D. Fitting a model with Columns and Rows but Contrast in lieu
of Treatments gives the ANOVA table in Table 17.6. The ANOVA table F statistic for Contrast is
295.8/119.2 = 2.48 with a P value of 0.136. It provides a test of whether treatment D is different
from the other treatments, when the other treatments are taken to have identical effects. Using our
best practice, we would actually compute the F statistic with the MSE from Table 17.5 in the
denominator giving Fobs = 295.8/146.2 = 2.02, which looks even less significant. This contrast
was chosen by looking at the data so as to appear as significant as possible and yet it still has a large
P value. Testing the two models against each other by using Tables 17.5 and 17.6 provides a test
of whether there are any differences among treatments A, B, C, and E. The F statistic of 0.08 is
so small that it would be suspiciously small if it had not been chosen, by looking at the data, to be
small.
The standard residual plots were given in Christensen (1996). They look quite good.
If these data were unbalanced, i.e., if we lost some observations, it would be important to look
at an ANOVA table that fits Treatments after both Columns and Rows. Fitted in the current order,
the F test for Rows indicates that blocking on rows after blocking on Columns was worthwhile but
the F test for Columns indicates that blocking on Columns alone would have been a waste of time.
In an unbalanced experiment, if we cared enough, we might fit Columns after Rows to see whether
blocking on Columns was a complete waste of time. Because the data are balanced, the two tests
for Columns are the same and we can safely say from Table 17.5 that blocking on Columns was a
waste of time. ✷
The parameter μ is viewed as a grand mean, κi is an effect for the ith column, ρ j is an effect for the
jth row, and τk is an effect for the kth treatment. The subscripting for this model is peculiar. All of
the subscripts run from 1 to r but not freely. If we specify a row and a column, the design tells you
the treatment. Thus, if we know j and i, the design tells you k. If we specify a row and a treatment,
the design tells you the column, so j and k dictate i. In fact, if we know any two of the subscripts,
the design tells you the third.
where the numbers 1 through 4 are randomly assigned to the four people who will operate the
machines and the letters A through D are randomly assigned to the machines to be examined. More-
over, the days of the week should actually be randomly assigned to the rows of the Latin square. In
general, the rows, columns, and treatments should all be randomized in a Latin square.
Another distinct Latin square design for this situation is
Operator
Day 1 2 3 4
Mon A B C D
Tue B A D C
Wed C D B A
Thu D C A B
This square cannot be obtained from the first one by any interchange of rows, columns, and treat-
ments. Typically, one would randomly choose a possible Latin square design from a list of such
squares (see, for example, Cochran and Cox, 1957) in addition to randomly assigning the numbers,
letters, and rows to the operators, machines, and days.
The use of Latin square designs can be extended in numerous ways. One modification is the
incorporation of a third kind of block; such designs are called Graeco-Latin squares. The use of
Graeco-Latin squares is explored in the exercises for this chapter. A problem with Latin squares is
that small squares give poor variance estimates because they provide few degrees of freedom for
error. For example, a 3 × 3 Latin square gives only 2 degrees of freedom for error. In such cases, the
Latin square experiment is often performed several times, giving additional replications that provide
improved variance estimation. Section 18.6 presents an example in which several Latin squares are
used.
E XAMPLE 17.6.1. A simple balanced incomplete block design is given below for four treatments
A, B, C, D in four blocks of three units each.
17.6 BALANCED INCOMPLETE BLOCK DESIGNS 409
Block Treatments
1 A B C
2 B C D
3 C D A
4 D A B
Note that every pair of treatments occurs together in the same block exactly λ = 2 times. Thus,
for example, the pair A, B occurs in blocks 1 and 4. There are b = 4 blocks each containing k = 3
experimental units. There are t = 4 treatments and each treatment is observed r = 3 times. ✷
There are two relationships that must be satisfied by the numbers of blocks, b, units per block, k,
treatments, t, replications per treatment, r, and λ . Recall that λ is the number of times two treatments
occur together in a block. First, the total number of observations is the number of blocks times the
number of units per block, but the total number of observations is also the number of treatments
times the number of replications per treatment, thus
bk = rt.
The other key relationship in balanced incomplete block designs involves the number of compar-
isons that can be made between a given treatment and the other treatments within the same block.
Again, there are two ways to count this. The number of comparisons is the number of other treat-
ments, t − 1, multiplied by the number of times each other treatment is in the same block as the
given treatment, λ . Alternatively, the number of comparisons within blocks is the number of other
treatments within each block, k − 1, times the number of blocks in which the given treatment occurs,
r. Thus we have
(t − 1)λ = r(k − 1).
In Example 17.6.1, these relationships reduce to
(4)3 = 3(4)
and
(4 − 1)2 = 3(3 − 1).
The nice thing about balanced incomplete block designs is that the theory behind them works
out so simply that the computations can all be done on a hand calculator. I know, I did it once; see
Christensen (2011, Section 9.4). But once was enough for this lifetime! We will rely on a computer
program to provide the computations. We illustrate the techniques with an example.
E XAMPLE 17.6.2. John (1961) reported data on the number of dishes washed prior to losing
the suds in the wash basin. Dishes were soiled in a standard way and washed one at a time. Three
operators and three basins were available for the experiment, so at any one time only three treatments
could be applied. Operators worked at the same speed, so no effect for operators was necessary nor
should there be any effect due to basins. Nine detergent treatments were evaluated in a balanced
incomplete block design. The treatments and numbers of dishes washed are given in Table 17.7.
There were b = 12 blocks with k = 3 units in each block. Each of the t = 9 treatments was replicated
r = 4 times. Each pair of treatments occurred together λ = 1 time. The three treatments assigned
to a block were randomly assigned to basins as were the operators. The blocks were run in random
order.
The analysis of variance is given in Table 17.8. The F test for treatment effects is clearly signif-
icant. We now need to examine contrasts in the treatments.
The treatments were constructed with a structure that leads to interesting effects. Treatments
A, B, C, and D all consisted of detergent I using, respectively, 3, 2, 1, and 0 doses of an additive.
Similarly, treatments E, F, G, and H used detergent II with 3, 2, 1, and 0 doses of the additive.
410 17. BASIC EXPERIMENTAL DESIGNS
Table 17.7: Balanced incomplete block design investigating detergents; data are numbers of dishes washed.
Block Treatment, Observation
1 A, 19 B, 17 C, 11
2 D, 6 E, 26 F, 23
3 G, 21 H, 19 J, 28
4 A, 20 D, 7 G, 20
5 B, 17 E, 26 H, 19
6 C, 15 F, 23 J, 31
7 A, 20 E, 26 J, 31
8 B, 16 F, 23 G, 21
9 C, 13 D, 7 H, 20
10 A, 20 F, 24 H, 19
11 B, 17 D, 6 J, 29
12 C, 14 E, 24 G, 21
Treatment J was a control. We return to this example for a more detailed analysis of the treatments
in the next chapter.
As always, we need to evaluate our assumptions. The normal plot looks less than thrilling but
is not too bad. The fifth percentile of W ′ for 36 observations is .940, whereas the observed value is
.953. Alternatively, the residuals have only 16 degrees of freedom and W ′ (.95, 16) = .886. The data
are counts, so a square root or log transformation might be appropriate, but we continue with the
current analysis. A plot of standardized residuals versus predicted values looks good.
Table 17.9 contains diagnostic statistics for the example. Note that the leverages are all identical
for the BIB design. Some of the standardized deleted residuals (ts) are near 2 but none are so large
as to indicate an outlier. The Cook’s distances bring to one’s attention exactly the same points as the
standardized residuals and the ts. ✷
The data in Exercises 14.5.1, 14.5.3, and 16.4.3 were all balanced incomplete block designs.
Note that in those exercises we specifically indicated that block-by-treatment interactions should
not be entertained.
Balanced lattice designs are BIBs with t = k2 , r = k + 1, and b = k(k + 1). Table 17.10 gives an
example for k = 3. These designs can be viewed as k + 1 squares in which each treatment occurs
once. Each row of a square is a block, each block contains k units, there are k rows in a square, so
all of the t = k2 treatments can appear in each square. To achieve a BIB, k + 1 squares are required,
so there are r = k + 1 replications of each treatment. With k + 1 squares and k blocks (rows) per
square, there are b = k(k + 1) blocks. The analysis follows the standard form for a BIB. In fact, the
design in Example 17.6.2 is a balanced lattice with k = 3.
Youden squares are a generalization of BIBs that allows a second form of blocking and a very
similar analysis. These designs are discussed in the next section.
17.6 BALANCED INCOMPLETE BLOCK DESIGNS 411
k × k squares. Table 17.15 gives an example for k = 3. If k is odd, one can typically get by with
(k + 1)/2 squares. If k is even, k + 1 squares are generally needed.
Data are frequently collected with the intention of evaluating a change in the current system of
doing things. If we really want to know the effect of a change in the system, we have to execute
the change. It is not enough to look at conditions in the past that were similar to the proposed
change because, along with the past similarities, there were dissimilarities. For example, suppose
we think that instituting a good sex education program in schools will decrease teenage pregnancies.
To evaluate this, it is not enough to compare schools that currently have such programs with schools
that do not, because along with the differences in sex education programs there are other differences
in the schools that affect teen pregnancy rates. Such differences may include parents’ average socio-
economic status and education. While adjustments can be made for any such differences that can
be identified, there is no assurance that all important differences can be found. Moreover, initiating
the proposed program involves making a change and the very act of change can affect the results.
For example, current programs may exist and be effective because of the enthusiasm of the school
staff that initiated them. Such enthusiasm is not likely to be duplicated when the new program is
mandated from above.
To establish the effect of instituting a sex education program in a population of schools, we
really need to (randomly) choose schools and actually institute the program. The schools at which
the program is instituted should be chosen randomly, so no (unconscious) bias creeps in due to
the selection of schools. For example, the people conducting the investigation are likely to favor
or oppose the project. They could (perhaps unconsciously) choose the schools in such a way that
makes the evaluation likely to reflect their prior attitudes. Unconscious bias occurs frequently and
should always be assumed. Other schools without the program should be monitored to establish a
base of comparison. These other schools should be treated as similarly as possible to the schools
with the new program. For example, if the district school administration or the news media pay a lot
of attention to the schools with the new program but ignore the other schools, we will be unable to
distinguish the effect of the program from the effect of the attention. In addition, blocking similar
schools together can improve the precision of the experimental results.
One of the great difficulties in learning about human populations is that obtaining the best data
often requires morally unacceptable behavior. We object to having our lives randomly changed for
the benefit of experimental science and typically the more important the issue under study, the more
we object to such changes. Thus we find that in studying humans, the best data available are often
historical. In our example we might have to accept that the best data available will be an historical
record of schools with and without sex education programs. We must then try to identify and adjust
for all differences in the schools that could potentially affect our conclusions. It is the extreme
difficulty of doing this that leads to the relative unreliability of many studies in the social sciences.
On the other hand, it would be foolish to give up the study of interesting and important phenomena
just because they are difficult to study.
In one-sample, two-sample, and one-way ANOVA problems, we assume that we have random sam-
ples from various populations. In more sophisticated models we continue to assume that at least the
errors are a random sample from a N(0, σ 2 ) population. The statistical inferences we draw are valid
for the populations that were sampled. Often it is not clear what the sampled populations are. What
416 17. BASIC EXPERIMENTAL DESIGNS
are the populations from which the Albuquerque suicide ages were sampled? Presumably, our data
were all of the suicides reported in 1978 for these ethnic groups.
When we analyze data, we assume that the measurements are subject to errors and that the
errors are consistent with our models. However, the populations from which these samples are taken
may be nothing more than mental constructs. In such cases, it requires extrastatistical reasoning to
justify applying the statistical conclusions to whatever issues we really wish to address. Moreover,
the desire to predict the future underlies virtually all studies and, unfortunately, one can never be
sure that data collected now will apply to the conditions of the future. So what can we do? Only our
best. We can try to make our data as relevant as possible to our anticipation of future conditions. We
can try to collect data for which the assumptions will be reasonably true. We can try to validate our
assumptions. Studies in which it is not clear that the data are random samples from the population
of immediate interest are often called analytic studies.
About the only time one can be really sure that statistical conclusions apply directly to the
population of interest is when one has control of the population of interest. If we have a list of all
the elements in the population, we can choose a random sample from the population. Of course,
choosing a random sample is still very different from obtaining a random sample of observations.
Without control or total cooperation, we may not be able to take measurements on the sample.
(Even when we can find people that we want for a sample, many will not submit to a measurement
process.) Studies in which one can arrange to have the assumptions met are often called enumerative
studies. See Hahn and Meeker (1993) and Deming (1986) for additional discussion of these issues.
17.10 Exercises
E XERCISE 17.10.1. Snedecor (1945b) presented data on a spray for killing adult flies as they
emerged from a breeding medium. The data were numbers of adults found in cages that were set
over the medium containers. The treatments were different levels of the spray’s active ingredient,
namely 0, 4, 8, and 16 units. (Actually, it is not clear whether a spray with 0 units was actually
applied or whether no spray was applied. The former might be preferable.) Seven different sources
for the breeding mediums were used and each spray was applied on each distinct breeding medium.
The data are presented in Table 17.16.
(a) Identify the design for this experiment and give an appropriate model. List all the assumptions
made in the model.
(b) Analyze the data. Give an appropriate analysis of variance table. Ccompare the treatment with no
active ingredient to the average of the three treatments that contain the active ingredient. Ignoring
the treatment with no active ingredient, the other three treatments are quantitative levels of the
active ingredient. On the log scale, these levels are equally spaced.
(c) Check the assumptions of the model and adjust the analysis appropriately.
17.10 EXERCISES 417
E XERCISE 17.10.2. Cornell (1988) considered data on scaled thickness values for five formula-
tions of vinyl designed for use in automobile seat covers. Eight groups of material were prepared.
The production process was then set up and the five formulations run with the first group. The pro-
duction process was then reset and another group of five was run. In all, the production process was
set eight times and a group of five formulations was run with each setting. The data are displayed in
Table 17.17.
(a) From the information given, identify the design for this experiment and give an appropriate
model. List all the assumptions made in the model.
(b) Analyze the data. Give an appropriate analysis of variance table. Examine appropriate contrasts
using the Bonferroni method with an α of about .05.
(c) Check the assumptions of the model and adjust the analysis appropriately.
E XERCISE 17.10.3. In data related to that of the previous problem, Cornell (1988) has scaled
thickness values for vinyl under four different process conditions. The process conditions were A,
high rate of extrusion, low drying temperature; B, low rate of extrusion, high drying temperature; C,
low rate of extrusion, low drying temperature; D, high rate of extrusion, high drying temperature.
An initial set of data with these conditions was collected and later a second set was obtained. The
data are given below.
Treatments
A B C D
Rep 1 7.8 11.0 7.4 11.0
Rep 2 7.6 8.8 7.0 9.2
Identify the design, give the model, check the assumptions, give the analysis of variance table and
interpret the F test for treatments. The structure of the treatments suggests looking at average rates,
average temperatures, and interaction between rates and temperatures.
E XERCISE 17.10.4. Johnson (1978) and Mandel and Lashof (1987) present data on measure-
ments of P2 O5 (phosphorous pentoxide) in fertilizers. Table 17.18 presents data for five fertilizers,
each analyzed in five labs. Our interest is in differences among the labs. Analyze the data.
418 17. BASIC EXPERIMENTAL DESIGNS
E XERCISE 17.10.5. Table 17.19 presents data on yields of cowpea hay. Four treatments are of
interest, variety I of hay was planted 4 inches apart (I4), variety I of hay was planted 8 inches apart
(I8), variety II of hay was planted 4 inches apart (II4), and variety II of hay was planted 8 inches
apart (II8). Three blocks of land were each divided into four plots and one of the four treatments
was randomly applied to each plot. These data are actually a subset of a larger data set given by
Snedecor and Cochran (1980, p. 309) that involves three varieties and three spacings in four blocks.
Analyze the data. Check your assumptions. Examine appropriate contrasts.
E XERCISE 17.10.6. In the study of the optical emission spectrometer discussed in Exam-
ple 17.4.1 and Table 17.1, the target value for readings was 0.89. Subtract 0.89 from each ob-
servation and repeat the analysis. What new questions are of interest? Which aspects of the analysis
have changed and which have not?
E XERCISE 17.10.8. Table 17.21 contains data similar to that in the previous exercise except
that in this Latin square differences among four machines: 1, 2, 3, 4, were investigated rather than
differences among operators. Machines 1 and 2 were operated with a hand lever, while machines
3 and 4 were operated with a foot lever. Construct an analysis of variance table. What, if any,
differences can be established among the machines?
E XERCISE 17.10.9. Table 17.21 is incomplete. The data were actually obtained from a Graeco-
Latin square that incorporates four different operators as well as the four different machines. The
17.10 EXERCISES 419
correct design is given in Table 17.22. Note that this is a Latin square for machines when we ignore
the operators and a Latin square for operators when we ignore the machines. Moreover, every op-
erator works once with every machine. Give the new analysis of variance table. How do the results
on machines change? What evidence is there for differences among operators? Was the analysis for
machines given earlier incorrect or merely inefficient?
E XERCISE 17.10.10. Table 17.23 presents data given from Nelson (1993) on disk drives from
a Graeco-Latin square design (see Exercise 17.10.9). The experiment was planned to investigate
the effect of four different substrates on the drives. The dependent variable is the amplitude of a
signal read from the disk where the signal written onto the disk had a fixed amplitude. Blocks were
constructed from machines, operators, and day of production. (In Table 17.23, days are indicated by
lower case Latin letters.) The substrata consist of A, aluminum; B, nickel-plated aluminum; and two
types of glass, C and D. Analyze the data. In particular, check for differences between aluminum
and glass, between the two types of glass, and between the two types of aluminum. Check your
assumptions.
E XERCISE 17.10.11. George Snedecor (1945a) asked for the appropriate variance estimate in
the following problem. One of six treatments was applied to the 10 hens contained in each of 12
cages. Each treatment was randomly assigned to two cages. The data were the number of eggs laid
by each hen.
(a) What should you tell Snedecor? Were the treatments applied to the hens or to the cages? How
will the analysis differ depending on the answer to this question?
E XERCISE 17.10.12. The data in Exercises 14.5.1, 14.5.3, and 16.4.3 were all balanced incomplete
block designs. Determine the values of t, r, b, k, and λ for each experiment.
Chapter 18
Factorial Treatments
Factorial treatment structures are simply an efficient way of defining the treatments used in an ex-
periment. They can be used with any of the standard experimental designs discussed in Chapter 17.
Factorial treatment structures have two great advantages, they give information that is not readily
available from other methods and they use experimental material very efficiently. Section 18.1 intro-
duces factorial treatment structures with an examination of treatments that involve two factors. Sec-
tion 18.2 illustrates the analysis of factorial structures on the data of Example 17.4.1. Section 18.3
addresses some modeling issues involved with factorial structures. Section 18.4 looks at modeling
interaction in the context of a designed experiment. Section 18.5 looks at a treatment structure that
is slightly more complicated than factorial structure. Section 18.6 examines extensions of the Latin
square designs that were discussed in Section 17.5.
421
422 18. FACTORIAL TREATMENTS
experiments involve the use of 40 people, the factorial experiment involves the use of only 20 peo-
ple, yet the factorial experiment contains just as much information about both alcohol effects and
sleeping pill effects as the two separate experiments. The effect of alcohol can be studied by con-
trasting the 5 a0 s0 people with the 5 a1 s0 people and also by comparing the 5 a0 s1 people with the 5
a1 s1 people. Thus we have a total of 10 no-alcohol people to compare with 10 alcohol people, just
as we had in the separate experiment for alcohol. Recall that with no interaction, the effect of factor
A is the same regardless of the dose of factor S, so we have 10 valid comparisons of the effect of
alcohol. A similar analysis shows that we have 10 no-sleeping-pill people to compare with 10 peo-
ple using sleeping pills, the same as in the separate experiment for sleeping pills. Thus, when there
is no interaction, the 20 people in the factorial experiment are as informative about the effects of
alcohol and sleeping pills as the 40 people in the two separate experiments. Moreover, the factorial
experiment provides information about possible interactions between the factors that is unavailable
from the separate experiments.
The factorial treatment concept involves only the definition of the treatments. Factorial treat-
ment structure can be used in any design, e.g., completely randomized designs, randomized block
designs, and in Latin square designs. All of these designs allow for arbitrary treatments, so the
treatments can be chosen to have factorial structure.
Experiments involving factorial treatment structures are often referred to as factorial experi-
ments or factorial designs. A useful notation for factorial experiments identifies the number of fac-
tors and the number of levels of each factor. For example, the alcohol–sleeping pill experiment has
4 treatments because there are 2 levels of alcohol times 2 levels of sleeping pills. This is described
as a 2 × 2 factorial experiment. If we had 3 levels of alcohol and 4 doses (levels) of sleeping pills
we would have a 3 × 4 experiment involving 12 treatments.
18.2 Analysis
A CRD is analyzed as a one-way ANOVA with the treatments defining the groups. However, if
the CRD has treatments defined by two factors, it can also be analyzed as a two-way ANOVA
with interaction. Similarly, if the CRD has treatments defined by three factors, it can be analyzed
as a three-way ANOVA as illustrated in Chapter 16. Similarly, an RCB design uses a two-way
model with no interaction between treatments and blocks. For treatments based on two factors, an
equivalent model for an RCB is a three-way model but the only interaction is between the two
treatment factors. We now illustrate a two-factor treatment structure in a randomized block design.
yi j = μ + βi + τ j + εi j , εi j s independent N(0, σ 2 ),
j 1 2 3 4 5 6
(g, h) (1, 1) (1, 2) (2, 1) (2, 2) (3, 1) (3, 2)
where the γg s are main effects for the first factor, the ξh s are main effects for the second factor, and
the (γξ )gh s are effects that allow interaction between the factors.
Changing from Model (18.3.1) to Model (18.3.2) is accomplished by making the substitution
There is less going on here than meets the eye. The only difference between the parameters τgh
and (γξ )gh is the choice of Greek letters and the presence of parentheses. They accomplish exactly
the same things for the two models. The parameters γg and ξh are completely redundant. Anything
one could explain with these parameters could be explained equally well with the (γξ )gh s. As they
stand, models (18.3.1) and (18.3.2) are equivalent. The point of using Model (18.3.2) is that it lends
itself nicely to an interesting reduced model. If we drop the τgh s from Model (18.3.1), we drop all
of the treatment effects, so testing Model (18.3.1) against this reduced model is a test of whether
there are any treatment effects. If we drop the (γξ )gh s from Model (18.3.2), we get
yi jk = μ + κi + ρ j + τk + εi jk , εi jk s independent N(0, σ 2 ),
where the subscripts i, j, and k indicate columns, rows, and treatments, respectively. With two
factors, we can again replace the treatment subscript k with the pair (g, h) and write
Again, we can expand the treatment effects τgh to correspond to the factorial treatment structure as
A B C D E F
n0 p0 n0 p1 n0 p2 n1 p0 n1 p1 n1 p2
426 18. FACTORIAL TREATMENTS
The data are presented in Table 18.2. The basic ANOVA table is presented as Table 18.3. The
ANOVA F test indicates substantial differences between the treatments. Blocking on rows of the
square was quite effective with an F ratio of 7.10. Blocking on columns was considerably less
effective with an F of only 3.20, but it was still worthwhile. For unbalanced data, the rows and
columns can be fitted in either order (with appropriate interpretations of test results) but the treat-
ments should be fitted last.
We begin by fitting the Latin square model for six treatments k = 1, . . . , 6,
yi jk = μ + ρi + κ j + τk + εi jk .
g = 0, 1 and h = 0, 1, 2. Adding main effect parameters for nitrogen and phosphorous leads to
and fitting a sequence of models that successively adds each term from left to right in Model
(18.4.1), we get the following ANOVA table.
Analysis of Variance: Model (18.4.1)
Source df SS MS F P
Rows 5 54199 10840 7.10 0.001
Columns 5 24467 4893 3.20 0.028
N 1 77191 77191 50.55 0.000
P 2 164872 82436 53.98 0.000
N∗P 2 6117 3059 2.00 0.161
Error 20 30541 1527
Total 35 357387
The Error line is the same as in Table 18.3, as should be the case for equivalent models. There does
not seem to be much evidence for interaction with a P value of 0.161. (But we may soon change our
minds about that.)
18.4 INTERACTION IN A LATIN SQUARE 427
The three levels of phosphorous are quantitative, so we can fit separate quadratic models in lieu
of fitting interaction. Letting ph = 0, 1, 2 be the known quantitative levels, an equivalent model is
3
Standardized residuals
Standardized residuals
2
2
1
1
0
0
−1
−1
−2
−2
Fitted Rows
3
Standardized residuals
Standardized residuals
2
2
1
1
0
0
−1
−1
−2
−2
1 2 3 4 5 6 0 1 2 3 4 5
Columns Treatments
Of course the predicted lines also depend on the row i and the column j. For no phosphorous, no
nitrogen is estimated to yield 60.70 = 2(30.35) pounds less than a dose of nitrogen. For a single
.
dose of phosphorous, no nitrogen is estimated to yield 2(30.35 + 15.958) = 93 pounds less than a
dose of nitrogen. For a double dose of phosphorous, no nitrogen is estimated to yield 2(30.35 +
.
15.958 × 2) = 125 pounds less than a dose of nitrogen. Estimated yields go up as you add more
.
phosphorous, but estimated yields go up faster (at a rate of about 32 = 2(15.958) pounds per dose)
if you are also applying nitrogen.
Figures 18.1 and 18.2 contain residual plots from the full interaction model. They show some
interesting features but nothing so outstanding that I, personally, find them disturbing.
18.5 A BALANCED INCOMPLETE BLOCK DESIGN 429
Normal Q−Q Plot
3
2
Standardized residuals
1
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Table 18.5: Balanced incomplete block design investigating detergents; data are numbers of dishes washed.
Block Treatment, Observation
1 A, 19 B, 17 C, 11
2 D, 6 E, 26 F, 23
3 G, 21 H, 19 J, 28
4 A, 20 D, 7 G, 20
5 B, 17 E, 26 H, 19
6 C, 15 F, 23 J, 31
7 A, 20 E, 26 J, 31
8 B, 16 F, 23 G, 21
9 C, 13 D, 7 H, 20
10 A, 20 F, 24 H, 19
11 B, 17 D, 6 J, 29
12 C, 14 E, 24 G, 21
E XAMPLE 18.5.1. John (1961) reported data on the number of dishes washed prior to losing
the suds in the wash basin. Dishes were soiled in a standard way and washed one at a time. Three
operators and three basins were available for the experiment, so at any one time only three treatments
could be applied. Operators worked at the same speed, so no effect for operators was necessary nor
should there be any effect due to basins. Nine detergent treatments were evaluated in a balanced
incomplete block design. The treatments and numbers of dishes washed are repeated in Table 18.5.
There were b = 12 blocks with k = 3 units in each block. Each of the t = 9 treatments was replicated
r = 4 times. Each pair of treatments occurred together λ = 1 time. The three treatments assigned
to a block were randomly assigned to basins as were the operators. The blocks were run in random
order. The analysis of variance is given in Table 18.6. The F test for treatment effects is clearly
significant. We now examine the treatments more carefully.
430 18. FACTORIAL TREATMENTS
The treatments were constructed with an interesting structure. Treatments A, B, C, and D all
consisted of detergent I using, respectively, 3, 2, 1, and 0 doses of an additive. Similarly, treatments
E, F, G, and H used detergent II with 3, 2, 1, and 0 doses of the additive. Treatment J was a control.
Except for the control, the treatment structure is factorial in detergents and amounts of additive. The
general blocking model is
Treatment j kgh
A 1 113
B 2 112
C 3 111
D 4 110
E 5 123
F 6 122
G 7 121
H 8 120
J 9 230
The three subscripts kgh uniquely identify the treatments so we can refit the original blocking
model (18.5.1) as
Going a step further, we can identify Control effects, Detergent effects, Amount effects, and a
Detergent–Amount interaction,
This is equivalent to Model (18.5.1), but is somewhat awkward to fit for many general linear model
programs because both the variable Control and the variable Detergent uniquely identify the control
treatment. For example, Minitab only gives the following output in the ANOVA table.
18.5 A BALANCED INCOMPLETE BLOCK DESIGN 431
Analysis of Variance: Model (18.5.2)
Seq.
Source df SS MS F P
Blocks 11 412.750
Control 1 345.042
Detergent 1 381.338
Amount 3 311.384
Det∗Amt 3 49.051
Error 16 13.185
Total 35 1512.750
As usual, this is obtained by sequentially fitting the terms in Model (18.5.2) from left to right.
Minitab is also somewhat dubious about the degrees of freedom. But never fear, all is well. The
Error line agrees with that in Table 18.6, which is good evidence for my claim of equivalence. On
the other hand, R just fills out the ANOVA table.
We now exploit the quantitative nature of the amounts by recasting our Amount factor variable
as a quantitative variable a, which we also square and cube. Fitting a separate cubic polynomial for
both detergents other than the control gives
yikgh = μ + βi + Ck + δg0 + γ1 ah + δg1 ah + γ2 a2h + δg2a2h + γ3 a3h + δg3 a3h + εikgh . (18.5.4)
Models (18.5.3) and (18.5.4) should give us a model equivalent to Model (18.5.2). But these models
still involve both the Control (k) and Detergent (g) effects, so they remain computer unfriendly.
Sequentially fitting the terms in Model (18.5.4), gives
Analysis of Variance: Model (18.5.4)
Seq.
Source df SS MS F P
Blocks 11 412.750
Control 1 345.042
Detergent 1 381.338
a 1 306.134
Det*a 1 41.223
a2 1 5.042
Det*a 2 1 7.782
a3 1 0.208
Det*a3 1 0.045
Error 16 13.185
Total 35 1512.750
Again, as should be the case for equivalent models, Model (18.5.4) gives the same Error term as
models (18.5.1) and (18.5.2). We could easily fill out the blank columns of the ANOVA table with
a hand calculator because all the difficult computations have been made. With MSE = 0.824 from
Table 18.6, and most of the terms having one degree of freedom, it is easy to glace at the ANOVA
table for Model (18.5.4) and see what effects are important. The Det*a3 term checks whether the
cubic coefficients are different for detergents I and II, and a3 checks whether the overall cubic
coefficient is different from 0. Both are small, so there is no evidence for any cubic effects. On the
other hand, Det*a2 is almost 10 times the size of MSE, so we need different parabolas for detergents
I and II.
Rather than continuing to work with a general linear model program, I refit Model (18.5.3) as a
regression. I used 11 indictor variables for blocks 2 through 12, B2 , . . . , B12 , I created three indicator
432 18. FACTORIAL TREATMENTS
variables for detergents, d1 , d2 , d3 , where d3 is the indicator of the control, and I used the quantitative
amount variable a to create variables to fit separate polynomials for detergents I and II by defining
the products d1 a, d1 a2 , d1 a3 and d2 a, d2 a2 , d2 a3 . Fitting the regression model
12
yr = β0 + ∑ β0 j Br j + δ30 dr3 + δ10dr1
j=2
+ δ11 dr1 ar + δ21 dr2 ar + δ12 dr1 a2r + δ22 dr2 a2r + δ13 dr1 a3r + δ23dr2 a3r + εr , (18.5.5)
gives
Analysis of Variance: Model (18.5.5)
Source df SS MS F P
Regression 19 1499.56 78.924 95.774 0.000000
Error 16 13.19 0.824
Total 35 1512.75
from which, with the sequential sums of squares, we can construct
Analysis of Variance: Model (18.5.5)
Seq.
Source df SS MS F P
Blocks 11 412.75 37.523 45.533 0.000000
d3 1 345.04 345.042 418.702 0.000000
d1 1 381.34 381.338 462.747 0.000000
d1 ∗ a 1 286.02 286.017 347.076 0.000000
d2 ∗ a 1 61.34 61.341 74.436 0.000000
d1 ∗ a2 1 12.68 12.676 15.382 0.001216
d2 ∗ a2 1 0.15 0.148 0.180 0.677212
d1 ∗ a3 1 0.22 0.224 0.272 0.609196
d2 ∗ a3 1 0.03 0.030 0.036 0.851993
Error 16 13.19 0.824
Total 35 1512.75
Both of the cubic terms are small, so we need neither. This agrees with the Model (18.5.4) ANOVA
table where we had sequential sums of squares for a3 of 0.208 and for Det*a3 of 0.045, which add
to 0.253. The cubic terms here have sums of squares 0.22 and 0.03, which sum to the same thing
(except for round-off error). In the Model (18.5.4) ANOVA table we could only identify that there
was quadratic interaction. From the term d2 ∗ a2 we see that detergent II needs no quadratic term,
while from d1 ∗ a2 we see that detergent II does need such a term. This implies a difference in the
quadratic coefficients, hence the significant Det*a2 quadratic interaction in the ANOVA table of
Model (18.5.4). All of the other effects look important. The table of coefficients for Model (18.5.5)
[not given] actually has a large P value for d1 ∗ a2 but that provides a test for dropping d1 ∗ a2 out of
a model that still contains the cubic terms, so the result is irrelevant.
Incorporating these conclusions, the table of coefficients for fitting
12
yr = β0 + ∑ β0 j Br j + δ30 dr3 + δ10 dr1 + δ11 (dr1 ar ) + δ21(dr2 ar ) + δ12(dr1 a2r ) + εr (18.5.6)
j=2
Det
I
II
30
Fitted
20
10
0
Amount
Figure 18.3 Predicted dishes washed for two detergents as a function of the amounts of an additive in Block 1.
Figure 18.3 displays the results for Block 1 and detergents I and II.
From inspection of Figure 18.3, suds last longer when there is more additive (up to a triple dose).
Detergent II works uniformly better than detergent I. The effect of a dose of the additive is greater
at low levels for detergent I than at high levels but the effect of a dose is steady for detergent II. The
control is easily better than any of the new treatments with m̂(1, 2, 3, a) = 28.5018.
Cow Cow
Period 7 8 9 10 11 12
1 A 1091 B 1234 C 1300 A 1105 B 891 C 859
2 B 798 C 902 A 1297 C 712 A 830 B 617
3 C 534 A 869 B 962 B 453 C 629 A 597
Cow Cow
Period 12 14 15 16 17 18
1 A 941 B 794 C 779 A 933 B 724 C 749
2 B 718 C 603 A 718 C 658 A 649 B 594
3 C 548 A 613 B 515 B 576 C 496 A 612
E XAMPLE 18.6.1. Patterson (1950) and John (1971) considered the milk production of cows
that were given three different diets. The three feed regimens were A, good hay; B, poor hay; and
C, straw. Eighteen cows were used and milk production was measured during three time periods
for each cow. Each cow received a different diet during each time period. The data are given in
Table 18.8. The cows were divided into six groups of 3. A 3 × 3 Latin square design was used for
each group of three cows along with the three periods and the three feed treatments. Having eighteen
cows, we get 6 Latin squares. The six squares are clearly marked in Table 18.8 by double vertical
and horizontal lines. We will not do a complete analysis of these data, rather we point out salient
features of the analysis.
The basic model for multiple Latin squares is
where S indicates the 6 Square effects, C and P indicate 3 Cow and 3 Period effects within a Latin
square, but the effects change between Latin squares (2 degrees of freedom per square times 6
squares), and τ indicates 3 treatment effects that do not change between Latin squares. The analysis
of variance table is presented in Table 18.9. In general, all the ANOVA tables should be obtained
by fitting a sequence of hierarchical models where the terms are added to the sequence in the same
order that we have placed them in the model. These data are balanced, which makes the order of
fitting less important.
So far we have acted as though the model presumes that the columns are different in every Latin
square, as are the rows. This is true for the columns, no cow is ever used in more than one square.
It is less clear whether, say, period 1 is the same in the first Latin square as it is in the second and
other squares. We will return to this issue later. It is clear, however, that the treatments are the same
in every Latin square.
18.6 EXTENSIONS OF LATIN SQUARES 435
From Table 18.9, mean squares and F statistics are easily obtained. If this was a classic applica-
tion of multiple Latin squares, the only F test of real interest would be that for treatments, since the
other lines of Table 18.9 denote various forms of blocking. The F statistic for treatments is about
25, so, with 22 degrees of freedom for Error, the test is highly significant. One should then compare
the three treatments using contrasts and check the validity of the assumptions using residual plots.
The basic model (18.6.1) and analysis of variance Table 18.9 can be modified in many ways. We
now present some of those ways.
As a standard practice, John (1971, Section 6.5) includes a square-by-treatment interaction to
examine whether the treatments behave the same in the various Latin squares,
In our example with 6 squares and 3 treatments such a term would typically have (6 − 1) × (3 − 1) =
10 degrees of freedom.
We mentioned earlier that periods might be considered the same from square to square. If so,
we should fit
yhi jk = μ + Sh + Chi + Pj + (SP)h j + τk + εhi jk . (18.6.2)
We will want to test this against the no-interaction model to examine whether the periods behave the
same from square to square. The analysis of variance table incorporating this change is presented as
Table 18.10. Our current data are balanced but for unbalanced data one could debate whether the ap-
propriate test for square-by-period interaction should be conducted before or after fitting treatments.
I would always fit treatments after everything that involves blocks.
If the Latin squares were constructed using the complete randomization discussed in Sec-
tion 17.5, one could argue that the period-by-squares interaction must really be error and that the 10
degrees of freedom and corresponding sum of squares should be pooled with the current error. Such
an analysis is equivalent to simply thinking of the design as one large rectangle with three terms to
consider: the 3 periods (rows), the 18 cows (columns), and the 3 treatments. For this design,
Such an analysis is illustrated in Table 18.11. The sum of squares for Cows in Table 18.11 equals
the sum of squares for Cows within Squares plus the sum of squares for Squares from the earlier
ANOVA tables. The 17 degrees of freedom for Cows are also the 12 degrees of freedom for cows
within squares plus the 5 degrees of freedom for Squares.
In this example, choosing between the analyses of Tables 18.10 and 18.11 is easy because of
additional structure in the design that we have not yet considered. This particular design was chosen
because consuming a particular diet during one period might have an effect that carries over into the
next time period. In the three Latin squares on the left of Table 18.8, treatment A is always followed
by treatment B, treatment B is always followed by treatment C, and treatment C is always followed
by treatment A. In the three Latin squares on the right of Table 18.8, treatment A is always followed
by treatment C, treatment B is followed by treatment A, and treatment C is followed by treatment B.
436 18. FACTORIAL TREATMENTS
This is referred to as a cross-over or change-over design. Since there are systematic changes in the
squares, it is reasonable to investigate whether the period effects differ from square to square and
so we should use Table 18.10. In particular, we would like to isolate 2 degrees of freedom from the
period-by-square interaction to look at whether the period effects differ between the three squares
on the left as compared to the three squares on the right. To do this, we replace the Squares subscript
h = 1, . . . , 6 with two subscripts: f = 1, 2 and g = 1, 2, 3 where f identifies right and left squares.
We then fit the model
y f gi jk = μ + C f gi + Pj + (SdP) f j + τk + ε f gi jk
where (SdP) f j is a side-by-period interaction. When the data are balanced, we don’t need to worry
about whether to fit this interaction before or after treatments. These issues are addressed in Exer-
cise 18.7.6. ✷
18.7 Exercises
E XERCISE 18.7.1. The process condition treatments in Exercise 17.11.3 on vinyl thickness had
factorial treatment structure. Give the factorial analysis of variance table for the data. The data are
repeated below.
E XERCISE 18.7.2. Garner (1956) presented data on the tensile strength of fabrics as measured
with Scott testing machines. The experimental procedure involved selecting eight 4×100-inch strips
from available stocks of uniform twill, type I. Each strip was divided into sixteen 4 × 6 inch samples
(with some left over). Each of three operators selected four samples at random and, assigning each
sample to one of four machines, tested the samples. The four extra samples from each strip were
held in reserve in case difficulties arose in the examination of any of the original samples. It was
considered that each 4 × 100 inch strip constituted a relatively homogeneous batch of material.
Effects due to operators include differences in the details of preparation of samples for testing and
mannerisms of testing. Machine differences include differences in component parts, calibration, and
speed. The data are presented in Table 18.12. Entries in Table 18.12 are values of the strengths in
excess of 180 pounds.
(a) Identify the design for this experiment and give an appropriate model. List all the assumptions
made in the model.
(b) Analyze the data. Give an appropriate analysis of variance table.
(c) Check the assumptions of the model and adjust the analysis appropriately.
E XERCISE 18.7.3. Consider the milk production data in Table 18.8 and the corresponding anal-
ysis of variance in Table 18.9. Relative to the periods, the squares on the left of Table 18.8 always
18.7 EXERCISES 437
have treatment A followed by B, B followed by C, and C followed by A. The squares on the right
always have treatment A followed by C, B followed by A, and C followed by B. Test whether there
is an interaction between periods and left–right square differences.
E XERCISE 18.7.5. Exercises 17.11.7, 17.11.8, 17.11.9, and the previous exercise used subsets
of data reported in Garner (1956). The experiment was designed to examine differences among
operators and machines when using Suter hydrostatic pressure-testing machines. No interaction
between machines and operators was expected.
A one-foot square of cloth was placed in a machine. Water pressure was applied using a lever
until the operator observed three droplets of water penetrating the cloth. The pressure was then
relieved using the same lever. The observation was the amount of water pressure consumed and it
was measured as the number of inches that water rose up a cylindrical tube with radial area of 1
square inch. Operator differences are due largely to differences in their ability to spot the droplets
and their reaction times in relieving the pressure. Machines 1 and 2 were operated with a hand lever.
Machines 3 and 4 were operated with at foot lever.
A 52 × 200-inch strip of water-repellant cotton Oxford was available for the experiment. From
this, four 48 × 48-inch squares were cut successively along the warp (length) of the fabric. It was
decided to adjust for heterogeneity in the application of the water repellant along the warp and fill
(width) of the fabric, so each 48 × 48 square was divided into four equal parts along the warp and
four equal parts along the fill, yielding 16 smaller squares. The design involves four replications
of a Graeco-Latin square. In each 48 × 48 square, every operator worked once with every row and
438 18. FACTORIAL TREATMENTS
column of the larger square and once with every machine. Similarly, every row and column of the
48 × 48 square was used only once on each machine. The data are given in Table 18.14.
Analyze the data. Give an appropriate analysis of variance table. Give a model and check your
assumptions. Use the Bonferonni method to determine differences among operators and to deter-
mine differences among machines.
The cuts along the warp of the fabric were apparently the rows. Should the rows be considered
the same from square to square? How would doing this affect the analysis?
Look at the means for each square. Is there any evidence of a trend in the water repellency as
we move along the warp of the fabric? How should this be tested?
E XERCISE 18.7.6. Consider the milk production data in Table 18.8 and the corresponding anal-
ysis of variance in Table 18.10. Relative to the periods, the squares on the left of Table 11.8 always
have treatment A followed by B, B followed by C, and C followed by A. The squares on the right
always have treatment A followed by C, B followed by A, and C followed by B. Test whether there
is an average difference between the squares on the left and those on the right. Test whether there is
an interaction between periods and left–right square differences.
Chapter 19
Dependent Data
In this chapter we examine methods for performing analysis of variance on data that are not com-
pletely independent. The two methods considered are appropriate for similar data but they are based
on different assumptions. All the data involved have independent groups of observations but the
observations within groups are not independent. In terms of analyzing unbalanced data, both of
these procedures easily handle unbalanced groups of observations but the statistical theory breaks
down when the observations within the groups become unbalanced. The first method was developed
for analyzing the results of split-plot designs. The corresponding models involve constant variance
for all observations and the lack of independence consists of a constant correlation between ob-
servations within each group. The second method is multivariate analysis of variance. Multivari-
ate ANOVA allows an arbitrary variance and correlation structure among the observations within
groups but assumes that the same structure applies for every group. These are two extremes in terms
of modeling dependence among observations in a group and many useful models can be fitted that
have other interesting variance-correlation structures, cf. Christensen et al. (2010, Section 10.3).
However, the two variance-correlation structures considered here are the most amenable to further
statistical analysis.
Section 19.1 introduces unbalanced split-plot models and illustrates the highlights of the analy-
sis. Section 19.2 gives a detailed analysis for a complicated balanced split-plot model using methods
that are applicable to unbalanced groups. Subsection 19.2.1 even discusses methods that apply for
unbalanced observations within the groups but such imbalance requires us to abandon comparisons
between groups. Section 19.3 introduces multivariate analysis of variance. Section 19.4 considers
some special cases of the model examined in Sections 19.1 and 19.2; these are subsampling models
and one-way analysis of variance models in which group effects are random.
Split-plot designs involve simultaneous use of different sized experimental units. The corresponding
models involve more than one error term. According to Casella (2008, p.171), “Split-plot experi-
ments are the workhorse of statistical design. There is a saying that if the only tool you own is a
hammer, then everything in the world looks like a nail. It might be fair to say that, from now on,
almost every design that you see will be some sort of split plot.”
Suppose we produce an agricultural commodity and are interested in the effects of two factors:
an insecticide and a fertilizer. The fertilizer is applied using a tractor and the insecticide is applied via
crop dusting. The method of applying the chemicals is part of the treatment. Crop dusting involves
using an airplane to spread the material. Obviously, you need a fairly large piece of land for crop
dusting, so the number of replications on the crop-dusted treatments will be relatively few. On the
other hand, different fertilizers can be applied with a tractor to reasonably small pieces of land, so
we can obtain more replications. If our primary interest is in the main effects of the crop-dusted
insecticides, we are stuck. Accurate results require a substantial number of large fields to obtain
replications on the crop-dusting treatments. However, if our primary interest is in the fertilizers or
439
440 19. DEPENDENT DATA
the interaction between fertilizers and insecticides, we can design a good experiment using only a
few large fields.
To construct a split-plot design, start with the large fields and design an experiment that is
appropriate for examining just the insecticides. Depending on the information available about the
fields, this can be a CRD, a RCB design, a Latin square design, or pretty much any design you think
is appropriate. Suppose there are three levels of insecticide to be investigated. If we have three fields
in the Gallatin Valley of Montana, three fields near Willmar, Minnesota, and three fields along the
Rio Grande River in New Mexico, it is appropriate to set up a block in each state so that we see
each insecticide in each location. Alternatively, if we have one field near Bozeman, MT, one near
Cedar City, UT, one near Twin Peaks, WA, one near Winters, CA, one near Fields, OR, and one near
Grand Marais, MN, a CRD seems more appropriate. We need a valid design for this experiment on
insecticides, but often it will not have enough replications to yield a very precise analysis. Each of
the large fields used for insecticides is called a whole plot. The insecticides are randomly applied to
the whole plots, so they are referred to as the whole-plot treatments. Any complete blocks used in
the whole-plot design are typically called “Replications” or just “Reps.”
Regardless of the design for the insecticides, the key to a split-plot design is using each whole
plot (large field) as a block for examining the subplot treatments (fertilizers). If we have four fertil-
izer treatments, we divide each whole plot into four subplots. The fertilizers are randomly assigned
to the subplots. The analysis for the subplot treatments is just a modification of the RCB analysis
with each whole plot treated as a block.
We have a much more accurate experiment for fertilizers than for insecticides. If, as alluded
to earlier, the insecticide (whole-plot) experiment was set up with 3 blocks (MT, MN, NM) each
containing 3 whole plots, we have just 3 replications on each insecticide, but each of the 9 whole
plots is a block for the fertilizers, so we have 9 replications of the fertilizers. Moreover, fertilizers
are compared within whole plots, so they are not subject to the whole-plot-to-whole-plot variation.
Perhaps the most important aspect of the design is the interaction. It is easy to set up a mediocre
design for insecticides and a good experiment for fertilizers; the difficulty is in getting to look at
them together and the primary point in looking at them together is to investigate interaction. The
most important single fact in the analysis is that the interaction between insecticides and fertilizers
is subject to exactly the same variability as fertilizer comparisons. Thus we have eliminated a major
source of variation, the whole-plot-to-whole-plot variability. Interaction effects are only subject to
the subplot variability, i.e., the variability within whole plots.
The basic idea behind split-plot designs is very general. The key idea is that an observational
unit (whole plot, large field) is broken up to allow several distinct measurements on the unit. These
are often called repeated measures. In an example in the next section, the weight loss due to abrasion
of one piece of fabric is measured after 1000, 2000, and 3000 revolutions of a machine designed to
cause abrasion. Another possibility is giving drugs to people and measuring their heart rates after 10,
20, and 30 minutes. When repeated measurements are made on the same observational unit, these
measurements are more likely to be similar than measurements taken on different observational
units. Thus the measurements on the same unit are correlated. This correlation needs to be modeled
in the analysis. Note, however, that with the weight loss and heart rate examples, the “treatments”
(rotations, minutes) cannot be randomly assigned to the units. In such cases the variance-correlation
structure of a split-plot model may be less appropriate than that of the multivariate ANOVA model
or various other models. In terms of balance, the methods of analysis presented hold if we lose all
data on an observational unit (piece of fabric, person) but break down if we lose some but not all of
the information on a unit.
We now consider an example of a simple split-plot design. Section 19.2 presents a second ex-
ample that considers the detailed analysis of a study with four factors.
E XAMPLE 19.1.1. Garner (1956) and Christensen (1996, Section 12.1) present data on the
amount of moisture absorbed by water-repellant cotton Oxford material. Two 24-yard strips of cloth
were obtained. Each strip is a replication and was divided into four 6-yard strips. The 6-yard strips
19.1 THE ANALYSIS OF SPLIT-PLOT DESIGNS 441
were randomly assigned to one of four laundries. After laundering and drying, the 6-yard strips
were further divided into four 1.5-yard strips and randomly assigned to one of four laboratories for
determination of dynamic water absorption. The data presented in Table 19.1 are actually the means
of two determinations of dynamic absorption made for each 1.5-yard strip. The label “Test” is used
to identify different laboratories (out of fear that the words laundry and laboratory might get con-
fused). To illustrate the analysis of unbalanced data we have removed the data for Laundry 3 from
the second replication.
First consider how the experimental design deals with laundries. There are two blocks (Reps) of
material available, the 24-yard strips. These are subdivided into four sections and randomly assigned
to laundries. Thus we have a randomized complete block (RCB) design for laundries with two
blocks and four treatments from which we are missing the information on the third treatment in
the second block. The 6-yard strips are the whole-plot experimental units, laundries are whole-plot
treatments, and the 24-yard strips are whole-plot blocks.
After the 6-yard strips have been laundered, they are further subdivided into 1.5-yard strips
and these are randomly assigned to laboratories for testing. In other words, each experimental unit
in the whole-plot design for laundries is split into subunits for further treatment. The whole-plot
experimental units (6-yard strips) serve as blocks for the subplot treatments. The 1.5-yard strips are
subplot experimental units and the tests are subplot treatments.
The peculiar structure of the design leads us to analyze the data almost as two separate experi-
ments. There is a whole-plot analysis focusing on laundries and a subplot analysis focusing on tests.
The subplot analysis also allows us to investigate interaction.
Consider the effects of the laundries. The analysis for laundries is called the whole-plot analysis.
We have a block design for laundries but a block analysis requires just one number for each laundry
observation (whole plot). The one number used for each whole plot is the mean absorption averaged
over the four subplots (tests) contained in the whole plot. These 7 mean values are reported in
Table 19.1. With two reps, four treatments, and a missing whole plot we get
Whole-plot ANOVA for laundries using subplot means.
Source df Seq. SS MS F P
Reps 1 1.489 1.488 274.83 0.004
Laundry (after Reps) 3 65.379 21.793 4023.34 0.000
Error 1 2 0.011 0.005
Total 6 66.879
As usual, we fit (whole-plot) treatments after reps (whole-plot blocks). With one minor exception,
this provides the whole-plot analysis section of a split-plot model ANOVA table. The degrees of
freedom are Reps, 1; Laundry, 3; and the whole-plot error, Error 1, with 2 df . The minor exception
is that when we present the combined split-plot model ANOVA in Table 19.2, the sums of squares
and mean squares presented here are all multiplied by the number of subplot treatments, four. This
multiplication has no effect on significance tests, e.g., in an F test the numerator mean square and
the denominator mean square are both multiplied by the same number, so the multiplications cancel.
Multiplying these mean squares and sums of squares by the number of subplot treatments maintains
consistency with the subplot model computations.
442 19. DEPENDENT DATA
Now consider the analysis of the subplot treatments, i.e., the absorption tests. The subplot anal-
ysis is largely produced by treating each whole plot as a block. Note that we observe every subplot
treatment within each whole plot, so the blocks are complete. There will be, however, one notable
exception to treating the subplot analysis as an RCB analysis, i.e., the identification of interaction
effects.
RCB ANOVA for tests: Whole plots as subplot blocks.
Source df Seq. SS MS F P
Whole plots 6 267.515 44.586 31.79 0.000
Test 3 105.290 35.097 25.02 0.000
Error 18 25.246 1.403
Total 27 398.050
In a blocking analysis with whole plots taken as subplot blocks there are 7 whole plots, so there are
6 degrees of freedom for subplot blocks. In addition there are 3 degrees of freedom for tests, so the
degrees of freedom for error are 28 − 6 − 3 − 1 = (6)(3) = 18.
The subplot analysis differs from the standard blocking analysis in the handling of the 18 degrees
of freedom for error. A standard blocking analysis takes the block-by-treatment interaction as error.
This is appropriate because the extent to which treatment effects vary from block to block is an
appropriate measure of error for treatment effects. However, in a split-plot design the subplot blocks
are not obtained haphazardly, they have consistencies due to the whole-plot treatments. We can
identify structure within the subplot-block-by-subplot-treatment interaction. Some of the block-by-
treatment interaction can be ascribed to whole-plot-treatment-by-subplot-treatment interaction. In
this experiment, the laundry-by-test interaction has 3 × 3 = 9 degrees of freedom. This is extracted
from the 18 degrees of freedom for error in the subplot RCB analysis to give a subplot error term
(Error 2) with only 18 − 9 = 9 degrees of freedom. Finally, it is of interest to note that the 6 degrees
of freedom for subplot blocks correspond to the 6 degrees of freedom in the whole-plot analysis: 1
for Reps, 3 for Laundries, and 2 for Whole Plot Error. In addition, up to round-off error, the sum of
squares for subplot blocks is also the total of the sums of squares for Reps, Laundries (after fitting
Reps), and Whole-plot Error (Error 1) reported earlier after multiplying those sums of squares by
the number of subplot treatments.
Table 19.2 combines the whole-plot analysis and the subplot analysis into a common analysis
of variance table. Error 1 indicates the whole-plot error term and its mean square is used for in-
ferences about laundries and Reps (if you think it is appropriate to draw inferences about Reps).
Error 2 indicates the subplot error term and its mean square is used for inferences about tests and
laundry-by-test interaction. A subplot blocks line does not appear in the table; the whole-plot anal-
ysis replaces it. Note that for the given whole-plot design, Error 1 is computationally equivalent
to a Rep ∗ Laundry interaction. Again, the sums of squares and mean squares for Reps, Laundry,
and Error 1 in Table 19.2 are, up to round-off error, equal to 4 times the values given earlier in the
analysis based on the 7 Rep–Laundry means.
As for comparing the “whole plots as subplot blocks” ANOVA table given earlier to Table 19.2,
in the first row the whole plots degrees of freedom and sums of squares are the sums of the Reps,
Laundry, and Error 1 degrees of freedom and sum of squares in Table 19.2. The Test lines in
the second row are identical. The “whole plots as subplot blocks” Error term is broken into the
Laundry∗Test interaction and Error 2 degrees of freedom and sum of squares of Table 19.2. The
Total lines are identical.
From Table 19.2 the Laundry ∗ Test interaction is clearly significant, so the analysis would typ-
ically focus there. On the other hand, while the interaction is statistically important, its F statistic is
an order of magnitude smaller than the F statistic for tests, so the person responsible for the exper-
iment might decide that interaction is not of practical importance. The analysis might then ignore
the interaction and focus on the main effects for tests and laundries. Since I am not responsible for
the experiment (only for its inclusion in this book), I will not presume to declare a highly significant
interaction unimportant. Modeling the interaction will be considered in the next subsection. ✷
19.1 THE ANALYSIS OF SPLIT-PLOT DESIGNS 443
Source df Seq. SS MS F P
Reps 1 5.955 5.955 275.60 0.004
Laundry 3 261.517 87.172 4036.04 0.000
Error 1 2 0.043 0.021
Test 3 105.290 35.097 92.876 0.000
Laundry ∗ Test 9 21.845 2.427 6.423 0.005
Error 2 9 3.401 0.378
Total 27 398.050
We now examine the assumptions behind this analysis. The basic split-plot model for a whole-
plot design with Reps is
yi jk = μ + ri + w j + ηi j + sk + (ws) jk + εi jk (19.1.1)
In the laundries example we get 0.011/0.378 = 0.029 on 2 and 9 degrees of freedom and a P value
of 0.97, which may be suspiciously large. This is rather like testing for blocks in a randomized
complete block design. Both tests merely tell you if you wasted your time. An insignificant test for
blocks indicates that blocking was a waste of time. Similarly, an insignificant test for whole-plot
variability indicates that forming a split-plot design was a waste of time. In each case, it is too late
to do anything about it. The analysis should follow the design that was actually used. However, the
information may be of value in designing future studies.
where we drop s̄· and (ws) j· because they are indistinguishable from the μ and w j effects. The
top left panel of Figure 19.1 contains a normal plot for the Error 1 residuals; it looks reasonably
straight. The top right panel has the Error 1 residuals versus predicted values. Note the wider spread
for predicted values near 3. The bottom left panel plots the Error 1 residuals against Reps and
shows nothing startling. The bottom right panel is a plot against Laundries in which we see that
the spread for Laundry 4 is much wider than the spread for the other laundries. This seems to be
worth discussing with the experimenter. Of course there are only 6 residuals (with only 2 degrees of
freedom), so it is difficult to draw any firm conclusions.
19.1 THE ANALYSIS OF SPLIT-PLOT DESIGNS 445
Normal Q−Q Plot Whole−Plot Residual−Fitted plot
4
Standardized residuals
Standardized residuals
2
2
0
0
−2
−2
−4
−4
−1.0 −0.5 0.0 0.5 1.0 4 6 8 10
4
Standardized residuals
Standardized residuals
2
2
0
0
−2
−2
−4
−4
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Replications Laundries
Figure 19.1 Normal plot of whole-plot residuals. Whole-plot residuals versus predicted values, replications,
and laundries. Absorption data.
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
There are seven whole plots, so why are there only six whole-plot residuals in the graphs? Rep
2, Laundry 3 is missing, so the only observation on Laundry 3 is that from Rep 1, Laundry 3. It
follows that Rep 1, Laundry 3 has a leverage of one and the fitted value always equals the data
point. There is little value to a residual that the model forces to be zero.
Figures 19.2 and 19.3 contain a series of Error 2 residual plots obtained from Model (19.1.1)
fitted with η fixed. Figure 19.2 contains the normal plot; it looks alright. The top left panel of
Figure 19.3 plots Error 2 residuals versus predicted values (treating the η s as fixed). The other
panels are plots against Reps, Laundries, and Tests. There is nothing startling. ✷
446 19. DEPENDENT DATA
Subplot Residual−Fitted plot Subplot Residual−Replication plot
2
Standardized residuals
Standardized residuals
1
1
0
0
−2 −1
−2 −1
2 4 6 8 10 12 14 1.0 1.2 1.4 1.6 1.8 2.0
Fitted Replications
2
Standardized residuals
Standardized residuals
1
1
0
0
−2 −1
−2 −1
1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Laundries Tests
Figure 19.3: Subplot residuals versus predicted values, replications, laundries, and tests. Absorption data.
Neither the fitted (predicted) values in Figure 19.1 nor the fitted values in Figure 19.3 are the
fitted values from the split-plot model.
Table 19.3 contains the usual diagnostics (treating the η s as fixed). All the observations on
Laundry 3 have leverage 1 because Rep 2, Laundry 3 is missing.
yi jk = μ + ri + w j + ηi j + sk + (ws) jk + εi jk ,
(The moda (b) function for integers a and b is the remainder when b is divided by a, thus mod4 (7) =
3.) The data are presented again in Table 19.4 in a form suitable for modeling. Rep, Laundry, and
Test correspond to i, j, k while “inter” corresponds to h.
We fit two hierarchies of four models for each Laundry. First, we successively fit models that
incorporate A = D; A = D and B = C; A = D = B = C for one laundry at a time. The other hierarchy
successively incorporates B = C; A = D and B = C; A = D = B = C. In these two hierarchies, only
19.1 THE ANALYSIS OF SPLIT-PLOT DESIGNS 447
Laundry
1
2
3
4
10
Fitted
5
0
A B C D
Test
Figure 19.4: Interaction plot for dynamic absorption data. Plot of ŷ1 jk versus k for each j.
448 19. DEPENDENT DATA
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
i j k h L1 L1 L1 L1 L:2-4
Rep Laundry Test y inter A=D B=C both ABCD T:B-C
1 1 1 7.20 1 1 1 1 1 1
1 1 2 11.70 2 2 2 2 1 2
1 1 3 15.12 3 3 2 2 1 3
1 1 4 8.10 4 1 4 1 1 4
2 1 1 9.06 1 1 1 1 1 1
2 1 2 11.79 2 2 2 2 1 2
2 1 3 14.38 3 3 2 2 1 3
2 1 4 8.12 4 1 4 1 1 4
1 2 1 2.40 5 5 5 5 5 5
1 2 2 7.76 6 6 6 6 6 6
1 2 3 6.13 7 7 7 7 7 7
1 2 4 2.64 8 8 8 8 8 8
2 2 1 2.14 5 5 5 5 5 5
2 2 2 7.76 6 6 6 6 6 6
2 2 3 6.89 7 7 7 7 7 7
2 2 4 3.17 8 8 8 8 8 8
1 3 1 2.19 9 9 9 9 9 9
1 3 2 4.92 10 10 10 10 10 10
1 3 3 5.34 11 11 11 11 11 11
1 3 4 2.47 12 12 12 12 12 12
1 4 1 1.22 13 13 13 13 13 13
1 4 2 2.62 14 14 14 14 14 6
1 4 3 5.50 15 15 15 15 15 7
1 4 4 2.74 16 16 16 16 16 16
2 4 1 2.43 13 13 13 13 13 13
2 4 2 3.90 14 14 14 14 14 6
2 4 3 5.27 15 15 15 15 15 7
2 4 4 2.31 16 16 16 16 16 16
Table 19.5: SSE and dfE for dynamic absorption hierarchical models.
the first model changes. Fitting these four models for Laundry 1 involves fitting a Reps-Laundry
effect along with fitting an effect for one of the other columns from Table 19.4. The first models in
the two hierarchies replace h in Model (19.1.3) with C6 and C7, respectively, in which to incorporate
A = D in Laundry 1 we have replaced the index for Laundry 1—Test D by the index for Laundry
1—Test A and to get B = C in Laundry 1 we have replaced the index for Laundry 1—Test C by
the index for Laundry 1—Test B. The second model in each hierarchy uses C8. The last uses C9.
We refer to Model (19.1.4) as the C5 model and the models that incorporate A = D; B = C; A = D
and B = C; and A = D = B = C for Laundry 1 as the C6, C7, C8, and C9 models, respectively.
Similar data columns must be constructed if we want to fit the hierarchies to each of the other three
Laundries. Table 19.5 contains the results of fitting the 16 models that incorporate A = D; B = C;
A = D and B = C; A = D = B = C for each laundry (leaving the other laundries unmodeled).
The sums of squares for various model tests are obtained by comparing the error sums of squares
19.1 THE ANALYSIS OF SPLIT-PLOT DESIGNS 449
with each other and with that from the interaction model (19.1.1) given in Table 19.2. Consider
modeling the Laundry 1 behavior. Note that, to three decimal places, the SSE is the same regardless
of whether we incorporate A = D for Laundry 1. The sum of squares for testing A = D in Laundry
1 can be obtained by comparing the full model (19.1.1), i.e., model C5, to the C6 model. With the
subplots balanced in each laundry, the same sum of squares for testing A = D is obtained after fitting
B = C by comparing the C7 model to the C8 model. Similarly, B = C can be tested by comparing
models C5 and C7 or models C6 and C8. We can also test A = D = B = C after fitting A = D and
B = C by comparing the full model C8 to the reduced model C9. Again, all of these models also
include a Reps-Laundry effect. These comparisons give the Laundry 1 sums of squares that follow.
From these sums of squares there is little evidence against A = D for any laundry. For every laundry
there is considerable evidence against B = C except for Laundry 3, where we have half as much
data. There is very considerable evidence against all four tests being equal (given that A = D and
B = C, the latter of which is unlikely to be true). Except for the Laundry 3 results, these sums of
squares agree, up to round-off error, with the balanced analysis presented in Christensen (1996,
Section 12.2). These sums of squares can be compared with MSE(2) = 0.378 from Table 19.2 to
obtain tests with a null F(1, 9) distribution. Unfortunately, similar techniques for comparing the
Laundries for each fixed Test will not lead to a test statistic that can be compared to a known F
distribution. [Actually, such a test is possible by refitting the whole-plot model but eliminating the
data from all Tests except the one of interest. This leads to different error terms for each fixed Test.
This possibility is again mentioned in the next section.]
Finally, we can formally examine interactions. Since there is no evidence against A = D for any
laundry, we will not find evidence that the differential effect of A versus D changes from laundry
to laundry. Let’s examine the interaction effect of whether the difference between tests B and C is
the same for Laundry 2 as it is for Laundry 4. We can test this by fitting a model with main effects
for Laundries 2 and 4 and main effects for Tests B and C, but that allows every other combination
of a laundry and test to have its own effect. Comparing this reduced model to the full model gives
the desired test for interaction. However, since the model we are fitting already (implicitly) contains
main effects for Laundries, the first step of fitting main effects for Laundries 2 and 4 is redundant.
Fitting the model using column C10 after Rep-Laundry effects accomplishes our goal of having
main effects only for Tests B and C in Laundries 2 and 4 but possible interaction for any other factor
combination and gives dfE = 10 and SSE = 9.096. Comparing this to the full model (19.1.1) with
dfE = 9 and SSE = 3.401, the test mean square is 9.096 − 3.401 = 5.695 on 1 degree of freedom,
which is much larger than MSE(2) = 0.378, so there is substantial evidence that the B-C effect
changes from Laundry 2 to Laundry 4. Note that C10 is similar to the full interaction model C5
except that for Laundry 4, instead of having distinct indices for Tests B and C, C10 uses the same
indices for Tests B and C as were used for them with Laundry 2.
One last note. Because the whole-plot design is allowed to be unbalanced, rather than just having
one whole-plot ANOVA table as in Table 19.2, we might need to consider fitting different sequences
of models for the whole plots, similar to Chapters 14 and 16. However, because the subplots are
balanced within whole plots, typically there will be only one form for the subplot entries in ANOVA
tables like Table 19.2.
450 19. DEPENDENT DATA
E XAMPLE 19.2.1. In Section 16.2 we considered data from Box (1950) on fabric abrasion. The
data consisted of three factors: Surface treatment (yes, no), Fill (A, B), and Proportion of fill (25%,
50%, 75%). These are referred to as S, F, and P, respectively. (Again, we hope no confusion occurs
between the factor F and the use of F statistics or between the factor P and the use of P values!)
In Section 16.2 we restricted our attention to the weight loss that occurred during the first 1000
revolutions of a machine designed for evaluating abrasion resistance, but data are also available
on each piece of cloth for weight loss between 1000 and 2000 rotations and weight loss occurring
between 2000 and 3000 rotations. The full data are given in Table 19.6. In analyzing the full data,
many aspects are just simple extensions of the analysis given earlier in Section 16.2. There are now
four factors, S, F, P, and one for rotations, say, R. With four factors, there are many more effects
to deal with. There is one more main effect, R, three more two-factor interactions, S ∗ R, F ∗ R,
and P ∗ R, three more three-factor interactions, S ∗ F ∗ R, S ∗ P ∗ R, and F ∗ P ∗ R, and a four-factor
interaction, S ∗ F ∗ P ∗ R.
In addition to having more factors than we have considered before, what makes these data wor-
thy of our further attention is the fact that not all of the observations are independent. Observations
on different pieces of fabric may be independent, but the three observations on the same piece of
fabric, one after 1000, one after 2000, and one after 3000 revolutions, should behave similarly as
compared to observations on different pieces of fabric. In other words, the three observations on one
piece of fabric should display positive correlations. The analysis considered in this section assumes
19.2 A FOUR-FACTOR EXAMPLE 451
that the correlation is the same between any two of the three observations on a piece of fabric. To
achieve this, we consider a model that includes two error terms,
h = 1, 2, i = 1, 2, j = 1, 2, 3, k = 1, 2, m = 1, 2, 3. The error terms are the ηhi jk s and the εhi jkm s. These
are all assumed to be independent of each other with
The ηhi jk s are error terms due to the use of a particular piece of fabric and the εhi jkm s are error
terms due to taking the observations after 1000, 2000, and 3000 rotations. While we have two error
terms, and thus two variances, the variances are assumed to be constant for each error term, so
that all observations have the same variance, σw2 + σs2 . Observations on the same piece of fabric
are identically correlated because they all involve the same fabric error term ηhi jk . Note that Model
(19.2.1) could also be written in a form more similar to the previous section as
Source df SS MS F P
S 1 24494.2 24494.2 78.58 0.000
F 1 107802.7 107802.7 345.86 0.000
P 2 13570.4 6785.2 21.77 0.000
S∗F 1 1682.0 1682.0 5.40 0.039
S∗P 2 795.0 397.5 1.28 0.315
F∗P 2 9884.7 4942.3 15.86 0.000
S∗F∗P 2 299.3 149.6 0.48 0.630
Error 1 12 3740.3 311.7
R 2 60958.5 30479.3 160.68 0.000
S∗R 2 8248.0 4124.0 21.74 0.000
F∗R 2 18287.7 9143.8 48.20 0.000
P∗R 4 1762.8 440.7 2.32 0.086
S∗F∗R 2 2328.1 1164.0 6.14 0.007
S∗P∗R 4 686.0 171.5 0.90 0.477
F∗P∗R 4 1415.6 353.9 1.87 0.149
S∗F∗P∗R 4 465.9 116.5 0.61 0.657
Error 2 24 4552.7 189.7
Total 71 260973.9
examining the highest-order interactions down to the two-factor interactions and main effects. From
Table 19.7 we see that the four-factor interaction has a test statistic of 0.61 and a very large P value,
0.657. We will see that the same results arise from methods for unbalanced data. Even if all of the
four-factor interaction sum of squares was ascribed to one degree of freedom, an unadjusted F test
would not be significant. There is no evidence for a four-factor interaction.
The next step is to consider three-factor interactions. There is one three-factor interaction in the
whole plots and three of them in the subplots. We need different methods to evaluate these. Nonethe-
less, in this balanced case we easily see that the only clearly important three-factor interaction is
S ∗ F ∗ R whereas the F ∗ P ∗ R interaction, with 4 degrees of freedom, has a P value that is small
enough that we might want to investigate whether some interesting, interpretable interaction effect
is being hidden by the overall test. In the absence of F ∗ P ∗ R interaction, we would want to explore
all of the corresponding two-factor effects, in particular the F ∗ P and P ∗ R interactions, which Ta-
ble 19.7 tells us are clearly significant and marginally significant, respectively. The other two-factor
effect subsumed by F ∗ P ∗ R is F ∗ R, but that is also subsumed by the significant S ∗ F ∗ R effects,
so F ∗ R does not warrant separate consideration. However, our goal is to illustrate techniques that
can be used for unbalanced observations, and we examine these interactions using such methods (as
opposed to using contrasts, which is what one would traditionally do for a balanced analysis).
Table 19.8: Subplot high-order interaction models for data of Table 19.6.
we examine models that include a separate fixed effect for each whole plot, i.e., we treat the η terms
as fixed effects. We label these effects as SFPW. Rotations are the only effect not included in this
term, so any interesting additional effects must include rotations. The model with all the subplot
effects and subplot–whole-plot interactions is
yhi jkm = (s f pw)hi jk + rm + (sr)hm + ( f r)im + (pr) jm
+ (s f r)him + (spr)h jm + ( f pr)i jm + (s f pr)hi jm + εhi jkm . (19.2.2)
Table 19.8 gives results for fitting this model and various reduced models using shorthand notation
to denote models. Note the similarity between the models considered in Table 19.8 and the models
considered in Table 16.2. The models in Table 19.8 all include [SFPW] and the other terms all
include an R but otherwise the nine models are similar. For balanced subplots, the information
in Table 19.8 can be obtained from Table 19.7, but Table 19.8 is also appropriate for unbalanced
subplots. Table 19.8 also includes the C p statistics for these models. The C p statistics can be treated
in the usual way but would not be appropriate for comparing models that do not have a separate fixed
effect for each whole plot. The best-fitting models are [SFPW][SFR][FPR] and [SFPW][SFR][PR],
both of which include the S ∗ F ∗ R interaction between surface treatments, fills, and rotations.
The two best C p models are hierarchical and the test of them, that is, of [SFPW][SFR][FPR]
versus [SFPW][SFR][PR], provides a test of F ∗ P ∗ R interaction with statistic
[7120.2 − 5704.6]/[36 − 32]
Fobs = = 1.87,
189.696
which, when compared to an F(4, 24) distribution, gives a one-sided P value of 0.149 as reported in
Table 19.7 for these balanced data. There is no strong evidence for an F ∗ P ∗ R interaction, but that
is not proof that it does not exist.
In our analysis from Section 16.2 of the 1000-rotation data, we found an S ∗ F interaction but
similar analyses for the 2000 and 3000 rotation data show no S ∗ F interaction. (All three ANOVA
tables are given in Section 19.3.) Tables 19.7 and 19.8 confirm the importance of fitting an S ∗ F ∗ R
interaction. Assuming no four-factor interaction, the three different tests for S ∗ F ∗ R available from
Table 19.8 give
[7346.7 − 5018.6]/[30 − 28] [8762.3 − 6434.2]/[34 − 32]
Fobs = =
189.696 189.696
[9448.3 − 7120.2]/[38 − 35] [2328.1]/2
= = = 6.14,
189.696 189.696
454 19. DEPENDENT DATA
which all agree because of subplot balance. Thus the S ∗ F interaction depends on the number of
rotations. It might be of interest if we could find a natural interpretation for this interaction. We
now proceed to examine what is driving the S ∗ F ∗ R interaction and the more dubious F ∗ P ∗ R
interaction.
We begin by looking at the S ∗ F ∗ R interaction. Normally, with rotations at quantitative levels,
we would use linear and quadratic models in rotations to examine interaction. However, we previ-
ously analyzed the data from each number of rotations separately and discovered no S ∗ F interaction
at 2000 and 3000 rotations, so we will use a model that does not distinguish between these levels of
rotations, i.e., in a new categorical variable that we will call R2=3 , 2000 and 3000 rotations have the
same index. These models continue to include terms (sr)hm + ( f r)im + (pr) jm or their equivalents
but replace (s f r)him with (s f r)him̃ , which uses a new index variable m̃ that does not distinguish
between rotations 2000 and 3000. These models do not incorporate the idea that there is no S ∗ F
interaction at 2000 or 3000 rotations, but they do incorporate the idea that the S ∗ F interaction is
the same at 2000 and 3000 rotations yet is possibly different from the S ∗ F interaction at 1000 ro-
tations. We can investigate these terms in any model that includes the S ∗ F ∗ R interaction but not
the four-factor interaction. The most reasonable choices for evaluating S ∗ F ∗ R with unbalanced
subplot data are in the model with all of the three-factor interactions or in the two good models
identified by the C p statistic.
Table 19.9 gives the model fitting information. In particular, Table 19.9 gives three sets of three
models, one that includes the S ∗ F ∗ R interaction, one that posits no change in the S ∗ F interac-
tion for rotations 2000 and 3000, and one that eliminates the S ∗ F ∗ R interaction. It also gives the
differences in sums of squares for the three models. The models that posit no change in the S ∗ F
interaction for rotations 2000 and 3000 fit the data well with a difference in sums of squares of
24.1. Because of subplot balance, these numbers do not depend on which of the three particular
sets of model comparisons are being made. (The value of 24.0 rather than 24.1 is round-off er-
ror.) However, models that posit no difference between the S ∗ F interaction at 1000 rotations and
the common S ∗ F interaction at 2000 and 3000 rotations have a substantial difference in sums of
squares of 2304.0, which leads to a significant F test. Using Scheffé’s multiple comparison method
is appropriate because the data suggested the model. (Previous analysis showed no S ∗ F interaction
for 2000 or 3000 rotations but some for 1000.) The test statistic that compares the S ∗ F interaction
at 1000 rotations with the others is
2304/2
Fobs = = 6.07,
189.7
which is significant at the 0.01 level because F(0.99, 2, 24) = 5.61. Again, for unbalanced data the
three model comparisons could differ. This model-based analysis that is applicable to unbalanced
subplot data reproduces the results in Christensen (1996, Section 12.2) that examine orthogonal
contrasts in the S ∗ F ∗ R interaction for balanced data.
19.2 A FOUR-FACTOR EXAMPLE 455
Recall that in our earlier analysis from Section 16.2 based on just the 1000-rotation data, we
also found an F ∗ P interaction. An F ∗ P ∗ R interaction indicates that the F ∗ P interaction changes
with the number of rotations. If we conclude that no F ∗ P ∗ R interaction exists, we need to consider
the corresponding two-factor interactions involving P. We need to focus on P because it is the only
factor that is not included in the significant S ∗ F ∗ R interaction. The possible two-factor interactions
are F ∗ P and P ∗ R.
It is not clear that an F ∗ P ∗ R interaction exists but, to be safe, we will examine some reason-
able reduced interaction models. If some interpretable interaction effect has a large sum of squares,
it suggests that an important interaction may be hidden within the 4-degree-of-freedom interac-
tion test. To examine F ∗ P ∗ R interaction, we consider polynomial models in both proportions and
rotations,
yhi jkm = (s f pw)hi jk +(s f r)him +(pr) jm + β11i jm+ β12i jm2 + β21i j2 m+ β22i j2 m2 + εhi jkm . (19.2.3)
Relative to this quadratic-by-quadratic interaction model, the first reduced model fitted drops the
term β22i j2 m2 . From the model without β22i j2 m2 , we can drop either β21i j2 m or β12i jm2 deter-
mining two hierarchies of models. The last reduced model in the hierarchy will include only the
linear-by-linear interaction term β11i jm. Dropping this term leads to a model without F ∗ P ∗ R in-
teraction.
The results of fitting the two hierarchies are given in Table 19.10. Because these subplot data are
balanced, the differential effects for the intermediate regression terms are identical (up to round-off
error) in the two hierarchies, i.e., 660 for fitting a quadratic-by-quadratic term after fitting the others,
495 for fitting a proportion-linear-by-rotation-quadratic term, regardless of whether a proportion-
quadratic-by-rotation-linear term has already been fitted, 40 for fitting a proportion-quadratic-by-
rotation-linear term, regardless of whether a proportion-linear-by-rotation-quadratic term has al-
ready been fitted, and 220.5 for fitting a proportion-linear-by-rotation-linear term. These results
provide a model-based reproduction of results obtained using orthogonal interaction contrasts for
balanced data in Christensen (1996, Section 12.2). Note also that these hierarchies involve dropping
pairs of regression coefficients, e.g., β22i , i = 1, 2, but dropping these pairs only reduces the error
degrees of freedom by 1. This is a result of having (s f pw)hi jk in every model.
These models were not chosen by looking at the data, so less stringent multiple comparison
methods than Scheffé’s can be used on them. On the other hand, the models are not particularly
informative. None of these models suggests a particularly strong source of interaction. F tests are
constructed by dividing each of the four sums of squares by MSE(2). None of the F ratios is signif-
icant when compared to F(0.95, 1, 24) = 4.26. This analysis seems consistent with the hypothesis
of no F ∗ P ∗ R interaction.
If we accept the working assumption of no F ∗ P ∗ R interaction, we need to examine the two-
456 19. DEPENDENT DATA
factor interactions that can be constructed from the three factors. These are F ∗ P, F ∗ R, and P ∗ R.
The F ∗ R effects are, however, not worth further consideration because they are subsumed within
the S ∗ F ∗ R effects that have already been established as important. Another way of looking at this
is that in Model (19.2.1), the ( f r)im effects are unnecessary in a model that already has (s f r)him
effects. Thus we focus our attention on F ∗ P and P ∗ R. The F ∗ P interaction is a whole-plot effect,
so it will be considered in the next subsection.
We now examine the P ∗ R interaction. Information for testing whether [PR] can be dropped
from [SFPW][SFR][PR] is given at the top and bottom of Table 19.11. The F statistic becomes,
yhi jkm = (s f pw)hi jk + (s f r)him + β11 jm + β12 jm2 + β21 j2 m + β22 j2 m2 + εhi jkm . (19.2.4)
Results for fitting reduced models are given in Table 19.11. There are two hierarchies but due to
subplot balance they give the same results. We find that the sequential sum of squares for dropping
β12 is 726.0 and for dropping β11 is 741.1. Comparing them to MSE(2), these sums of squares are
not small but neither are they clearly significant. The interaction plot in Figure 19.5 of ŷ11 j1m values
from Model (19.2.4) seems to confirm that there is no obvious interaction being overlooked by the
four degrees of freedom test. We remain unconvinced that there is any substantial P ∗ R interaction.
These are exact analogues to results in Christensen (1996, Section 12.2.).
260
240
220
Fitted
200
180
160
Percent
25
50
140
75
Rotations
Figure 19.5: Proportion—Rotation interaction plot for abrasion data. Plot of ŷ11 j1m versus m for each j.
We begin by dropping all the subplot effects out of Model (19.2.2) and fitting
yhi jkm = (s f pw)hi jk + εhi jkm ,
cf. Table 19.12. To obtain the whole-plot Error, we compare this to a model with all whole-plot
effects but no subplot effects.
yhi jkm = μ + sh + fi + p j (19.2.5)
+ (s f )hi + (sp)h j + ( f p)i j + (s f p)hi j + εhi jkm .
From Table 19.12, MSE(1) = [102446 − 98705]/[60 − 48] = 3741/12 = 311.7, which agrees with
Table 19.7.
As we create reduced models relative to Model (19.2.5) in the whole plots we can get test
degrees of freedom and sums of squares by differencing the errors of various reduced models in the
usual way. Table 19.12 includes the usual 9 models for 3 factors and modified CP statistics. From
Table 19.12 we can reproduce the whole-plot tests of Table 19.7. For example, SS(S ∗ F ∗ P) =
102745 − 102446 = 299 with df (S ∗ F ∗ P) = 62 − 60 = 2. Moreover, one sum of squares for S ∗ P
is
R [SP][FP][SF] ≡ SSE([SP][FP][SF]) − SSE([FP][SF]) = 103540 − 102745 = 795
with
dfE ([SP][FP][SF]) − dfE ([FP][SF]) = 64 − 62 = 2.
The best among the usual 9 models appears to be [SF][FP], which is equivalent to a quadratic
model in proportions
yhi jkm = (s f )hi + p j + β1i j + β2i j2 + εhi jkm ,
a model that we denote [SF][P][FP1 ][FP2 ]. Dropping the quadratic terms in proportions gives
[SF][P][FP1 ] and dropping the linear term reduces us to [SF][P]. The sum of squares for the quadratic
F ∗ P interaction term is
R [FP2 ][SF][P][FP1 ] ≡ SSE([SF][P][FP1 ]) − SSE([SF][P][FP1 ][FP2 ])
= 103563 − 103540 = 23
458 19. DEPENDENT DATA
each on 1 degree of freedom. The F ∗ P interaction is a whole-plot effect, so the appropriate error is
MSE(1) = 311.7 and the F ratios are 0.075 and 31.64, respectively. There is no evidence that the
curvatures in proportions are different for Fills A and B. However, there is evidence that the slopes
are different for Fills A and B.
In fact, we can take this further. The data are consistent with there being not only no change
in curvature but no curvature at all and, although the slopes are different, there is no evidence of a
nonzero slope for Fill B.
To fit separate quadratic models in j for each fill, we need to manipulate the indices. A 0-1
indicator variable for Fill B is fB ≡ i − 1 and an indicator variable for Fill A is fA ≡ 2 − i. Define
a linear term in proportions for Fill A only as pA ≡ j ∗ fA and the quadratic term is p2A ≡ j2 ∗ fA .
Similarly, the linear and quadratic terms for Fill B are pB ≡ j ∗ fB and p2B ≡ j2 ∗ fB . The following
model is equivalent to [SF][FP],
yhi jkm = (s f )hi + βA1 pA + βA2 p2A + βB1 pB + βB2 p2B + εhi jkm .
with dfE = 65 and SSE = 103652 for a difference in sums of squares of 103652 − 103540 = 112
for the Fill B quadratic term. Further dropping the linear term for Fill B gives
with dfE = 66 and SSE = 103793 for a difference in sums of squares of 103793 − 103652 = 141
due to the Fill B linear term. As far as we can tell, the weight loss does not change as a function of
proportion of filler when using Fill B.
19.2 A FOUR-FACTOR EXAMPLE 459
A similar analysis for Fill A shows that weight loss increases with proportion and there is again
no evidence of curvature. In particular, for Fill A, the quadratic term has a sum of squares of 14. For
Fill A, the linear term has sum of squares 23202.
The other significant whole-plot effect is the S ∗ F interaction but those effects are subsumed by
incorporating the S ∗ F ∗ R subplot effects.
or equivalently,
which is a perfectly reasonable model to fit and one that allows exploration of S ∗ R interaction for
Fill A. Any inferences we choose to make can continue to be based on MSE(2) as our estimate of
variance. Similarly, there is no problem with fixing the level of a whole-plot effect in the whole-plot
analysis, similar to what we did in the previous subsection.
On the other hand, if we fix m = 1 in Model (19.2.2), the subplot model becomes
or equivalently,
yhi jk1 = (s f pw)hi jk + εhi jk1 ,
which is not a model that allows us to examine S ∗ F interaction.
Fortunately, we can examine S ∗ F interaction for a fixed level of m; we just cannot do it in the
split-plot model context. If we go back to the original split-plot model (19.2.1) and fix m = 1 we get
yhi jk1 = μ + sh + fi + p j
+ (s f )hi + (sp)h j + ( f p)i j + (s f p)hi j
+ ηhi j1
+ r1 + (sr)h1 + ( f r)i1 + (pr) j1
+ (s f r)hi1 + (spr)h j1 + ( f pr)i j1 + (s f pr)hi j1
+ εhi jk1
or equivalently
which is really just the model that we analyzed in Section 16.2 where the two independent error
terms ηhi jk and εhi jk1 are added and treated as a single error term. Our test in Section 16.2 of
the S ∗ F interaction using just the 1000-rotation data is perfectly appropriate and similar tests using
just the 2000-rotation data and just the 3000-rotation data would also be appropriate. ANOVA tables
460 19. DEPENDENT DATA
for the separate analyses of the 1000-, 2000-, and 3000-rotation data are given in Section 19.3 as
Tables 19.14, 19.15, and 19.16. Note, though, that the separate analyses are not independent, because
the observations at 2000 rotations are not independent of the observations at 3000 rotations, etc.
On occasion, when examining models for a fixed subplot treatment, rather than using the MSEs
from the separate analyses, the degrees of freedom and sums of squares for Error 1 and Error 2
are pooled and these are used instead. This is precisely the error estimate obtained by pooling the
error estimates from the three separate ANOVAs. Such a pooled estimate should be better than the
estimates from the separate analyses but it is difficult to quantify the effect of pooling. The three
separate ANOVAs are not independent, so pooling the variance estimates does not have the nice
properties of the pooled estimate of the variance used in, say, one-way ANOVA. As alluded to
above, we cannot get exact F tests based on the pooled variance estimate. If the three ANOVA’s
were independent, the pooled error would have 12 + 12 + 12 = 36 degrees of freedom, but we do
not have independence, so we do not even know an appropriate number of degrees of freedom to
use with the pooled estimate, much less the appropriate distribution.
as
yh1 jkm = p j + ηh1 jk + (sr)hm + εh1 jkm (19.2.6)
and
yh2 jkm = p j + ηh2 jk + (sr)hm + εh2 jkm . (19.2.7)
Models (19.2.6) and (19.2.7) can be written in split plot form as
and
yh2 jkm = sh + p j + ηh2 jk + rm + (sr)hm + εh2 jkm .
In the S ∗ F ∗ R interaction, rotations 2000 and 3000 are similar, so, as alternates to models
(19.2.6) and (19.2.7), we could fit models that do not distinguish between them using the index m̃,
and
yh2 jkm = sh + p j + ηh2 jk + rm + (sr)hm̃ + εh2 jkm . (19.2.9)
19.2 A FOUR-FACTOR EXAMPLE 461
And we can also fit the no-interaction models
and
yh2 jkm = sh + p j + ηh2 jk + rm + εh2 jkm . (19.2.11)
The sum of squares for comparing models (19.2.6) and (19.2.8) is
All of these are compared to MSE(2) = 189.7. There is no evidence of interactions involving 2000
and 3000 rotations with surface treatments, regardless of fill type. With Fill A, there is marginal
evidence of an interaction in which the effect of S is different at 1000 rotations than at 2000 and
3000 rotations. With Fill B, there is clear evidence of an interaction where the effect of S is different
at 1000 rotations than at 2000 and 3000 rotations.
We earlier established that there is no quadratic effect in proportions for fill A, so Model (19.2.6)
can be replaced by
yh1 jkm = γ j + ηh1 jk + (sr)hm + εh1 jkm .
We earlier showed that there is no linear or quadratic effects in proportions for fill B so Model
(19.2.7) can be replaced by
yh2 jkm = ηh2 jk + (sr)hm + εh2 jkm .
Incorporating the earlier subplot models gives us the split-plot models
and
yh2 jkm = sh + ηh2 jk + rm + (sr)hm̃ + εh2 jkm . (19.2.13)
Parameter estimates can be obtained by least squares, i.e., by fitting the models ignoring the η errors.
The fitted values are given in Table 19.13. Note that the rows and columns have been rearranged
from those used for the data in Table 19.6.
For Fill A, either surface treatment and any level of rotation, estimated weight loss increases by
31.08 as the proportion goes up.
For Fill A, the effect of going from 1000 to 2000 rotations is an estimated decrease in weight
loss of 12.0 units with a surface treatment but an estimated increase in weight loss of 8.0 units
without the surface treatment. The estimated effect of going from 2000 to 3000 rotations is a drop
of 41.1 units in weight loss regardless of the surface treatment.
For Fill A, the estimated effect of the surface treatment is an additional 20.6 units in weight loss
at either 2000 or 3000 rotations but it is an additional 40.5 units at 1000 rotations.
For Fill B, the estimated weight loss does not depend on proportions.
For Fill B, the effect of going from 1000 to 2000 rotations is an estimated decrease in weight
loss of 111.0 units with a surface treatment but only an estimated decrease in weight loss of 43.2
462 19. DEPENDENT DATA
without the surface treatment. The estimated effect of going from 2000 to 3000 rotations is a drop
of 22.1 units in weight loss regardless of the surface treatment.
For Fill B, the estimated effect of the surface treatment is an additional 24.0 units in weight loss
at either 2000 or 3000 rotations but it is an additional 91.8 units at 1000 rotations.
Most of these estimates are identical to estimates based on the balanced analysis presented in
Christensen (1996, Section 12.2). The exceptions are the estimates that compare results for 1000
and 2000 rotations for fixed levels of surface treatment, and fill (proportions being irrelevant). The
estimates in Christensen (1996, Section 12.2) were somewhat more naive in that they did not incor-
porate the lack of S ∗ F interaction at 2000 and 3000 rotations.
The same information can be obtained from the tables of coefficients for models (19.2.12) and
(19.2.13) but it is much more straightforward to get the estimates from the table of fitted values. In
particular, fitting the models (with intercepts but without the whole-plot error terms) in R gives
Table of Coefficients
Fill A: Model (19.2.12) Fill B: Model (19.2.13)
Predictor Est Predictor Est
(Intercept) 172.833 (Intercept) 227.000
Sa2 −40.500 Sb2 −91.833
pa 31.083
RTa2 7.958 RTb2 −43.125
RTa3 −33.125 RTb3 −65.208
Sa1:Mtildea2 −19.917 Sb1:Mtildeb2 −67.917
Sa2:Mtildea2 NA Sb2:Mtildeb2 NA
The whole-plot errors were not included in the fitted models so the standard errors, t statistics, and
P values are all invalid and not reported.
The estimates for Fill A that we obtained from Table 19.13 can now be found as
31.083 = pa
−11.959 = −19.917 + 7.958 = Sa1:Mtildea2 + RTa2
7.958 = RTa2
−41.083 = −33.125 − 7.958 = RTa3 − RTa2
20.583 = 40.500 − 19.917 = 20.583 = −Sa2 + Sa1:Mtildea2
40.500 = −Sa2
2
1
Standardized residuals
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
−43.125 = RTb2
−22.083 = −65.208 + 43.125 = RTb3 − RTb2
23.916 = 91.833 − 67.917 = −Sb2 + Sb1:Mtildeb2
91.833 = −Sb2.
The down side of looking at the coefficients is that it is by no means clear how to figure out that these
parameter estimates, and linear combinations of parameter estimates, are what one wants to look at.
It is much easier to look at the table of fitted values to isolate features of interest corresponding to
the fitted model.
If you were to include fixed η effects when fitting models (19.2.12) and (19.2.13), you would get
the same estimates of any terms that involve the subplot treatments. In this example those would be
RTa2, RTb2, RTa3, RTb3, Sa1:Mtildea2, and Sb1:Mtildeb2. Moreover, the reported standard errors
for these parameter estimates would be appropriate. (Although even better standard errors could
be constructed by pooling the error estimates from models (19.2.12) and (19.2.13).) With fixed η
effects in the models, estimates of any whole-plot terms (any terms not previously listed) depend
entirely on the side conditions used to fit the model.
Finally, we examine residual plots for Model (19.2.1). The Error 1 plots are based on a model
for the whole plots that averages observations in subplots. Figures 19.6 and 19.7 contain residual
plots for the Error 1 residuals. The Error 2 plots are based on Model (19.2.2). Figures 19.8 and 19.9
contain residual plots for the Error 2 residuals. We see no serious problems in any of the plots. ✷
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
100 150 200 250 1.0 1.2 1.4 1.6 1.8 2.0
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.5 2.0 2.5 3.0
Fills Proportions
0 1 2
−2
−2 −1 0 1 2
Theoretical Quantiles
Residual−Fitted plot
Standardized residuals
0 1 2
−2
Fitted
Figure 19.8 Normal plot of subplot residuals, W ′ = 0.98 and subplot residuals versus predicted values, Box
data.
19.3 MULTIVARIATE ANALYSIS OF VARIANCE 465
Residual−Surface Treatments plot Residual−Fills plot
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0
Proportions Rotations
requires balance except that there be no missing observations among the multiple measures on
a subject. Entirely missing a subject causes few problems. The discussion in Christensen (2001,
Section 1.5) is quite general and particularly relevant in that it makes extensive comparisons to
split-plot analyses. Unfortunately, the mathematical level of Christensen (2001) is much higher than
the level of this book. Almost all Statistics books on multivariate analysis deal with MANOVA.
Johnson and Wichern (2007) or Johnson (1998) are reasonable places to look for more information
on the subject.
The discussion in this section makes some use of matrices. Matrices are reviewed in Ap-
pendix A.
E XAMPLE 19.3.1. Consider again the Box (1950) data on the abrasion resistance of a fabric. We
began in Section 16.2 by analyzing the weight losses obtained after 1000 revolutions of the testing
machine. In the split-plot analysis we combined these data for 1000 rotations with the data for 2000
and 3000 rotations. In the multivariate approach, we revert to the earlier analysis and fit separate
ANOVA models for the data from 1000 rotations, 2000 rotations, and 3000 rotations. Again, the
three factors are referred to as S, F, and P, respectively. The variables yhi jk,1 , yhi jk,2 , and yhi jk,3
denote the data from 1000, 2000, and 3000 rotations, respectively. We fit the models
yhi jk,1 = μhi jk,1 + εhi jk,1
= μ1 + sh,1 + fi,1 + p j,1
+ (s f )hi,1 + (sp)h j,1 + ( f p)i j,1 + (s f p)hi j,1 + εhi jk,1 ,
Source df SS MS F P
S 1 26268.2 26268.2 97.74 0.000
F 1 6800.7 6800.7 25.30 0.000
P 2 5967.6 2983.8 11.10 0.002
S∗F 1 3952.7 3952.7 14.71 0.002
S∗P 2 1186.1 593.0 2.21 0.153
F∗P 2 3529.1 1764.5 6.57 0.012
S∗F∗P 2 478.6 239.3 0.89 0.436
Error 12 3225.0 268.8
Total 23 51407.8
Source df SS MS F P
S 1 5017.0 5017.0 25.03 0.000
F 1 70959.4 70959.4 353.99 0.000
P 2 7969.0 3984.5 19.88 0.000
S∗F 1 57.0 57.0 0.28 0.603
S∗P 2 44.3 22.2 0.11 0.896
F∗P 2 6031.0 3015.5 15.04 0.001
S∗F∗P 2 14.3 7.2 0.04 0.965
Error 12 2405.5 200.5
Total 23 92497.6
h = 1, 2, i = 1, 2, j = 1, 2, 3, k = 1, 2.
As in standard ANOVA models, we assume that the individuals (on which the repeated measures
were taken) are independent. Thus, for fixed m = 1, 2, 3, the εhi jk,m s are independent N(0, σmm ) ran-
dom variables. We are now using a double subscript in σmm to denote a variance rather than writing
σm2 . As usual, the errors on a common dependent variable, say εhi jk,m and εh′ i′ j′ k′ ,m , are independent
when (h, i, j, k) = (h′ , i′ , j′ , k′ ), but we also assume that the errors on different dependent variables,
say εhi jk,m and εh′ i′ j′ k′ ,m′ , are independent when (h, i, j, k) = (h′ , i′ , j′ , k′ ). However, not all of the er-
rors for all the variables are assumed independent. Two observations (or errors) on the same subject
are not assumed to be independent. For fixed h, i, j, k the errors for any two variables are possibly
correlated with, say, Cov(εhi jk,m , εhi jk,m′ ) = σmm′ .
The models for each variable are of the same form but the parameters differ for the different
dependent variables yhi jk,m . All the parameters have an additional subscript to indicate which de-
pendent variable they belong to. The essence of the procedure is simply to fit each of the models
individually and then to combine results. Fitting individually gives three separate sets of residuals,
ε̂hi jk,m = yhi jk,m − ȳhi j·,m for m = 1, 2, 3, so three separate sets of residual plots and three separate
ANOVA tables. The three ANOVA tables are given as Tables 19.14, 19.15, and 19.16. (Table 19.14
reproduces Table 16.10.) Each variable can be analyzed in detail using the ordinary methods for
multifactor models illustrated in Section 16.2. Residual plots for y1 were previously given in Sec-
tion 16.2 as Figures 16.3 and 16.4 with additional plots given here. The top left residual plot for y1
in Figure 19.10 was given as Figure 16.3. Residual plots for the analyses on y2 and y3 are given in
Figures 19.11 through 19.14.
The key to multivariate analysis of variance is to combine results across the three variables y1 ,
y2 , and y3 . Recall that the mean squared errors are just the sums of the squared residuals divided by
19.3 MULTIVARIATE ANALYSIS OF VARIANCE 467
Source df SS MS F P
S 1 1457.0 1457.0 6.57 0.025
F 1 48330.4 48330.4 217.83 0.000
P 2 1396.6 698.3 3.15 0.080
S∗F 1 0.4 0.4 0.00 0.968
S∗P 2 250.6 125.3 0.56 0.583
F∗P 2 1740.3 870.1 3.92 0.049
S∗F∗P 2 272.2 136.1 0.61 0.558
Error 12 2662.5 221.9
Total 23 56110.0
3
Standardized residuals
Standardized residuals
2
2
1
1
−3 −2 −1 0
3
Standardized residuals
Standardized residuals
2
2
1
1
−3 −2 −1 0
−3 −2 −1 0
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.5 2.0 2.5 3.0
Fills Proportions
1
MSEmm ≡ smm =
dfE ∑ ε̂hi2 jk,m .
hi jk
This provides an estimate of σmm . We can also use the residuals to estimate covariances between
the three variables. The estimate of σmm′ is
1
MSEmm′ ≡ smm′ =
dfE ∑ ε̂hi jk,m ε̂hi jk,m′ .
hi jk
3
2
1
Standardized residuals
0
−1
−2
−3
−2 −1 0 1 2
Theoretical Quantiles
3
Standardized residuals
Standardized residuals
2
2
1
1
−3 −2 −1 0
−3 −2 −1 0
100 150 200 250 1.0 1.2 1.4 1.6 1.8 2.0
3
Standardized residuals
Standardized residuals
2
2
1
1
−3 −2 −1 0
−3 −2 −1 0
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.5 2.0 2.5 3.0
Fills Proportions
Note that smm′ = sm′ m , e.g., s12 = s21 . The matrix S provides an estimate of the covariance matrix
⎡ ⎤
σ11 σ12 σ13
Σ ≡ ⎣ σ21 σ22 σ23 ⎦ .
σ31 σ32 σ33
The key difference between this analysis and the split-plot analysis is that this analysis makes no
assumptions about the variances and covariances in Σ. The split-plot analysis assumes that
1.5
1.0
0.5
Standardized residuals
0.0
−0.5
−1.0
−1.5
−2 −1 0 1 2
Theoretical Quantiles
1.5
Standardized residuals
Standardized residuals
0.5
0.5
−0.5
−0.5
−1.5
−1.5
1.5
Standardized residuals
Standardized residuals
0.5
0.5
−0.5
−0.5
−1.5
−1.5
1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.5 2.0 2.5 3.0
Fills Proportionss
Similarly, we can construct a matrix that contains sums of squares error and sums of cross
products error. Write
emm′ ≡ ∑ ε̂hi jk,m ε̂hi jk,m′
hi jk
470 19. DEPENDENT DATA
where emm = SSEmm and ⎡ ⎤
e11 e12 e13
E ≡ ⎣ e21 e22 e23 ⎦ .
e31 e32 e33
Obviously, E = (dfE)S. For Box’s fabric data,
⎡ ⎤
3225.00 − 80.50 1656.50
E = ⎣ −80.50 2405.50 −112.00 ⎦ .
1656.50 −112.00 2662.50
The diagonal elements of this matrix are the error sums of squares from Tables 19.14, 19.15, and
19.16.
We can use similar methods for every line in the three analysis of variance tables. For example,
each variable m = 1, 2, 3 has a sum of squares for S ∗ P, say,
2 3
SS(S ∗ P)mm ≡ h(S ∗ P)mm = 4 ∑ ∑
2
ȳh· j·,m − ȳh···,m − ȳ·· j·,m + ȳ····,m .
h=1 j=1
We can also include cross products using SS(S ∗ P)mm′ ≡ h(S ∗ P)mm′ , where
2 3
h(S ∗ P)mm′ = 4 ∑ ∑ ȳh· j·,m − ȳh···,m − ȳ·· j·,m + ȳ····,m ȳh· j·,m′ − ȳh···,m′ − ȳ·· j·,m′ + ȳ····,m′
h=1 j=1
(The nice algebraic formulae only exist because the entire model is balanced.) For the fabric data
⎡ ⎤
1186.0833 −33.166667 526.79167
H(S ∗ P) = ⎣ −33.166667 44.333333 −41.583333 ⎦ .
526.79167 −41.583333 250.58333
Note that the diagonal elements of H(S ∗ P) are the S ∗ P interaction sums of squares from Ta-
bles 19.14, 19.15, and 19.16. Table 19.17 contains the H matrices for all of the sources in the
analysis of variance.
In the standard (univariate) analysis of y1 that was performed in Section 16.2, the test for S ∗ P
interactions was based on
MS(S ∗ P)11 SS(S ∗ P)11 1/df (S ∗ P) h(S ∗ P)11 dfE
F= = = .
MSE11 SSE11 1/dfE e11 df (S ∗ P)
The last two equalities are given to emphasize that the test depends on the yhi jk,1 s only through
−1
h(S ∗ P)11 [e11 ] . Similarly, a multivariate test of S ∗ P is a function of the matrices
H(S ∗ P)E −1 ,
where E −1 is the matrix inverse of E. A major difference between the univariate and multivariate
procedures is that there is no uniform agreement on how to use H(S ∗ P)E −1 to construct a test. The
generalized likelihood ratio test, also known as Wilks’ lambda, is
1
Λ(S ∗ P) ≡
|I + H(S ∗ P)E −1 |
19.3 MULTIVARIATE ANALYSIS OF VARIANCE 471
where I indicates a 3 × 3 identity matrix and |A| denotes the determinant of a matrix A. Roy’s
maximum root statistic is the maximum eigenvalue of H(S ∗ P)E −1 , say, φmax (S ∗ P). On occasion,
Roy’s statistic is taken as
φmax (S ∗ P)
θmax (S ∗ P) ≡ .
1 + φmax (S ∗ P)
A third statistic is the Lawley–Hotelling trace,
T 2 (S ∗ P) ≡ dfE tr H(S ∗ P)E −1 ,
Similar test statistics Λ, φ , θ , T 2 and V can be constructed for all of the other main effects and
interactions. It can be shown that for H terms with only one degree of freedom, these test statistics
are equivalent to each other and to an F statistic. In such cases, we only present T 2 and the F value.
Table 19.18 presents the test statistics for each term. When the F statistic is exactly correct, it is
given in the table. In other cases, the table presents F statistic approximations. The approximations
472 19. DEPENDENT DATA
are commonly used and discussed; see, for example, Rao (1973, chapter 8) or Christensen (2001,
Section 1.2). Degrees of freedom for the F approximations and P values are also given.
Each effect in Table 19.18 corresponds to a combination of a whole-plot effect and a whole-plot-
by-subplot interaction from the split-plot analysis Table 19.7. For example, the multivariate effect
S corresponds to combining the effects S and S ∗ R from the univariate analysis. The highest-order
terms in the table that are significant are the F ∗ P and the S ∗ F terms. Relative to the split-plot
analysis, these suggest the presence of F ∗ P interaction or F ∗ P ∗ R interaction and S ∗ F interaction
or S ∗ F ∗ R interaction. In Section 19.2, we found the merest suggestion of an F ∗ P ∗ R interaction
but clear evidence of an F ∗ P interaction; we also found clear evidence of an S ∗ F ∗ R interaction.
However, the split-plot results were obtained under different, and perhaps less appropriate, assump-
tions.
✷
To complete a multivariate analysis, additional modeling is needed (or MANOVA contrasts for
balanced data). The MANOVA assumptions also suggest some alternative residual analysis. We
will not discuss either of these subjects. Moreover, our analysis has exploited the balance in S, F,
and P so that we have not needed to examine various sequences of models that would, in general,
determine different H matrices for the effects. (Balance in R is required for the MANOVA).
Finally, a personal warning. One should not underestimate how much one can learn from simply
doing the analyses for the individual variables. Personally, I would look thoroughly at each indi-
vidual variable (number of rotations in our example) before worrying about what a multivariate
analysis can add.
19.4.1 Subsampling
It is my impression that many of the disasters that occur in planning and analyzing studies occur
because people misunderstand subsampling. The following is both a true story and part of the folk-
lore of the Statistics program at the University of Minnesota. A graduate student wanted to study
the effects of two drugs on mice. The student collected 200 observations in the following way. Two
mice were randomly assigned to each drug. From each mouse, tissue samples were collected at 50
sites. The subjects were the mice because the drugs were applied to the mice, not to the tissue sites.
There are two sources of variation: mouse-to-mouse variation and within-mouse variation. The 50
observations (subsamples) on each mouse are very useful in reducing the within-mouse variation but
do nothing to reduce mouse-to-mouse variation. Relative to the mouse-to-mouse variation, which
is likely to be larger than the within-mouse variation, there are only two observations that have the
same treatment. As a result, each of the two treatment groups provides only one degree of freedom
for estimating the variance that applies to treatment comparisons. In other words, the experiment
provides two degrees of freedom for (the appropriate) error. Obviously a lot of work went into col-
lecting the 200 observations. The work was wasted! Moreover, the problem in the design of this
experiment could easily have been compounded by an analysis that ignored the subsampling prob-
lem. If subsampling is ignored in the analysis of such data, the MSE is inappropriately small and
effects look more significant than they really are. (Fortunately, none of the many Statistics students
that were approached to analyze these data were willing to do it incorrectly.)
Another example comes from Montana State University. A Range science graduate student
wanted to compare two types of mountain meadows. He had located two such meadows and was
planning to take extensive measurements on each. It had not occurred to him that this procedure
would look only at within-meadow variation and that there was variation between meadows that he
was ignoring.
Consider the subsampling model
yi jk = μi + ηi j + εi jk (19.4.1)
ȳi j· = μi + ei j (19.4.2)
where we define
ei j ≡ ηi j + ε̄i j·
and have i = 1, . . . , a, j = 1, . . . , ni . Using Proposition 1.2.11, it is not difficult to see that the ei j s
are independent N(0, σw2 + σs2 /N), so that Model (19.4.2) is just an unbalanced one-way ANOVA
model and can be analyzed as such. If desired, the methods of the next subsection can be used to
estimate the between-unit (whole-plot) variance σw2 and the within-unit (subplot) variance σs2 . Note
474 19. DEPENDENT DATA
that our analysis in Example 19.1.1 was actually on a model similar to (19.4.2). The data analyzed
were averages of two repeat measurements of dynamic absorption.
Model (19.4.2) also helps to formalize the benefits of subsampling. We have N subsamples
that lead to Var(ei j ) = σw2 + σs2 /N. If we did not take subsamples, the variance would be σw2 + σs2 ,
so we have reduced one of the terms in the variance by subsampling. If the within-unit variance
σs2 is large relative to the between-unit variance σw2 , subsampling can be very beneficial. If the
between-unit variance σw2 is substantial when compared to the within-unit variance σs2 , subsampling
has very limited benefits. In this latter case, it is important to obtain a substantial number of true
replications involving the between-unit variability with subsampling based on convenience (rather
than importance).
Model (19.4.1) was chosen to have unequal numbers of units on each treatment but a balanced
number of subsamples. This was done to suggest the generality of the procedure. Subsamples can
be incorporated into any linear model and, as long as the number of subsamples is constant for each
unit, a simple analysis can be obtained by averaging the subsamples for each unit and using the
averages as data. Christensen (2011, Section 11.4) provides a closely related discussion that is not
too mathematical.
E XAMPLE 19.4.1. Ott (1949) presented data on an electrical characteristic associated with ce-
ramic components for a phonograph (one of those ancient machines that played vinyl records). Ott
and Schilling (1990) and Ryan (1989) have also considered these data. Ceramic pieces were cut
from strips, each of which could provide 25 pieces. It was decided to take 7 pieces from each strip,
manufacture the 7 ceramic phonograph components, and measure the electrical characteristic on
each. The data from 4 strips are given below. (These are actually the third through sixth of the strips
reported by Ott.)
Strip Observations
1 17.3 15.8 16.8 17.2 16.2 16.9 14.9
2 16.9 15.8 16.9 16.8 16.6 16.0 16.6
3 15.5 16.6 15.9 16.5 16.1 16.2 15.7
4 13.5 14.5 16.0 15.9 13.7 15.2 15.9
The standard analysis looks for differences between the means of these four specific ceramic
strips. An alternative approach to these data is to think of the four ceramic strips as being a random
sample from the population of ceramic strips that are involved in making the assemblies. If we do
that, we have two sources of variability, variability among the observations on a given strip and
variability between different ceramic strips. Our goal in this subsection is to estimate the variances
and test whether there is any variability between strips. ✷
yi j = μ + αi + εi j
Var(yi j ) = Var(μ + αi + εi j )
= Var(αi ) + Var(εi j )
= σA2 + σ 2 .
Thus Var(yi j ) is the sum of two variance components σA2 and σ 2 . Moreover,
N σA2 + σ 2 N σA2
= 1 + .
σ2 σ2
If H0 : σA2 = 0 holds, the F statistic should be about 1. In general, if H0 holds,
MSGrps
∼ F(a − 1, dfE),
MSE
and the usual F test can be interpreted as a test of H0 : σA2 = 0. Interestingly, however, for this test
to be good, we need N large; not a. Typically, it is easier to get N large than it is to get a large, so
typically it is easier to tell whether σA2 = 0 than it is to tell what σA2 actually is.
E XAMPLE 19.4.1 CONTINUED . For the electrical characteristic data, the analysis of variance
table is given below.
476 19. DEPENDENT DATA
Analysis of Variance: Electrical characteristic data
Source df SS MS F P
Treatments 3 10.873 3.624 6.45 0.002
Error 24 13.477 0.562
Total 27 24.350
The F statistic shows strong evidence that variability exists between ceramic strips. The estimate of
within-strip variability is MSE = 0.562. With 7 observations on each ceramic strip, the estimate of
between-strip variability is
MSGrps − MSE 3.624 − 0.562
σ̂A2 = = = 0.437,
N 7
but it is not a very good estimate, being worse than an estimate based on 3 degrees of freedom.
While in many ways this random effects analysis seems more appropriate for the relatively
undifferentiated strips being considered, this analysis also seems less informative for these data
than the fixed effects analysis. It is easy to see that most of the between-strip “variation” is due to a
single strip, number 4, being substantially different from the others. Are we to consider this strip an
outlier in the population of ceramic strips? Having three sample means that are quite close and one
that is substantially different certainly calls into question the assumption that the random treatment
effects are normally distributed. Most importantly, some kind of analysis that looks at individual
sample means is necessary to have any chance of identifying an odd strip. ✷
While the argument given here works only for balanced data, the corresponding model fitting
ideas give similar results for unbalanced one-way ANOVA data. In particular,
SSGrps − MSE(a − 1)
σ̂A2 = ,
n − ∑ai=1 Ni2 /n
and the usual F test gives an appropriate test of H0 : σA2 = 0. Christensen (2011, Section 12.9, Sub-
section 12.10.11) provides theoretical justification for these claims but does not treat this particular
example.
The ideas behind the analysis of the balanced one-way ANOVA model generalize nicely to other
balanced models. Consider the balanced two-way with replication,
yi jk = μ + αi + β j + γi j + εi jk ,
When the β j s are all equal, MS(β ) estimates σ 2 + N σγ2 . It follows that to obtain an F test for
equality of the β j s, the test must reject when MS(β ) is much larger than MS(γ ). In particular, an
α -level test rejects if
MS(β )
> F(1 − α , a − 1, [a − 1][b − 1]).
MS(γ )
19.5 EXERCISES 477
This is just the usual result except that the MSE has been replaced by the MS(γ ). The analysis of
effects, i.e., further modeling or contrasts, involving the β j s also follows the standard pattern but
with MS(γ ) used in place of MSE. Similar results hold for investigating the αi s. Basically, you can
think of the εi jk s as subsampling errors and do the analysis on the ȳi j· s.
The moral of this analysis is that one needs to think very carefully about whether to model inter-
actions as fixed effects or random effects. It would seem that if you do not care about interactions,
if they are just an annoyance in evaluating the main effects, you probably should treat them as ran-
dom and use the interaction mean square as the appropriate estimate of variability. A related way of
thinking is to stipulate that you do not care about any main effects unless they are large enough to
show up above any interaction. In particular, that is essentially what is done in a randomized com-
plete block design. An RCB takes the block-by-treatment interaction as the error and only treatment
effects that are strong enough to show up over and above any block-by-treatment interaction are
deemed significant. On the other hand, if interactions are something of direct interest, they should
typically be treated as fixed effects.
19.5 Exercises
E XERCISE 19.5.1. In Exercises 17.11.3, 17.11.4, and 18.7.1, we considered data from Cornell
(1988) on scaled vinyl thicknesses. Exercise 17.11.3 involved five blends of vinyl and we discussed
the fact that the production process was set up eight times with a group of five blends run on each
setting. The eight production settings were those in Exercise 17.11.4. The complete data are dis-
played in Table 19.19.
(a) Identify the design for this experiment and give an appropriate model. List all the assumptions
made in the model.
(b) Analyze the data. Give an appropriate analysis of variance table. Examine appropriate contrasts
using the LSD method with an α of .05.
(c) Check the assumptions of the model and adjust the analysis appropriately.
(d) Discuss the relationship between the current analysis and those conducted earlier.
E XERCISE 19.5.2. Wilm (1945) presented data involving the effect of logging on soil moisture
deficits under a forest. Treatments consist of five intensities of logging. Treatments were identified
as the volume of the trees left standing after logging that were larger than 9.6 inches in diameter. The
logging treatments were uncut, 6000 board-feet, 4000 board-feet, 2000 board-feet, 0 board-feet. The
experiment was conducted by selecting four blocks (A,B,C,D) of forest. These were subdivided into
five plots. Within each block each of the treatments were randomly assigned to a plot. Soil moisture
deficits were measured in each of three consecutive years, 1941, 1942, and 1943. The data are
presented in Table 19.20.
478 19. DEPENDENT DATA
Block
Treatment Year A B C D
41 2.40 0.98 1.38 1.37
Uncut 42 3.32 1.91 2.36 1.62
43 2.59 1.44 1.66 1.75
41 1.76 1.65 1.69 1.11
6000 42 2.78 2.07 2.98 2.50
43 2.27 2.28 2.16 2.06
41 1.43 1.30 0.18 1.66
4000 42 2.51 1.48 1.83 2.36
43 1.54 1.46 0.16 1.84
41 1.24 0.70 0.69 0.82
2000 42 3.29 2.00 1.38 1.98
43 2.67 1.44 1.75 1.56
41 0.79 0.21 0.01 0.16
None 42 1.70 1.44 2.65 2.15
43 1.62 1.26 1.36 1.87
Treatments are volumes of timber left standing
in trees withdiameters greater than 9.6 inches.
Volumes are measured in board-feet.
(a) Identify the design for this experiment and give an appropriate model. List all the assumptions
made in the model.
(b) Analyze the data. Give an appropriate analysis of variance table. Examine appropriate contrasts.
In particular, compare the uncut plots to the average of the other plots and use polynomials to ex-
amine differences among the other four treatments. Discuss the reasonableness of this procedure
in which the ‘uncut’ treatment is excluded when fitting the polynomials.
(c) Check the assumptions of the model and adjust the analysis appropriately. What assumptions are
difficult to check? Identify any such assumptions that are particularly suspicious.
E XERCISE 19.5.3. Day and del Priore (1953) report data from an experiment on the noise gen-
erated by various reduction gear designs. The data were collected because of the Navy’s interest in
building quiet submarines. Primary interest focused on the direction of lubricant application. Lubri-
cants were applied either inmesh (I) or tangent (T) and either at the top (T) or the bottom (B). Thus
the direction TB indicates tangent, bottom while IT is inmesh, top.
Four additional factors were considered. Load was 25%, 100%, or 125%. The temperature of
the input lubricant was 90, 120, or 160 degrees F. The volume of lubricant flow was 0.5 gpm, 1 gpm,
or 2 gpm. The speed was either 300 rpm or 1200 rpm. Temperature and volume were of less interest
than direction; speed and load were of even less interest. It was considered that load, temperature,
and volume would not interact but that speed might interact with the other factors. There was little
idea whether direction would interact with other factors. As a result, a split-plot design with whole
plots in a 3 × 3 Latin square was used. The factors used in defining the whole-plot Latin square were
load, temperature, and volume. The subplot factors were speed and the direction factors.
The data are presented in Table 19.21. The four observations with 100% load, 90-degree tem-
perature, 0.5-gpm volume, and lubricant applied tangentially were not made. Substitutes for these
values were used. As an approximate analysis, treat the substitute values as real values but subtract
four degrees of freedom from the subplot error. Analyze the data.
E XERCISE 19.5.4. In Exercise 16.4.1 and Table 16.19 we presented Baten’s (1956) data on
lengths of steel bars. The bars were made with one of two heat treatments (W, L) and cut on one
of four screw machines (A, B, C, D) at one of three times of day (8 am, 11 am, 3 pm). There are
19.5 EXERCISES 479
distressing aspects to Baten’s article. First, he never mentions what the heat treatments are. Second,
he does not discuss how the four screw machines differ or whether the same person operates the
same machine all the time. If the machines were largely the same and one specific person always
operates the same machine all the time, then machine differences would be due to operators rather
than machines. If the machines were different and one person operates the same machine all the
time, it becomes impossible to tell whether machine differences are due to machines or operators.
Most importantly, Baten does not discuss how the replications were obtained. In particular, consider
the role of day-to-day variation in the analysis.
If the 12 observations on a heat treatment–machine combination are all taken on the same day,
there is no replication in the experiment that accounts for day-to-day variation. In that case the av-
erage of the four numbers for each heat treatment–machine–time combination gives essentially one
observation and for each heat treatment–machine combination the three time means are correlated.
To obtain an analysis, the heat–machine interaction and the heat–machine–time interaction would
have to be used as the two error terms.
Suppose the 12 observations on a heat treatment–machine combination are taken on four differ-
ent days with one observation obtained on each day for each time period. Then the three observations
on a given day are correlated but the observations on different days are independent. This leads to a
traditional split-plot analysis.
Finally, suppose that the 12 observations on a heat treatment–machine combination are all taken
on 12 different days. Yet another analysis is appropriate.
Compare the results of these three different methods of analyzing the experiment. If the day-to-
day variability is no larger than the within-day variability, there should be little difference. When
considering the analysis that assumes 12 observations taken on four different days, treat the order
of the four heat treatment–machine–time observations as indicating the day. For example, with heat
treatment W and machine A, take 9, 3, and 4 as the three time observations on the second day.
E XERCISE 19.5.5. Reanalyze Mandel’s (1972) data from Example 12.4.1 and Table 12.4 as-
suming that the five laboratories are a random sample from a population of laboratories. Include
estimates of both variance components.
E XERCISE 19.5.6. Reanalyze the data of Example 17.4.1 assuming that the Disk-by-Window
interaction is a random effect. Include estimates of both variance components.
480 19. DEPENDENT DATA
E XERCISE 19.5.7. People who really want to test their skill may wish to examine the data pre-
sented in Snedecor and Haber (1946) and repeated in Table 19.22. The experiment was to examine
the effects of three cutting dates on asparagus. Six blocks were used. One plot was assigned a cut-
ting date of June 1 (a), one a cutting date of June 15 (b), and the last a cutting date of July 1 (c).
Data were collected on these plots for 10 years.
Try to come up with an intelligible summary of the data that would be of use to someone growing
asparagus. In particular, the experiment was planned to run for the effective lifetime of the planting,
normally 20 years or longer. The experiment was cut short due to lack of labor but interest remained
in predicting behavior ten years after the termination of data collection. As most effects seem to
be significant, I would be inclined to focus on effects that seem relatively large rather than on
statistically significant effects.
E XERCISE 19.5.8. Reconsider the data of Exercises 15.5.2, 15.5.3, and Table 15.10 from Smith,
Gnanadesikan, and Hughes (1962). Perform a multivariate ACOVA on the data. Are the data re-
peated measures data? Is it reasonable to apply a split-plot model to the data? If so, do so.
Chapter 20
For the most part, this book concerns itself with measurement data and the corresponding analyses
based on normal distributions. In this chapter and the next we consider data that consist of counts.
Elementary count data were introduced in Chapter 5.
Frequently, data are collected on whether or not a certain event occurs. A mouse dies when
exposed to a dose of chloracetic acid or it does not. In the past, O-rings failed during a space shuttle
launch or they did not. Men have coronary incidents or they do not. These are modeled as random
events and we collect data on how often the event occurs. We also collect data on potential predictor
(explanatory) variables. For example, we use the size of dose to estimate the probability that a
mouse will die when exposed. We use the atmospheric temperature at launch time to estimate the
probability that O-rings fail. We may use weight, cholesterol, and blood pressure to estimate the
probability that men have coronary incidents. Once we have estimated the probability that these
events will occur, we are ready to make predictions. In this chapter we investigate the use of logistic
models to estimate probabilities. Logistic models (also known as logit models) are linear models for
the log-odds that an event will occur. For a more complete discussion of logistic and logit models
see Christensen (1997).
Section 20.1 introduces models for predicting count data. Section 20.2 presents a simple model
with one predictor variable where the data are the proportions of trials that display the event. It also
discusses the output one typically obtains from running a logistic regression program. Section 20.3
discusses how to perform model tests with count data. Section 20.4 discusses how logistic models
are fitted. Section 20.5 introduces the important special case in which each observation is a separate
trial that either displays the event or does not. Section 20.6 explores the use of multiple continuous
predictors. Section 20.7 examines ANOVA type models with Section 20.8 examining ACOVA type
models.
E XAMPLE 20.1.1. Woodward et al. (1941) reported data on 120 mice divided into 12 groups of
10. The mice in each group were exposed to a specific dose of chloracetic acid and the observations
consist of the number in each group that lived and died. Doses were measured in grams of acid per
kilogram of body weight. The data are given in Table 20.1, along with the proportions yh of mice
who died at each dose xh .
We could analyze these data using the methods discussed earlier in Chapter 5. We have samples
from twelve populations. We could test to see if the populations are the same. We don’t think they
are because we think survival depends on dose. More importantly, we want to try to model the rela-
tionship between dose level and the probability of dying, because that allows us to make predictions
about the probability of dying for any dose level that is similar to the doses in the original data. ✷
In Section 3.1 we talked about models for measurement data yh , h = 1, . . . , n with E(yh ) ≡ μh
481
482 20. LOGISTIC REGRESSION: PREDICTING COUNTS
yh s independent N(μh , σ 2 ),
with some model for the μh s. In Section 3.9 we got more specific about models, writing
yh s independent N[m(xh ), σ 2 ],
where xh is the value of some predictor variable or vector and m(·) is the model for the means, i.e.,
μh ≡ m(xh ).
We then discussed a variety of models m(·) that could be used for various types of predictor variables
and exploited those models in subsequent chapters.
In this chapter, we discuss similar models for data that are binomial proportions. In Section 1.4
we discussed binomial sampling. In particular, if we have N independent trials of whether some
event occurs (e.g., flipping a coin and seeing heads) and if each trial has the same probability p that
the event occurs, then the number of occurrences is a binomial random variable W , say
W ∼ Bin(N, p),
with
E(W ) = N p and Var(W ) = N p(1 − p).
We will be interested in binomial proportions
W
y≡ ,
N
with
E(y) = p
and
p(1 − p)
Var(y) = ,
N
see Proposition 1.2.11. In applications, N is known and p is an unknown parameter to be modeled
and estimated.
In general, we assume n independent binomial proportions yh for which we know the number of
trials Nh , i.e.,
Nh yh independent Bin(Nh , ph ), h = 1, . . . , n.
20.1 MODELS FOR BINOMIAL DATA 483
With E(yh ) = ph , much like we did for measurement data, we want to create a model for the ph s that
depends on a predictor xh . In fact, we would like to use the same models, simple linear regression,
multiple regression, one-way ANOVA and multifactor ANOVA, that we used for measurement data.
But before we can do that, we need to deal with a problem.
We want to create models for ph = E(yh ), but with binomial proportions this mean value is
always a probability and probabilities are required to be between 0 and 1. If we wrote a simple
linear regression model such as ph = β0 + β1 xh for some predictor variable x, nothing forces the
probabilities to be between 0 and 1. When modeling probabilities, it seems reasonable to ask that
they be between 0 and 1.
Rather than modeling the probabilities directly, we model a function of the probabilities that is
not restricted between 0 and 1. In particular, we model the log of the odds, rather than the actual
probabilities. The odds Oh are defined to be the probability that the event occurs, divided by the
probability that it does not occur, thus
ph
Oh ≡ .
1 − ph
Probabilities must be between 0 and 1, so the odds can take any values between 0 and +∞. Taking
the log of the odds permits any values between −∞ and +∞, so we consider models
ph
log = m(xh ), (20.1.1)
1 − ph
which is known as the logit transform. It maps the unit interval into the real line. On the other hand, if
the model m(xh ) corresponds to any sort of regression model, models like (20.1.1) are called logistic
regression models. These models are named after the logistic transform, which is the inverse of the
logit transform,
eη
p = g(η ) ≡ .
1 + eη
The functions are inverses in the sense that g( f (p)) = p and f (g(η )) = η . To perform any worth-
while data analysis requires using both the logit transform and the logistic transform, so it really
does not matter what you call the models. These days, any model of the form (20.1.1) is often called
logistic regression, regardless of whether m(xh ) corresponds to a regression model.
In Chapter 3, to perform tests and construct confidence intervals, we assumed that the yh obser-
vations were independent, with a common variance σ 2 , and normally distributed. In this chapter,
to perform tests and construct confidence intervals similar to those used earlier, we need to rely on
having large amounts of data. That can happen in two different ways. The best way is to have the
Nh values large for every value of h. In the chloracetic acid data, each Nh is 10, which is probably
large enough. Unfortunately, this best way to have the data may be the least common way of ac-
tually obtaining data. The other and more common way to get a lot of data is to have the number
of proportions n reasonably large but the Nh s possibly small. Frequently, the Nh s all equal 1. When
worrying about O-ring failure, each shuttle launch is a separate trial, Nh = 1, but we have n = 23
launches to examine. When examining coronary incidents, each man is a separate trial, Nh = 1, but
we have n = 200 men to examine. In other words, if the Nh s are all large, we don’t really care if n is
large or not. If the Nh s are not all large, we need n to be large. A key point is that n needs to be large
484 20. LOGISTIC REGRESSION: PREDICTING COUNTS
relative to the number of parameters we fit in our model. For the O-ring data, we will only fit two
parameters, so n = 23 is probably reasonable. For the coronary incident data, we have many more
predictors, so we need many more subjects. In fact, we will need to resist the temptation to fit too
many parameters to the data.
E XAMPLE 20.2.1. In Example 20.1.1 and Table 20.1 we presented the data of Woodward et
al. (1941) on the slaughter of mice. These data are extremely well behaved in that they all have the
same reasonably large number of trials Nh = 10, h = 1, . . . , 12, and there is only one measurement
predictor variable, the dose xh .
A simple linear logistic regression model has
ph
log = β 0 + β 1 xh , (20.2.1)
1 − ph
so our model fits a straight line in dose to the log-odds. Alternatively,
eβ0 +β1 xh
ph = .
1 + eβ0+β1 xh
Indeed, for an arbitrary dose x we can write
eβ 0 + β 1 x
p(x) = . (20.2.2)
1 + eβ0 +β1 x
Standard computer output involves a table of coefficients:
Table of Coefficients: Model (20.2.1).
Predictor β̂k SE(β̂k ) t P
Constant −3.56974 0.705330 −5.06 0.000
Dose 14.6369 3.33248 4.39 0.000
The validity of everything but the point estimates relies on having large amounts of data. Using the
point estimates gives the linear predictor
Applying the logistic transformation to the linear predictor gives the estimated probability for any
x,
eη̂ (x)
p̂(x) = .
1 + eη̂ (x)
This function is plotted in Figure 20.1. The approximate model is unlikely to fit well outside the
range of the xh values that actually occurred in Table 20.1, although since this range of xh values
gets the fitted values reasonably close to both zero and one, predicting outside the range of the
observed doses may be less of a problem than in regression for measurement data.
The table of coefficients is used exactly like previous tables of coefficients, e.g., β̂1 = 14.64
is the estimated slope parameter and SE(β̂1 ) = 3.326 is its standard error. The t values are simply
the estimates divided by their standard errors, so they provide statistics for testing whether the
regression coefficient equals 0. The P values are based on large sample normal approximations, i.e.,
the t statistics are compared to a t(∞) distribution. Clearly, there is a significant effect for fitting the
dose, so we reject the hypothesis that β1 = 0. The dose helps explain the data.
Many computer programs expand the table of coefficients to include odds ratios, defined as
20.2 SIMPLE LINEAR LOGISTIC REGRESSION 485
0.8
0.6
Fitted
0.4
0.2
Dose
ξk ≡ eβk , and a confidence interval for the odds ratio. The (1 − α ) confidence interval for ξk is
typically found by exponentiating the limits of the confidence interval for βk , i.e., it is (eLk , eUk )
where Lk ≡ β̂k −t(1− α /2, ∞)SE(β̂k ) and Uk ≡ β̂k +t(1− α /2, ∞)SE(β̂k ) provide the (1− α )100%
confidence limits for βk .
Additional standard output includes the Log-Likelihood = −63.945 (explained in Section 20.4)
and a model-based χ 2 test for β1 = 0 that is explained in Section 20.3. The model-based test for
β1 = 0 has G2 = 23.450 with df = 1 and a P value of 0.000, obtained by comparing 23.450 to a
χ 2 (1) distribution. This test provides substantial evidence that death is related to dose. ✷
• The number of concordant pairs is C; the number of pairs with (yi − y j )( p̂i − p̂ j ) > 0.
• The number of discordant pairs is D; the number of pairs with (yi − y j )( p̂i − p̂ j ) < 0.
• The number of ties is T ; the number of pairs with ( p̂i − p̂ j ) = 0.
The idea is that the higher the percentage of concordant pairs, the better the predictive ability of the
model.
488 20. LOGISTIC REGRESSION: PREDICTING COUNTS
The 197 ties occur because many of the 120 cases have the same predictor variable. In particular,
the number of ties in these data is the sum over the 12 categories of the number died times the
number survived.
The three summary measures of association commonly given are
C −D C −D
Somers’ D ≡ = n n n ,
C +D+T 2 − 2 − 2
1 2
C −D
Goodman–Kruskal Gamma ≡ ,
C +D
and
C −D
Kendall’s Tau-a ≡ n .
2
Other versions of Kendall’s Tau make adjustments for the number of ties. It is pretty obvious that
For the same yh s, increasing the Nh s by a constant multiple does not affect any of the measures
of predictive ability but it does increase the goodness-of-fit statistics and also makes them more
valid.
This SE does not really account for the process of fitting the model, i.e., estimating ph . We can in-
corporate the fitting process by incorporating the leverage, say, ah . A standardized Pearson residual
is
yh − p̂h
r̃h = .
p̂h (1 − p̂h)(1 − ah)/Nh
Leverages for logistic regression are similar in spirit to those discussed in Chapters 7 and 11, but
rather more complicated to compute. Values near 1 are still high-leverage points and the 2r/n and
3r/n rules of thumb can be applied where r is the number of (functionally distinct) parameters in
the model. Table 20.2 contains diagnostics for the mouse data. Nothing seems overly disturbing.
I prefer using the standardized Pearson residuals, but the Pearson residuals often get used be-
cause of their simplicity. When all Nh s are large, both residuals can be compared to a N(0, 1) dis-
tribution to assess whether they are consistent with the model and the other data. In this large Nh
case, much like the spirit of Chapter 5, we use the residuals to identify cases that cause problems
in the goodness-of-fit test. Even with small Nh s, where no valid goodness-of-fit test is present, the
residuals are used to identify potential problems.
With measurement data, residuals are used to check for outliers in the dependent variable, i.e.,
values of the dependent variable that do not seem to belong with the rest of the data. With count
data it is uncommon to get anything that is really an outlier in the counts. The yh s are proportions,
so outliers would be values that are not between 0 and 1. With count data, large residuals really
highlight areas where the model is not fitting the data very well. If you have a high dose of poison
but very few mice die, something is wrong. The problem is often something that we have left out of
the model.
20.3 MODEL TESTING 489
Based on the results of a valid goodness-of-fit test, we already have reason to believe that a sim-
ple linear logistic regression fits the chloracetic acid data reasonably well, but for the purpose of
illustrating the procedure for testing models, we will test the simple linear logistic model against a
cubic polynomial logistic model. This section demonstrates the test. In the next section we discuss
the motivation for it.
In Section 20.2 we gave the table of coefficients and the table of goodness-of-fit tests for the
simple linear logistic regression model
ph
log = β 0 + β 1 xh . (20.3.1)
1 − ph
with
Table of Coefficients: Model (20.3.2).
Predictor γ̂k SE(γ̂k ) t P
Constant −2.47396 4.99096 −0.50 0.620
dose −5.76314 83.1709 −0.07 0.945
x2 114.558 434.717 0.26 0.792
x3 −196.844 714.422 −0.28 0.783
and goodness-of-fit tests
490 20. LOGISTIC REGRESSION: PREDICTING COUNTS
Goodness-of-Fit Tests: Model (20.3.2).
Method Chi-Square df P
Pearson 8.7367 8 0.365
Deviance 10.1700 8 0.253
Hosmer–Lemeshow 6.3389 4 0.175
Additional standard output includes the Log-Likelihood = −63.903 and a model-based test that all
slopes are zero, i.e., 0 = γ1 = γ2 = γ3 , that has G2 = 23.534 with df = 3, and a P value of 0.000.
To test the full cubic model against the reduced simple linear model, we compute the likelihood
ratio test statistic from the log-likelihoods,
There are 4 parameters in Model (20.3.2) and only 2 parameters in Model (20.3.1) so there are
4 − 2 = 2 degrees of freedom associated with this test. When the total number of cases n is large
compared to the number of parameters in the full model, we can compare G2 = 0.084 to a χ 2 (4 − 2)
distribution. This provides no evidence that the cubic model fits better than the simple linear model.
Note that the validity of this test does not depend on having the Nh s large.
For these data, we can also obtain G2 by the difference in deviances reported for the two models,
The difference in the deviance degrees of freedom is 10 − 8 = 2, which is also the correct degrees
of freedom.
Although finding likelihood ratio tests by subtracting deviances and deviance degrees of free-
dom is our preferred computational tool, unfortunately, subtracting the deviances and the deviance
degrees of freedom cannot be trusted to give the correct G2 and degrees of freedom when using
programs designed for fitting logistic models (as opposed to programs for fitting generalized linear
models). As discussed in Section 20.5, many logistic regression programs pool cases with identical
predictor variables prior to computing the deviance and when models use different predictors, the
pooling often changes, which screws up the test. Subtracting the deviances and deviance degrees of
freedom does typically give the correct result when using programs for generalized linear models.
The standard output for Model (20.3.1) also included a model-based test for β1 = 0 with G2 =
23.450, df = 1, and a P value of 0.000. This is the likelihood ratio test for comparing the full model
(20.3.1) with the intercept-only model
ph
log = δ0 . (20.3.3)
1 − ph
Alas, many logistic regression programs do not like to fit Model (20.3.3), so we take the program’s
word for the result of the test. (Programs for generalized linear models are more willing to fit Model
(20.3.3).) Finding the test statistic is discussed in Section 20.5.
The usual output for fitting Model (20.3.2) has a model-based test that all slopes are zero, i.e.,
that 0 = γ1 = γ2 = γ3 , for which G2 = 23.534 with df = 3 and a P value of 0.000. This is the
likelihood ratio test for the full model (20.3.2) against the reduced “intercept-only” model (20.3.3).
Generally, when fitting a model these additional reported G2 tests are for comparing the current
model to the intercept-only model (20.3.3).
em(xh )
ph = . (20.4.2)
1 + em(xh)
Given the estimate m̂(x) we get
em̂(x)
p̂(x) = .
1 + em̂(x)
For example, given the estimates β̂0 and β̂1 for a simple linear logistic regression, we get
exp(β̂0 + β̂1x)
p̂(x) = . (20.4.3)
1 + exp(β̂0 + β̂1 x)
In particular, this formula provides the p̂h s when doing predictions at the xh s.
Estimates of coefficients are found by maximizing the likelihood function. The likelihood func-
tion is the probability of getting the data that were actually observed. It is a function of the unknown
model parameters contained in m(·). Because the Nh yh s are independent binomials, the likelihood
function is
n
Nh
L(p1 , . . . , pn ) = ∏
N y
ph h h (1 − ph)Nh −Nh yh . (20.4.4)
h=1 Nh y h
For a particular proportion yh , Nh yy is Bin(Nh , ph ) and the probability from Section 1.4 is an indi-
vidual term on the right. We multiply the individual terms because the Nh yh s are independent.
If we substitute for the ph s using (20.4.2) into the likelihood function (20.4.4), the likelihood
becomes a function of the model parameters. For example, if m(xh ) = β0 + β1 xh the likelihood be-
comes a function of the model parameters β0 and β1 for known values of (xh , yh , Nh ), h = 1, . . . , n.
Computer programs maximize this function of β0 and β1 to give maximum likelihood estimates β̂0
and β̂1 along with approximate standard errors. The estimates have approximate normal distribu-
tions for large sample sizes. For the large sample approximations to be valid, it is typically enough
that the total number of trials in the entire data n be large relative to the number of model param-
eters; the individual sample sizes Nh need not be large. The normal approximations also hold if all
the Nh s are large regardless of the size of n.
In Section 11.3 we found the least squares estimates for linear regression models. Although we
did not explicitly give the likelihood function for regression models with normally distributed data,
we mentioned that the least squares estimates were also maximum likelihood estimates. Unfortu-
nately, for logistic regression there are no closed-form solutions for the estimates and standard errors
like those presented for measurement data in Chapter 11. For logistic regression, different computer
programs may give slightly different results because the computations are more complex.
Maximum likelihood theory also provides a (generalized) likelihood ratio (LR) test for a full
model versus a reduced model. Suppose the full model is
pFh
log = mF (xh ).
1 − pFh
492 20. LOGISTIC REGRESSION: PREDICTING COUNTS
Fitting the model leads to estimated probabilities p̂Fh . The reduced model must be a special case of
the full model, say,
pRh
log = mR (xh ),
1 − pRh
with fitted probabilities p̂Rh . The commonly used form of the likelihood ratio test statistic is,
2 L( p̂R1 , . . . , p̂Rn )
G = −2 log
L( p̂F 1 , . . . , p̂Fn )
n
= 2 ∑ {Nh yh log( p̂Fh / p̂Rh ) + Nh (1 − yh) log[(1 − p̂Fh )/(1 − p̂Rh)]} ,
h=1
where the second equality is based on Equation (20.4.4). An alternative to the LR test statistic is the
Pearson test statistic, which is
2
n n
(Nh p̂Fh − Nh p̂Rn )2 p̂Fh − p̂Rn
X =∑
2
=∑ .
h=1 Nh p̂Rn (1 − p̂Rn) h=1 p̂Rn (1 − p̂Rn)/Nh
To compare a full and reduced model, G2 is twice the absolute value of the difference between these
values. When using logistic regression programs (as opposed to generalized linear model programs),
this is how one needs to compute G2 .
The smallest interesting logistic model that we can fit to the data is the intercept-only model
ph
log = β0 . (20.4.5)
1 − ph
The largest logistic model that we can fit to the data is the saturated model that has a separate
parameter for each case,
ph
log = γh . (20.4.6)
1 − ph
Interesting models tend to be somewhere between these two. Many computer programs automati-
cally report the results of testing the fitted model against both of these.
For standard simple linear regression, we have two tests for H0 : β1 = 0, a t test and an F test, and
the two tests are equivalent, e.g., always give the same P value, cf. Section 6.1. For simple linear
logistic regression we have a t test for H0 : β1 = 0, and testing against the intercept-only model
provides a G2 test for H0 : β1 = 0. As will be seen in Section 20.5 for the O-ring data, these tests
typically do not give the same P values. The two tests are not equivalent. For the mouse data, both
P values were reported as 0.000, so one could not see that the two P values were different beyond
the three decimal points reported.
Testing a fitted model m(·) against the saturated model (20.4.6) is called a goodness-of-fit test.
The fitted probabilities under Model (20.4.6) are just the observed proportions for each case, the
yh s. The deviance for a fitted model is defined as G2 for testing the fitted model against the saturated
20.5 BINARY DATA 493
model (20.4.6),
L( p̂1 , . . . , p̂n )
G2 = −2 log (20.4.7)
L(y1 , . . . , yn )
n
= 2 ∑ [Nh yh log(yh / p̂h ) + (Nh − Nh yh ) log((1 − yh)/(1 − p̂h))] .
h=1
In this formula, if a = 0, then a log(a) is taken as zero. The degrees of freedom for the deviance
are n (the number of parameters in Model (20.4.6)) minus the number of (functionally distinct)
parameters in the fitted model.
The problem with the goodness-of-fit test is that the number of parameters in Model (20.4.6)
is the sample size n, so the only way for G2 to have an asymptotic χ 2 distribution is if all the Nh s
are large. For the mouse death data, the Nh s are all 10, which is probably fine, but for a great many
data sets, all the Nh s are 1, so a χ 2 test of the goodness-of-fit statistic is not appropriate. A similar
conclusion holds for the Pearson statistic.
As also discussed in the next section, in an effort to increase the size of the Nh s, many logistic
regression computer programs pool together any cases for which xh = xi . Thus, instead of having
two cases with Nh yh ∼ Bin(Nh , ph ) and Ni yi ∼ Bin(Ni , pi ), the two cases get pooled into a single
case with (Nh yh + Ni yi ) ∼ Bin(Nh + Ni , ph ). Note that if xh = xi , it follows that ph = pi and the new
proportion would be (Nh yh + Ni yi )/(Nh + Ni ). I have never encountered regression data with so few
distinct xh values that this pooling procedure actually accomplished its purpose of making all the
group sizes reasonably large, but if the mouse data were presented as 120 mice that either died or
not along with their dose, such pooling would work fine.
Ideally, the deviance G2 of (20.4.7) could be used analogously to the SSE in normal theory and
the degrees of freedom for the deviance of (20.4.7) would be analogous to the dfE. To compare a full
and reduced model you just subtract their deviances (rather than their SSEs) and compare the test
statistic to a χ 2 with degrees of freedom equal to the difference in the deviance degrees of freedom
(rather than differencing the dfEs). This procedure works just fine when fitting the models using
programs for fitting generalized linear models. The invidious thing about the pooling procedure of
the previous paragraph is that when you change the model from reduced to full, you often change
the predictor vector xh in such a way that it changes which cases have xh = xi . When comparing a
full and a reduced model, the models may well have different cases pooled together, which means
that the difference in deviances no longer provides the appropriate G2 for testing the models. In such
cases G2 needs to be computed directly from the log-likelihood.
After discussing the commonly reported goodness-of-fit statistics in the next section, we will
no longer discuss any deviance values that are obtained by pooling. After Subsection 20.5.1, the
deviances we discuss may not be those reported by a logistic regression program but they should be
those obtained by a generalized linear models program.
Let pi be the probability that any O-ring fails in case i. The simple linear logistic regression
model is
pi
logit(pi ) ≡ log = β 0 + β 1 xi ,
1 − pi
where xi is the known temperature and β0 and β1 are unknown intercept and slope parameters
(coefficients).
Maximum likelihood theory gives the coefficient estimates, standard errors, and t values as
Table of Coefficients: O-rings
Predictor β̂k SE(β̂k ) t P
Constant 15.0429 7.37862 2.04 0.041
Temperature −0.232163 0.108236 −2.14 0.032
The t values are the estimate divided by the standard error. For testing H0 : β1 = 0, the value t =
−2.14 yields a P value that is approximately 0.03, so there is evidence that temperature does help
predict O-ring failure. Alternatively, a model-based test of β1 = 0 compares the simple linear logistic
model to an intercept-only model and gives G2 = 7.952 with df = 1 and P = 0.005. These tests
should be reasonably valid because n = 23 is reasonably large relative to the 2 parameters in the
fitted model. The log-likelihood is ℓ = −10.158.
Figure 20.2 gives a plot of the estimated probabilities as a function of temperature,
e15.0429−0.232163x
p̂(x) = .
1 + e15.0429−0.232163x
The Challenger was launched at x = 31 degrees, so the predicted log odds are 15.04 − 0.2321(31) =
7.8449 and the predicted probability of an O-ring failure is e7.8449 /(1 + e7.8449) = 0.9996. Actually,
there are problems with this prediction because we are predicting very far from the observed data.
The lowest temperature at which a shuttle had previously been launched was 53 degrees, very far
from 31 degrees. According to the fitted model, a launch at 53 degrees has probability 0.939 of
O-ring failure, so even with the caveat about predicting beyond the range of the data, the model
indicates an overwhelming probability of failure.
Many specialized logistic regression computer programs report the following goodness-of-fit statis-
tics for the O-ring data.
20.5 BINARY DATA 495
1.0
0.8
0.6
Fitted
0.4
0.2
0.0
30 40 50 60 70 80
Temperature
Goodness-of-Fit Tests
Method Chi-Square df P
Pearson 11.1303 14 0.676
Deviance 11.9974 14 0.607
Hosmer–Lemeshow 9.7119 8 0.286
For 0-1 data, these are all useless. The Hosmer–Lemeshow statistic does not have a χ 2 distribution.
For computing the Pearson and deviance statistics the 23 original cases have been pooled into ñ = 16
new cases based on duplicate temperatures. This gives binomial sample sizes of Ñ6 = 3, Ñ9 = 4,
Ñ12 = Ñ13 = 2, and Ñh = 1 for all other cases. With two parameters in the fitted model, the reported
degrees of freedom are 14 = 16 − 2. To have a valid χ 2 (14) test, all the Ñh s would need to be large,
but none of them are. Pooling does not give a valid χ 2 test and it also eliminates the deviance as a
useful tool in model testing.
Henceforward, we only report deviances that are not obtained by pooling. These are the
likelihood ratio test statistics for the fitted model against the saturated model with the correspond-
ing degrees of freedom. Test statistics for any full and reduced models can then be obtained by
subtracting the corresponding deviances from each other just as the degrees of freedom for the test
can be obtained by subtraction. These deviances can generally be found by fitting logistic models
as special cases in programs for fitting generalized linear models. When using specialized logistic
regression software, great care must be taken and the safest bet is to always use log-likelihoods to
obtain test statistics.
E XAMPLE 20.5.1 C ONTINUED . For the simple linear logistic regression model
pi
log = β 0 + β 1 xi . (20.5.1)
1 − pi
Without pooling, the deviance is G2 = 20.315 with 21 = 23 − 2 = n − 2 degrees of freedom. For the
intercept-only model
pi
log = β0 (20.5.2)
1 − pi
496 20. LOGISTIC REGRESSION: PREDICTING COUNTS
Table 20.4: Diagnostics for Challenger data: Generalized linear modeling program.
Case yh p̂h Leverage rh r̃h Cook
1 1 0.939 0.167 0.254 0.279 0.0078
2 1 0.859 0.208 0.405 0.455 0.0271
3 1 0.829 0.209 0.454 0.511 0.0345
4 1 0.603 0.143 0.812 0.877 0.0641
5 0 0.430 0.086 −0.8694 −0.910 0.0391
6 0 0.375 0.074 −0.7741 −0.805 0.0260
7 0 0.375 0.074 −0.7741 −0.805 0.0260
8 0 0.375 0.074 −0.7741 −0.805 0.0260
9 0 0.322 0.067 −0.6893 −0.714 0.0183
10 0 0.274 0.063 −0.6138 −0.634 0.0136
11 0 0.230 0.063 −0.5465 −0.564 0.0107
12 0 0.230 0.063 −0.5465 −0.564 0.0107
13 1 0.230 0.063 1.830 1.890 0.1196
14 1 0.230 0.063 1.830 1.890 0.1196
15 0 0.158 0.066 −0.4333 −0.448 0.0071
16 0 0.130 0.068 −0.3858 −0.400 0.0058
17 0 0.086 0.069 −0.3059 −0.317 0.0037
18 1 0.086 0.069 3.270 3.389 0.4265
19 0 0.069 0.068 −0.2723 −0.282 0.0029
20 0 0.069 0.068 −0.2723 −0.282 0.0029
21 0 0.045 0.063 −0.2159 −0.223 0.0017
22 0 0.036 0.059 −0.1922 −0.198 0.0012
23 0 0.023 0.051 −0.1524 −0.156 0.0007
It can be difficult to get even generalized linear model programs to fit the intercept-only model
but the deviance G2 can be obtained from the formula in Section 20.4. Given the estimate β̂0 for
Model (20.5.2), we get p̂i = eβ̂0 /(1 + eβ̂0 ) for all i, and apply the formula. In general, for the
intercept-only model p̂i = ∑ni=1 Ni yi / ∑ni=1 Ni , which, for binary data, reduces to p̂i = ∑ni=1 yi /n.
The degrees of freedom are the number of cases minus the number of fitted parameters, n − 1.
This may seem like a small number, but it is difficult to predict well in 0-1 logistic regression even
when you know the perfect model p(x). The predictive ability of the model depends on the x values
one is likely to see. For x values that correspond to p(x) close to 0 or 1, the model will make
.
very good predictions. But for x values with p(x) = 0.5, we will never be able to make reliable
predictions. Those predictions will be no better than predictions for the flip of a coin.
The other commonly used predictive measures for these data are given below.
Measures of Association
Between the Response Variable and Predicted Probabilities
Pairs Number Percent Summary Measures
Concordant 85 75.9 Somers’ D 0.56
Discordant 22 19.6 Goodman–Kruskal Gamma 0.59
Ties 5 4.5 Kendall’s Tau-a 0.25
Total 112 100.0
This section examines regression models for the log-odds of a two-category response variable in
which we use more than one predictor variable. The discussion is centered around an example.
70
110
180
100
60
160
90
50
Diastolic
Systolic
Age
140
80
40
120
70
30
100
60
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
y y y
74
500
250
72
400
70
200
Weight
Height
Chol
68
300
66
150
200
64
62
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
y y y
Of the 200 cases, 26 had coronary incidents. The data are available on the Internet, like all the data
in this book, through the webpage:
http://stat.unm.edu/~fletcher.
The data are part of the data that go along with both this book and the book Log-Linear Models and
Logistic Regression. They are also available on the Internet via STATLIB. Figure 20.3 plots each
variable against y = CN. Figures 20.4 through 20.7 provide a scatterplot matrix of the predictor
variables.
Let pi be the probability of a coronary incident for the ith man. We begin with the logistic
regression model
log[pi /(1 − pi)] = β0 + β1Agi + β2 Si + β3 Di + β4Chi + β5 Hi + β6Wi (20.6.1)
i = 1, . . . , 200. The maximum likelihood fit of this model is given in Table 20.6. The deviance df
is the number of cases, 200, minus the number of fitted parameters, 7. Based on the t values, none
of the variables really stand out. There are suggestions of age, cholesterol, and weight effects. The
(unpooled) deviance G2 would look good except that, as discussed earlier, with Ni = 1 for all i there
is no basis for comparing it to a χ 2 (193) distribution.
Prediction follows as usual,
log[ p̂i /(1 − p̂i)] = β̂0 + β̂1 Agi + β̂2Si + β̂3Di + β̂4Chi + β̂5Hi + β̂6Wi .
20.6 MULTIPLE LOGISTIC REGRESSION 499
70
70
70
50
50
50
Age
Age
Age
30
30
30
30 40 50 60 70 100 140 180 60 70 80 90 110
180
180
180
Systolic
Systolic
Systolic
140
140
140
100
100
100
30 40 50 60 70 100 140 180 60 70 80 90 110
100
100
Diastolic
Diastolic
Diastolic
80
80
80
60
60
60
30 40 50 60 70 100 140 180 60 70 80 90 110
70
70
50
50
50
Age
Age
Age
30
30
180
180
Systolic
Systolic
Systolic
140
140
140
100
100
100
100
100
Diastolic
Diastolic
Diastolic
80
80
80
60
60
60
400
400
400
Chol
Chol
Chol
200
200
200
30 40 50 60 70 100 140 180 60 70 80 90 110
74
74
74
70
70
70
Height
Height
Height
66
66
66
62
62
62
30 40 50 60 70 100 140 180 60 70 80 90 110
Weight
Weight
30 40 50 60 70 100 140 180 60 70 80 90 110
400
400
Chol
Chol
Chol
200
200
200
200 300 400 500 62 64 66 68 70 72 74 150 200 250
74
74
70
70
70
Height
Height
Height
66
66
66
62
62
62
Weight
Weight
For a 60-year-old man with blood pressure of 140 over 90, a cholesterol reading of 200, who is 69
inches tall and weighs 200 pounds, the estimated log odds of a coronary incident are
e−1.2435
p̂ = = 0.224 .
1 + e−1.2435
20.6 MULTIPLE LOGISTIC REGRESSION 501
0.5
Chol
300
200
0.4
0.3
Fitted
0.2
0.1
0.0
20 30 40 50 60 70
Age
Figure 20.8 Coronary incident probabilities as a function of age for S = 140, D = 90, H = 69, W = 200. Solid
Ch = 200, dashed Ch = 300.
Figure 20.8 plots the estimated probability of a coronary incident as a function of age for people
with S = 140, D = 90, H = 69, W = 200 and either Ch = 200 (solid line) or Ch = 300 (dashed line).
Diagnostic quantities for the cases with the largest Cook’s distances are given in Table 20.7.
They include 19 of the 26 cases that had coronary incidents. The large residuals are for people who
had low probabilities for a coronary incident but had one nonetheless. High leverages correspond to
unusual data. For example, case 41 has the highest cholesterol. Case 108 is the heaviest man in the
data.
We now consider fitting some reduced models. Simple linear logistic regressions were fitted for
each of the variables with high t values, i.e., Ag, Ch, and W. Regressions with variables that seem
naturally paired were also fitted, i.e., S,D and H,W. Table 20.8 contains the models along with df ,
G2 , A − q, and A∗ . The first two of these are the deviance degrees of freedom and the deviance.
No P values are given because the asymptotic χ 2 approximation does not hold. Also given are two
analogues of Mallow’s C p statistic, A − q and A∗ . A − q ≡ G2 − 2(df ) is the Akaike information
criterion (AIC) less twice the number of trials (q ≡ 2n). A∗ is a version of the Akaike information
criterion defined for comparing Model (20.6.1) to various submodels. It gives numerical values
similar to the C p statistic and is defined by
Here 134.9 is the deviance G2 for the full model (20.6.1), 7 comes from the degrees of freedom
for the full model (6 explanatory variables plus an intercept), and p comes from the degrees of
freedom for the submodel (p = 1 + number of explanatory variables). The information in A − q and
A∗ is identical, A∗ = 258.1 + (A − q). (The value 258.1 = 2n − G2 [full model] − p[full model] =
400 − 134.9 − 7 does not depend on the reduced model.) A∗ is listed because it is a little easier to
look at since it takes values similar to C p . Computer programs rarely report A − q or A∗ . (The glm
procedure in the R language provides a version of the AIC.) A − q is very easy to compute from the
deviance and its degrees of freedom.
Of the models listed in Table 20.8,
Table 20.7: Diagnostics for Chapman data. Cases with high Cook’s distances.
Case yh p̂h Leverage rh r̃h Cook
5 1 0.36 0.13 1.32 1.42 0.043
19 1 0.46 0.15 1.08 1.17 0.036
21 1 0.08 0.02 3.34 3.37 0.028
27 1 0.21 0.03 1.97 1.99 0.016
29 1 0.11 0.01 2.73 2.75 0.016
39 1 0.16 0.03 2.33 2.36 0.022
41 1 0.31 0.15 1.46 1.59 0.065
42 1 0.12 0.03 2.60 2.63 0.027
44 1 0.41 0.09 1.19 1.24 0.021
48 1 0.18 0.06 2.14 2.21 0.045
51 1 0.34 0.06 1.39 1.44 0.019
54 1 0.19 0.03 2.07 2.09 0.017
55 1 0.52 0.08 0.96 1.00 0.012
81 1 0.32 0.06 1.44 1.49 0.021
84 0 0.36 0.20 -0.74 -0.83 0.026
86 1 0.03 0.01 5.95 5.98 0.052
108 0 0.45 0.17 -0.91 -1.00 0.029
111 1 0.56 0.11 0.89 0.95 0.015
113 0 0.37 0.21 -0.76 -0.85 0.027
114 0 0.46 0.14 -0.93 -1.00 0.024
116 0 0.41 0.10 -0.84 -0.89 0.013
123 1 0.36 0.07 1.35 1.40 0.022
124 1 0.12 0.02 2.70 2.72 0.019
126 1 0.13 0.04 2.64 2.70 0.047
is the only model that is better than the full model based on the information criterion, i.e., A∗ is 4.8
for this model, less than the 7 for Model (20.6.1).
Asymptotically valid tests of submodels against Model (20.6.1) are available. These are per-
formed in the usual way, i.e., the differences in deviance degrees of freedom and deviance G2 s give
the appropriate values for the tests. For example, the test of Model (20.6.2) versus Model (20.6.1)
has G2 = 142.7 − 134.9 = 7.8 with df = 198 − 193 = 5. This and other tests are given below.
Tests against Model (20.6.1)
Model df G2
Ag 5 7.8
W 5 15.2**
H,W 4 11.9*
Ch 5 12.0*
S,D 4 13.0*
Intercept 6 19.7**
All of the test statistics are significant at the 0.05 level, except for that associated with Model
(20.6.2). This indicates that none of the models other than (20.6.2) is an adequate substitute for
20.6 MULTIPLE LOGISTIC REGRESSION 503
the full model (20.6.1). In this table, one asterisk indicates significance at the 0.05 level and two
asterisks denotes significance at the 0.01 level.
Our next step is to investigate models that include Ag and some other variables. If we can find
one or two variables that account for most of the value G2 = 7.8, we may have an improvement over
Model (20.6.2). If it takes three or more variables to explain the 7.8, Model (20.6.2) will continue to
be the best-looking model. (Note that χ 2 (.95, 3) = 7.81, so a model with three more variables than
Model (20.6.2) and the same G2 fit as Model (20.6.1) would still not demonstrate a significant lack
of fit in Model (20.6.2).)
Fits for all models that involve Ag and either one or two other explanatory variables are given
in Table 20.9. Based on the A∗ values, two models stand out:
which gives deviance G2 = 134.9 on df = 194. This model is a reduced model relative to Model
(20.6.1), so from Table 20.9 a test of it against Model (20.6.1) has
with df = 194 − 193 = 1. The G2 is essentially 0, so the data are consistent with the reduced model.
Of course this reduced model was suggested by the fitted full model, so any formal test would be
biased—but then one does not accept null hypotheses anyway, and the whole point of choosing
this reduced model was that it seemed likely to give a G2 close to that of Model (20.6.1). We
note that the new variable S − D is still not significant in Model (20.6.5); it only has a t value of
0.006834/0.01877 = 0.36.
If we wanted to test something like H0 : β3 = −0.005, the reduced model is
and involves a known term (−0.005)Di in the linear predictor. This known term is called an offset.
To fit a model with an offset, most computer programs require that the offset be specified separately
and that the model be specified without it, i.e., as
We learned earlier that, relative to Model (20.6.1), either model (20.6.3) or (20.6.4) does an
adequate job of explaining the data. This conclusion was based on looking at A∗ values, but would
also be obtained by doing formal tests of models.
Christensen (1997, Section 4.4) discusses how to perform best subset selection, similar to Sec-
tion 10.2, for logistic regression. His preferred method requires access to a standard best subset
selection program that allows weighted regression. He does not recommend the score test proce-
dure used by SAS in PROC LOGISTIC. Stepwise methods, like backward elimination and forward
selection, are relatively easy to apply.
E XAMPLE 20.7.1. A study on mice examined the relationship between two drugs and muscle
tension. Each mouse had a muscle identified and its tension measured. A randomly selected drug
was administered to the mouse and the change in muscle tension was evaluated. Muscles of two
types were used. The weight of the muscle was also measured. Factors and levels are as follow.
Factor Abbreviation Levels
Change in muscle tension T High, Low
Weight of muscle W High, Low
Muscle type M Type 1, Type 2
Drug D Drug 1, Drug 2
The data in Table 20.10 are counts (rather than proportions) for every combination of the factors.
Probabilities phi jk can be defined for every factor combination with p1i jk + p2i jk = 1, so the odds of
a high tension change are p1i jk /p2i jk .
Change in muscle tension is a response factor. Weight, muscle type, and drug are all predictor
variables. We model the log odds of having a high change in muscle tension (given the levels of the
explanatory factors), so the observed proportion of, say, high change for Weight = Low, Muscle =
2, Drug = 1 is, from Table 20.10, 4/(4 + 6). The most general ANOVA model (saturated model)
includes all main effects and all interactions between the explanatory factors, i.e.,
As introduced in earlier chapters, we denote this model [WMD] with similar notations for other
models that focus on the highest-order effects.
Models can be fitted by maximum likelihood. Reduced models can be tested. Estimates and
asymptotic standard errors can be obtained. The analysis of model (20.7.1) is similar to that of an
unbalanced three-factor ANOVA model as illustrated in Chapter 16.
Table 20.11 gives a list of ANOVA type logit models, deviance df , deviance G2 , P values for
testing the fitted model against Model (20.7.1), and A − q values. Clearly, the best-fitting logit mod-
els are the models [MD][W] and [WM][MD]. Both involve the muscle-type-by-drug interaction and
a main effect for weight. One of the models includes the muscle-type-by-weight interaction. Note
that P values associated with saturated model goodness-of-fit tests are appropriate here because we
are not dealing with 0-1 data. (The smallest group size is 3 + 3 = 6.)
The estimated odds for a high tension change using [MD][W] are given in Table 20.12. The
estimated odds are 1.22 times greater for high-weight muscles than for low-weight muscles. For
example, in Table 20.12, 0.625/0.512 = 1.22 but also 1.22 = 0.590/0.483 = 1.827/1.495 =
0.592/0.485. This corresponds to the main effect for weight in the logit model. The odds also
involve a muscle-type-by-drug interaction. To establish the nature of this interaction, consider the
four estimated odds for high weights with various muscles and drugs. These are the four values at
the top of Table 20.12, e.g., for muscle type 1, drug 1 this is 0.625. In every muscle type–drug com-
bination other than type 1, drug 2, the estimated odds of having a high tension change are about 0.6.
The estimated probability of having a high tension change is about 0.6/(1 + 0.6) = 0.375. How-
ever, for type 1, drug 2, the estimated odds are 1.827 and the estimated probability of a high tension
change is 1.827/(1 + 1.827) = 0.646. The chance of having a high tension change is much greater
for the combination muscle type 1, drug 2 than for any other muscle type–drug combination. A
similar analysis holds for the low-weight odds p̂12 jk /(1 − p̂12 jk ) but the actual values of the odds
are smaller by a multiplicative factor of 1.22 because of the main effect for weight.
The other logit model that fits quite well is [WM][MD]. Tables 20.13 and 20.14 both contain
20.8 ORDERED CATEGORIES 507
the estimated odds of high tension change for this model. The difference between Tables 20.13 and
20.14 is that the rows of Table 20.13 have been rearranged in Table 20.14. This sounds like a trivial
change, but examination of the tables shows that Table 20.14 is easier to interpret. The reason for
changing from Table 20.13 to Table 20.14 is the nature of the logit model. The model [WM][MD]
has M in both terms, so it is easiest to interpret the fitted model when fixing the level of M. For
a fixed level of M, the effects of W and D are additive in the log odds, although the size of those
effects change with the level of M.
Looking at the type 2 muscles in Table 20.14, the high-weight odds are 0.919 times the low-
weight odds. Also, the drug 1 odds are 1.111 times the drug 2 odds. Neither of these are really
very striking differences. For muscle type 2, the odds of a high tension change are about the same
regardless of weight and drug. Contrary to our previous model, they do not seem to depend much on
weight and to the extent that they do depend on weight, the odds go down rather than up for higher
weights.
Looking at the type 1 muscles, we see the dominant features of the previous model reproduced.
The odds of high tension change are 1.622 times greater for high weights than for low weights. The
odds of high tension change are 2.722 times greater for drug 2 than for drug 1.
Both models indicate that for type 1 muscles, high weight increases the odds and drug 2 in-
creases the odds. Both models indicate that for type 2 muscles, drug 2 does not substantially change
the odds. The difference between the models [MD][W] and [WM][MD] is that [MD][W] indicates
that for type 2 muscles, high weight should increase the odds, but [WM][MD] indicates little change
for high weight and, in fact, what change there is indicates a decrease in the odds.
In dealing with ANOVA models for measurement data, when one or more factors had quantitative
levels, it was useful to model effects with polynomials. Similar results apply to logit models.
E XAMPLE 20.8.1. Consider data in which there are four factors defining a 2 × 2 × 2 × 6 table.
The factors are
508 20. LOGISTIC REGRESSION: PREDICTING COUNTS
Male Yes 24 18 16 12 6 4
No 5 7 7 6 8 10
Nonwhite
Female Yes 21 25 20 17 14 13
No 4 6 5 5 5 5
Abbrev-
Factor iation Levels
Race (h) R White, Nonwhite
Sex (i) S Male, Female
Opinion ( j) O Yes = Supports Legalized Abortion
No = Opposed to Legalized Abortion
Age (k) A 18–25, 26–35, 36–45, 46–55, 56–65, 66+ years
Opinion is the response factor. Age has ordered categories. The data are given in Table 20.15.
The probability of a Yes opinion for Race h, Sex i, Age k is phik ≡ phi1k . The corresponding No
probability has 1 − phik ≡ phi2k .
As in the previous section, we could fit a three-factor ANOVA type logit model to these data.
Table 20.16 contains fitting information for standard three-factor models wherein [] indicates the
intercept (grand mean) only model. From the deviances and A − q in Table 20.16 a good-fitting logit
model is
log[phik /(1 − phik )] = (RS)hi + Ak . (20.8.1)
Fitting this model gives the estimated odds of supporting relative to opposing legalized abortion that
follow.
Odds of Support versus Opposed: Model (20.8.1)
Age
Race Sex 18–25 26–35 36–45 46–55 56–65 65+
White Male 2.52 2.14 2.09 1.60 1.38 1.28
Female 3.18 2.70 2.64 2.01 1.75 1.62
Nonwhite Male 2.48 2.11 2.06 1.57 1.36 1.26
Female 5.08 4.31 4.22 3.22 2.79 2.58
The deviance G2 is 9.104 with 15 df . The G2 for fitting [R][S][A] is 11.77 on 16 df . The difference
in G2 s is not large, so the reduced logit model log[phik /(1 − phik )] = Rh + Si + Ak may fit adequately,
but we continue to examine Model (20.8.1).
The odds suggest two things: 1) odds decrease as age increases, and 2) the odds for males
are about the same, regardless of race. We fit models that incorporate these suggestions. Of course,
because the data are suggesting the models, formal tests of significance will be even less appropriate
than usual but G2 s still give a reasonable measure of the quality of model fit.
To model odds that are decreasing with age we incorporate a linear trend in ages. In the absence
of specific ages to associate with the age categories we simply use the scores k = 1, 2, . . . , 6. These
20.8 ORDERED CATEGORIES 509
The deviance G2 is 10.18 on 19 df , so the linear trend in coded ages fits very well. Recall that
Model (20.8.1) has G2 = 9.104 on 15 df , so a test of Model (20.8.2) versus Model (20.8.1) has
G2 = 10.18 − 9.104 = 1.08 on 19 − 15 = 4 df .
To incorporate the idea that males have the same odds of support, recode the indices for races
and sexes. The indices for the (RS)hi terms are (h, i) = (1, 1), (1, 2), (2, 1), (2, 2). We recode with
new indexes ( f , e) having the correspondence
(h, i) (1,1) (1,2) (2,1) (2,2)
( f , e) (1,1) (2,1) (1,2) (3,1).
The model
log[p f ek /(1 − p f ek )] = (RS) f e + Ak
gives exactly the same fit as Model (20.8.1). Together, the subscripts f , e, and k still distinguish
all of the cases in the data. The point of this recoding is that the single subscript f distinguishes
between males and the two female groups but does not distinguish between white and nonwhite
males, so now if we fit the model
we have a model that treats the two male groups the same. To fit this, you generally do not need to
define the index e in your data file, even though it will implicitly exist in the model.
Of course, Model (20.8.3) is a reduced model relative to Model (20.8.1). Model (20.8.3) has
deviance G2 = 9.110 on 16 df , so the comparison between models has G2 = 9.110 − 9.104 = 0.006
on 16 − 15 = 1 df . We have lost almost nothing by going from Model (20.8.1) to Model (20.8.3).
Finally, we can write a model that incorporates both the trend in ages and the equality for males
This has G2 = 10.19 on 20 df . Thus, relative to Model (20.8.1), we have dropped 5 df from the
model, yet only increased the G2 by 10.19 − 9.10 = 1.09. Rather than fitting Model (20.8.4), we fit
the equivalent model that includes an intercept (grand mean) μ . The estimates and standard errors
for this model, using the side condition (RS)1 = 0, are
510 20. LOGISTIC REGRESSION: PREDICTING COUNTS
20.9 Exercises
E XERCISE 20.9.1. Fit a logistic model to the data of Table 20.17 that relates probability of
conviction to year. Is there evidence of a trend in the conviction rates over time? Is there evidence
for a lack of fit?
E XERCISE 20.9.2. Stigler (1986, p. 208) reports data from the Edinburgh Medical and Surgical
Journal (1817) on the relationship between heights and chest circumferences for Scottish militia
men. Measurements were made in inches. We concern ourselves with two groups of men, those
with 39-inch chests and those with 40-inch chests. The data are given in Table 20.18. Test whether
the distribution of heights is the same for these two groups, cf. Chapter 5.
Is it reasonable to fit a logistic regression to the data of Table 20.18? Why or why not? Explain
what such a model would be doing. Whether reasonable or not, fitting such a model can be done.
Fit a logistic model and discuss the results. Is there evidence for a lack of fit?
E XERCISE 20.9.3. Chapman, Masinda, and Strong (1995) give the data in Table 20.19. These
20.9 EXERCISES 511
are the number out of 150 popcorn kernels that fail to pop when microwaved for a given amount of
time. There are three replicates. Fit a logistic regression with time as the predictor.
E XERCISE 20.9.4. Use the results of Subsection 12.5.2 and the not-so-obvious fact that
n n
∑i=1 p̂i /n = ∑i=1 yi /n ≡ ȳ to show that, for the mouse data of Section 20.1, the formula for R2
when n = 12, Nh ≡ N = 10 is
2
[∑ni=1 ( p̂i − ȳ)(yi − ȳ)]
R2 =
[∑i=1 ( p̂i − ȳ)2 ] [∑i=1 (yi − ȳ)2 ]
n n
E XERCISE 20.9.5. For the n = 12 version of the mouse data, reanalyze it using the log of the
dose as a predictor variable. Create a version of the data with n = 120 and reanalyze both the x and
log(x) versions. Compare. Among other things compare the deviances and R2 values.
Chapter 21
In a longitudinal study discussed by Christensen (1997), 2121 people neither exercised regularly
nor developed cardiovascular disease during the study. These subjects were cross-classified by three
factors: Personality type (A,B), Cholesterol level (normal, high), and Diastolic Blood Pressure (nor-
mal, high). The data are given in Table 21.1.
Table 21.1 is a three-way table of counts. The three factors are Personality, Cholesterol level,
and Diastolic Blood Pressure. Each factor happens to be at two levels, but that is of no particular
consequence for our modeling of the data. We can analyze the data by fitting three-way ANOVA
type models to it. However, count data are not normally distributed, so standard ANOVA methods
are inappropriate. In particular, random variables for counts tend to have variances that depend on
their mean values. Standard sampling schemes for count data are multinomial sampling and Poisson
sampling. In this case, we can think of the data as being a sample of 2121 from a multinomial
distribution, cf. Section 1.5.
In general we assume that our data are independent Poisson random variables, say,
yh ∼ Pois(μh ), h = 1, . . . , n
513
514 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
In this chapter we examine log-linear models, we also relate them to logistic regression models and
show how to use log-linear models to develop prediction models for factors that have more than two
possible outcomes.
log(μi j ) = μ + αi + η j + γi j , i = 1, . . . 3 ; j = 1, . . . , 4.
An alternative notation is often used for log-linear models that merely changes the names of the
parameters,
log(μi j ) = u + u1(i) + u2( j) + u12(i j). (21.1.1)
This log-linear model imposes no constraints on the table of cell means because it includes a separate
parameter for every cell in the table. Actually, the u, u1(i) , and u2( j) terms are all redundant because
the u12(i j) terms alone provide a parameter for explaining every expected cell count in the table. This
model will fit the data for any two-factor table perfectly! In other words, it will lead to μ̂i j = yi j .
Because it has a parameter for every cell, this model is referred to as the saturated model.
The log-linear model that includes only main effects is
In terms of Table 21.2, if this model fits the data, it says the data are explained adequately by a model
in which religion and occupation are independent, cf. Christensen (1997). If religion and occupation
are independent, knowing one’s religion gives no new information about a person’s occupation. That
makes sense relative to the model involving only main effects, because then religion affects only the
terms u1(i) and has no effect on the contribution from occupation, which is the additive term u2( j) .
On the other hand, if Model (21.1.1) applies to the data, the interaction terms u12(i j) allow the
possibility of different occupation effects for every religious group. (It turns out that the model of
independence does not fit these data well.)
Table 21.3 Religion and occupations: i is Religion, j is Occupation, k collapses Roman Catholic and Protes-
tant, k and m together uniquely define religions.
y i j k m
210 1 1 0 1
277 1 2 0 1
254 1 3 0 1
394 1 4 0 1
102 2 1 0 2
140 2 2 0 2
127 2 3 0 2
279 2 4 0 2
36 3 1 3 0
60 3 2 3 0
30 3 3 3 0
17 3 4 3 0
subscripts k and m that replace i and can be used to fit the reduced table. The full interaction model
(21.1.1) can be written as
log(μkm j ) = u + u1(km) + u2( j) + u12(km j), (21.1.3)
whereas the main-effects model (21.1.2) can be rewritten as
log(μkm j ) = u + u1(km) + u2( j). (21.1.4)
The reduced table of Tables 5.10 and 5.11 can be fitted using the model
log(μkm j ) = u + u1(km) + u2(k j). (21.1.5)
This model has a separate parameter for each Jewish occupation, so it effectively leaves them alone,
and fits an independence model to the Roman Catholics and Protestants, cf. Christensen (1997,
Exercise 8.4.3). The pair of subscripts km uniquely define the three religious groups, so in Model
(21.1.5) the term u1(km) is really a main effect for religions. The terms u2(k j) define main effects
for occupations when k = 0, i.e., for Roman Catholics and Protestants, but the terms u2(k j) define
separate effects for each Jewish occupation when k = 3. The key ideas are that k has a unique
value for each religious category except it does not distinguish between the categories that are to
be collapsed, and that k and m uniquely define the religions. Thus m needs to have different values
for each religion that is to be collapsed. In particular, assuming i never takes on the value 0, we can
define the new variables as
0 if row i is collapsed i if row i is collapsed
k= m=
i if row i is not collapsed 0 if row i is not collapsed.
Note that Model (21.1.5) is a special case of the full interaction model (21.1.3) but is more general
than the independence model (21.1.4).
For fitting Model (21.1.5), G2 = 12.206 on df = 3, which are the deviance and degrees of
freedom for the reduced table. For fitting models (21.1.2) and (21.1.4) the deviance is G2 = 64.342
on df = 6. The deviance and degrees of freedom for the collapsed table are obtained by subtraction,
G2 = 64.342 − 12.206 = 52.136 on df = 6 − 3 = 3. These are roughly similar to the corresponding
Pearson χ 2 statistics discussed in Chapter 5.
This model implies complete independence of the three factors in the table. Note that a less overpa-
rameterized version of Model (21.2.1) is
This model implies that factor 1 is independent of factor 2 given factor 3. Note that factor 3 is in
both interaction terms, which is why the model has 1 and 2 independent given 3. We can also get
two other models with similar interpretations by including only the u12(i j) and u23( jk) interactions
(factors 1 and 3 independent given factor 2) or only the u12(i j) and u13(ik) interactions (factors 2 and
3 independent given factor 1). Note that a less overparameterized version of Model (21.2.3) is
which we can abbreviate as [12][13][23]. This model has no nice interpretation in terms of indepen-
dence.
21.2 MODELS FOR THREE-FACTOR TABLES 517
21.2.1 Testing models
Testing models works much the same as it does for linear models except with Poisson or multi-
nomial data there is no need to estimate a variance. Thus, the tests are similar to looking at only
the numerator sums of squares for an analysis of variance or regression test. (This would be totally
appropriate in ANOVA and regression if we ever actually knew the value of σ 2 .) The deviance of a
model is
G2 = 2 ∑ yi jk log yi jk /μ̂i jk ,
all cells
where μ̂i jk is the (maximum likelihood) estimated expected cell count based on the model we are
testing. The subscripts in G2 are written for a three-dimensional table, but the subscripts are irrele-
vant. What is relevant is summing over all cells.
G2 gives a test of whether the model gives an adequate explanation of the data relative to the
saturated model. (Recall that the saturated model always fits the data perfectly because it has a
separate parameter for every cell in the table. Thus the estimated cell counts for a saturated model
are always just the observed cell counts.) The deviance degrees of freedom (df ) for G2 are the
degrees of freedom for the (interaction) terms that have been left out of the saturated model. So in
an I × J × K table, the model that drops out the three-factor interaction has (I − 1)(J − 1)(K − 1)
degrees of freedom for G2 . If we drop out the [12] interaction as well as the three-factor interaction,
G2 has (I − 1)(J − 1)(K − 1) + (I − 1)(J − 1) degrees of freedom. If the observations y in each cell
are reasonably large, the G2 statistics can be compared to a χ 2 distribution with the appropriate
number of degrees of freedom. A large G2 indicates that the model fits poorly as compared to the
saturated model.
Alternatively, one can use the Pearson statistic,
(yi jk − μ̂i jk )2
X2 = ∑ μ̂i jk
all cells
where μ̂Ri jk and μ̂Fi jk are the maximum likelihood estimates from the reduced and full models,
respectively, and showing the last equality is beyond the scope of this book. The degrees of freedom
for the test is the difference in deviance degrees of freedom, df (Red.) − df (Full).
In Chapter 20 we introduced AIC as a tool for model selection. For log-linear models, maximiz-
ing Akaike’s information criterion amounts to choosing the model “M” that minimizes
AM = G2 (M) − [q − 2r] ,
where G2 (M) is the deviance for testing the M model against the saturated model, r is the number
of degrees of freedom for the M model (not the degrees of freedom for the model’s deviance), and
there are q ≡ n degrees of freedom for the saturated model, i.e., q ≡ n cells in the table.
518 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
Given a list of models to be compared along with their G2 statistics and the degrees of freedom
for the tests, a slight modification of AM is easier to compute,
AM − q = G2 (M) − 2[q − r]
= G2 (M) − 2df (M) .
E XAMPLE 21.2.1. For the personality (1), cholesterol (2), blood pressure (3) data of Table 21.1,
testing models against the saturated model gives deviance and AIC values
Model df G2 A−q
[12][13][23] 1 0.613 −1.387
[12][13] 2 2.062 −1.938
[12][23] 2 2.980 −1.020
[13][23] 2 4.563 0.563
[1][23] 3 7.101 1.101
[2][13] 3 6.184 0.184
[3][12] 3 4.602 −1.398
[1][2][3] 4 8.723 0.723
Comparing the G2 values to 95th percentiles of χ 2 distributions with the appropriate number of
degrees of freedom, all of the models seem to explain the data adequately.
To test, for example, the reduced model [1][2][3] against a full model [12][13], the test statistic
is
With only three factors it is easy to look at all possible models. Model selection procedures
become more important when dealing with tables having more than three factors, cf. Christensen
(1997) and Section 21.4.
fits the data very well. Below are given the data and the estimated expected cell counts based on the
model.
2.333 350(23)
2.064 = =
1.130 26(150)
times greater if one is not ejected from the car. This is known as an odds ratio. Similarly, in a
rollover, the odds of a not-severe injury are
60(80)
2.256 =
19(112)
times greater if one is not ejected from the car. These odds ratios are quite close to one another.
In the no-three-way-interaction model, these odds ratios are forced to be the same. If we make the
same computations using the estimated expected cell counts, we get
350.49(23.49) 59.51(79.51)
2.158 = = .
25.51(149.51) 14.49(112.49)
For both collisions and rollovers, the odds of a severe injury are about twice as large if the driver
is ejected from the vehicle than if not. Equivalently, the odds of having a not-severe injury are about
twice as great if the driver is not ejected from the vehicle than if the driver is ejected. It should
be noted that the odds of being severely injured in a rollover are consistently much higher than in
a collision. What we have concluded in our analysis is that the relative effect of the driver being
ejected is the same for both types of accident and that being ejected substantially increases one’s
chances of being severely injured. ✷
All of the models discussed in Section 21.2 have interpretations in terms of odds ratios. An odds
ratio keeps one of the three indexes fixed, say, k, and looks at quantities like
E XAMPLE 21.3.2. Consider data from Everitt (1977) and Christensen (1997) on classroom be-
havior. The three factors are Classroom Behavior (Deviant or Nondeviant), Risk of the home situa-
tion: not at risk (N) or at risk (R), and Adversity of the school situation (Low, Medium, or High). The
data and estimated expected cell counts for the model log(μi jk ) = u + u1(i) + u2( j) + u3(k) + u23( jk) ,
in which behavior is independent of risk and adversity, are given below.
yi jk Adversity (k)
(μ̂i jk ) Low Medium High
Risk ( j) N R N R N R
Non. 16 7 15 34 5 3
Classroom (14.02) (6.60) (14.85) (34.64) (4.95) (4.95)
Behavior (i) Dev. 1 1 3 8 1 3
(2.98) (1.40) (3.15) (7.36) (1.05) (1.05)
Thus, any odds ratio in which either j or k is held fixed always equals 1. The estimate of the log
odds of nondeviant behavior is
The odds of having a home situation that is not at risk depend on the adversity level. Up to
round-off error, the odds satisfy
⎧
μ̂i1k ⎨ 14.02/6.60 = 2.98/1.40 = 2.125 k = 1
= 14.85/34.64 = 3.15/7.36 = 0.428 k = 2 .
μ̂i2k ⎩
4.95/4.95 = 1.05/1.05 = 1 k=3
The odds ratios do not depend on i, so any odds ratios that fix i (but change k) will equal each other,
but will not necessarily equal 1; whereas odds ratios that fix k (but change i) will always equal 1.
E XAMPLE 21.4.1. Consider again the data of Example 20.7.1 and Table 20.10 on muscle tension
change. Here we examine models for expected cell counts but do not identify tension as a response.
The log-linear model of all main effects is
log(μhi jk ) = γ + τh + ωi + μ j + δk . (21.4.1)
21.4 HIGHER-DIMENSIONAL TABLES 521
The model of all two-factor interactions is
log(μhi jk ) = τh + ωi + μ j + δk , (21.4.1)
[T][W][M][D] (21.4.1)
[TW][TM][WM][TD][WD][MD] (21.4.2)
[TWM][TWD][TMD][WMD] . (21.4.3)
Statistics for testing models against the saturated model are given below. The only model
considered that fits these data has all three-factor interactions.
Model df G2 P A−q
[TWM][TWD][TMD][WMD] 1 0.11 0.74 −1.89
[TW][TM][WM][TD][WD][MD] 5 47.67 0.00 37.67
[T][W][M][D] 11 127.4 0.00 105.4
To test reduced models, say, [TWM][TWD][TMD][WMD] against the reduced model of all two-
factor terms [TW][TM][WM][TD][WD][MD], compute G2 = 47.67 − 0.11 = 47.56 with df = 5 −
1 = 4. Clearly, the reduced model does not fit. A reasonable beginning for modelling these data
would be to find which, if any, three-factor interactions can be removed without harming the model
significantly. ✷
E XAMPLE 21.4.2. In later sections we will consider an expanded version of the abortion opinion
data of Example 20.8.1 and Table 20.15 that includes another opinion category that was ignored in
Chapter 20. The four factors now define a 2 × 2 × 3 × 6 table. The factors and levels are
Yes 24 18 16 12 6 4
Male No 5 7 7 6 8 10
Und 2 1 3 4 3 4
Nonwhite
Yes 21 25 20 17 14 13
Female No 4 6 5 5 5 5
Und 1 2 1 1 1 1
E XAMPLE 21.5.1. Men from Framingham, Massachusetts were categorized by their serum
cholesterol and systolic blood pressure. Consider the subsample that did not develop coronary heart
disease during the follow-up period. The data are as follows.
Both factors have ordered levels, although there is no one number associated with each level.
We consider quantitative levels 1, 2, 3, 4 for both factors. Obviously, this is somewhat arbitrary. An
alternative approach that involves nonlinear modeling is to estimate the quantitative levels for each
factor, cf. Christensen (1997) for more information.
Now consider four models that involve the quantitative levels:
Abbreviation Model
[C][P][C1 ] log(μi j ) = u + uC(i) + uP( j) + C1i · j
[C][P][P1] log(μi j ) = u + uC(i) + uP( j) + P1 j · i
[C][P][γ ] log(μi j ) = u + uC(i) + uP( j) + γ · i · j
[C][P] log(μi j ) = u + uC(i) + uP( j).
These are called the row effects, column effects, uniform association, and independence models,
respectively. The fits for the models relative to the saturated model are
21.5 ORDERED CATEGORIES 523
Model df G2 A−q
[C][P][C1] 6 7.404 −4.596
[C][P][P1] 6 5.534 −6.466
[C][P][γ ] 8 7.429 −8.571
[C][P] 9 20.38 2.38
Using the side conditions uC(1) = uP(1) = 0, the parameter estimates and large sample standard
errors are
Because these are obtained from the uniform association model, the odds ratios for consecutive
table entries are identical. For example, the odds of blood pressure < 127 relative to blood pressure
127–146 for men with cholesterol < 200 are 1.11 times the similar odds for men with cholesterol
of 200–219; up to round-off error
112.0/131.0 112.0(105.4)
= = e0.1044 = 1.11
81.3/105.4 81.3(131.0)
where 0.1044 = γ̂ . Similarly, the odds of blood pressure 127–146 relative to blood pressure 147–166
for men with cholesterol < 200 are 1.11 times the odds for men with cholesterol of 200–219:
131.0(38.5)
= e0.1044 = 1.11 .
105.4(43.1)
Also the odds of blood pressure < 127 relative to blood pressure 127–146 for men with cholesterol
200–219 are 1.11 times the odds for men with cholesterol of 220–259:
81.3(187.4)
= e0.1044 = 1.11 .
130.1(105.4)
524 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
For consecutive categories, the odds of lower blood pressure are 1.11 times greater with lower blood
cholesterol than with higher blood cholesterol.
Of course, we can also compare nonconsecutive categories. For categories that are one step away
from consecutive, the odds of lower blood pressure are 1.23 = e2(0.1044) times greater with lower
cholesterol than with higher cholesterol. For example, the odds of having blood pressure <127
compared to having blood pressure of 147–166 with cholesterol <200 are 1.23 = e2(0.1044) times
those for cholesterol 200–219. To check this observe that
112.0(38.5)
= 1.23.
81.3(43.1)
Similarly, the odds of having blood pressure <127 compared to having blood pressure of 127–
146 with cholesterol <200 are 1.23 times those for cholesterol 220–259. Extending this leads to
observing that the odds of having blood pressure <127 compared to having blood pressure of 167+
with cholesterol <200 are 2.559 = e9(0.1044) times those for cholesterol ≥ 260.
It is of interest to compare the estimated cell counts obtained under uniform association with
the estimated cell counts under independence. The estimated cell counts under independence are
E XAMPLE 21.5.2. For the Abortion Opinion data of Table 21.4, the model [RSO][OA] fits well.
The ages are quantitative levels. We consider whether using the quantitative nature of this factor
leads to a more succinct model. The age categories are 18–25, 26–35, 36–45, 46–55, 56–65, and
66+. For lack of a better idea, the category scores were taken as 1, 2, 3, 4, 5, and 6. Since the first
and last age categories are different from the other four, the use of the scores 1 and 6 are particularly
open to question. Two new models were considered:
Abbreviation Model
[RSO][OA] log(μhi jk ) = uRSO(hi j) + uOA( jk)
[RSO][A][O1 ] log(μhi jk ) = uRSO(hi j) + uA(k) + O1 j k
[RSO][A][O1 ][O2 ] log(μhi jk ) = uRSO(hi j) + uA(k) + O1 j k + O2 j k2
Both of these are reduced models relative to [RSO][OA]. To compare models, we need the
following statistics
Model df G2 A−q
[RSO][OA] 45 24.77 −65.23
[RSO][A][O1][O2 ] 51 26.99 −75.01
[RSO][A][O1] 53 29.33 −76.67
Comparing [RSO][A][O1] versus [RSO][OA] gives G2 = 29.33 − 24.77 = 4.56 with degrees of
21.6 OFFSETS 525
21.6 Offsets
Most of our examples have involved multinomial data with ANOVA type models. Now we consider
a regression with Poisson data. This example also involves a term in the linear predictor that is
known.
E XAMPLE 21.6.1. Consider the data in Table 21.5. This is data from Bissell (1972) on the number
of faults in pieces of fabric. It is reasonable to model the number of faults in any piece of fabric as
Poisson and to use the length of the fabric as a predictor variable for the number of faults.
In general, we assume the existence of n independent random variables yh with yh ∼ Pois(μh ).
In this example, a reasonable model might be that the expected number of faults μh is some number
λ times the length of the piece of fabric, say lh , i.e.,
μh = λ lh . (21.6.1)
Such a model assumes that the faults are being generated at a constant rate, and therefore the ex-
pected number of faults is proportional to the length. We can rewrite Model (21.6.1) as a log-linear
model
log(μh ) = log(λ ) + log(lh )
or
log(μh ) = β0 + (1) log(lh ), (21.6.2)
where β0 ≡ log(λ ). If we generalize Model (21.6.2) it might look a bit more familiar. Using a simple
linear regression structure with log(lh ) as a predictor variable, we have the more general model
implies
pi μi1
log = log
1 − pi μi2
= log(μi1 ) − log(μi2 )
= u1(i) + u2(1) + η11xi1 + η21xi2
− u1(i) + u2(2) + η12 xi1 + η22 xi2
= u2(1) − u2(2) + [η11 xi1 − η12 xi1 ] + [η21 xi2 − η22 xi2 ]
≡ β0 + β1 xi1 + β2xi2
where β0 ≡ u2(1) − u2(2) , β1 ≡ [η11 − η12 ], and β2 ≡ [η21 − η22 ].
Similar results apply to ANOVA type models. The logit model from Equation (20.8.1)
in which the terms of the logit model become terms in the log-linear model except that the response
factor O with subscript j is incorporated into the terms, e.g., (RS)hi becomes [RSO]hi j and Ak be-
comes [OA] jk . Also, the log-linear model has an effect for every combination of the explanatory
factors, e.g., it includes [RSA]hik .
The logit model
log(μhi1k /μhi2k ) = (RS)hi + γ k
where we have added the highest-order interaction term not involving O, [RSA], and made the (RS)
and γ terms depend on the opinion level j.
528 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
E XAMPLE 21.8.1. We now examine fitting models to the data from Table 21.4 on race, sex,
opinions on abortion, and age. We treat opinions as a three-category response variable. In a log-
linear model, the variables are treated symmetrically. The analysis looks for relationships among
any of the variables. Here we consider opinions as a response variable. This changes the analysis
in that we think of having separate independent samples from every combination of the predictor
variables. Under this sampling scheme, the interaction among all of the predictors, [RSA], must be
included in all models. Table 21.8 presents fits for all the models that include [RSA] and correspond
to ANOVA-type logit models.
Using AIC, the best-fitting model is clearly [RSA][RSO][OA]. The fitted values for
[RSA][RSO][OA] are given in Table 21.9.
This log-linear model can be used directly to fit multiple logit models that address specific
issues related to the multinomial responses. The method of identifying these logit models is pre-
cisely as illustrated in the previous section. For these data we might consider two logit models, one
that examines the odds of the first category (supporting legalized abortion) to the second category
(opposing legalized abortion) and another that examines the second category (odds of opposing
legalized abortion) to the third (being undecided),
log(μhi1k /μhi2k ) = w1RS(hi) + w1A(R)
log(μhi2k /μhi3k ) = w2RS(hi) + w2A(k) .
Alternatively, we might examine the odds of each of the first two categories relative to the third, i.e.,
the odds of supporting legalized abortion to undecided and the odds of opposing to undecided:
log(μhi1k /μhi3k ) = v1RS(hi) + v1A(k)
log(μhi2k /μhi3k ) = v2RS(hi) + v2A(k) .
21.8 MULTINOMIAL RESPONSES 529
Other possibilities exist. Of those mentioned, the only odds that seem particularly interesting to the
author are the odds of supporting to opposing. In the second pair of models, the category “unde-
cided” is being used as a standard level to which other levels are compared. This seems a particularly
bad choice in the context of these data. The fact that undecided happens to be the last opinion cate-
gory listed in the table is no reason for it to be chosen as the standard of comparison. Either of the
other categories would make a better standard.
Neither of these pairs of models are particularly appealing for these data, so we only illustrate a
few salient points before moving on. Consider the odds of support relative to opposed. The odds can
be obtained from the fitted values in Table 21.9. For example, the odds for young white males are
100.1/39.73 = 2.52. The full table of odds is given in Table 21.10. Except for nonwhite females,
the odds of support are essentially identical to those obtained in Section 20.8 in which undecideds
were excluded. The four values vary from age to age by a constant multiple depending on the ages
involved. The odds of support decrease steadily with age. The model has no inherent structure
among the four race–sex categories; however, the odds for white males and nonwhite males are
surprisingly similar. Nonwhite females are most likely to support legalized abortion, white females
are next, and males are least likely to support legalized abortion. Confidence intervals for log odds
or log odds ratios can be found using methods in Christensen (1997).
Another approach to modeling is to examine the set of three models that consists of the odds
of supporting, the odds of opposing, and the odds of undecided (in each case, the odds are defined
relative to the union of the other categories).
Finally, we could examine two models, one for the odds of supporting to opposing and one
for the odds of undecided to having an opinion. (Note the similarity to Lancaster–Irwin partition-
ing.) Fitting these two models involves fitting log-linear models to two sets of data. Eliminating all
530 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
undecideds from the data, we fit [RSA][RSO][OA] to the 2 × 2 × 2 × 6 table containing only the
“support” and “oppose” categories. We essentially did this already in Section 20.8.
We now pool the support and oppose categories to get a 2 × 2 × 2 × 6 table in which the opinions
are “support or oppose” and “undecided.” Again, the model [RSA][RSO][OA] is fitted to the data.
For this model, we report only the estimated odds.
Odds of Being Decided on Abortion
Age
Race Sex 18–25 26–35 36–45 46–55 56–65 65+
White Male 116.79 78.52 32.67 24.34 22.26 16.95
Female 83.43 56.08 23.34 17.38 15.90 12.11
Nonwhite Male 23.76 15.97 6.65 4.95 4.53 3.45
Female 68.82 46.26 19.25 14.34 13.12 9.99
The estimated odds vary from age to age by a constant multiple. The odds decrease with age,
so older people are less likely to take a position. White males are most likely to state a position.
Nonwhite males are least likely to state a position. (Recall from Section 20.8 that white and non-
white males take nearly the same positions but now we see that they state positions very differently.)
White and nonwhite females have odds of being decided that are somewhat similar.
With support and opposed collapsed, the G2 for [RSA][RSO][OA] turns out to be 5.176 on
15 df . The G2 for the smaller model [RSA][RO][SO][OA] is 12.71 on 16 df . The difference is
very large. Although, as seen in Section 20.8 and specifically Table 20.15, a main-effects-only logit
model fits the support–opposition data quite well, to deal with the undecided category requires a
race–sex interaction.
Additional modeling similar to that in Section 20.8 can be applied to the odds of having made a
decision on legalized abortion.
E XAMPLE 21.9.1. Aitchison and Dunsmore (1975, p. 212) consider 21 individuals with one of
3 types of Cushing’s syndrome. Cushing’s syndrome involves overproduction of cortisol. The three
types considered are
A−adenoma
B−bilateral hyperplasia
C−carcinoma
The case variables considered are the rates at which two steroid metabolites are excreted in the
21.9 LOGISTIC DISCRIMINATION AND ALLOCATION 531
urine. (These are measured in milligrams per day.) The two steroids are
TETRA —Tetrahydrocortisone
and
PREG —Pregnanetriol.
The data are listed in Table 21.11. Note the strange PREG value for Case 4.
The data determine the 3 × 21 table
Case( j)
Type(i) 1 2 3 4 5 6 7 8 ··· 16 17 18 19 20 21
A 1 1 1 1 1 1 0 0 ··· 0 0 0 0 0 0
B 0 0 0 0 0 0 1 1 ··· 1 0 0 0 0 0
C 0 0 0 0 0 0 0 0 ··· 0 1 1 1 1 1
to which we fit a log-linear model.
The case variables TETRA and PREG are used to model the interaction in this table. The case
variables are highly skewed, so, following Aitchison and Dunsmore, we analyze the transformed
variables T L ≡ log(TETRA) and PL ≡ log(PREG). The transformed data are plotted in Figure 21.1.
The evaluation of the relationship is based on the relative likelihoods of the three syndrome
types. Thus with i denoting the population for any case j, our interest is in the relative sizes of
p1 j , p2 j , and p3 j . Estimates of these quantities are easily obtained from the μ̂i j s. Simply take the
fitted mean value μ̂i j and divide by the number of observations from population i,
μ̂i j
p̂i j = . (21.9.1)
yi·
For a new patient of unknown syndrome type but whose values of T L and PL place them in cat-
egory j, the most likely type of Cushing’s syndrome is that with the largest value among p1 j , p2 j ,
and p3 j . In practice, new patients are unlikely to fall into one of the 21 previously observed cate-
gories but the modeling procedure is flexible enough to allow allocation of individuals having any
values of T L and PL.
Discrimination
The main effects model is
log(μi j ) = αi + β j i = 1, 2, 3 j = 1, . . . , 21 .
4
2
log(Pregnanetriol)
0
−2
Adenoma
Bilateral Hyperplasia
−4
Carcinoma
0 1 2 3 4 5
log(Tetrahydrocortisone)
The ratio p1 j /p2 j is not an odds of Type A relative to Type B. Both numbers are probabilities but they
are probabilities from different populations. The correct interpretation of p1 j /p2 j is as a likelihood
ratio, specifically the likelihood of Type A relative to Type B.
The G2 for Model (21.9.2) is 12.30 on 36 degrees of freedom. As in logistic regression, although
G is a valid measure of goodness of fit, G2 cannot legitimately be compared to a χ 2 distribution.
2
log(μi j ) = αi + β j + γ1i (T L) j
has G2 = 37.23 on 38 degrees of freedom. Neither of the reduced models provides an adequate fit.
(Recall that χ 2 tests of model comparisons like these were also valid for logistic regression.)
Table 21.12 contains estimated probabilities for the three populations. The probabilities are com-
puted using Equation (21.9.1) and Model (21.9.2).
Table 21.13 illustrates two Bayesian analyses. For each case j, it gives the estimated posterior
probability that the case belongs to each of the three syndrome types. The data consist of the ob-
served T L and PL values in category j. Given that the syndrome type is i, the estimated probability
of observing data in category j is p̂i j . Let π (i) be the prior probability that the case is of syndrome
type i. Bayes theorem gives
p̂i j π (i)
π̂ (i|Data) = 3 .
∑i=1 p̂i j π (i)
Two choices of prior probabilities are used in Table 21.13: probabilities proportional to sample sizes,
i.e., π (i) = yi· /y·· , and equal probabilities π (i) = 13 . Prior probabilities proportional to sample sizes
are rarely appropriate, but they relate in simple ways to standard output, so they are often given
more prominence than they deserve. Both of the sets of posterior probabilities are easily obtained.
The table entries for proportional probabilities are just the μ̂i j values from fitting the log-linear
model in the usual way. This follows from two facts: first, μ̂i j = yi· p̂i j , and second, the model fixes
the column totals so μ̂· j = 1 = y· j . To obtain the equal probabilities values, simply divide the entries
in Table 21.12 by the sum of the three probabilities for each case. Cases that are misclassified by
either procedure are indicated with a double asterisk in Table 21.13.
Allocation
Model (21.9.2) includes a separate term β j for each case so it is not clear how Model (21.9.2) can
be used to allocate future cases. We begin with logit models and work back to an allocation model.
From (21.9.2), we can model the probability ratio of type A relative to type B
log(p1 j /p2 j )
= log(μ1 j /μ2 j ) − log(y1· /y2· ) (21.9.3)
= (α1 − α2 ) + (γ11 − γ12 )(T L) j + (γ21 − γ22 )(PL) j − log(y1· /y2· ) .
The log-likelihoods of A relative to C are
log(p1 j /p3 j )
= log(μ1 j /μ3 j ) − log(y1· /y3· ) (21.9.4)
= (α1 − α3 ) + (γ11 − γ13 )(T L) j + (γ21 − γ23 )(PL) j − log(y1· /y3· ) .
Fitting Model (21.9.2) gives the estimated parameters.
534 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
Using TETRA = 4.10 and PREG = 1.10, the assumption π (i) = yi· /y·· and more numerical
accuracy in the parameter estimates than was reported earlier,
π̂ (1|Data) = 0.433
π̂ (2|Data) = 0.565
π̂ (3|Data) = 0.002 .
π̂ (1|Data) = 0.560
π̂ (2|Data) = 0.438
π̂ (3|Data) = 0.002 .
Note that the values of tetrahydrocortisone and pregnanetriol used are identical to those for case 5;
thus the π̂ (i|Data)s are identical to those listed in Table 21.13 for case 5.
To use the log-linear model approach illustrated here, one needs to fit a 3 × 21 table. Typically,
a data file of 63 entries is needed. Three rows of the data file are associated with each of the 21
cases. Each data entry has to be identified by case and by type. In addition, the case variables should
be included in the file in such a way that all three rows for a case include the corresponding case
variables, T L and PL. Model (21.9.2) is easily fitted using R or SAS PROC GENMOD.
It is easy just to fit log-linear models to data such as that in Table 21.11 and get μ̂i j s, or, when
there are only two populations, fit logistic models and get p̂i j s. If you treat these values as estimated
probabilities for being in the various populations, you are doing a Bayesian analysis with prior
probabilities proportional to sample sizes. This is rarely an appropriate methodology.
21.10 Exercises
E XERCISE 21.10.1. Watkins, Bergman, and Horton (1994) presented data on a complicated de-
signed experiment that generated counts. The dependent variable is the number of ends cut by a tool.
The experiment was a half replication of an experiment with five factors each at two levels, i.e., a
half rep. of a 25 or a 25−1 . The factors in the design of the experiment were the first five factors listed
in Table 21.14. There are two different chasers, two different coolants, the two speeds were coded
as intermediate (1) and high (2), two different pipes, and two different rake angles. In addition, two
covariates were observed. On each run it was noted whether the spindle was left (1) or right (2). In
the course of the experiment, two new heads were installed. A new head was installed prior to run
number 8 and also prior to the second observation of run 15. The data are given in Table 21.15. It
seems reasonable to treat all of the observations as independent Poisson random variables. Analysis
of variance type models on seven factors can be performed. Analyze the data.
E XERCISE 21.10.2. Bisgaard and Fuller (1995) give the data in Table 21.16 on the numbers of
defectives per grille in a process examined by Chrysler Motors Engineering. The data are from a
fractional factorial design. The factors are: A, Mold Cycle; B, Viscosity; C, Mold Temp; D, Mold
536 21. LOG-LINEAR MODELS: DESCRIBING COUNT DATA
Pressure; E, Weight; F, Priming; G, Thickening Process; H, Glass Type; J, Cutting Pattern. Fit a
main-effects-only model. Try to fit a model with all main effects and all two-factor interactions.
E XERCISE 21.10.3. Reanalyze the data of Table 21.11 using a model based on TETRA and
PREG rather than T L and PL. How much does deleting Case 4 affect the conclusions in either
analysis?
E XERCISE 21.10.4. Find two data sets from earlier chapters that are good candidates for analysis
by log-linear models and reanalyze them. (There are many data sets that consist of counts.)
Time-to-event data is just that: measurements of how long it takes before some event occurs. If that
event involves a machine or a machine component, analyzing such data has traditionally been called
reliability analysis. If that event is the death of a medical patient, the analysis is called survival
analysis. More generally, survival analysis is used to describe any analysis of time-to-event data in
the biological or medical fields and reliability analysis is used for applications in the physical and
engineering sciences.
Traditionally, a major distinction between reliability and survival analysis was that survival anal-
ysis dealt with lost (partially observed) observations and reliability did not. By lost observations we
mean, say, patients who began the study but then were lost to the study. For such patients, the exact
time of survival is unknown; one only knows at what time the patient was last contacted alive. This
form of partial information is known as censoring. These days censoring seems to come up often
in reliability also. The examples presented in this chapter do not involve censoring, but the use of
linear structures for modeling the data depends little on whether the data are censored or not. To
introduce a detailed discussion of censoring would take us too far afield.
Another traditional difference between reliability and survival analysis has been that people
doing reliability have been happy to assume parametric models for the distribution of the data, while
survival analysis has focused strongly on nonparametric models, in which no specific distributional
assumptions are made. Personally, I think that the least important assumption one typically makes in
a data analysis is the distributional assumption. I think that the assumptions of independence, having
an appropriate mean structure (no lack of fit), and the assumption of an appropriate variability model
(e.g., equal variances for normal data), are all more important than the distributional assumption.
Moreover, if residuals are available, it is pretty easy to check the distributional assumption. For what
little it is worth, my personal opinion is that the emphasis on nonparametric methods in survival
analysis is often misplaced. (This is not to be confused with nonparametric regression methods in
which one makes no strong assumption about the functional form of the mean structure.)
In this chapter we examine two parametric approaches that are specific to the analysis of time-
to-event data. (Another parametric approach is just to take logarithms of the times and go on your
merry way.) The first parametric approach is probably the oldest, exponential regression. In stan-
dard regression, we assume that each observation has a normal distribution but that the expected
value of an observation follows some linear model. In exponential regression we assume that each
observation follows an exponential distribution but that the log of the expected value follows a linear
model. The second method is a generalization of the first. The exponential distribution is a special
case of the gamma distribution. We can assume that each observation follows a gamma distribution
but that the log of the expected value follows a linear model. A nonparametric method that involves
the use of linear structures is the Cox proportional hazards model, but that will not be discussed. As
always, this book is less concerned with data analysis, or even the basis for these models, than with
the fact that the same linear structures illustrated in earlier chapters can still be used to analyze such
537
538 22. EXPONENTIAL AND GAMMA REGRESSION: TIME-TO-EVENT DATA
data. Censoring and Cox models are discussed in a wide variety of places including Christensen et
al. (2011).
f (y|λ ) = λ e−λ y
for y > 0 and λ > 0. It can be shown that E(y) ≡ μ = 1/λ and that Var(y) = 1/λ 2 . Sometimes the
density is written in terms of the parameter μ .
In exponential regression, y1 , . . . , yn are independent Exp(λi ). Each yi has an associated vector
of predictor variables, xi . We assume a linear structure
1
log ≡ log(μi ) = x′i β .
λi
Taking exponents gives
1
= μi = exp(x′i β )
λi
or
λi = e−xi β .
′
As in previous chapters, for any model the deviance statistic is used as the basis for checking model
fits.
E XAMPLE 22.1.1 The structure of the Feigl and Zelen data is exactly similar to an analysis of
covariance. We use similar models except that the yi s have exponential distributions rather than
normal distributions. In other words, we use exactly the same kinds of linear structures to model
these data as were used in Chapter 15. Let i indicate the AG status and let j denote the observations
within each AG status. Also, we use the base 10 log of the white blood cell count (lw) as a predictor
variable. Begin by fitting a model that includes just an overall mean,
log(μi j ) = ν . (22.1.1)
22.1 EXPONENTIAL REGRESSION 539
Survival Probability
AG
0.8
1
2
0.4
0.0
0 50 100 150
Time
Survival Probability
AG
0.8
1
2
0.4
0.0
0 50 100 150
Time
Figure 22.1 Exponential regression estimated survival functions for lw = 5. Top, Model (22.1.4); bottom,
Model (22.1.5).
on 30 − 29 = 1 df . It fits remarkably well, as it should since we used the data to suggest the reduced
model! ✷
Perhaps the most useful end result from an exponential regression is a plotted cumulative distri-
bution function, F(y), or plotted survival function, S(y) ≡ 1 − F(y). Such plots are made for specific
values of the vector x using the estimate of β . For exponential regression, the survival function for
given x and β is
S(y|x, β ) = exp −ye−x β .
′
(22.1.6)
Figure 22.1 plots the maximum likelihood estimated survival curves for lw = 5 and both AG = 1, 2.
The top panel is from Model (22.1.4) and the bottom is Model (22.1.5). They simply substitute into
Equation (22.1.6) the appropriate x vectors and estimate of β from Table 22.3. The point estimates
seem to give very different pictures of the importance of AG but remember that there is no significant
difference between the models. These pictures ignore the variability in the estimates.
R’s “glm” procedure gives different results. Because Minitab will not fit Model (22.1.5), to demon-
strate the issue I created a regression version of Model (22.1.5) by defining a 0-1 indicator variable
a ≡ AG − 1 that identifies individuals with AG = 2. I fitted the model
The correspondence between the parameters in models (22.1.7) and (22.1.5) is: β0 = ν1 , β1 = ν2 −
ν1 , β2 = γ1 , and β3 = γ2 − γ1 . Table 22.4 gives the results. The standard errors for β̂0 and β̂2 are larger
in GENMOD and Minitab than in R but the reverse is true for β̂1 and β̂3 . Note that the relatively
small t values for β3 , 0.7548/0.5668 and 0.7548/0.6145, confirm the earlier deviance test that there
is not much need to use Model (22.1.5) rather than Model (22.1.4).
Actually, neither SAS nor R actually fits exponential regression. They both fit gamma regres-
sion (as in the next section), but they let one specify the scale parameter, which allows one to fit
exponential regression. Minitab does exponential regression but not gamma regression.
E XAMPLE 22.2.1 We can also use gamma regression to model the Feigl and Zelen data. The lin-
ear models we consider are exactly the same as those given in the previous section. Moreover, fitting
these models gives exactly the same deviances, degrees of freedom, and parameter estimates as in
exponential regression. What differs from exponential regression are the standard errors of param-
eter estimates and how the deviances are used. For example, Table 22.5 gives parameter estimates
and standard errors for fitting Model (22.1.5) using gamma regression. The parameter estimates are
identical to those from Section 22.1 but the standard errors are different. In exponential regression,
542 22. EXPONENTIAL AND GAMMA REGRESSION: TIME-TO-EVENT DATA
the variance of an observation is a direct function of the mean. Thus in exponential regression,
deviances are used in ways similar to those used for binomial, multinomial, and Poisson data as
illustrated in Chapters 20 and 21. The gamma distribution has two parameters, like a normal dis-
tribution, and deviances in gamma regression are used like sums of squares error in normal theory
models.
In gamma regression, as in normal theory regression, when testing models, we must adjust for
the scale parameter. As in normal theory, the largest model fitted must be assumed to fit the data.
Thus, to test Model (22.1.4) against the larger model (22.1.5), we construct a pseudo F statistic,
40.319 − 38.555 38.555
Fobs = = 1.33
30 − 29 29
and compare the statistic to an F(1, 29) distribution. Unlike normal theory, the F distribution is
merely an approximate distribution that is valid for large samples. Clearly, the test provides no
evidence that we need separate regressions over and above the ACOVA model.
As always, we can use advanced ideas of linear modeling. Suppose that in Model (22.1.4) we
want to incorporate the hypothesis that the slope of lw is −1. That gives us the model
log(μi j ) = νi + (−1)(lw)i j .
In normal theory, we would just use (−1)(lw)i j to alter the dependent variable. Here, the procedure
is a bit more complex, but standard computer programs accommodate such models by using an
offset. An offset is just a term in a linear model that is not multiplied by an unknown parameter.
Computer commands are illustrated on the website. For now, merely note that the deviance of the
model is 41.407 on 31 df . Testing the model against (22.1.4) gives
41.407 − 40.319 40.319
Fobs = = 0.81
31 − 30 30
and no evidence against H0 : γ = −1. Alternatively, for this simple hypothesis we could compute
the “Wald” statistic, which is [γ̂ − (−1)]/SE(γ̂ ), from the table of coefficients for gamma regression
with Model (22.1.4), cf. Exercise 22.3.1. ✷
Rather than using a model with log(μi ) = x′i β , some people prefer a model that involves as-
suming −1/ μi = x′i β . This is the canonical link function and is the default link in some programs.
However, in such models not all β vectors are permissible, because β must be restricted so that
x′i β < 0 for all i.
In addition to being an approach to modeling time-to-event data, gamma regression is often
used to model situations in which the data have a constant coefficient of variation. The coefficient
of variation is
Var(yi ) α /λi2 1
= =√ .
E(yi ) α /λi α
In such cases, gamma regression is an alternative to doing a standard linear model analysis on the
logs of the data, cf. Section 7.3.
22.3 EXERCISES 543
22.3 Exercises
E XERCISE 22.3.1. Fit Model (22.1.4) to the Feigl-Zelen data using gamma regression and com-
pare the Wald test of H0 : γ = −1 to the deviance (generalized likelihood ratio) test.
E XERCISE 22.3.2. Reanalyze the Feigl-Zelen data by taking a log transform and using the meth-
ods of Chapter 15. How do the results change? Do you have any way to decide which analysis is
superior?
E XERCISE 22.3.3. The time to an event is a measurement of time. Can you think of any reasons
why time measurements should be treated differently from other measurements?
E XERCISE 22.3.4. How would you define fitted values, residuals, and “crude” standardized resid-
uals (ones that do not account for variability associated with fitting the model) in gamma regression?
E XERCISE 22.3.5. One way to compare the predictive ability of models is to compute R2 as
the squared sample correlation betweens the values (yh , ŷh ). Based on this criterion, will gamma
regression always look better than its special case exponential regression?
Chapter 23
Nonlinear Regression
Most relationships between predictor variables and the mean values of observations are nonlin-
ear. Fortunately, the “linear” in linear models refers to how the coefficients are incorporated into the
model, not to having a linear relationship between the predictor variables and the mean values of ob-
servations. In Chapter 8 we discussed methods for fitting nonlinear relationships using models that
are linear in the parameters. Moreover, Taylor’s theorem from calculus indicates that even simple
linear models and low-order polynomial models can make good approximate models to nonlinear
relationships. Nonetheless, when we have special knowledge about the relationship between mean
values and predictor variables, nonlinear regression provides a way to use that knowledge and thus
can provide much better models. The biggest difficulty with nonlinear regression is that to use it you
need detailed knowledge about the process generating the data, i.e., you need a good idea about the
appropriate nonlinear relationship between the parameters associated with the predictor variables
and the mean values of the observations. Nonlinear regression is a technique with wide applicability
in the biological and physical sciences.
From a statistical point of view, nonlinear regression models are much more difficult to work
with than linear regression models. It is harder to obtain estimates of the parameters. It is harder to
do good statistical inference once those parameter estimates are obtained. Section 1 introduces non-
linear regression models. In section 2 we discuss parameter estimation. Section 3 examines methods
for statistical inference. Section 4 considers the choice that is sometimes available between doing
nonlinear regression and doing linear regression on transformed data. For a much more extensive
treatment of nonlinear regression; see Seber and Wild (1989).
yi = x′i β + εi (23.1.2)
where x′i = (1, xi1 , . . . , xi p−1) and β = (β0 , . . . , β p−1 )′ These models are linear in the sense that
E(yi ) = x′i β where the unknown parameters, the β j s, are multiplied by known constants, the xi j s,
and added together. In this chapter we consider an important generalization of this model, nonlinear
regression. A nonlinear regression model is simply a model for E(yi ) that does not combine the
parameters of the model in a linear fashion.
f1 (x; β0 , β1 , β2 ) = β0 + β1 sin(β2 x)
545
546 23. NONLINEAR REGRESSION
f2 (x; β0 , β1 , β2 ) = β0 + β1 eβ2 x
f3 (x; β0 , β1 , β2 ) = β0 [1 + β1eβ2 x ]
f4 (x; β0 , β1 , β2 , β3 ) = β0 + β1 [eβ2 x − eβ3x ].
Each of these can be made into a nonlinear regression model. Using f4 , we can write a model for
data pairs (yi , xi ), i = 1, . . . , n:
yi = β0 + β1 [eβ2 xi − eβ3 xi ] + εi
≡ f4 (xi ; β0 , β1 , β2 , β3 ) + εi .
Similarly, for k = 1, 2, 3 we can write models
yi = fk (xi ; β0 , β1 , β2 ) + εi .
As usual, we assume that the εi s are independent N(0, σ 2 ) random variables. As alluded to earlier,
the problem is to find an appropriate function f (·) for the data at hand. ✷
In general, for s predictor variables and p regression parameters we can write a nonlinear re-
gression model that generalizes Model (23.1.1) as
where xi = (xi1 , . . . , xis )′ and β = (β0 , β1 , . . . , β p−1 )′ are vectors defined similarly to Model (23.1.2).
Note that
E(yi ) = f (xi ; β ).
E XAMPLE 23.1.2. Pritchard, Downie, and Bacon (1977) reported data from Jaswal et al. (1969)
on the initial rate r of benzene oxidation over a vanadium pentoxide catalyst. The predictor variables
involve three levels of the temperature, T , for the reactions, different oxygen and benzene concen-
trations, x1 and x2 , and the observed number of moles of oxygen consumed per mole of benzene, x4 .
Based on chemical theory, a steady state adsorption model was proposed. One algebraically simple
form of this model is
1 xi4
yi = exp[β0 + β1 xi3 ] + exp[β2 + β3xi3 ] + εi , (23.1.4)
xi2 xi1
where y = 100/r and the temperature is involved through x3 = 1/T − 1/648. The data are given in
Table 23.1.
The function giving the mean structure for Model (23.1.4) is
1 x4
f (x; β ) ≡ f (x1 , x2 , x3 , x4 ; β0 , β2 , β3 , β4 ) = exp[β0 + β1 x3 ] + exp[β2 + β3 x3 ] . (23.1.5)
x2 x1
✷
23.2 Estimation
We used least squares estimation to obtain the β̂ j s in linear regression; we will continue to use least
squares estimation in nonlinear regression. For the linear regression Model (23.1.2), least squares
estimates minimize
n n
SSE(β ) ≡ ∑ [yi − E(yi )] = ∑ [yi − x′i β ] .
2 2
i=1 i=1
23.2 ESTIMATION 547
For the nonlinear regression Model (23.1.3), least squares estimates minimize
n n
SSE(β ) ≡ ∑ [yi − E(yi )] = ∑ [yi − f (xi ; β )] .
2 2
(23.2.1)
i=1 i=1
As shown below, in nonlinear regression with independent N(0, σ 2 ) errors, the least squares esti-
mates are also maximum likelihood estimates. Not surprisingly, finding the minimum of a function
like (23.2.1) involves extensive use of calculus. We present in detail the Gauss–Newton algorithm
for finding the least squares estimates and briefly mention an alternative method for finding the
estimates.
Y = F(X; β r ) + Zr (β − β r ) + e
= F(X; β r ) + Zr β − Zr β r + e .
From linear regression theory, the value β r+1 minimizes the function
Actually, we wish to minimize the function defined in (23.2.1). In matrix form, (23.2.1) is
Thus, for a linear regression problem, the Gauss–Newton algorithm arrives at β̂ in only one iteration.
✷
E XAMPLE 23.2.2. To perform the analysis on the benzene oxidation data, we need the partial
23.2 ESTIMATION 549
derivatives of the function (23.1.5):
∂ f (x; β ) 1
= exp[β0 + β1 x3 ]
∂ β0 x2
∂ f (x; β ) x3
= exp[β0 + β1 x3 ]
∂ β1 x2
∂ f (x; β ) x4
= exp[β2 + β3 x3 ]
∂ β2 x1
∂ f (x; β ) x3 x 4
= exp[β2 + β3 x3 ] .
∂ β3 x1
With β 1 = (0.843092, 11427.598, 0.039828, 2018.7689)′, we illustrate one step of the algo-
rithm. The dependent variable in Model (23.2.4) is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0.458716 0.297187 0.391121 0.552649
⎢ 0.529101 ⎥ ⎢ 0.295806 ⎥ ⎢ 0.375497 ⎥ ⎢ 0.608792 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 0.520833 ⎥ ⎢ 0.330450 ⎥ ⎢ 0.382850 ⎥ ⎢ 0.573233 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 0.574713 ⎥ ⎢ 0.367968 ⎥ ⎢ 0.387393 ⎥ ⎢ 0.594137 ⎥
Y − F(X; β 1 ) + Z1 β 1 = ⎢ ⎥−⎢ ⎥+⎢ ⎥=⎢ ⎥
⎢ 0.657895 ⎥ ⎢ 0.389871 ⎥ ⎢ 0.391003 ⎥ ⎢ 0.659027 ⎥ .
⎢ . ⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ . ⎥
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ 0.140252 ⎦ ⎣ 0.142284 ⎦ ⎣ 0.005125 ⎦ ⎣ 0.003093 ⎦
0.135870 0.142300 0.005123 −0.001307
Fitting Model (23.2.4) gives the estimate β 2 = (1.42986, 12717, −0.15060, 9087.3)′. Eventually,
the sequence converges to β̂ ′ = (1.3130, 11908, −0.23463, 10559.5). ✷
In practice, methods related to Marquardt (1963) are often used to find the least squares esti-
mates. These involve use of a statistical procedure known as ridge regression, cf. Seber and Wild
(1989, p. 624). Marquardt’s method involves modifying Model (23.2.4) to estimate β − β r by sub-
tracting Zr β r from both sides of the equality. Now, rather than using the least squares estimate
β r+1 − β r = (Zr′ Zr )−1 Zr′ [Y − F(X ; β r )], the simplest form of ridge regression (cf. Christensen, 2011)
uses the estimate
β r+1 − β r = (Zr′ Zr + kI p )−1 Zr′ [Y − F(X; β r )],
where I p is a p × p identity matrix and k is a number that needs to be determined. More complicated
forms of ridge regression involve replacing I p with a diagonal matrix.
When the sequence of values β r stops changing (converges), β r is the least squares estimate.
We will use a geometric argument to justify this statement. The argument applies to both the Gauss–
Newton algorithm and the Marquardt method. By definition, SSE(β ) is the squared length of the
vector Y − F(X ; β ), i.e., it is the square of the distance between Y and F(X; β ). Geometrically, β̂
is the value of β that makes Y − F(X ; β ) as short a vector as possible. Y can be viewed as either a
550 23. NONLINEAR REGRESSION
F(X;β)
F(X;β1)
F(X;βr)
From the Gauss–Newton algorithm, at convergence we have β r+1 = β r and by (23.2.5) β r+1 =
(Zr′ Zr )−1 Zr′ [Y − F(X; β r )] + β r , so we must have
This occurs precisely when 0 = Zr′ [Y − F(X ; β r )] because you can go back and forth between the two
equations by multiplying with (Zr′ Zr ) and (Zr′ Zr )−1 , respectively. Thus β r is the value that makes
Y − F(X ; β ) as short a vector as possible and β r = β̂ . Essentially the same argument applies to the
Marquardt method except Equation (23.2.6) is replaced by 0 = (Zr′ Zr + kI p)−1 Zr′ [Y − F(X ; β r )].
The problem with this geometric argument—and indeed with the algorithms themselves—is that
sometimes there is more than one β for which Y − F(X ; β ) is perpendicular to the surface F(X ; β ).
If you start with an unfortunate choice of β 0 , the sequence might converge to a value that does not
minimize SSE(β ) over all β but only in a region around β 0 . In fact, sometimes the sequence β r
might not even converge.
23.3 STATISTICAL INFERENCE 551
23.2.2 Maximum likelihood estimation
Nonlinear regression is a problem in which least squares estimates are maximum likelihood esti-
mates. We now show this. The density of a random variable y with distribution N(μ , σ 2 ) is
1
φ (y) = √ √ exp[−(y − μ )2/2σ 2 ].
2π σ 2
The joint density of independent random variables is obtained by multiplying the densities of the
individual random variables. From Model (23.1.3), the yi s are independent N[ f (xi ; β ), σ 2 ] random
variables, so
n
φ (Y ) ≡ φ (y1 , . . . , yn ) = ∏ φ (yi )
i=1
n
1
= ∏ √ √ exp[−{yi − f (xi ; β )}2 /2σ 2 ]
i=1 2π σ
2
n √
1 −n 1 n
= √ σ 2 exp − 2 ∑ {yi − f (xi ; β )} 2
2π 2σ i=1
n √
1 −n 1
= √ σ 2 exp − 2 SSE(β ) .
2π 2σ
The density is a function of Y for fixed values of β and σ 2 . The likelihood is exactly the same
function except that the likelihood is a function of β and σ 2 for fixed values of the observations yi .
Thus, the likelihood function is
n √
1 −n 1
L(β , σ ) = √
2
σ 2 exp − 2 SSE(β ) .
2π 2σ
The maximum likelihood estimates of β and σ 2 are those values that maximize L(β , σ 2 ). For any
given value of σ 2 , the likelihood is a simple function of SSE(β ). In fact, the likelihood is maximized
by whatever value of β that minimizes SSE(β ), i.e., the least squares estimate β̂ . Moreover, the
function SSE(β ) does not involve σ 2 , so β̂ does not involve σ 2 and the maximum of L(β , σ 2 ) occurs
wherever the maximum of L(β̂ , σ 2 ) occurs. This is now a function of σ 2 alone. Differentiating with
respect to σ 2 , it is not difficult to see that the maximum likelihood estimate of σ 2 is
SSE(β̂ ) 1 n 2
σ̂ 2 = = ∑ yi − f (xi ; β̂ ) .
n n i=1
SSE(β̂ ) 1 n 2
MSE =
n− p
= ∑
n − p i=1
yi − f (xi ; β̂ ) .
Incidentally, these exact same arguments apply to linear regression, showing that least squares
estimates are also maximum likelihood estimates in linear regression.
To make this into an interval for y0 , simply add f (x0 ; β̂ ) − z′∗0 β̂ to each term, giving the interval
Similarly, the (1 − α )100% confidence interval from Model (23.3.1) for a point on the surface
gives a confidence interval for z′∗0 β rather than for f (x0 ; β ). Defining
α
Ws ≡ t 1 − , n − p MSE z′∗0 (Z∗′ Z∗ )−1 z′∗0 ,
2
the confidence interval for z′∗0 β is
We can also test full models against reduced models. Again, write the full model as
which, when fitted, gives SSE(β̂ ), and write the reduced model as
yi = f0 (xi ; γ ) + εi (23.3.3)
with γ ′ = (γ0 , . . . , γq−1 ). When fitted, Model (23.3.3) gives SSE(γ̂ ). The simplest way of ensuring
that Model (23.3.3) is a reduced model relative to Model (23.3.2) is by specifying constraints on the
parameters.
[Y − F0 (X ; γ̂ ) + Z∗0γ̂ ] = Z∗0 γ + e.
Alas, this model will typically not be a reduced model relative to Model (23.3.1). In fact, the depen-
dent variables (left-hand sides of the equations) are not even the same. Nonetheless, because Model
(23.3.3) is a reduced version of Model (23.3.2), we can test the models in the usual way by using
sums of squares error. Reject the reduced model with an α -level test if
Fitting the model gives estimated parameters γ̂ ′ = (9.267172, 12155.54478) with SSE =
23.3 STATISTICAL INFERENCE 555
Normal Q−Q Plot
2
1
Standardized residuals
0
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
0.4919048545 on dfE = 54 − 2 = 52 for MSE = 0.0094597087. From Inequality (23.3.4) and Ex-
ample 23.3.1, the test statistic is
[0.4919048545 − 0.0810169059]/[4 − 2]
= 126.79.
0.0016203381
With an F statistic this large, the test will be rejected for any reasonable α level. ✷
Standardized residuals
2
1
0
−2
dhat
Residual−yhat plot
Standardized residuals
2
1
0
−2
yhat
Table 23.2 contains standard diagnostic quantities from fitting Model (23.3.1). We use these
quantities in the usual way but possible problems are discussed at the end of the section. Given
that there are 54 cases, none of the standardized residuals r or standardized deleted residuals t look
exceptionally large.
Figure 23.5 contains index plots of the leverages and Cook’s distances. They simply plot the
value against the observation number for each case. Neither plot looks too bad to me (at least at
12:15 a.m. while I am doing this). However, there are some leverages that exceed the 3p/n =
3(4)/54 = 0.222 rule.
For more on how to analyze these data, see Pritchard et al. (1977) and Carroll and Ruppert
(1984). ✷
Unlike linear regression, where the procedure is dominated by the predictor variables, nonlinear
regression is very parameter oriented. This is perhaps excusable because in nonlinear regression
there is usually some specific theory suggesting the regression model and that theory may give
meaning to the parameters. Nonetheless, one can create big statistical problems or remove statis-
tical problems simply by the choice of the parameterization. For example, Model (23.1.4) can be
rewritten as
1 x4
yi = γ0 exp[γ1 xi3 ] + γ2 exp[γ3 xi3 ] + εi . (23.3.5)
x2 x1
If, say, γ0 = 0, the entire term γ0 exp[γ1 xi3 ]/x2 vanishes. This term is the only place in which the
parameter γ1 appears. So if γ0 = 0, it will be impossible to learn about γ1 . More to the point, if γ0
is near zero, it will be very difficult to learn about γ1 . (Of course, one could argue that from the
viewpoint of prediction, one may not care much what γ1 is if γ0 is very near zero and x3 is of moder-
ate size.) In any case, unlike linear regression, the value of one parameter can affect what we learn
23.3 STATISTICAL INFERENCE 557
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
50 100 150 15 20 25 30 35 40
x1 x2
2
Standardized residuals
Standardized residuals
1
1
0
0
−1
−1
−2
−2
630 640 650 660 670 5.4 5.6 5.8 6.0 6.2 6.4
T x4
about other parameters. (In linear regression, the values of some predictor variables affect what we
can learn about the parameters for other predictor variables, but it is not the parameters themselves
that create the problem. In fact, in nonlinear regression, as the benzene example indicates, the pre-
dictor variables are not necessarily associated with any particular parameter.) In Model (23.1.4) we
have ameliorated the problem of γ0 near 0 by using the parameter β0 . When γ0 approaches zero, β0
approaches negative infinity, so this problem with the coefficient of xi3 , i.e. γ1 or β1 , will not arise
for finite β0 . However, unlike (23.3.5), Model (23.1.4) cannot deal with the possibility of γ0 < 0.
Similar problems can occur with γ2 .
All of the methods in this section depend crucially on the quality of the approximation in (23.2.3)
when β r = β̂ . If this approximation is poor, these methods can be very misleading. In particular,
Cook and Tsai (1985, 1990) discuss problems with residual analysis when the approximation is
poor and discuss diagnostics for the quality of the normal approximation. St. Laurent and Cook
(1992) discuss concepts of leverage for nonlinear regression. For large samples, the true value of
β should be close to β̂ and the approximation should be good. (This conclusion also depends on
having the standard errors for functions of β̂ small in large samples.) But it is very difficult to tell
what constitutes a ‘large sample.’ As a practical matter, the quality of the approximation depends a
great deal on the amount of curvature found in f (x; β) near β = β̂ . This curvature is conveniently
measured by the second partial derivatives ∂ 2 f (x; β ) ∂ β j ∂ βk evaluated at β̂ . A good analysis of
nonlinear regression data should include an examination of curvature, but such an examination is
beyond the scope of this book, cf. Seber and Wild (1989, Chapter 4).
23.4 LINEARIZABLE MODELS 559
Leverage plot
0.25
Leverage
0.15
0.05
0 10 20 30 40 50
Index
0.4
0.2
0.0
0 10 20 30 40 50
Index
E XAMPLE 23.4.1. In Section 7.3 we analyzed the Hooker data using a linear model log(yi ) =
β0 + β1 xi + εi . Exponentiating both sides gives yi = exp[β0 + β1 xi + εi ], which we can rewrite as
yi = exp[β0 + β1 xi ]ξi , where ξi is a multiplicative error term with ξi = exp(εi ). Alternatively, we
could fit a nonlinear regression model
yi = exp[β0 + β1 xi ] + εi . (23.4.2)
560 23. NONLINEAR REGRESSION
The difference between these two models is that in the first model (the linearized model) the er-
rors on the original scale are multiplied by the regression structure exp[β0 + β1 xi ], whereas in
the nonlinear model the errors are additive, i.e., are added to the regression structure. To fit the
nonlinear model (23.4.2), we need the partial derivatives of f (x; β0 , β1 ) ≡ exp[β0 + β1 x], namely
∂ f (x; β0 , β1 )/∂ β0 = exp[β0 + β1x] and ∂ f (x; β0 , β1 )/∂ β1 = exp[β0 + β1 x]x. As mentioned earlier,
the choice between using the linearized model from Section 7.3 or the nonlinear regression model
(23.4.2) is often based on which model seems to have better residual plots, etc. Exercise 23.5.1 asks
for this comparison. ✷
23.5 Exercises
E XERCISE 23.5.1. Fit the nonlinear regression (23.4.2) to the Hooker data and compare the fit
of this model to the fit of the linearized model described in Section 7.3.
E XERCISE 23.5.2. For pregnant women, Day (1966) modeled the relationship between weight z
and week of gestation x with
E(y) = β0 + exp[β1 + β2x]
√
where y = 1/ z − z0 and z0 is the initial weight of the woman. For a woman with initial weight of
138 pounds, the data in Table 23.3 were recorded.
Fit the model yi = β0 + exp[β1 + β2 xi ] + εi . Test whether each parameter is equal to zero, give
95% confidence intervals for each parameter, give 95% prediction intervals and surface confidence
intervals for x = 21 weeks, and check the diagnostic quantities. Test the reduced model defined by
H0 : β0 = 0; β1 = 0.
E XERCISE 23.5.3. Following Bliss and James (1966), fit the model yi = (xi β0 ) (xi + β1 ) + εi to
the following data on the relationship between reaction velocity y and concentration of substrate x.
x .138 .220 .291 .560 .766 1.460
y .148 .171 .234 .324 .390 .493
23.5 EXERCISES 561
Test whether each parameter is equal to zero, give 99% confidence intervals for each parameter,
give 99% prediction intervals and surface confidence intervals for x = .5, and check the diagnostic
quantities.
E XERCISE 23.5.4. Bliss and James (1966) give data on the median survival time z of house flies
following application of thepesticide DDT at a level of molar concentration x. Letting y = 100/z,
fit the model yi = β0 + β1xi (xi + β2) + εi to the data given in Table 23.4.
Test whether each parameter is equal to zero, give 99% confidence intervals for each parameter,
give 95% prediction intervals and surface confidence intervals for a concentration of x = .03, and
check the diagnostic quantities. Find the SSE and test the reduced model defined by H0 : β0 =
0, β2 = .0125. Test H0 : β2 = .0125.
Matrices
A matrix is a rectangular array of numbers. Such arrays have rows and columns. The numbers of
rows and columns are referred to as the dimensions of a matrix. A matrix with, say, 5 rows and 3
columns is referred to as a 5 × 3 matrix.
E XAMPLE A.0.1. Three matrices are given below along with their dimensions.
⎡ ⎤ ⎡ ⎤
6
1 4
⎣2 5⎦, 20 80 ⎢ 180 ⎥
, ⎣ ⎦
90 140 −3 .
3 6
0
3×2 2×2 4×1
✷
Let r be an arbitrary positive integer. A matrix with r rows and r columns, i.e., an r × r matrix,
is called a square matrix. The second matrix in Example A.0.1 is square. A matrix with only one
column, i.e., an r × 1 matrix, is a vector, sometimes called a column vector. The third matrix in
Example A.0.1 is a vector. A 1 × r matrix is sometimes called a row vector.
An arbitrary matrix A is often written
A = [ai j ]
where ai j denotes the element of A in the ith row and jth column. Two matrices are equal if they have
the same dimensions and all of their elements (entries) are equal. Thus for r × c matrices A = [ai j ]
and B = [bi j ], A = B if and only if ai j = bi j for every i = 1, . . . , r and j = 1, . . . , c.
The transpose of a matrix A, denoted A′ , changes the rows of A into columns of a new matrix
A′ . If A is an r × c matrix, the transpose A′ is a c × r matrix. In particular, if we write A′ = [ãi j ], then
the element in row i and column j of A′ is defined to be ãi j = a ji .
E XAMPLE A.0.3. ⎡ ⎤′
1 4
⎣2 5⎦ = 1 2 3
4 5 6
3 6
and ′
20 80 20 90
= .
90 140 80 140
564 APPENDIX A: MATRICES
The transpose of a column vector is a row vector,
⎡ ⎤′
6
⎢ 180 ⎥
⎣ ⎦ = [ 6 180 −3 0]. ✷
−3
0
E XAMPLE A.1.1.
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 4 2 8 1+2 4+8 3 12
⎣2 5 ⎦ + ⎣ 4 10 ⎦ = ⎣ 2 + 4 5 + 10 ⎦ = ⎣ 6 15 ⎦ .
3 6 6 12 3 + 6 6 + 12 9 18
20 80 −15 −75 35 155
− = .
90 140 80 130 10 10
✷
E XAMPLE A.3.1.
⎡ ⎤ ⎡ ⎤
1 4 (1)(20) + (4)(90) (1)(80) + (4)(140)
⎣ 2 5 ⎦ 20 80 = ⎣ (2)(20) + (5)(90) (2)(80) + (5)(140) ⎦
90 140
3 6 (3)(20) + (6)(90) (3)(80) + (6)(140)
⎡ ⎤
380 640
= ⎣ 490 860 ⎦ .
600 1080
A.3 MATRIX MULTIPLICATION 565
The entry in the first row and column of the product matrix, (1)(20) + (4)(90), matches the
first row of the first matrix, (1 4), with the elements
elementsin the in the first column of the second
20 20
matrix, . The 1 in (1 4) is matched up with the 20 in and these numbers are multiplied.
90 90
20
Similarly, the 4 in (1 4) is matched up with the 90 in and the numbers are multiplied. Finally,
90
the two products are added to obtain the entry (1)(20) + (4)(90). Similarly, the entry in the third
row, second column of the product, (3)(80) + (6)(140), matches the elements in the thirdrow of
80
the first matrix, (3 6), with the elements in the second column of the second matrix, . After
140
multiplying and adding we get the entry (3)(80) + (6)(140). To carry out this matching, the number
of columns in the first matrix must equal the number of rows in the second matrix. The matrix
product has the same number of rows as the first matrix and the same number of columns as the
second because each row of the first matrix can be matched with each column of the second.
✷
Notice that in matrix multiplication the roles of the first matrix and the second matrix are not
interchangeable. In particular, if we reverse the order of the matrices in Example A.3.1, the matrix
product ⎡ ⎤
1 4
20 80 ⎣
2 5⎦
90 140
3 6
is undefined because the first matrix has two columns while the second matrix has three rows. Even
when the matrix products are defined for both AB and BA, the results of the multiplication typically
differ. If A is r × s and B is s × r, then AB is an r × r matrix and BA is and s × s matrix. When r = s,
clearly AB = BA, but even when r = s we still can not expect AB to equal BA.
Multiplication gives
2 6
AB =
4 14
and
6 8
BA = ,
7 10
so AB = BA. ✷
566 APPENDIX A: MATRICES
In general if A = [ai j ] is an r × s matrix and B = [bi j ] is a s × c matrix, then
AB = [di j ]
is the r × c matrix with
s
di j = ∑ aiℓbℓ j .
ℓ=1
A useful result is that the transpose of the product AB is the product, in reverse order, of the
transposed matrices, i.e. (AB)′ = B′ A′ .
For any r × c matrix A, the product A′ A is always symmetric. This was illustrated in Exam-
ple 23.3.2. More generally, write A = [ai j ], A′ = [ãi j ] with ãi j = a ji , and
c
A′ A = [di j ] = ∑ ãiℓaℓ j .
ℓ=1
Note that
c c c
di j = ∑ ãiℓaℓ j = ∑ aℓiaℓ j = ∑ ã jℓaℓi = d ji
ℓ=1 ℓ=1 ℓ=1
so the matrix is symmetric.
Diagonal matrices are square matrices with all off-diagonal elements equal to zero.
In general, a diagonal matrix is a square matrix A = [ai j ] with ai j = 0 for i = j. Obviously, diagonally
matrices are symmetric.
An identity matrix is a diagonal matrix with all 1s along the diagonal, i.e., aii = 1 for all i. The
third matrix in Example A.4.2 above is a 3 × 3 identity matrix. The identity matrix gets it name
because any matrix multiplied by an identity matrix remains unchanged.
E XAMPLE A.4.3. ⎡ ⎤ ⎡ ⎤
1 4 1 4
1 0
⎣2 5⎦ = ⎣2 5⎦.
0 1
3 6 3 6
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 1 4 1 4
⎣0 1 0⎦⎣2 5⎦ = ⎣2 5⎦.
0 0 1 3 6 3 6
✷
An r × r identity matrix is denoted Ir with the subscript deleted if the dimension is clear.
A zero matrix is a matrix that consists entirely of zeros. Obviously, the product of any matrix
multiplied by a zero matrix is zero.
E XAMPLE A.4.4. ⎡ ⎤
⎡ ⎤ 0
0 0
⎣0 0⎦, ⎢0⎥
⎣ ⎦.
0
0 0
0 ✷
Often a zero matrix is denoted by 0 where the dimension of the matrix, and the fact that it is a
matrix rather than a scalar, must be inferred from the context.
A matrix M that has the property MM = M is called idempotent. A symmetric idempotent matrix
is a perpendicular projection operator.
E XAMPLE A.4.5. The following matrices are both symmetric and idempotent:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 0 1/3 1/3 1/3 .5 .5 0
⎣ 0 1 0 ⎦ , ⎣ 1/3 1/3 1/3 ⎦ , ⎣ .5 .5 0 ⎦ .
0 0 1 1/3 1/3 1/3 0 0 1
✷
E XAMPLE A.5.1. Observe that the example matrix A given at the beginning of the section has
⎡ ⎤⎡ 5 ⎤ ⎡ ⎤
1 2 5 1 0
⎣2 ⎢ 0⎥ ⎣ ⎦
2 10 6 ⎦ ⎣ ⎦= 0 ,
−1
3 4 15 1 0
0
so the columns of A are linearly dependent. ✷
The rank of A is the smallest number of columns of A that can generate C(A). It is also the
maximum number of linearly independent columns in A.
AA−1 = A−1 A = I.
The inverse of A exists only if the columns of A are linearly independent. Typically, it is difficult to
find inverses without the aid of a computer. For a 2 × 2 matrix
a11 a12
A= ,
a21 a22
2x + 4y = 20
3x + 4y = 10.
Using the definition of the inverse on the left-hand side of the equality and the formula in (A.6.1)
on the right-hand side gives
1 0 x −1 1 20
=
0 1 y 3/4 −1/2 10
or
x −10
= .
y 10
Thus (x, y) = (−10, 10) is the solution for the two equations, i.e., 2(−10) + 4(10) = 20 and
3(−10) + 4(10) = 10. ✷
in which the ai j s and ci s are known and the yi s are variables, can be written in matrix form as
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 a13 y1 c1
⎣ a21 a22 a23 ⎦ ⎣ y2 ⎦ = ⎣ c2 ⎦
a31 a32 a33 y3 c3
or
AY = C.
To find Y simply observe that AY = C implies A−1 AY = A−1C and Y = A−1C. Of course this ar-
gument assumes that A−1 exists, which is not always the case. Moreover, the procedure obviously
extends to larger sets of equations.
On a computer, there are better ways of finding solutions to systems of equations than finding the
570 APPENDIX A: MATRICES
inverse of a matrix. In fact, inverses are often found by solving systems of equations. For example,
in a 3 × 3 case the first column of A−1 can be found as the solution to
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 a13 y1 1
⎣ a21 a22 a23 ⎦ ⎣ y2 ⎦ = ⎣ 0 ⎦ .
a31 a32 a33 y3 0
For a special type of square matrix, called an orthogonal matrix, the transpose is also the inverse.
In other words, a square matrix P is an orthogonal matrix if
P′ P = I = PP′ .
To establish that P is orthogonal, it is enough to show either that P′ P = I or that PP′ = I. Orthogonal
matrices are particularly useful in discussions of eigenvalues and principal component regression.
Proposition A.7.1. Let A, B, and C be matrices of appropriate dimensions and let λ be a scalar.
A+B = B+A
(A + B) + C = A + (B + C)
(AB)C = A(BC)
C(A + B) = CA + CB
λ (A + B) = λ A + λ B
′
(A′ ) = A
(A + B)′ = A′ + B′
(AB)′ = B′ A′
−1
A −1 = A
′ −1
−1 ′
(A ) = A
(AB)−1 = B −1 A −1 .
The last equality only holds when A and B both have inverses. The second-to-last property implies
that the inverse of a symmetric matrix is symmetric because then A−1 = (A′ )−1 = (A−1 )′ . This is a
very important property.
The value 2 is also an eigenvalue with eigenvectors that are nonzero multiples of (1, −1, 0)′ .
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
3 1 −1 1 2 1
⎣ 1 3 −1 ⎦ ⎣ −1 ⎦ = ⎣ −2 ⎦ = 2 ⎣ −1 ⎦ .
−1 −1 5 0 0 0
Finally, 6 is an eigenvalue with eigenvectors that are nonzero multiples of (1, 1, −2)′ . ✷
Proposition A.8.2. Let A be a symmetric matrix, then for a diagonal matrix D(φi ) consisting
of eigenvalues there exists an orthogonal matrix P whose columns are corresponding eigenvectors
such that
A = PD(φi )P′ .
This matrix is closely related to the matrix in Example A.8.1. The matrix B has 3 as an eigenvalue
with corresponding eigenvectors that are multiples of (1, 1, 1)′ , just like the matrix A. Once again 6
is an eigenvalue with corresponding eigenvector (1, 1, −2)′ and once again (1, −1, 0)′ is an eigen-
vector, but now, unlike A, (1, −1, 0) also corresponds to the eigenvalue 6. We leave it to the reader
to verify these facts. The point is that in this matrix, 6 is an eigenvalue that has two linearly inde-
pendent eigenvectors. In such cases, any nonzero linear combination of the two eigenvectors is also
an eigenvector. For example, it is easy to see that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 5
3 ⎣ −1 ⎦ + 2 ⎣ 1 ⎦ = ⎣ −1 ⎦
0 −2 −4
Generally, if we need eigenvalues or eigenvectors we get a computer to find them for us.
Two frequently used functions of a square matrix are the determinant and the trace.
Definition A.8.5.
a) The determinant of a square matrix is the product of the eigenvalues of the matrix.
b) The trace of a square matrix is the sum of the eigenvalues of the matrix.
In fact, one can show that the trace of a square matrix also equals the sum of the diagonal elements
of that matrix.
Tables
574 APPENDIX B: TABLES
B.1 Tables of the t distribution
Agresti, A. and Coull, B.A. (1998). Approximate is Better than Exact for Interval Estimation of Binomial
Proportions. The American Statistician, 52, 119–126.
Aitchison, J. and Dunsmore, I. R. (1975). Statistical Prediction Analysis. Cambridge University Press, Cam-
bridge.
Atkinson, A. C. (1973). Testing transformations to normality. Journal of the Royal Statistical Society, Series B,
35, 473–479.
Atkinson, A. C. (1985). Plots, Transformations, and Regression: An Introduction to Graphical Methods of
Diagnostic Regression Analysis. Oxford University Press, Oxford.
Bailey, D. W. (1953). The Inheritance of Maternal Influences on the Growth of the Rat. Ph.D. Thesis, University
of California.
Baten, W. D. (1956). An analysis of variance applied to screw machines. Industrial Quality Control, 10, 8–9.
Beineke, L. A. and Suddarth, S. K. (1979). Modeling joints made with light-gauge metal connector plates.
Forest Products Journal, 29, 39–44.
Berry, D. A. (1996). Statistics: A Bayesian Perspective. Wadsworth, Belmont, CA.
Bethea, R. M., Duran, B. S., and Boullion, T. L. (1985). Statistical Methods for Engineers and Scientists,
Second Edition. Marcel Dekker, New York.
Bickel, P. J., Hammel, E. A., and O’Conner, J. W. (1975). Sex bias in graduate admissions: Data from Berkeley.
Science, 187, 398–404.
Bisgaard, S. and Fuller, H. T. (1995). Reducing variation with two-level factorial experiments. Quality Engi-
neering, 8, 373–377.
Bissell, A. F. (1972). A negative binomial model with varying element sizes. Biometrika, 59, 435–441.
Bliss, C. I. (1947). 2 × 2 factorial experiments in incomplete groups for use in biological assays. Biometrics, 3,
69–88.
Bliss, C. I. and James, A. T. (1966). Fitting the rectangular hyperbola. Biometrics, 22, 573–602.
Box, G. E. P. (1950). Problems in the analysis of growth and wear curves. Biometrics, 6, 362–389.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society,
Series B, 26, 211–246.
Box, G. E. P. and Draper, N. R. (1987). Empirical Model-Building and Response Surfaces. John Wiley and
Sons, New York.
Box, G. E. P. and Tidwell, P. W. (1962). Transformations of the independent variables. Technometrics, 4, 531–
550.
Brownlee, K. A. (1960). Statistical Theory and Methodology in Science and Engineering. John Wiley and
Sons, New York.
Burt, C. (1966). The genetic determination of differences in intelligence: A study of monozygotic twins reared
together and apart. Br. J. Psych., 57, 137–153.
Carroll, R. J. and Ruppert, D. (1984). Power transformations when fitting theoretical models to data. Journal
of the American Statistical Association, 79, 321–328.
Casella, G. (2008). Statistical Design. Springer-Verlag, New York.
Cassini, J. (1740). Eléments d’astronomie. Imprimerie Royale, Paris.
600 REFERENCES
Chapman, R. E., Masinda, K. and Strong, A. (1995). Designing simple reliability experiments. Quality Engi-
neering, 7, 567–582.
Christensen, R. (1987). Plane Answers to Complex Questions: The Theory of Linear Models. Springer-Verlag,
New York.
Christensen, R. (1996). Analysis of Variance, Design, and Regression: Applied Statistical Methods. Chapman
and Hall/CRC, Boca Raton, FL.
Christensen, R. (1997). Log-Linear Models and Logistic Regression, Second Edition. Springer-Verlag, New
York.
Christensen, R. (2001). Advanced Linear Modeling (previously Linear Models for Multivariate, Time Series,
and Spatial Data). Springer-Verlag, New York.
Christensen, Ronald (2000). Linear and log-linear models. Journal of the American Statistical Association, 95,
1290-1293.
Christensen, R. (2011). Plane Answers to Complex Questions: The Theory of Linear Models, Fourth Edition.
Springer-Verlag, New York.
Christensen, R. and Bedrick, E. J. (1997). Testing the independence assumption in linear models. Journal of
the American Statistical Association, 92, 1006–1016.
Christensen, R., Johnson, W., Branscum, A. and Hanson, T. E. (2010). Bayesian Ideas and Data Analysis, CRC
Press, Boca Raton.
Cochran, W. G. and Cox, G. M. (1957). Experimental Designs, Second Edition. John Wiley and Sons, New
York.
Coleman, D. E. and Montgomery, D. C. (1993). A systematic approach to planning for a designed industrial
experiment (with discussion). Technometrics, 35, 1–27.
Conover, W. J. (1971). Practical Nonparametric Statistics. John Wiley and Sons, New York.
Cook, R. D. and Tsai, C.-L. (1985). Residuals in nonlinear regression. Biometrika, 72, 23–29.
Cook, R. D. and Tsai, C.-L. (1990). Diagnostics for assessing the accuracy of normal approximations in expo-
nential family nonlinear models. Journal of the American Statistical Association, 85, 770–777.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall, New York.
Cornell, J. A. (1988). Analyzing mixture experiments containing process variables. A split plot approach.
Journal of Quality Technology, 20, 2–23.
Cox, D. R. (1958). Planning of Experiments. John Wiley and Sons, New York.
Cox, D. R. and Reid, N. (2000). The Theory of the Design of Experiments. Chapman and Hall/CRC, Boca
Raton, FL.
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton.
Dalal, S.R., Fowlkes, E.B., and Hoadley, B. (1989). Risk analysis of the space shuttle: Pre-Challenger predic-
tion of failure. Journal of the American Statistical Association, 84, 945–957.
David, H. A. (1988). The Method of Paired Comparisons. Methuen, New York.
Day, B. B. and del Priore, F. R. (1953). The statistics in a gear-test program. Industrial Quality Control, 7,
16–20.
Day, N. E. (1966). Fitting curves to longitudinal data. Biometrics, 22, 276–291.
Deming, W. E. (1986). Out of the Crisis. MIT Center for Advanced Engineering Study, Cambridge, MA.
Devore, Jay L. (1991). Probability and Statistics for Engineering and the Sciences, Third Edition. Brooks/Cole,
Pacific Grove, CA.
Dixon, W. J. and Massey, F. J., Jr. (1969). Introduction to Statistical Analysis, Third Edition. McGraw-Hill,
New York.
Dixon, W. J. and Massey, F. J., Jr. (1983). Introduction to Statistical Analysis, Fourth Edition. McGraw-Hill,
New York.
Draper, N. and Smith, H. (1966). Applied Regression Analysis. John Wiley and Sons, New York.
Emerson, J. D. (1983). Mathematical aspects of transformation. In Understanding Robust and Exploratory
Data Analysis, edited by D.C. Hoaglin, F. Mosteller, and J.W. Tukey. John Wiley and Sons, New York.
REFERENCES 601
Everitt, B. J. (1977). The Analysis of Contingency Tables. Chapman and Hall, London.
Feigl, P. and Zelen, M. (1965). Estimation of exponential probabilities with concomitant information. Biomet-
rics, 21, 826–838.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, Second Edition. MIT Press, Cam-
bridge, MA. Reprinted in 2007 by Springer-Verlag.
Finney, D. J. (1964). Statistical Method in Biological Assay, Second Edition. Hafner Press, New York.
Fisher, R. A. (1925). Statistical Methods for Research Workers, 14th Edition, 1970. Hafner Press, New York.
Fisher, R. A. (1935). The Design of Experiments, Ninth Edition, 1971. Hafner Press, New York.
Fisher, R. A. (1947). The analysis of covariance method for the relation between a part and the whole. Biomet-
rics, 3, 65–68.
Forbes, J. D. (1857). Further experiments and remarks on the measurement of heights by the boiling point of
water. Transactions of the Royal Society of Edinburgh, 21, 135– 143.
Fuchs, C. and Kenett, R. S. (1987). Multivariate tolerance regions on F-tests. Journal of Quality Technology,
19, 122–131.
Garner, N. R. (1956). Studies in textile testing. Industrial Quality Control, 10, 44–46.
Hader, R. J. and Grandage, A. H. E. (1958). Simple and multiple regression analyses. In Experimental Designs
in Industry, edited by V. Chew, pp. 108–137. John Wiley and Sons, New York.
Hahn, G. J. and Meeker, W. Q. (1993). Assumptions for statistical inference. The American Statistician, 47,
1–11.
Heyl, P. R. (1930). A determination of the constant of gravitation. Journal of Research of the National Bureau
of Standards, 5, 1243–1250.
Hinkelmann, K. and Kempthorne, O. (2005). Design and Analysis of Experiments: Volume 2, Advanced Exper-
imental Design. John Wiley and Sons, Hoboken, NJ.
Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments: Volume 1, Introduction to
Experimental Design, Second Edition. John Wiley and Sons, Hoboken, NJ.
Hochberg, Y. and Tamhane, A. (1987). Multiple Comparison Procedures. John Wiley and Sons, New York.
Hosmer, D. W., Hosmer, T., le Cessie, S. and Lemeshow, S. (1997). A comparison of goodness-of-fit tests for
the logistic regression model. Statistics in Medicine, 16(9), 965–980.
Hsu, J.C. (1996). Multiple Comparisons, Theory and methods. Chapman & Hall, Boca Raton.
Inman, J., Ledolter, J., Lenth, R. V., and Niemi, L. (1992). Two case studies involving an optical emission
spectrometer. Journal of Quality Technology, 24, 27–36.
Jaswal, I. S., Mann, R. F., Juusola, J. A., and Downie, J. (1969). The vapour-phase oxidation of benzene over
a vanadium pentoxide catalyst. Canadian Journal of Chemical Engineering, 47, No. 3, 284–287.
Jensen, R. J. (1977). Evinrude’s computerized quality control productivity. Quality Progress, X, 9, 12–16.
John, P. W. M. (1961). An application of a balanced incomplete block design. Technometrics, 3, 51–54.
John, P. W. M. (1971). Statistical Design and Analysis of Experiments. Macmillan, New York.
Johnson, F. J. (1978). Automated determination of phosphorus in fertilizers: Collaborative study. Journal of
the Association of Official Analytical Chemists, 61, 533–536.
Johnson, D. E. (1998). Applied multivariate methods for data analysts. Duxbury Press, Belmont, CA.
Johnson, R. A. and Wichern, D. W. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Prentice-
Hall, Englewood Cliffs, NJ.
Jolicoeur, P. and Mosimann, J. E. (1960). Size and shape variation on the painted turtle: A principal component
analysis. Growth, 24, 339–354.
Kempthorne, O. (1952). Design and Analysis of Experiments. Krieger, Huntington, NY.
Kihlberg, J. K., Narragon, E. A., and Campbell, B. J. (1964). Automobile crash injury in relation to car size.
Cornell Aero. Lab. Report No. VJ-1823-Rll.
Koopmans, L. H. (1987). Introduction to Contemporary Statistical Methods, Second Edition. Duxbury Press,
Boston.
602 REFERENCES
Lazerwitz, B. (1961). A comparison of major United States religious groups. Journal of the American Statisti-
cal Association, 56, 568–579.
Lehmann, E. L. (1975). Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco.
Lindgren, B.W. (1968). Statistical Theory, Second Edition. Macmillan, New York.
McCullagh, P. (2000). Invariance and factorial models, with discussion. Journal of the Royal Statistical Society,
Series B, 62, 209-238.
McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, Second Edition. Chapman and Hall,
London.
McDonald, G. C. and Schwing, R. C. (1973). Instabilities in regression estimates relating air pollution to
mortality. Technometrics, 15, 463–481.
Mandel, J. (1972). Repeatability and reproducibility. Journal of Quality Technology, 4, 74–85.
Mandel, J. (1989a). Some thoughts on variable-selection in multiple regression. Journal of Quality Technology,
21, 2–6.
Mandel, J. (1989b). The nature of collinearity. Journal of Quality Technology, 21, 268–276.
Mandel, J. and Lashof, T. W. (1987). The nature of repeatability and reproducibility. Journal of Quality Tech-
nology, 19, 29–36.
Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. SIAM Journal of
Applied Mathematics, 11, 431–441.
May, J. M. (1952). Extended and corrected tables of the upper percentage points of the studentized range.
Biometrika, 39, 192–193.
Mercer. W. B. and Hall, A. D. (1911). The experimental error of field trials. Journal of Agricultural Science,
iv, 107–132.
Miller, R. G., Jr. (1981). Simultaneous Statistical Inference, Second Edition. Springer-Verlag, New York.
Milliken, G. A. and Graybill, F. A. (1970). Extensions of the general linear hypothesis model. Journal of the
American Statistical Association, 65, 797–807.
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA.
Mulrow, J. M., Vecchia, D. F., Buonaccorsi, J. P., and Iyer, H. K. (1988). Problems with interval estimation
when data are adjusted via calibration. Journal of Quality Technology, 20, 233–247.
Nelson, P. R. (1993). Additional uses for the analysis of means and extended tables of critical values. Techno-
metrics, 35, 61–71.
Ott, E. R. (1949). Variables control charts in production research. Industrial Quality Control, 3, 30–31.
Ott, E. R. and Schilling, E. G. (1990). Process Quality Control: Trouble Shooting and Interpretation of Data,
Second Edition. McGraw-Hill, New York.
Patterson, H. D. (1950). The analysis of change-over trials. Journal of Agricultural Science, 40, 375–380.
Pauling, L. (1971). The significance of the evidence about ascorbic acid and the common cold. Proceedings of
the National Academy of Science, 68, 2678–2681.
Pritchard, D. J., Downie, J., and Bacon, D. W. (1977). Further consideration of heteroscedasticity in fitting
kinetic models. Technometrics, 19, 227–236.
Quetelet, A. (1842). A Treatise on Man and the Development of His Faculties. Chambers, Edinburgh.
Rao, C. R. (1965). Linear Statistical Inference and Its Applications. John Wiley and Sons, New York.
Rao, C. R. (1973). Linear Statistical Inference and Its Applications, Second Edition. John Wiley and Sons,
New York.
Reiss, I. L., Banward, A., and Foreman, H. (1975). Premarital contraceptive usage: A study and some theoret-
ical explorations. Journal of Marriage and the Family, 37, 619–630.
Ryan, T. P. (1989). Statistical Methods for Quality Improvement. John Wiley and Sons, New York.
St. Laurent, R. T. and Cook, R. D. (1992). Leverage and superleverage in nonlinear regression. Journal of the
American Statistical Association, 87, 985–990.
REFERENCES 603
Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics, 2,
110–114.
Scheffé, H. (1959). The Analysis of Variance. John Wiley and Sons, New York.
Schneider, H. and Pruett, J. M. (1994). Control charting issues in the process industries. Quality Engineering,
6, 345–373.
Seber, G. A. F. and Wild, C. J. (1989). Nonlinear Regression. John Wiley and Sons, New York. (The 2003
version appears to be just a reprint of this.)
Shapiro, S. S. and Francia, R. S. (1972). An approximate analysis of variance test for normality. Journal of the
American Statistical Association, 67, 215–216.
Shewhart, W. A. (1931). Economic Control of Quality. Van Nostrand, New York.
Shewhart, W. A. (1939). Statistical Method from the Viewpoint of Quality Control. Graduate School of the
Department of Agriculture, Washington. Reprint (1986), Dover, New York.
Shumway, R. H. and Stoffer, D. S. (2000). Time Series Analysis and Its Applications. Springer-Verlag, New
York.
Smith, H., Gnanadesikan, R., and Hughes, J. B. (1962). Multivariate analysis of variance (MANOVA). Biomet-
rics, 18, 22–41.
Snedecor, G. W. (1945a). Query. Biometrics, 1, 25.
Snedecor, G. W. (1945b). Query. Biometrics, 1, 85.
Snedecor, G. W. and Cochran, W. G. (1967). Statistical Methods, Sixth Edition. Iowa State University Press,
Ames, IA.
Snedecor, G. W. and Cochran, W. G. (1980). Statistical Methods, Seventh Edition. Iowa State University Press,
Ames, IA.
Snedecor, G. W. and Haber, E. S. (1946). Statistical methods for an incomplete experiment on a perennial crop.
Biometrics, 2, 61–69.
Stigler, S. M. (1986). The History of Statistics. Harvard University Press, Cambridge, MA.
Sulzberger, P. H. (1953). The effects of temperature on the strength of wood, plywood and glued joints. Aero-
nautical Research Consultative Committee, Australia, Department of Supply, Report ACA–46.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,
Series B, 58, 267–288.
Tukey, J. W. (1949). One degree of freedom for nonadditivity. Biometrics, 5, 232–242.
Utts, J. (1982). The rainbow test for lack of fit in regression. Communications in Statistics—Theory and Meth-
ods, 11, 2801–2815.
Wahba, G. (1990). Spline Models for Observational Data. (Vol. 59, CBMS-NSF Regional Conference Series
in Applied Mathematics.) SIAM, Philadelphia.
Watkins, D., Bergman, A., and Horton, R. (1994). Optimization of tool life on the shop floor using design of
experiments. Quality Engineering, 6, 609–620.
Weisberg, S. (1985). Applied Linear Regression. Second Edition. John Wiley and Sons, New York.
Williams, E. J. (1959). Regression Analysis. John Wiley and Sons, New York.
Wilm, H. G. (1945). Notes on analysis of experiments replicated in time. Biometrics, 1, 16–20.
Woodward, G., Lange, S. W., Nelson, K. W., and Calvert, H. O. (1941). The acute oral toxicity of acetic,
chloracetic, dichloracetic and trichloracetic acids. Journal of Industrial Hygiene and Toxicology, 23, 78–
81.
Younger, M. S. (1979). A Handbook for Linear Regression, Duxbury Press, Belmont, CA.