Imstat
Imstat
ii
Table of contents
Preface 4
I Introduction to data 8
1 Hello data 10
1.1 Case study: Using stents to prevent strokes . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Study design 28
2.1 Sampling principles and strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Chapter review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Applications: Data 47
3.1 Case study: Olympic 1500m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Simpson’s paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Interactive R tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 R labs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Applications: Explore 96
6.1 Case study: Effective communication of exploratory results . . . . . . . . . . . . . . . 96
6.2 Interactive R tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 R labs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Appendices 479
A Exercise solutions 479
B References 497
vi TABLE OF CONTENTS
1
Welcome to IMS2
2
Copyright © 2024.
Second Edition.
Version date: May 31, 2024.
This textbook and its supplements, including slides, labs, and interactive tutorials, may be downloaded
for free at
openintro.org/book/ims.
This textbook is a derivative of OpenIntro Statistics 4th Edition and Introduction to Statistics with
Randomization and Simulation 1st Edition by Diez, Barr, and Çetinkaya-Rundel, and it’s available
under a Creative Commons Attribution-ShareAlike 3.0 Unported United States License. License
details are available at the Creative Commons website:
creativecommons.org.
Source files for this book can be found on GitHub at
github.com/openintrostat/ims.
3
Authors
Mine Çetinkaya-Rundel
Mine Çetinkaya-Rundel is Professor of the Practice at the Department of Statistical Science at Duke
University and Developer Educator at Posit. Mine’s work focuses on innovation in statistics and
data science pedagogy, with an emphasis on computing, reproducible research, student-centered
learning, and open-source education as well as pedagogical approaches for enhancing retention of
women and under-represented minorities in STEM. Mine works on integrating computation into the
undergraduate statistics curriculum, using reproducible research methodologies and analysis of real
and complex datasets. She also organizes ASA DataFest, an annual two-day competition in which
teams of undergraduate students work to reveal insights into a rich and complex dataset. Mine has
been working on the OpenIntro project since its founding and as part of this project she co-authored
four open-source introductory statistics textbooks (including this one!). She is also the creator and
maintainer of datasciencebox.org and she teaches the popular Statistics with R MOOC on Coursera.
Johanna Hardin
Pomona College
[email protected]
Jo Hardin is Professor of Mathematics and Statistics at Pomona College. She collaborates with
molecular biologists to create novel statistical methods for analyzing high throughput data. She
has also worked extensively in statistics and data science education, facilitating modern curricula
for higher education instructors. She was a co-author on the 2014 ASA Curriculum Guidelines for
Undergraduate Programs in Statistical Science, and she writes on the blog teachdatascience.com. The
best part of her job is collaborating with undergraduate students. In her spare time, she loves running,
hiking, and jigsaw puzzles.
4
Preface
Welcome to the second edition of “Introduction to Modern Statistics”!
We hope readers will take away three ideas from this book in addition to forming a foundation of
statistical thinking and methods.
1. Statistics is an applied field with a wide range of practical applications.
2. You don’t have to be a math guru to learn from interesting, real data.
3. Data are messy, and statistical tools are imperfect. However, when you understand the strengths
and weaknesses of these tools, you can use them to learn interesting things about the world.
Textbook overview
• Part 1: Introduction to data. Data structures, variables, summaries, graphics, and basic
data collection and study design techniques.
• Part 2: Exploratory data analysis. Data visualization and summarization, with particular
emphasis on multivariable relationships.
• Part 3: Regression modeling. Modeling numerical and categorical outcomes with linear and
logistic regression and using model results to describe relationships and make predictions.
• Part 4: Foundations for inference. Case studies are used to introduce the ideas of statistical
inference with randomization tests, bootstrap intervals, and mathematical models.
• Part 5: Statistical inference. Further details of statistical inference using randomization
tests, bootstrap intervals, and mathematical models for numerical and categorical data.
• Part 6: Inferential modeling. Extending inference techniques presented thus-far to linear
and logistic regression settings and evaluating model performance.
Each part contains multiple chapters and ends with a case study. Building on the content covered in
the part, the case study presents a high-level overview using the tools and techniques from the part.
In the chapters that cover statistical inference, we have presented a parallel structure that walks the
student through both computational and mathematical approaches to every inferential topic. Trying
to cover every approach for every topic is likely too much material for a one semester class. We suggest
that you make deliberate choices for navigating the book with your students. A few potential paths
through the book (with chapter numbers in parentheses) are given as follows:
• Focus on parallel structure of computational and mathematical methods: Introduction
to data (1, 2), Exploratory data analysis (4, 5), Regression (7), Foundations (11, 12, 13, 14),
Inference (a subset of: 16, 17, 18, 19, 20, 21, 22; potentially: 16, 17, 19, 20).
• Focus on computational methods: Introduction to data (1, 2), Exploratory data analysis (4,
5), Regression (7), Foundations (11, 12, 14), Inference (computational methods only for some
subset of: 16, 17, 18, 19, 20, 21, 22).
• Focus on mathematical methods: Introduction to data (1, 2), Exploratory data analysis (4,
5), Regression (7), Foundations (11, 12, 13, 14), Inference (mathematical methods only for some
subset of: 16, 17, 18, 19, 20, 21, 22).
• Focus on modeling: Introduction to data (1, 2), Exploratory data analysis (4, 5), Regression
(7, 8, 9), Foundations (11, 12, 13, 14), Inference (19), Inferential modeling (24, 25, 26).
We expect that most courses following a classical syllabus will not have time to cover the chapters in
the last part, Inferential modeling (24, 25, 26).
Changes for the second edition 5
Each chapter ends with a review which contains a chapter summary as well as a list of key terms
introduced in the chapter. If you’re not sure what some of these terms mean, we recommend you
go back in the text and review their definitions. We purposefully present them in alphabetical order,
instead of in order of appearance, so they will be a little more challenging to locate. However, you
should be able to easily spot them as bolded text.
While the second edition does not represent a major change from the first edition, we have worked
hard to improve content, to add exercises, and to update text and code to reflect changes in best
practices (e.g., the book is now written in Quarto).
A brief summary of the biggest changes follows:
• Twenty-five completely new exercises were added. Most of the new exercises are concatenated
onto existing exercises so as to retain similar numbering across editions. However, a few exercises
have been moved in order to produce both odd exercises (with solutions) and even exercises
(without solutions) on the same topic.
• Multiple datasets were added or updated. For example, the pm25_2022_durham data on air
quality in Durham, NC in 2022 can be found in the openintro R package.
• Chapter 3 was re-written with an updated context and data example. Additionally, in Chapter 3,
we explore Simpson’s Paradox.
• Throughout the text and the exercises, “statistically significant” has been changed to “statisti-
cally discernible” so as to distance ourselves from the more colloquial use of the word “significant.”
EXAMPLE
This is an example. When a question is asked here, where can the answer be found?
The answer can be found here, in the solution section of the example!
When we think the reader is ready to try determining a solution on their own, we frame it as Guided
Practice.
GUIDED PRACTICE
The reader may check or learn the answer to any Guided Practice problem by reviewing
the full solution in a footnote.1
Exercises are also provided at the end of each chapter. Solutions are given for odd-numbered exercises
in Appendix A.
1 Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the
A large majority of the datasets used in the book can be found in various R packages. Each time a
new dataset is introduced in the narrative, a reference to the package like the one below is provided.
Many of these datasets are in the openintro R package that contains datasets used in OpenIntro’s
open-source textbooks.2
The datasets used throughout the book come from real sources like opinion polls and scientific articles,
except for a handful of cases where we use toy data to highlight a particular feature or explain a
particular concept. References for the sources of the real data are provided at the end of the book.
Computing with R
The narrative and the exercises in the book are computing language agnostic, however while it’s
possible to learn about modern statistics without computing, it’s not possible to apply it. Therefore,
we invite you to navigate the concepts you have learned in each part using the interactive R tutorials
and the R labs that are included at the end of each part.
Interactive R tutorials
The self-paced and interactive R tutorials were developed using the learnr R package, and only an
internet browser is needed to complete them.
Each part comes with a tutorial comprised of 4-10 lessons and listed like this.
You can access the full list of tutorials supporting this book https://openintrostat.github.io/ims-
tutorials.
R labs
Once you feel comfortable with the material in the tutorials, we also encourage you to apply what
you’ve learned via the computational labs that are also linked at the end of each part. The labs
consist of data analysis case studies, and they require access to R and RStudio. The first lab includes
installation instructions. If you’d rather not install the software locally, you can also try Posit Cloud
for free.
You can access the full list of labs supporting this book at openintro.org/go?id=ims-r-labs.
2 Mine Çetinkaya-Rundel and David Diez and Andrew Bray and Albert Y. Kim and Ben Baumer and Chester Ismay
and Nick Paterno and Christopher Barr (2024). openintro: Datasets and Supplemental Functions from ‘OpenIntro’
Textbooks and Labs. R package version 2.5.0. https://github.com/openintrostat/openintro.
OpenIntro, online resources, and getting involved 7
Acknowledgements
The OpenIntro project would not have been possible without the dedication and volunteer hours of all
those involved, and we hope you will join us in extending a huge thank you to all those who volunteer
with OpenIntro.
The authors would like to thank the following individuals:
• David Diez and Christopher Barr for their work on the 1st Edition of this book,
• Ben Baumer and Andrew Bray for their contribution rethinking how and which order we present
this material as well as their work as original authors of the interactive tutorial content,
• Yanina Bellini Saibene, Florencia D’Andrea, and Roxana Noelia Villafañe for their work on
creating the interactive tutorials in learnr,
• Peter Baumgartner for review and revisions of the interactive learnr tutorials,
• Will Gray for conceptual diagrams,
• Allison Theobold, Melinda Yager, and Randy Prium for their valuable feedback and review of
the book,
• Colin Rundel for feedback on content and technical help with conversion from LaTeX to R
Markdown,
• Christophe Dervieux for help with multi-output bookdown issues, and
• Müge Çetinkaya and Meenal Patel for their design vision.
We would like to also thank the developers of the open-source tools that make the development and
authoring of this book possible, e.g., Quarto, tidyverse, tidymodels, and icons8.
We are also grateful to the many teachers, students, and other readers who have helped improve
OpenIntro resources through their feedback.
8
PART I
Introduction to data
9
The first part of the book will introduce you to data, their properties, how they are collected, and
the structure of the design used for the study. Different data and settings lead to different types of
conclusions, so you’ll always want to keep in mind the data provenance, especially as you move on to
modeling and inference.
• In Chapter 1 you’ll be introduced to tidy data, an important structure for describing, visualizing,
and analyzing data.
• In Chapter 2 the focus is on study design. In particular, the critical distinction between random
sampling and randomization is made.
• Chapter 3 includes an application on the Paralympics case study where the topics from the
Introduction to data part of the book are fully developed.
We recommend you come back to review this foundational part after you cover each new part in the
textbook. In particular, it is worthwhile to consider Figure 2.8 in all of the inferential settings you
cover. Each dataset you analyze will have a slightly different context which will require thoughtful
consideration of the appropriate conclusions.
10
Chapter 1
Hello data
In this section we introduce a classic challenge in statistics: evaluating the efficacy of a medical
treatment. Terms in this section, and indeed much of this chapter, will all be revisited later in the
text. The plan for now is simply to get a sense of the role statistics can play in practice.
An experiment is designed to study the effectiveness of stents in treating patients at risk of stroke
(Chimowitz et al. 2011). Stents are small mesh tubes that are placed inside narrow or weak arteries
to assist in patient recovery after cardiac events and reduce the risk of an additional heart attack or
death.
Many doctors have hoped that there would be similar benefits for patients at risk of stroke. We start
by writing the principal question the researchers hope to answer:
Does the use of stents reduce the risk of stroke?
The researchers who asked this question conducted an experiment with 451 at-risk patients. Each
volunteer patient was randomly assigned to one of two groups:
• Treatment group. Patients in the treatment group received a stent and medical management.
The medical management included medications, management of risk factors, and help in lifestyle
modification.
• Control group. Patients in the control group received the same medical management as the
treatment group, but they did not receive stents.
Researchers randomly assigned 224 patients to the treatment group and 227 to the control group. In
this study, the control group provides a reference point against which we can measure the medical
impact of stents in the treatment group.
Researchers studied the effect of stents at two time points: 30 days after enrollment and 365 days after
enrollment. The results of 5 patients are summarized in Table 1.1. Patient outcomes are recorded as
stroke or no event, representing whether the patient had a stroke during that time period.
1.1. CASE STUDY: USING STENTS TO PREVENT STROKES 11
The stent30 data and stent365 data can be found in the openintro R package.
Table 1.1: Results for five patients from the stent study.
It would be difficult to answer a question on the impact of stents on the occurrence of strokes for all
study patients using these individual observations. This question is better addressed by performing
a statistical data analysis of all observations. Table 1.2 summarizes the raw data in a more helpful
way. In this table, we can quickly see what happened over the entire study. For instance, to identify
the number of patients in the treatment group who had a stroke within 30 days after the treatment,
we look in the leftmost column (30 days), at the intersection of treatment and stroke: 33. To identify
the number of control patients who did not have a stroke after 365 days after receiving treatment, we
look at the rightmost column (365 days), at the intersection of control and no event: 199.
GUIDED PRACTICE
Of the 224 patients in the treatment group, 45 had a stroke by the end of the first year.
Using these two numbers, compute the proportion of patients in the treatment group
who had a stroke by the end of their first year. (Note: answers to all Guided Practice
exercises are provided in footnotes!)1
We can compute summary statistics from the table to give us a better idea of how the impact of
the stent treatment differed between the two groups. A summary statistic is a single number
summarizing data from a sample. For instance, the primary results of the study after 1 year could
be described by two summary statistics: the proportion of people who had a stroke in the treatment
and control groups.
• Proportion who had a stroke in the treatment (stent) group: 45/224 = 0.20 = 20%.
• Proportion who had a stroke in the control group: 28/227 = 0.12 = 12%.
These two summary statistics are useful in looking for differences in the groups, and we are in for a
surprise: an additional 8% of patients in the treatment group had a stroke! This is important for two
reasons. First, it is contrary to what doctors expected, which was that stents would reduce the rate
of strokes. Second, it leads to a statistical question: do the data show a “real” difference between the
groups?
1 The proportion of the 224 patients who had a stroke within 365 days: 45/224 = 0.20.
12 CHAPTER 1. HELLO DATA
This second question is subtle. Suppose you flip a coin 100 times. While the chance a coin lands heads
in any given coin flip is 50%, we probably won’t observe exactly 50 heads. This type of variation is
part of almost any type of data generating process. It is possible that the 8% difference in the stent
study is due to this natural variation. However, the larger the difference we observe (for a particular
sample size), the less believable it is that the difference is due to chance. So, what we are really asking
is the following: if in fact stents have no effect, how likely is it that we observe such a large difference?
While we do not yet have statistical tools to fully address this question on our own, we can comprehend
the conclusions of the published analysis: there was compelling evidence of harm by stents in this
study of stroke patients.
Be careful: Do not generalize the results of this study to all patients and all stents. This study
looked at patients with very specific characteristics who volunteered to be a part of this study and
who may not be representative of all stroke patients. In addition, there are many types of stents, and
this study only considered the self-expanding Wingspan stent (Boston Scientific). However, this study
does leave us with an important lesson: we should keep our eyes open for surprises.
Effective presentation and description of data is a first step in most analyses. This section introduces
one structure for organizing data as well as some terminology that will be used throughout this book.
Each row in the table represents a single loan. The formal name for a row is a case or observation
or unit of observation. The columns represent characteristics of each loan, where each column is
referred to as a variable. For example, the first row represents a loan of $22,000 with an interest rate
of 10.90%, where the borrower is based in New Jersey (NJ) and has an income of $59,000.
GUIDED PRACTICE
What is the grade of the first loan in Table 1.3? And what is the home ownership status
of the borrower for that first loan? Reminder: for these Guided Practice questions, you
can check your answer in the footnote.2
In practice, it is especially important to ask clarifying questions to ensure important aspects of the
data are understood. For instance, it is always important to be sure we know what each variable
means and its units of measurement. Descriptions of the variables in the loan50 dataset are given in
Table 1.4.
Table 1.4: Variables and their descriptions for the loan50 dataset.
Variable Description
loan_amount Amount of the loan received, in US dollars.
interest_rate Interest rate on the loan, in an annual percentage.
term The length of the loan, which is always set as a whole number of
months.
grade Loan grade, which takes a values A through G and represents the
quality of the loan and its likelihood of being repaid.
state US state where the borrower resides.
total_income Borrower’s total income, including any second income, in US dollars.
homeownership Indicates whether the person owns, owns but has a mortgage, or
rents.
The data in Table 1.3 represent a data frame, which is a convenient and common way to organize
data, especially if collecting data in a spreadsheet. A data frame where each row is a unique case
(observational unit), each column is a variable, and each cell is a single value is commonly referred to
as tidy data (Wickham 2014).
When recording data, use a tidy data frame unless you have a very good reason to use a different
structure. This structure allows new cases to be added as rows or new variables as new columns and
facilitates visualization, summarization, and other statistical analyses.
GUIDED PRACTICE
The grades for assignments, quizzes, and exams in a course are often recorded in a
gradebook that takes the form of a data frame. How might you organize a course’s
grade data using a data frame? Describe the observational units and variables.3
GUIDED PRACTICE
We consider data for 3,142 counties in the United States, which includes the name of
each county, the state where it resides, its population in 2017, the population change
from 2010 to 2017, poverty rate, and nine additional characteristics. How might these
data be organized in a data frame?4
The data described in the Guided Practice above represents the county dataset, which is shown as
a data frame in Table 1.5. The variables as well as the variables in the dataset that did not fit in
Table 1.5 are described in Table 1.6.
Table 1.5: Six observations and six variables from the county dataset.
3 There are multiple strategies that can be followed. One common strategy is to have each student represented by a
row, and then add a column for each assignment, quiz, or exam. Under this setup, it is easy to review a single line to
understand the grade history of a student. There should also be columns to include student information, such as one
column to list student names.
4 Each county may be viewed as a case, and there are eleven pieces of information recorded for each case. A table
with 3,142 rows and 14 columns could hold these data, where each row represents a county and each column represents
a particular piece of information.
14 CHAPTER 1. HELLO DATA
Table 1.6: Variables and their descriptions for the county dataset.
Variable Description
name Name of county.
state Name of state.
pop2000 Population in 2000.
pop2010 Population in 2010.
pop2017 Population in 2017.
pop_change Population change from 2010 to 2017 (in percent).
poverty Percent of population in poverty in 2017.
homeownership Homeownership rate, 2006-2010.
multi_unit Multi-unit rate: percent of housing units that are in multi-unit
structures, 2006-2010.
unemployment_rate Unemployment rate in 2017.
metro Whether the county contains a metropolitan area, taking one of the
values yes or no.
median_edu Median education level (2013-2017), taking one of the values
below_hs, hs_diploma, some_college, or bachelors.
per_capita_income Per capita (per person) income (2013-2017).
median_hh_income Median household income.
smoking_ban Describes the type of county-level smoking ban in place in 2010,
taking one of the values none, partial, or comprehensive.
EXAMPLE
Data were collected about students in a statistics course. Three variables were recorded for
each student: number of siblings, student height, and whether the student had previously taken
a statistics course. Classify each of the variables as continuous numerical, discrete numerical,
or categorical.
The number of siblings and student height represent numerical variables. Because the number
of siblings is a count, it is discrete. Height varies continuously, so it is a continuous numerical
variable. The last variable classifies students into two categories – those who have and those
who have not taken a statistics course – which makes this variable categorical.
GUIDED PRACTICE
An experiment is evaluating the effectiveness of a new drug in treating migraines. A
group variable is used to indicate the experiment group for each patient: treatment or
control. The num_migraines variable represents the number of migraines the patient
experienced during a 3-month period. Classify each variable as either numerical or
categorical?5
5 The group variable can take just one of two group names, making it categorical. The num_migraines variable
describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this
is a numerical outcome; more specifically, since it represents a count, num_migraines is a discrete numerical variable.
16 CHAPTER 1. HELLO DATA
Scatterplots are one type of graph used to study the relationship between two numerical variables.
Figure 1.2 displays the relationship between the variables homeownership and multi_unit, which is
the percent of housing units that are in multi-unit structures (e.g., apartments, condos). Each point
on the plot represents a single county. For instance, the highlighted dot corresponds to County 413
in the county dataset: Chattahoochee County, Georgia, which has 39.4% of housing units that are
in multi-unit structures and a homeownership rate of 31.3%. The scatterplot suggests a relationship
between the two variables: counties with a higher rate of housing units that are in multi-unit structures
tend to have lower homeownership rates. We might brainstorm as to why this relationship exists and
investigate each idea to determine which are the most reasonable explanations.
Figure 1.2: A scatterplot of homeownership versus the percent of housing units that are in multi-unit
structures for US counties. The highlighted dot represents Chattahoochee County, Georgia, which has a
multi-unit rate of 39.4% and a homeownership rate of 31.3%.
The multi-unit and homeownership rates are said to be associated because the plot shows a discernible
pattern. When two variables show some connection with one another, they are called associated
variables.
GUIDED PRACTICE
Examine the variables in the loan50 dataset, which are described in Table 1.4. Cre-
ate two questions about possible relationships between variables in loan50 that are of
interest to you.6
EXAMPLE
This example examines the relationship between the percent change in population from 2010
to 2017 and median household income for counties, which is visualized as a scatterplot in
Figure 1.3. Are these variables associated?
The larger the median household income for a county, the higher the population growth ob-
served for the county. While it isn’t true that every county with a higher median household
income has a higher population growth, the trend in the plot is evident. Since there is some
relationship between the variables, they are associated.
6 Two example questions: (1) What is the relationship between loan amount and total income? (2) If someone’s
income is above the average, will their interest rate tend to be above or below the average?
1.2. DATA BASICS 17
Figure 1.3: A scatterplot showing population change against median household income. Owsley County of
Kentucky is highlighted, which lost 3.63% of its population from 2010 to 2017 and had median household
income of $22,736.
Because there is a downward trend in Figure 1.2 – counties with more housing units that are in multi-
unit structures are associated with lower homeownership – these variables are said to be negatively
associated. A positive association is shown in the relationship between the median_hh_income
and pop_change variables in Figure 1.3, where counties with higher median household income tend
to have higher rates of population growth.
If two variables are not associated, then they are said to be independent. That is, two variables are
independent if there is no evident relationship between the two.
A pair of variables are either related in some way (associated) or not (independent). No
pair of variables is both associated and independent.
7 In some disciplines, it’s customary to refer to the explanatory variable as the independent variable and the
response variable as the dependent variable. However, this becomes confusing since a pair of variables might be
independent or dependent, so we avoid this language.
18 CHAPTER 1. HELLO DATA
When we suspect one variable might causally affect another, we label the first vari-
able the explanatory variable and the second the response variable. We also use the
terms explanatory and response to describe variables where the response might be
predicted using the explanatory even if there is no causal relationship.
For many pairs of variables, there is no hypothesized relationship, and these labels
would not be applied to either variable in such cases.
Bear in mind that the act of labeling the variables in this way does nothing to guarantee that a causal
relationship exists. A formal evaluation to check whether one variable causes a change in another
requires an experiment.
Association ≠ Causation.
1.3.1 Summary
This chapter introduced you to the world of data. Data can be organized in many ways but tidy data,
where each row represents an observation and each column represents a variable, lends itself most
easily to statistical analysis. Many of the ideas from this chapter will be revisited as we move on to
doing end-to-end data analyses. In the next chapter you’re going to learn about how we can design
studies to collect the data we need to make conclusions with the desired scope of inference.
1.3.2 Terms
The terms introduced in this chapter are presented in Table 1.7. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
1.4 Exercises
Length Gross
Title Hrs Mins Release Date Opening Wknd US US World
1 Iron Man 2 6 5/2/2008 98.62 319.03 585.8
2 The Incredible Hulk 1 52 6/12/2008 55.41 134.81 264.77
3 Iron Man 2 2 4 5/7/2010 128.12 312.43 623.93
4 Thor 1 55 5/6/2011 65.72 181.03 449.33
5 Captain America: The 2 4 7/22/2011 65.06 176.65 370.57
First Avenger
... ... ... ... ... ... ... ...
23 Spiderman: Far from 2 9 7/2/2019 92.58 390.53 1131.93
Home
2. Cherry Blossom Run. The data frame below contains information on runners in the 2017
Cherry Blossom Run, which is an annual road race that takes place in Washington, DC. Most
runners participate in a 10-mile run while a smaller fraction take part in a 5k run or walk. How
many observations and how many variables does this data frame have?9
Time
Bib Name Sex Age City / Country Net Clock Pace Event
1 6 Hiwot G. F 21 Ethiopia 3217 3217 321 10 Mile
2 22 Buze D. F 22 Ethiopia 3232 3232 323 10 Mile
3 16 Gladys K. F 31 Kenya 3276 3276 327 10 Mile
4 4 Mamitu D. F 33 Ethiopia 3285 3285 328 10 Mile
5 20 Karolina N. F 35 Poland 3288 3288 328 10 Mile
... ... ... ... ... ... ... ... ... ...
19961 25153 Andres E. M 33 Woodbridge, VA 5287 5334 1700 5K
3. Air pollution and birth outcomes, study components. Researchers collected data to
examine the relationship between air pollutants and preterm births in Southern California. Dur-
ing the study air pollution levels were measured by air quality monitoring stations. Specifically,
levels of carbon monoxide were recorded in parts per million, nitrogen dioxide and ozone in
parts per hundred million, and coarse particulate matter (PM10 ) in 𝜇𝑔/𝑚3 . Length of gesta-
tion data were collected on 143,196 births between the years 1989 and 1993, and air pollution
exposure during gestation was calculated for each birth. The analysis suggested that increased
ambient PM10 and, to a lesser degree, CO concentrations may be associated with the occurrence
of preterm births. (Ritz et al. 2000)
a. Identify the main research question of the study.
b. Who are the subjects in this study, and how many are included?
c. What are the variables in the study? Identify each variable as numerical or categorical. If
numerical, state whether the variable is discrete or continuous. If categorical, state whether
the variable is ordinal.
8 The mcu_films data used in this exercise can be found in the openintro R package.
9 The run17 data used in this exercise can be found in the cherryblossom R package.
1.4. EXERCISES 21
4. Cheaters, study components. Researchers studying the relationship between honesty, age
and self-control conducted an experiment on 160 children between the ages of 5 and 15. Par-
ticipants reported their age, sex, and whether they were an only child or not. The researchers
asked each child to toss a fair coin in private and to record the outcome (white or black) on a
paper sheet, and said they would only reward children who report white. (Bucciol and Piovesan
2011)
a. Identify the main research question of the study.
b. Who are the subjects in this study, and how many are included?
c. The study’s findings can be summarized as follows: “Half the students were explicitly told
not to cheat and the others were not given any explicit instructions. In the no instruc-
tion group probability of cheating was found to be uniform across groups based on child’s
characteristics. In the group that was explicitly told to not cheat, girls were less likely to
cheat, and while rate of cheating didn’t vary by age for boys, it decreased with age for girls.”
How many variables were recorded for each subject in the study in order to conclude these
findings? State the variables and their types.
Pain free?
Group No Yes
Control 44 2
Treatment 33 10
a. What percent of patients in the treatment group were pain free 24 hours after receiving
acupuncture?
b. What percent were pain free in the control group?
c. In which group did a higher percent of patients become pain free 24 hours after receiving
acupuncture?
d. Your findings so far might suggest that acupuncture is an effective treatment for migraines
for all people who suffer from migraines. However this is not the only possible conclusion.
What is one other possible explanation for the observed difference between the percentages
of patients that are pain free 24 hours after receiving acupuncture in the two groups?
e. What are the explanatory and response variables in this study?
8. Sinusitis and antibiotics. Researchers studying the effect of antibiotic treatment for acute
sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with
acute sinusitis to one of two groups: treatment or control. Study participants received either
a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste.
The placebo consisted of symptomatic treatments such as acetaminophen, nasal decongestants,
etc. At the end of the 10-day period, patients were asked if they experienced improvement in
symptoms. The distribution of responses is summarized below.11 (Garbutt et al. 2012)
Improvement
Group No Yes
Control 16 65
Treatment 19 66
9. Daycare fines, study components. Researchers tested the deterrence hypothesis which
predicts that the introduction of a penalty will reduce the occurrence of the behavior subject to
the fine, with the condition that the fine leaves everything else unchanged, by instituting a fine
for late pickup at daycare centers. For this study, they worked with 10 volunteer daycare centers
that did not originally impose a fine to parents for picking up their kids late. They randomly
selected 6 of these daycare centers and instituted a monetary fine (of a considerable amount)
for picking up children late and then removed it. In the remaining 4 daycare centers no fine
was introduced. The study period was divided into four: before the fine (weeks 1–4), the first
4 weeks with the fine (weeks 5-8), the last 8 weeks with fine (weeks 9–16), and the after fine
period (weeks 17-20). Throughout the study, the number of kids who were picked up late was
recorded each week for each daycare. The study found that the number of late-coming parents
increased discernibly when the fine was introduced, and no reduction occurred after the fine was
removed.12 (Gneezy and Rustichini 2000)
12. Smoking habits of UK residents. A survey was conducted to study the smoking habits of
1,691 UK residents. Below is a data frame displaying a portion of the data collected in this survey.
A blank cell indicates that data for that variable was not available for a given respondent.15
amount
sex age marital_status gross_income smoke weekend weekday
1 Female 61 Married 2,600 to 5,200 No
2 Female 61 Divorced 10,400 to 15,600 Yes 5 4
3 Female 69 Widowed 5,200 to 10,400 No
4 Female 50 Married 5,200 to 10,400 No
5 Male 31 Single 10,400 to 15,600 Yes 10 20
... ... ... ... ... ...
1691 Male 49 Divorced Above 36,400 Yes 15 10
13. US Airports. The visualization below shows the geographical distribution of airports in the
contiguous United States and Washington, DC. This visualization was constructed based on a
dataset where each observation is an airport.16
a. List the variables you believe were necessary to create this visualization.
b. Indicate whether each variable is numerical or categorical. If numerical, identify as contin-
uous or discrete. If categorical, indicate if the variable is ordinal.
15 The smoking data used in this exercise can be found in the openintro R package.
16 The usairports data used in this exercise can be found in the airports R package.
1.4. EXERCISES 25
14. UN Votes. The visualization below shows voting patterns in the United States, Canada, and
Mexico in the United Nations General Assembly on a variety of issues. Specifically, for a given
year between 1946 and 2019, it displays the percentage of roll calls in which the country voted
yes for each issue. This visualization was constructed based on a dataset where each observation
is a country/year pair.17
15. UK baby names. The visualization below shows the number of baby girls born in the United
Kingdom (comprised of England & Wales, Northern Ireland, and Scotland) who were given the
name “Fiona” over the years.18
a. List the variables you believe were necessary to create this visualization.
b. Indicate whether each variable is numerical or categorical. If numerical, identify as contin-
uous or discrete. If categorical, indicate if the variable is ordinal.
17 The data used in this exercise can be found in the unvotes R package.
18 The ukbabynames data used in this exercise can be found in the ukbabynames R package.
26 CHAPTER 1. HELLO DATA
16. Shows on Netflix. The visualization below shows the distribution of ratings of TV shows on
Netflix (a streaming entertainment service) based on the decade they were released in and the
country they were produced in. In the dataset, each observation is a TV show.19
a. List the variables you believe were necessary to create this visualization.
b. Indicate whether each variable is numerical or categorical. If numerical, identify as contin-
uous or discrete. If categorical, indicate if the variable is ordinal.
17. Stanford Open Policing. The Stanford Open Policing project gathers, analyzes, and releases
records from traffic stops by law enforcement agencies across the United States. Their goal is
to help researchers, journalists, and policy makers investigate and improve interactions between
police and the public. The following is an excerpt from a summary table created based off of
the data collected as part of this project. (Pierson et al. 2020)
Driver Car
County State Race / Ethnicity Arrest rate Stops / year Search rate
Apache County AZ Black 0.016 266 0.077
Apache County AZ Hispanic 0.018 1008 0.053
Apache County AZ White 0.006 6322 0.017
Cochise County AZ Black 0.015 1169 0.047
... ... ... ... ... ...
Wood County WI Hispanic 0.029 27 0.036
Wood County WI White 0.029 1157 0.033
a. What variables were collected on each individual traffic stop in order to create the summary
table above?
b. State whether each variable is numerical or categorical. If numerical, state whether it is
continuous or discrete. If categorical, state whether it is ordinal or not.
c. Suppose we wanted to evaluate whether vehicle search rates are different for drivers of
different races. In this analysis, which variable would be the response variable and which
variable would be the explanatory variable?
19 The netflix_titles data used in this exercise can be found in the tidytuesdayR R package.
1.4. EXERCISES 27
18. Space launches. The following summary table shows the number of space launches in the US
by the type of launching agency and the outcome of the launch (success or failure).20
a. What variables were collected on each launch in order to create to the summary table
above?
b. State whether each variable is numerical or categorical. If numerical, state whether it is
continuous or discrete. If categorical, state whether it is ordinal or not.
c. Suppose we wanted to study how the success rate of launches vary between launching
agencies and over time. In this analysis, which variable would be the response variable and
which variable would be the explanatory variable?
19. Pet names. The city of Seattle, WA has an open data portal that includes pets registered
in the city. For each registered pet, we have information on the pet’s name and species. The
following visualization plots the proportion of dogs with a given name versus the proportion of
cats with the same name. The 20 most common cat and dog names are displayed. The diagonal
line on the plot is the 𝑥 = 𝑦 line; if a name appeared on this line, the name’s popularity would
be exactly the same for dogs and cats.21
Chapter 2
Study design
Before digging into the details of working with data, we stop to think about
how data come to be. That is, if the data are to be used to make broad
and complete conclusions, then it is important to understand who or what
the data represent. One important aspect of data provenance is sampling.
Knowing how the observational units were selected from a larger entity
will allow for generalizations back to the population from which the data
were randomly selected. Additionally, by understanding the structure of
the study, causal relationships can be separated from those relationships
which are only associated. A good question to ask oneself before working
with the data at all is, “How were these observations collected?”. You will
learn a lot about the data by understanding its source.
The first step in conducting research is to identify topics or questions that are to be investigated. A
clearly laid out research question is helpful in identifying what subjects or cases should be studied
and what variables are important. It is also important to consider how data are collected so that the
data are reliable and help achieve the research goals.
GUIDED PRACTICE
For the second and third questions above, identify the target population and what
represents an individual case.1
Anecdotal evidence.
Be careful of data collected in a haphazard fashion. Such evidence may be true and
verifiable, but it may only represent extraordinary cases and therefore not be a good
representation of the population.
Anecdotal evidence typically is composed of unusual cases that we recall based on their striking
characteristics. For instance, we are more likely to remember the two people we met who took 7 years
to graduate than the six others who graduated in four years. Instead, of looking at the most unusual
cases, we should examine a sample of many cases that better represent the population.
1 The question “Over the last five years, what is the average time to complete a degree for Duke undergrads?” is
only relevant to students who complete their degree; the average cannot be computed using a student who never finished
their degree. Thus, only Duke undergrads who graduated in the last five years represent cases in the population under
consideration. Each such student is an individual case. For the question “Does a new drug reduce the number of deaths
in patients with severe heart disease?”, a person with severe heart disease represents a case. The population includes
all people with severe heart disease.
30 CHAPTER 2. STUDY DESIGN
Figure 2.1: 10 graduates are randomly selected from the population to be included in the sample.
EXAMPLE
Suppose we ask a student who happens to be majoring in nutrition to select several graduates
for the study. Which students do you think they might pick? Do you think their sample would
be representative of all graduates?
They might pick a disproportionate number of graduates from health-related fields, as shown
in Figure 2.2. When selecting samples by hand, we run the risk of picking a biased sample,
even if our bias is unintended.
Figure 2.2: Asked to pick a sample of graduates, a nutrition major might inadvertently pick a disproportion-
ate number of graduates from health-related majors.
If someone was permitted to pick and choose exactly which graduates were included in the sample, it
is entirely possible that the sample would overrepresent that person’s interests, which may be entirely
unintentional. This introduces bias into a sample. Sampling randomly helps address this problem.
The most basic random sample is called a simple random sample and is equivalent to drawing
names out of a hat to select cases. This means that each case in the population has an equal chance
of being included and the cases in the sample are not related to each other.
2.1. SAMPLING PRINCIPLES AND STRATEGIES 31
The act of taking a simple random sample helps minimize bias. However, bias can crop up in other
ways. Even when people are picked at random, e.g., for surveys, caution must be exercised if the
non-response rate is high. For instance, if only 30% of the people randomly sampled for a survey
actually respond, then it is unclear whether the results are representative of the entire population.
This non-response bias can skew results.
Figure 2.3: Due to the possibility of non-response, survey studies may only reach a certain group within the
population. It is difficult, and oftentimes impossible, to completely fix this problem.
Another common downfall is a convenience sample, where individuals who are easily accessible are
more likely to be included in the sample. For instance, if a political survey is done by stopping people
walking in the Bronx, this will not represent all of New York City. It is often difficult to discern what
sub-population a convenience sample represents.
GUIDED PRACTICE
We can easily access ratings for products, sellers, and companies through websites.
These ratings are based only on those people who go out of their way to provide a
rating. If 50% of online reviews for a product are negative, do you think this means
that 50% of buyers are dissatisfied with the product? Why?2
2 Answers will vary. From our own anecdotal experiences, we believe people tend to rant more about products that
fell below expectations than rave about those that perform as expected. For this reason, we suspect there is a negative
bias in product ratings on sites like Amazon. However, since our experiences may not be representative, we also keep
an open mind.
32 CHAPTER 2. STUDY DESIGN
Figure 2.4: Examples of simple random and stratified sampling. In the top panel, simple random sampling
was used to randomly select the 18 cases (denoted in red). In the bottom panel, stratified sampling was used:
cases were first grouped into strata, then simple random sampling was employed to randomly select 3 cases
within each stratum.’
EXAMPLE
Why would it be good for cases within each stratum to be very similar?
We might get a more stable estimate for the subpopulation in a stratum if the cases are very
similar, leading to more precise estimates within each group. When we combine these estimates
into a single estimate for the full population, that population estimate will tend to be more
precise since each individual group estimate is itself more precise.
In a cluster sample, we break up the population into many groups, called clusters. Then we sample
a fixed number of clusters and include all observations from each of those clusters in the sample. A
multistage sample is like a cluster sample, but rather than keeping all observations in each cluster,
we would collect a random sample within each selected cluster.
2.1. SAMPLING PRINCIPLES AND STRATEGIES 33
Figure 2.5: Examples of cluster and multistage sampling. In the top panel, cluster sampling was used: data
were binned into nine clusters, three of these clusters were sampled, and all observations within these three
clusters were included in the sample. In the bottom panel, multistage sampling was used, which differs from
cluster sampling only in that we randomly select a subset of each cluster to be included in the sample rather
than measuring every case in each sampled cluster.’
Sometimes cluster or multistage sampling can be more economical than the alternative sampling
techniques. Also, unlike stratified sampling, these approaches are most helpful when there is a lot of
case-to-case variability within a cluster but the clusters themselves do not look very different from one
another. For example, if neighborhoods represented clusters, then cluster or multistage sampling work
best when the populations inside each neighborhood are very diverse. A downside of these methods
is that more advanced techniques are typically required to analyze the data, though the methods in
this book can be extended to handle such data.
EXAMPLE
Suppose we are interested in estimating the malaria rate in a densely tropical portion of rural
Indonesia. We learn that there are 30 villages in that part of the Indonesian jungle, each more
or less like the next, but the distances between the villages are substantial. We want to test
150 individuals for malaria. What sampling method should we use?
A simple random sample would likely draw individuals from all 30 villages, which could make
data collection expensive. Stratified sampling would be a challenge since it is unclear how we
would build strata of similar individuals. However, cluster sampling or multistage sampling
seem like very good ideas. With multistage sampling, we could randomly select half of the
villages, then randomly select 10 people from each. This could reduce data collection costs
substantially in comparison to a simple random sample, and the cluster sample would still yield
reliable information, even if we would need to analyze the data with more advanced methods
than those introduced in this book.
34 CHAPTER 2. STUDY DESIGN
2.2 Experiments
Studies where the researchers assign treatments to cases are called experiments. When this assign-
ment includes randomization, e.g., using a coin flip to decide which treatment a patient receives, it
is called a randomized experiment. Randomized experiments are fundamentally important when
trying to show a causal connection between two variables.
Confounding variable.
A confounding variable is one that is associated with both the explanatory and
response variables. Because it is associated with both variables, it prevents the study
from concluding that the explanatory variable caused the response variable. Consider
a silly example with total ice-cream sales as the explanatory variable and number of
boating accidents as the response variable (which may seem highly correlated). Outside
temperature is associated with both variables, and therefore we cannot conclude that
high ice-cream sales is a cause of more boating accidents.
Confounding variables may or may not be measured as part of the study. Regardless,
drawing cause-and-effect conclusions is difficult in an observational study because of the
ever-present possibility of confounding variables.
3. Replication. The more cases researchers observe, the more accurately they can estimate the
effect of the explanatory variable on the response. In a single study, we replicate by collecting
a sufficiently large sample. What is considered sufficiently large varies from experiment to
experiment, but at a minimum we want to have multiple subjects (experimental units) per
treatment group. Another way of achieving replication is replicating an entire study to verify
an earlier finding. The term replication crisis refers to the ongoing methodological crisis
in which past findings from scientific studies in several disciplines have failed to be replicated.
Pseudoreplication occurs when individual observations under different treatments are heavily
dependent on each other. For example, suppose you have 50 subjects in an experiment where
you’re taking blood pressure measurements at 10 time points throughout the course of the
study. By the end, you will have 50 × 10 = 500 measurements. Reporting that you have 500
observations would be considered pseudoreplication, as the blood pressure measurements of a
given individual are not independent of each other. Pseudoreplication often happens when the
wrong entity is replicated, and the reported sample sizes are exaggerated.
4. Blocking. Researchers sometimes know or suspect that variables, other than the treatment,
influence the response. Under these circumstances, they may first group individuals based on
this variable into blocks and then randomize cases within each block to the treatment groups.
This strategy is often referred to as blocking. For instance, if we are looking at the effect of
a drug on heart attacks, we might first split patients in the study into low-risk and high-risk
blocks, then randomly assign half the patients from each block to the control group and the other
3 This is a different concept than a control group, which we discuss in the second principle and in Section 2.2.2.
4 Also called a lurking variable, confounding factor, or a confounder.
2.2. EXPERIMENTS 35
half to the treatment group, as shown in Figure 2.6. This strategy ensures that each treatment
group has the same number of low-risk patients and the same number of high-risk patients.
Figure 2.6: Blocking for patient risk. Patients are first divided into low-risk and high-risk blocks, then
patients in each block are evenly randomized into the treatment groups. This strategy ensures equal represen-
tation of patients in each treatment group from both risk categories.
It is important to incorporate the first three experimental design principles into any study, and this
book describes applicable methods for analyzing data from such experiments. Blocking is a slightly
more advanced technique, and statistical methods in this book may be extended to analyze data
collected using blocking.
Put yourself in the place of a person in the study. If you are in the treatment group, you are given a
fancy new drug that you anticipate will help you. On the other hand, a person in the other group does
not receive the drug and sits idly, hoping her participation does not increase her risk of death. These
perspectives suggest there are actually two effects in this study: the one of interest is the effectiveness
of the drug, and the second is an emotional effect of (not) taking the drug, which is difficult to quantify.
Researchers aren’t usually interested in the emotional effect, which might bias the study. To cir-
cumvent this problem, researchers do not want patients to know which group they are in. When
researchers keep the patients uninformed about their treatment, the study is said to be blind. But
there is one problem: if a patient does not receive a treatment, they will know they’re in the control
group. A solution to this problem is to give a fake treatment to patients in the control group. This is
called a placebo, and an effective placebo is the key to making a study truly blind. A classic example
of a placebo is a sugar pill that is made to look like the actual treatment pill. However, offering such
a fake treatment may not be ethical in certain experiments. For example, in medical experiments,
typically the control group must get the current standard of care. Oftentimes, a placebo results in a
slight but real improvement in patients. This effect has been dubbed the placebo effect.
The patients are not the only ones who should be blinded: doctors and researchers can unintentionally
bias a study. When a doctor knows a patient has been given the real treatment, they might inadver-
tently give that patient more attention or care than a patient that they know is on the placebo. To
guard against this bias, which again has been found to have a measurable effect in some instances,
most modern studies employ a double-blind setup where doctors or researchers who interact with
patients are, just like the patients, unaware of who is or is not receiving the treatment.6
GUIDED PRACTICE
Look back to the study in Section 1.1 where researchers were testing whether stents
were effective at reducing strokes in at-risk patients. Is this an experiment? Was the
study blinded? Was it double-blinded?7
GUIDED PRACTICE
For the study in Section 1.1, could the researchers have employed a placebo? If so, what
would that placebo have looked like?8
You may have many questions about the ethics of sham surgeries to create a placebo. These questions
may have even arisen in your mind when in the general experiment context, where a possibly helpful
treatment was withheld from individuals in the control group; the main difference is that a sham
surgery tends to create additional risk, while withholding a treatment only maintains a person’s risk.
There are always multiple viewpoints of experiments and placebos, and rarely is it obvious which is
ethically “correct”. For instance, is it ethical to use a sham surgery when it creates a risk to the
patient? However, if we do not use sham surgeries, we may promote the use of a costly treatment
that has no real effect; if this happens, money and other resources will be diverted away from other
treatments that are known to be helpful. Ultimately, this is a difficult situation where we cannot
perfectly protect both the patients who have volunteered for the study and the patients who may
benefit (or not) from the treatment in the future.
6 There are always some researchers involved in the study who do know which patients are receiving which treatment.
However, they do not interact with the study’s patients and do not tell the blinded health care professionals who is
receiving which treatment.
7 The researchers assigned the patients into their treatment groups, so this study was an experiment. However, the
patients could distinguish what treatment they received because a stent is a surgical procedure. There is no equivalent
surgical placebo, so this study was not blind. The study could not be double-blind since it was not blind.
8 Ultimately, can we make patients think they got treated from a surgery? In fact, we can, and some experiments
use a sham surgery. In a sham surgery, the patient does undergo surgery, but the patient does not receive the full
treatment, though they will still get a placebo effect.
2.3. OBSERVATIONAL STUDIES 37
Studies where no treatment has been explicitly applied (or explicitly withheld) are called observa-
tional studies. For instance, studies on the loan data and county data described in Section 1.2 are
would both be considered observational, as they rely on observational data.
Making causal conclusions based on experiments is often reasonable, since we can randomly assign
the explanatory variable(s), i.e., the treatments. However, making the same causal conclusions based
on observational data can be treacherous and is not recommended. Thus, observational studies are
generally only sufficient to show associations or form hypotheses that can be later checked with
experiments.
Suppose an observational study tracked sunscreen use and skin cancer, and it was found that the
more sunscreen someone used, the more likely the person was to have skin cancer. Does this mean
sunscreen causes skin cancer?
No! Some previous research tells us that using sunscreen actually reduces skin cancer risk, so maybe
there is another variable that can explain this hypothetical association between sunscreen usage and
skin cancer, as shown in Figure 2.7. One important piece of information that is absent is sun exposure.
If someone is out in the sun all day, they are more likely to use sunscreen and more likely to get skin
cancer. Exposure to the sun is unaccounted for in the simple observational investigation.
Figure 2.7: Sun exposure may be the root cause of both sunscreen use and skin cancer.
In this example, sun exposure is a confounding variable. The presence of confounding variables is
what inhibits the ability for observational studies to make causal claims. While one method to justify
making causal conclusions from observational studies is to exhaust the search for confounding variables,
there is no guarantee that all confounding variables can be examined or measured.
GUIDED PRACTICE
Figure 1.2 shows a negative association between the homeownership rate and the per-
centage of housing units that are in multi-unit structures in a county. However, it is
unreasonable to conclude that there is a causal relationship between the two variables.
Suggest a variable that might explain the negative relationship.9
Observational studies come in two forms: prospective and retrospective studies. A prospective study
identifies individuals and collects information as events unfold. For instance, medical researchers may
identify and follow a group of patients over many years to assess the possible influences of behavior on
cancer risk. One example of such a study is The Nurses’ Health Study. Started in 1976 and expanded
in 1989, the Nurses’ Health Study has collected data on over 275,000 nurses and is still enrolling
participants. This prospective study recruits registered nurses and then collects data from them using
questionnaires. Retrospective studies collect data after events have taken place, e.g., researchers
may review past events in medical records. Some datasets may contain both prospectively- and
retrospectively collected variables, such as medical studies which gather information on participants’
lives before they enter the study and subsequently collect data on participants throughout the study.
9 Answers will vary. Population density may be important. If a county is very dense, then this may require a larger
percentage of residents to live in housing units that are in multi-unit structures. Additionally, the high density may
contribute to increases in property value, making homeownership unfeasible for many residents.
38 CHAPTER 2. STUDY DESIGN
2.4.1 Summary
A proficient analyst will have a good sense of the types of data they are working with and how to
visualize the data in order to gain a complete understanding of the variables. Equally important,
however, is the data source. In this chapter, we have discussed randomized experiments and tak-
ing good, random, representative samples from a population. When we discuss inferential methods
(starting in Chapter 11), the conclusions that can be drawn will be dependent on how the data were
collected. Figure 2.8 summarizes how sampling and assignment methods relate to the scope of infer-
ence.10 Regularly revisiting Figure 2.8 will be important when making conclusions from a given data
analysis.
Figure 2.8: Analysis conclusions should be made carefully according to how the data were collected. Very
few datasets come from the top left box because usually ethics require that random assignment of treatments
can only be given to volunteers. Both representative (ideally random) sampling and experiments (random
assignment of treatments) are important for how statistical conclusions can be made on populations.
2.4.2 Terms
The terms introduced in this chapter are presented in Table 2.1. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
10 Derived from similar figures in Chance and Rossman (2018) and Ramsey and Schafer (2012).
2.5. EXERCISES 39
2.5 Exercises
2. Sleeping in college. A recent article in a college newspaper stated that college students get
an average of 5.5 hrs of sleep each night. A student who was skeptical about this value decided
to conduct a survey by randomly sampling 25 students. On average, the sampled students
slept 6.25 hours per night. Identify which value represents the sample mean and which value
represents the claimed population mean.
3. Air pollution and birth outcomes, scope of inference. Researchers collected data to ex-
amine the relationship between air pollutants and preterm births in Southern California. During
the study air pollution levels were measured by air quality monitoring stations. Length of ges-
tation data were collected on 143,196 births between the years 1989 and 1993, and air pollution
exposure during gestation was calculated for each birth. (Ritz et al. 2000)
a. Identify the population of interest and the sample in this study.
b. Comment on whether the results of the study can be generalized to the population, and if
the findings of the study can be used to establish causal relationships.
4. Cheaters, scope of inference. Researchers studying the relationship between honesty, age
and self-control conducted an experiment on 160 children between the ages of 5 and 15. The
researchers asked each child to toss a fair coin in private and to record the outcome (white or
black) on a paper sheet, and said they would only reward children who report white. Half the
students were explicitly told not to cheat and the others were not given any explicit instructions.
Differences were observed in the cheating rates in the instruction and no instruction groups,
as well as some differences across children’s characteristics within each group. (Bucciol and
Piovesan 2011)
a. Identify the population of interest and the sample in this study.
b. Comment on whether the results of the study can be generalized to the population, and if
the findings of the study can be used to establish causal relationships.
7. Relaxing after work. The General Social Survey asked the question, “After an average work
day, about how many hours do you have to relax or pursue activities that you enjoy?” to
a random sample of 1,155 Americans. The average relaxing time was found to be 1.65 hours.
Determine which of the following is an observation, a variable, a sample statistic, or a population
parameter.11
a. An American in the sample.
b. Number of hours spent relaxing after an average work day.
c. 1.65.
d. Average number of hours all Americans spend relaxing after an average work day.
8. Cats on YouTube. Suppose you want to estimate the percentage of videos on YouTube that
are cat videos. It is impossible for you to watch all videos on YouTube so you use a random video
picker to select 1000 videos for you. You find that 2% of these videos are cat videos. Determine
which of the following is an observation, a variable, a sample statistic, or a population parameter.
a. Percentage of all videos on YouTube that are cat videos.
b. 2%.
c. A video in your sample.
d. whether a video is a cat video.
9. Course satisfaction across sections. A large college class has 160 students. All 160 students
attend the lectures together, but the students are divided into 4 groups, each of 40 students,
for lab sections administered by different teaching assistants. The professor wants to conduct a
survey about how satisfied the students are with the course, and he believes that the lab section
a student is in might affect the student’s overall satisfaction with the course.
a. What type of study is this?
b. Suggest a sampling strategy for carrying out this study.
10. Housing proposal across dorms. On a large college campus first-year students and sopho-
mores live in dorms located on the eastern part of the campus and juniors and seniors live in
dorms located on the western part of the campus. Suppose you want to collect student opinions
on a new housing structure the college administration is proposing and you want to make sure
your survey equally represents opinions from students from all years.
a. What type of study is this?
b. Suggest a sampling strategy for carrying out this study.
11 The data used in this exercise comes from the General Social Survey, 2018.
2.5. EXERCISES 41
11. Internet use and life expectancy. The following scatterplot was created as part of a study
evaluating the relationship between estimated life expectancy at birth (as of 2014) and percent-
age of internet users (as of 2009) in 208 countries for which such data were available.12
a. Describe the relationship between life expectancy and percentage of internet users.
b. What type of study is this?
c. State a possible confounding variable that might explain this relationship and describe its
potential effect.
12. Stressed out. A study that surveyed a random sample of otherwise healthy high school students
found that they are more likely to get muscle cramps when they are stressed. The study also
noted that students drink more coffee and sleep less when they are stressed.
a. What type of study is this?
b. Can this study be used to conclude a causal relationship between increased stress and
muscle cramps?
c. State possible confounding variables that might explain the observed relationship between
increased stress and muscle cramps.
13. Evaluate sampling methods. A university wants to determine what fraction of its under-
graduate student body support a new $25 annual fee to improve the student union. For each
proposed method below, indicate whether the method is reasonable or not.
a. Survey a simple random sample of 500 students.
b. Stratify students by their field of study, then sample 10% of students from each stratum.
c. Cluster students by their ages (e.g., 18 years old in one cluster, 19 years old in one cluster,
etc.), then randomly sample three clusters and survey all students in those clusters.
14. Random digit dialing. The Gallup Poll uses a procedure called random digit dialing, which
creates phone numbers based on a list of all area codes in America in conjunction with the
associated number of residential households in each area code. Give a possible reason the Gallup
Poll chooses to use random digit dialing instead of picking phone numbers from the phone book.
12 The cia_factbook data used in this exercise can be found in the openintro R package.
42 CHAPTER 2. STUDY DESIGN
15. Haters are gonna hate, study confirms. A study published in the Journal of Personality
and Social Psychology asked a group of 200 randomly sampled participants recruited online
using Amazon’s Mechanical Turk to evaluate how they felt about various subjects, such as
camping, health care, architecture, taxidermy, crossword puzzles, and Japan in order to measure
their attitude towards mostly independent stimuli. Then, they presented the participants with
information about a new product: a microwave oven. This microwave oven does not exist, but
the participants didn’t know this, and were given three positive and three negative fake reviews.
People who reacted positively to the subjects on the dispositional attitude measurement also
tended to react positively to the microwave oven, and those who reacted negatively tended to
react negatively to it. Researchers concluded that “some people tend to like things, whereas
others tend to dislike things, and a more thorough understanding of this tendency will lead to a
more thorough understanding of the psychology of attitudes.” (Hepler and Albarracı́n 2013)
a. What are the cases?
b. What is (are) the response variable(s) in this study?
c. What is (are) the explanatory variable(s) in this study?
d. Does the study employ random sampling? Explain your reasoning.
e. Is this an observational study or an experiment? Explain your reasoning.
f. Can we establish a causal link between the explanatory and response variables?
g. Can the results of the study be generalized to the population at large?
16. Reading the paper. Below are excerpts from two articles published in the NY Times:
a. An excerpt from an article titled Risks: Smokers Found More Prone to Dementia is below.
Based on this study, can we conclude that smoking causes dementia later in life? Explain
your reasoning. (Rabin 2010)
“Researchers analyzed data from 23,123 health plan members who participated in a
voluntary exam and health behavior survey from 1978 to 1985, when they were 50-
60 years old. 23 years later, about 25% of the group had dementia, including 1,136
with Alzheimer’s disease and 416 with vascular dementia. After adjusting for other
factors, the researchers concluded that pack-a-day smokers were 37% more likely than
nonsmokers to develop dementia, and the risks went up with increased smoking; 44%
for one to two packs a day; and twice the risk for more than two packs.”
b. An excerpt from an article titled The School Bully Is Sleepy is below. A friend of yours
who read the article says, “The study shows that sleep disorders lead to bullying in school
children.” Is this statement justified? If not, how best can you describe the conclusion that
can be drawn from this study? (Parker-Pope 2011)
“The University of Michigan study, collected survey data from parents on each child’s
sleep habits and asked both parents and teachers to assess behavioral concerns. About
a third of the students studied were identified by parents or teachers as having prob-
lems with disruptive behavior or bullying. The researchers found that children who
had behavioral issues and those who were identified as bullies were twice as likely to
have shown symptoms of sleep disorders.”
17. Sampling strategies. A statistics student who is curious about the relationship between the
amount of time students spend on social networking sites and their performance at school decides
to conduct a survey. Various research strategies for collecting data are described below. In each,
name the sampling method proposed and any bias you might expect.
a. They randomly sample 40 students from the study’s population, give them the survey, ask
them to fill it out, and bring it back the next day.
b. They give out the survey only to their friends, making sure each one of them fills it out.
c. They post a link to an online survey on Facebook and ask their friends to fill it out.
d. They randomly sample 5 classes and asks a random sample of students from those classes
to fill out the survey.
2.5. EXERCISES 43
18. Family size. Suppose we want to estimate household size, where a “household” is defined as
people living together in the same dwelling, and sharing living accommodations. If we select
students at random at an elementary school and ask them what their family size is, will this be
a good measure of household size? Or will our average be biased? If so, will it overestimate or
underestimate the true value?
19. Light and exam performance. A study is designed to test the effect of light level on exam
performance of students. The researcher believes that light levels might have different effects on
people who wear glasses and people who don’t, so they want to make sure both groups of people
are equally represented in each treatment. The treatments are fluorescent overhead lighting,
yellow overhead lighting, no overhead lighting (only desk lamps).
a. What is the response variable?
b. What is the explanatory variable? What are its levels?
c. What is the blocking variable? What are its levels?
20. Vitamin supplements. To assess the effectiveness of taking large doses of vitamin C in
reducing the duration of the common cold, researchers recruited 400 healthy volunteers from
staff and students at a university. A quarter of the patients were assigned a placebo, and the
rest were evenly divided between 1g Vitamin C, 3g Vitamin C, or 3g Vitamin C plus additives
to be taken at onset of a cold for the following two days. All tablets had identical appearance
and packaging. The nurses who handed the prescribed pills to the patients knew which patient
received which treatment, but the researchers assessing the patients when they were sick did
not. No statistically discernible differences were observed in any measure of cold duration or
severity between the four groups, and the placebo group had the shortest duration of symptoms.
(Audera et al. 2001)
a. Was this an experiment or an observational study? Why?
b. What are the explanatory and response variables in this study?
c. Were the patients blinded to their treatment?
d. Was this study double-blind?
e. Participants are ultimately able to choose whether to use the pills prescribed to them. We
might expect that not all of them will adhere and take their pills. Does this introduce a
confounding variable to the study? Explain your reasoning.
21. Light, noise, and exam performance. A study is designed to test the effect of light level
and noise level on exam performance of students. The researcher believes that light and noise
levels might have different effects on people who wear glasses and people who don’t, so they
want to make sure both groups of people are equally represented in each treatment. The light
treatments considered are fluorescent overhead lighting, yellow overhead lighting, no overhead
lighting (only desk lamps). The noise treatments considered are no noise, construction noise,
and human chatter noise.
a. What type of study is this?
b. How many factors are considered in this study? Identify them, and describe their levels.
c. What is the role of the wearing glasses variable in this study?
22. Music and learning. You would like to conduct an experiment in class to see if students learn
better if they study without any music, with music that has no lyrics (instrumental), or with
music that has lyrics. Briefly outline a design for this study.
23. Soda preference. You would like to conduct an experiment in class to see if your classmates
prefer the taste of regular Coke or Diet Coke. Briefly outline a design for this study.
44 CHAPTER 2. STUDY DESIGN
24. Exercise and mental health. A researcher is interested in the effects of exercise on mental
health and they propose the following study: use stratified random sampling to ensure repre-
sentative proportions of 18-30, 31-40 and 41- 55 year-olds from the population. Next, randomly
assign half the subjects from each age group to exercise twice a week, and instruct the rest not
to exercise. Conduct a mental health exam at the beginning and at the end of the study, and
compare the results.
a. What type of study is this?
b. What are the treatment and control groups in this study?
c. Does this study make use of blocking? If so, what is the blocking variable?
d. Does this study make use of blinding?
e. Comment on whether the results of the study can be used to establish a causal relationship
between exercise and mental health, and indicate whether the conclusions can be generalized
to the population at large.
f. Suppose you are given the task of determining if this proposed study should get funding.
Would you have any reservations about the study proposal?
25. Chia seeds and weight loss. Chia Pets – those terra-cotta figurines that sprout fuzzy green
hair – made the chia plant a household name. But chia has since gained a reputation as a diet
supplement. In one 2009 study, 38 men and 38 women were recruited and and divided each
randomly into two groups: treatment or control. One group was given 25 grams of chia seeds
twice a day, and the other was given a placebo. The subjects volunteered to be a part of the
study. After 12 weeks, the scientists found no statistically discernible difference between the
groups in appetite or weight loss. (Nieman et al. 2009)
a. What type of study is this?
b. What are the experimental and control treatments in this study?
c. Has blocking been used in this study? If so, what is the blocking variable?
d. Has blinding been used in this study?
e. Comment on whether we can make a causal statement, and indicate whether we can gen-
eralize the conclusion to the population at large.
26. City council survey. A city council has requested a household survey be conducted in a
suburban area of their city. The area is broken into many distinct and unique neighborhoods,
some including large homes, some with only apartments, and others a diverse mixture of housing
structures. For each part below, identify the sampling methods described, and describe the
statistical pros and cons of the method in the city’s context.
a. Randomly sample 200 households from the city.
b. Divide the city into 20 neighborhoods, and then sample 10 households from each neighbor-
hood.
c. Divide the city into 20 neighborhoods, randomly sample 3 neighborhoods, and then sample
all households from those 3 neighborhoods.
d. Divide the city into 20 neighborhoods, randomly sample 8 neighborhoods, and then ran-
domly sample 50 households from those neighborhoods.
e. Sample the 200 households closest to the city council offices.
2.5. EXERCISES 45
27. Flawed reasoning. Identify the flaw(s) in reasoning in the following scenarios. Explain what
the individuals in the study should have done differently if they wanted to make such strong
conclusions.
a. Students at an elementary school are given a questionnaire that they are asked to return
after their parents have completed it. One of the questions asked is, “Do you find that your
work schedule makes it difficult for you to spend time with your kids after school?” Of the
parents who replied, 85% said “no”. Based on these results, the school officials conclude
that a great majority of the parents have no difficulty spending time with their kids after
school.
b. A survey is conducted on a simple random sample of 1,000 women who recently gave birth,
asking them about whether they smoked during pregnancy. A follow-up survey asking if
the children have respiratory problems is conducted 3 years later. However, only 567 of
these women are reached at the same address. The researcher reports that these 567 women
are representative of all mothers.
c. An orthopedist administers a questionnaire to 30 of his patients who do not have any
joint problems and finds that 20 of them regularly go running. He concludes that running
decreases the risk of joint problems.
28. Income and education in US counties. The scatterplot below shows the relationship be-
tween per capita income (in thousands of dollars) and percent of population with a bachelor’s
degree in 3,142 counties in the US in 2019.13
13 The county_complete data used in this exercise can be found in the openintro R package.
46 CHAPTER 2. STUDY DESIGN
29. Eat well, feel better. In a public health study on the effects of consumption of fruits and
vegetables on psychological well-being in young adults, participants were randomly assigned to
three groups: (1) diet-as-usual, (2) an ecological momentary intervention involving text message
reminders to increase their fruits and vegetable consumption plus a voucher to purchase them,
or (3) a fruit and vegetable intervention in which participants were given two additional daily
servings of fresh fruits and vegetables to consume on top of their normal diet. Participants were
asked to take a nightly survey on their smartphones. Participants were student volunteers at
the University of Otago, New Zealand. At the end of the 14-day study, only participants in the
third group showed improvements to their psychological well-being across the 14-days relative
to the other groups. (Conner et al. 2017)
a. What type of study is this?
b. Identify the explanatory and response variables.
c. Comment on whether the results of the study can be generalized to the population.
d. Comment on whether the results of the study can be used to establish causal relationships.
e. A newspaper article reporting on the study states, “The results of this study provide proof
that giving young adults fresh fruits and vegetables to eat can have psychological benefits,
even over a brief period of time.” How would you suggest revising this statement so that it
can be supported by the study?
30. Screens, teens, and psychological well-being. In a study of three nationally representative
large-scale datasets from Ireland, the United States, and the United Kingdom (n = 17,247),
teenagers between the ages of 12 to 15 were asked to keep a diary of their screen time and
answer questions about how they felt or acted. The answers to these questions were then used
to compute a psychological well-being score. Additional data were collected and included in the
analysis, such as each child’s sex and age, and on the mother’s education, ethnicity, psychological
distress, and employment. The study concluded that there is little clear-cut evidence that screen
time decreases adolescent well-being. (Orben and Baukney-Przybylski 2018)
a. What type of study is this?
b. Identify the explanatory variables.
c. Identify the response variable.
d. Comment on whether the results of the study can be generalized to the population, and
why.
e. Comment on whether the results of the study can be used to establish causal relationships.
47
Chapter 3
Applications: Data
While many of you may be glued to the Olympic Games every four years (or every two years if you
fancy both summer and winter sports), the Paralympic Games are less popular than the Olympic
Games, even if they hold the same competitive thrills.
The Paralympic Games began as a way to support soldiers who had been wounded in World War II
as a way to help them rehabilitate. The first Paralympic Games were held in Rome, Italy in 1960.
Since 1988 (Seoul, South Korea), the Paralympic Games have been held a few weeks later than the
Olympic Games in the same city, in both the summer and winter.
In this case study we introduce a dataset comparing Olympic and Paralympic gold medal finishers
in the 1500m running competition (the Olympic “mile”, if a bit shorter than a full mile). The goal
of the case study is to walk you through what a data scientist does when they first get a hold of
a dataset. We also provide some “foreshadowing” of concepts and techniques we’ll introduce in the
next few chapters on exploratory data analysis. Last, we introduce Simpson’s paradox and discuss
the importance of understanding the impact of multiple variables in an analysis.
Table 3.1 shows the last five rows from the dataset, which are the five most recent 1500m races. Notice
that there are racers from both the Men’s and Women’s divisions as well as those of varying visual
impairment (T11, T12, T13, and Olympic). The T11 athletes have almost complete visual impairment,
run with a black-out blindfold, and are allowed to run with a guide-runner. T12 and T13 athletes
have some visual impairment, and the visual acuity of Olympic runners is not determined.
When you encounter a new dataset, taking a peek at the last few rows as we did in Table 3.1 should
be instinctual. It can be helpful to look at the first few rows of the data as well to get a sense of other
aspects of the data which may not be apparent in the last few rows. Table 3.2 shows the top five rows
of the paralympic_1500 dataset, which reveals that for at least the first five Olympiads, there were
no runners in the Women’s division or in the Paralympics.
48 CHAPTER 3. APPLICATIONS: DATA
At this stage it’s also useful to think about how the data were collected, as that will inform the scope
of any inference you can make based on your analysis of the data.
GUIDED PRACTICE
Do these data come from an observational study or an experiment?1
GUIDED PRACTICE
There are 82 rows and 9 columns in the dataset. What does each row and each column
represent?2
1 This is an observational study. Researchers collected data on past gold medal race times in both Olympic and
Paralympic Games.
2 Each row represents a 1500m gold medal race and each column represents a variable containing information on
each race.
3.1. CASE STUDY: OLYMPIC 1500M 49
Once you’ve identified the rows and columns, it’s useful to review the data dictionary to learn about
what each column in the dataset represents. The data dictionary is provided in Table 3.3.
Table 3.3: Variables and their descriptions for the paralympic_1500 dataset.
Variable Description
year Year the Games took place.
city City of the Games.
country_of_games Country of the Games.
division Division: ‘Men‘ or ‘Women‘.
type Type: ‘Olympic‘, ‘T11‘, ‘T12‘, or ‘T13‘.
name Name of the athlete.
country_of_athlete Country of athlete.
time Time of gold medal race, in m:s.
time_min Time of gold medal race, in decimal minutes (min + sec/60).
We now have a better sense of what each column represents, but we do not yet know much about the
characteristics of each of the variables.
EXAMPLE
Determine whether each variable in the paralympic_1500 dataset is numerical or categori-
cal. For numerical variables, further classify them as continuous or discrete. For categorical
variables, determine if the variable is ordinal.
The numerical variables in the dataset are year (discrete), and time_min (continuous). The
categorical variables are city, country_of_games, division, type, name, and country_of_-
athlete. The time variable is trickier to classify – we can think of it as numerical, but it is
classified as categorical. The categorical classification is due to the colon : which separates the
hours from the seconds. Sometimes the data dictionary (presented in Table 3.3) isn’t sufficient
for a complete analysis, and we need to go back to the data source and try to understand the
data better before we can proceed with the analysis meaningfully.
Next, let’s try to get to know each variable a little bit better. For categorical variables, this involves
figuring out what their levels are and how commonly represented they are in the data. Figure 3.1
shows the distributions of two of the categorical variables in this dataset. We can see that the United
States has hosted the Games most often, but runners from Great Britain and Kenya have won the
1500m most often. There are a large number of countries who have had a single gold medal winner
of the 1500m. Similarly, there are a large number of countries who have hosted the Games only once.
Over the last century, the name describing the country for athletes from one particular region has
changed and includes Russian Federation, Unified Team, and Russian Paralympic Committee. Both
of the visualizations are bar plots, which you will learn more about in Chapter 4.
Similarly, we can examine the distributions of the numerical variables as well. We already know that
the 1500m times are mostly between 3.5min and 4.5min, based on Table 3.1 and Table 3.2. We
can break down the 1500m time by division and type of race. Table 3.4 shows the mean, minimum,
and maximum 1500m times broken down by division and race type. Recall that the Men’s Olympic
division has taken place since 1896, whereas the Men’s Paralympic division has happened only since
1960. The maximum race time, therefore, should be taken into context in terms of the year of the
Games.
50 CHAPTER 3. APPLICATIONS: DATA
(a) Country of origin of the athlete (b) Country in which the Games gook place
Table 3.4: Mean, minimum, and maximum of the gold medal times for the 1500m race broken down by
division and type of race.
Fun fact! Sometimes playing around with the dataset will uncover interesting elements about the
context in which the data were collected. A scatterplot of the Men’s 1500m broken down by race
type shows that, in each given year, the Olympic runner is substantially faster than the Paralympic
runners, with one exception. In the Rio de Janeiro 2016 Games, the T13 gold medal athlete ran
faster (3:48.29) than the Olympic gold medal athlete (3:50.00) (see Figure 3.2). In fact, some internet
sleuthing tells you that the top four T13 finishers all finished the 1500m under 3:50.00!
3.2. SIMPSON’S PARADOX 51
Figure 3.2: 1500m race time for Men’s Olympic and Paralympic athletes. Dashed grey line represents the
Rio Games in 2016.
So far we examined aspects of some of the individual variables, and we have broken down the 1500m
race times in terms of division and race type. You might have already wondered how the race times
vary across year. The paralymic_1500 dataset will provide us with an ability to explore an important
statistical concept, Simpson’s paradox.
Simpson’s paradox is a description of three (or more) variables. The paradox happens when a third
variable reverses the relationship between the first two variables.
Let’s start by considering how the 1500m gold medal race times have changed over year. Figure 3.3
shows a scatterplot describing 1500m race times and year for Men’s Olympic and Paralympic (T11)
athletes with a line of best fit (to the entire dataset) superimposed (see Chapter 7 where we will
present fitting a line to a scatterplot). Notice that the line of best fit shows a positive relationship
between race time and year. That is, for later years, the predicted gold medal time is higher than in
earlier years.
Figure 3.3: 1500m race time for Men’s Olympic and Paralympic (T11) athletes. The line represents a line
of best fit to the entire dataset.
Of course, both your eye and your intuition are likely telling you that it wouldn’t make any sense
to try to model all of the athletes together. Instead, a separate model should be run for each of
the two types of Games: Olympic and Paralympic (T11). Figure 3.4 shows a scatterplot describing
52 CHAPTER 3. APPLICATIONS: DATA
1500m race times and year for Men’s Olympic and Paralympic (T11) athletes with a line of best fit
superimposed separately for each of the two types of races. Notice that within each type of race, the
relationship between 1500m race time and year is now negative.
Figure 3.4: 1500m race time for Men’s Olympic and Paralympic (T11) athletes. The best fit line is now fit
separately to the Olympic and Paralympic athletes.
Simpson’s paradox.
Simpson’s paradox was seen in the 1500m race data because the aggregate data showed a positive
relationship (positive slope) between year and race time but a negative relationship (negative slope)
between year and race time when broken down by the type of race.
Simpson’s paradox is observed with categorical data and with numeric data. Often the paradox
happens because the third variable (here, race type) is imbalanced. There are either more observations
in one group or the observations happen at different intervals across the two groups. In the 1500m
data, we saw that the T11 runners had fewer observations and their times were both generally slower
and more recent than the Olympic runners.
In the 1500m analysis, it would be most prudent to report the trends separately for the Olympic and
the T11 athletes. However, in other situations, it might be better to aggregate the data and report
the overall trend. Many additional examples of Simpson’s paradox and a further exploration is given
in Witmer (2021).
In this case study, we introduced you to the very first steps a data scientist takes when they start
working with a new dataset. In the next few chapters, we will introduce exploratory data analysis,
and you’ll learn more about the various types of data visualizations and summary statistics you can
make to get to know your data better.
Before you move on, we encourage you to think about whether the following questions can be answered
with this dataset, and if yes, how you might go about answering them? It’s okay if your answer is
“I’m not sure”, we simply want to get your exploratory juices flowing to prime you for what’s to come!
1. Has there every been a year when a visually impaired paralympic gold medal athlete beat the
Olympic gold medal athlete?
2. When comparing the paralympic and Olympic 1500m gold medal athletes, does Simpson’s para-
dox hold in the Women’s division?
3. Is there a biological boundary which establishes a time under which no human could run 1500m?
3.3. INTERACTIVE R TUTORIALS 53
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 1: Introduction to data
https://openintrostat.github.io/ims-tutorials/01-data
Tutorial 1 - Lesson 1: Language of data
https://openintro.shinyapps.io/ims-01-data-01
Tutorial 1 - Lesson 2: Types of studies
https://openintro.shinyapps.io/ims-01-data-02
Tutorial 1 - Lesson 3: Sampling strategies and experimental design
https://openintro.shinyapps.io/ims-01-data-03
Tutorial 1 - Lesson 4: Case study
https://openintro.shinyapps.io/ims-01-data-04
You can also access the full list of tutorials supporting this book at
https://openintrostat.github.io/ims-tutorials.
3.4 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Intro to R - Birth rates
https://www.openintro.org/go?id=ims-r-lab-intro-to-r
You can also access the full list of labs supporting this book at
https://www.openintro.org/go?id=ims-r-labs.
54
PART II
After obtaining a dataset, it is vitally important to understand the characteristics of the existing data.
Sometimes the most effective way to grasp the data is through summary statistics or other numerical
measures. Often, however, it is a picture that tells a thousand words. Knowing how to best convey
the underlying meaning in a dataset is a hugely important aspect of communicating results.
• Categorical data is the focus of Chapter 4. Both numerical and graphical summaries are pre-
sented as ways to convey information about categorical data.
• Numerical data is the focus of Chapter 5. Both numerical and graphical summaries are presented
as ways to convey information about numerical data.
• Chapter 6 does a deep dive into important considerations when creating a visualization.
While our book is software agnostic, one of the best ways to become familiar with numerical and
graphical summaries is to practice working with different datasets using statistical software. For
example, if you are interested in using R, you might try working through some of the chapters in R
for Data Science (https://r4ds.hadley.nz), specifically the parts Whole game and Visualize.
56
Chapter 4
In this chapter we will work with data on loans from Lending Club that you’ve previously seen in
Chapter 1. The loan50 dataset from Chapter 1 represents a sample from a larger loan dataset called
loans. This larger dataset contains information on 10,000 loans made through Lending Club. We
will examine the relationship between homeownership, which for the loans data can take a value of
rent, mortgage (owns but has a mortgage), or own, and app_type, which indicates whether the loan
application was made with a partner or whether it was an individual application.
Table 4.1 summarizes two variables: application_type and homeownership. Note that loans from
Lending Club are typically for small items or for cash, not for homes. The individuals in the dataset
are taking out loans for their personal use, and we categorize them based on their homeownership
status (which is unrelated to the purpose of the loan). A table that summarizes data for two categorical
variables in this way is called a contingency table. Each value in the table represents the number
of times a particular combination of variable outcomes occurred.
For example, the value 3496 corresponds to the number of loans in the dataset where the borrower
rents their home and the application type was by an individual. Row and column totals are also
included. The row totals provide the total counts across each row and the column totals down
each column. We can also create a table that shows only the overall percentages or proportions for
each combination of categories, or we can create a table for a single variable, such as the one shown
in Table 4.2 for the homeownership variable.
4.2. VISUALIZING TWO CATEGORICAL VARIABLES 57
homeownership
application_type rent mortgage own Total
joint 362 950 183 1495
individual 3496 3839 1170 8505
Total 3858 4789 1353 10000
Table 4.2: A table summarizing the frequencies for each value of the homeownership variable – mortgage,
own, and rent.
homeownership Count
rent 3858
mortgage 4789
own 1353
Total 10000
A bar plot is a common way to display a single categorical variable. Figure 4.1a displays a bar plot
of the homeownership variable. In Figure 4.1b the counts are converted into proportions, showing the
proportion of observations that are in each level.
Figure 4.2: Three bar plots displaying homeownership and application type variables.
EXAMPLE
Examine the three bar plots in Figure 4.2. When is the stacked, dodged, or standardized bar
plot the most useful?
The stacked bar plot is most useful when it’s reasonable to assign one variable as the explana-
tory variable (here homeownership) and the other variable as the response (here application_-
type) since we are effectively grouping by one variable first and then breaking it down by the
others.
Dodged bar plots are more agnostic in their display about which variable, if any, represents the
explanatory and which the response variable. It is also easy to discern the number of cases in
each of the six different group combinations. However, one downside is that it tends to require
more horizontal space; the narrowness of Plot B compared to the other two in Figure 4.2 makes
the plot feel a bit cramped. Additionally, when two groups are of very different sizes, as we
see in the group own relative to either of the other two groups, it is difficult to discern if there
is an association between the variables.
The standardized stacked bar plot is helpful if the primary variable in the stacked bar plot is
relatively imbalanced, e.g., the category has only a third of the observations in the category,
making the simple stacked bar plot less useful for checking for an association. The major
downside of the standardized version is that we lose all sense of how many cases each of the
bars represents.
Figure 4.3b displays the relationship between homeownership and application type. Each column is
split proportionally to the number of loans from individual and joint borrowers. For example, the
second column represents loans where the borrower has a mortgage, and it was divided into individual
loans (upper) and joint loans (lower). As another example, the bottom segment of the third column
represents loans where the borrower owns their home and applied jointly, while the upper segment
of this column represents borrowers who are homeowners and filed individually. We can again use
this plot to see that the homeownership and application_type variables are associated, since some
columns are divided in different vertical locations than others, which was the same technique used for
checking an association in the standardized stacked bar plot.
Figure 4.3: Two mosaic plots, one for homeownership alone and the other displaying the relationship between
homeownership and application type.
In Figure 4.3, we chose to first split by the homeowner status of the borrower. However, we could have
instead first split by the application type, as in Figure 4.4. Like with the bar plots, it’s common to
use the explanatory variable to represent the first split in a mosaic plot, and then for the response to
break up each level of the explanatory variable if these labels are reasonable to attach to the variables
under consideration.
Figure 4.4: Mosaic plot where loans are grouped by homeownership after they have been divided into
individual and joint application types.
60 CHAPTER 4. EXPLORING CATEGORICAL DATA
In the previous sections we inspected visualizations of two categorical variables in bar plots and
mosaic plots. However, we have not discussed how the values in the bar and mosaic plots that show
proportions are calculated. In this section we will investigate fractional breakdown of one variable
in another and we can modify our contingency table to provide such a view. Table 4.3 shows row
proportions for Table 4.1, which are computed as the counts divided by their row totals. The value
3496 at the intersection of individual and rent is replaced by 3496/8505 = 0.411, i.e., 3496 divided
by its row total, 8505. So, what does 0.411 represent? It corresponds to the proportion of individual
applicants who rent.
Table 4.3: A contingency table with row proportions for application type and homeownership.
homeownership
application_type rent mortgage own Total
joint 0.242 0.635 0.122 1
individual 0.411 0.451 0.138 1
A contingency table of the column proportions is computed in a similar way, where each is computed
as the count divided by the corresponding column total. Table 4.4 shows such a table, and here the
value 0.906 indicates that 90.6% of renters applied as individuals for the loan. This rate is higher
compared to loans from people with mortgages (80.2%) or who own their home (86.5%). Because
these rates vary between the three levels of homeownership (rent, mortgage, own), this provides
evidence that app_type and homeownership variables may be associated.
Table 4.4: A contingency table with column proportions for application type and homeownership.
homeownership
application_type rent mortgage own
joint 0.094 0.198 0.135
individual 0.906 0.802 0.865
Total 1.000 1.000 1.000
Row and column proportions can also be thought of as conditional proportions as they tell us
about the proportion of observations in a given level of a categorical variable conditional on the level
of another categorical variable.
We could also have checked for an association between application_type and homeownership in
Table 4.3 using row proportions. When comparing these row proportions, we would look down columns
to see if the fraction of loans where the borrower rents, has a mortgage, or owns varied across the
application types.
GUIDED PRACTICE
What does 0.451 represent in Table 4.3? What does 0.802 represent in Table 4.4?1
GUIDED PRACTICE
What does 0.122 represent in Table 4.3? What does 0.135 represent in Table 4.4?2
1 0.451 represents the proportion of individual applicants who have a mortgage. 0.802 represents the fraction of
EXAMPLE
Data scientists use statistics to build email spam filters. By noting specific characteristics of
an email, a data scientist may be able to classify some emails as spam or not spam with high
accuracy. One such characteristic is the email format, which indicates whether an email has
any HTML content, such as bolded text. We’ll focus on email format and spam status using
the dataset; these variables are summarized in a contingency table in Table 4.5. Which would
be more helpful to someone hoping to classify email as spam or regular email for this table:
row or column proportions?
A data scientist would be interested in how the proportion of spam changes within each email
format. This corresponds to column proportions: the proportion of spam in plain text emails
and the proportion of spam in HTML emails. If we generate the column proportions, we can
see that a higher fraction of plain text emails are spam (209/1195 = 17.5%) than compared
to HTML emails (158/2726 = 5.8%). This information on its own is insufficient to classify
an email as spam or not spam, as over 80% of plain text emails are not spam. Yet, when
we carefully combine this information with many other characteristics, we stand a reasonable
chance of being able to classify some emails as spam or not spam with confidence. This example
points out that row and column proportions are not equivalent. Before settling on one form
for a table, it is important to consider each to ensure that the most useful table is constructed.
However, sometimes it simply isn’t clear which, if either, is more useful.
EXAMPLE
Look back to Table 4.3 and Table 4.4. Are there any obvious scenarios where one might be
more useful than the other?
None that we think are obvious! What is distinct about the email example is that the two
loan variables do not have a clear explanatory-response variable relationship that we might
hypothesize. Usually it is most useful to “condition” on the explanatory variable. For instance,
in the email example, the email format was seen as a possible explanatory variable of whether
the message was spam, so we would find it more interesting to compute the relative frequencies
(proportions) for each email format.
A pie chart is shown in Figure 4.5a alongside a bar plot representing the same information in
Figure 4.5b. Pie charts can be useful for giving a high-level overview to show how a set of cases break
down. However, it is also difficult to decipher certain details in a pie chart. For example, it’s not
immediately obvious that there are more loans where the borrower has a mortgage than rent when
looking at the pie chart, while this detail is very obvious in the bar plot.
62 CHAPTER 4. EXPLORING CATEGORICAL DATA
Pie charts can work well when the goal is to visualize a categorical variable with very few levels, and
especially if each level represents a simple fraction (e.g., one-half, one-quarter, etc.). However, they
can be quite difficult to read when they are used to visualize a categorical variable with many levels.
For example, the pie chart Figure 4.6a and the Figure 4.6b both represent the distribution of loan
grades (A through G). In this case, it is far easier to compare the counts of each loan grade using the
bar plot than the pie chart.
Another useful technique of visualizing categorical data is a waffle chart. Waffle charts can be used
to communicate the proportion of the data that falls into each level of a categorical variable. Just
like with pie charts, they work best when the number of levels represented is low. However, unlike pie
charts, they can make it easier to compare proportions that represent non-simple fractions. Figure 4.7a
is a waffle chart of homeownership and Figure 4.7b is a waffle chart of loan status.
(a) Homeownership: rent, mortgage, and own (b) Loan status: fully paid, in grace period, and late
Some of the more interesting investigations can be considered by examining numerical data across
groups. In this section we will expand on a few methods we have already seen to make plots for
numerical data from multiple groups on the same graph as well as introduce a few new methods for
comparing numerical data across groups.
We will revisit the county dataset and compare the median household income for counties that gained
population from 2010 to 2017 versus counties that had no gain. While we might like to make a causal
connection between income and population growth, remember that these are observational data and
so such an interpretation would be, at best, half-baked.
We have data on 3142 counties in the United States. We are missing 2017 population data from 3 of
them, and of the remaining 3139 counties, in 1541 the population increased from 2010 to 2017 and
in the remaining 1598 the population decreased. Table 4.6 shows a sample of four observations from
each group.
Table 4.6: The median household income from a random sample of four counties with population gain
between 2010 to 2017 and another random sample of four counties with no population gain.
Color can be used to split histograms (see Section 5.3 for an introduction to histograms) for numerical
variables by levels of a categorical variable. An example of this is shown in Figure 4.8a. The side-
by-side box plot is another traditional tool for comparing across groups. An example is shown in
Figure 4.8b, where there are two box plots (see Section 5.5 for an introduction to box plots), one for
each group, placed into one plotting window and drawn on the same scale.
(a) Histograms
Figure 4.8: Visualizations of median household income of counties by change in population (gain or loss).
64 CHAPTER 4. EXPLORING CATEGORICAL DATA
GUIDED PRACTICE
Use the plots in Figure 4.8 to compare the incomes for counties across the two groups.
What do you notice about the approximate center of each group? What do you notice
about the variability between groups? Is the shape relatively consistent between groups?
How many prominent modes are there for each group?3
GUIDED PRACTICE
What components of each plot in Figure 4.8 do you find most useful?4
Another useful visualization for comparing numerical data across groups is a ridge plot, which
combines density plots (see Section 5.5 for an introduction to density plots) for various groups drawn
on the same scale in a single plotting window. Figure 4.9 displays a ridge plot for the distribution of
median household income in counties, split by whether there was a population gain or not.
Figure 4.9: Ridge plot for median household income, where counties are split by whether there was a
population gain or not.
GUIDED PRACTICE
What components of the ridge plot in Figure 4.9 do you find most useful compared to
those in Figure 4.8?5
One last visualization technique we’ll highlight for comparing numerical data across groups is faceting.
In this technique we split (facet) the graphical display of the data across plotting windows based on
groups. In Figure 4.10a displays the same information as Figure 4.8a, however here the distributions of
median household income for counties with and without population gain are faceted across two plotting
windows. We preserve the same scale on the x and y axes for easier comparison. An advantage of this
approach is that it extends to splitting the data across levels of two categorical variables, which allows
for displaying relationships between three variables. In Figure 4.10b we have now split the data into
four groups using the pop_change and metro variables:
• top left represents counties that are not in a metropolitan area with population gain,
• top right represents counties that are in a metropolitan area with population gain,
• bottom left represents counties that are not in a metropolitan area without population gain, and
finally
• bottom right represents counties that are in a metropolitan area without population gain.
3 Answers may vary a little. The counties with population gains tend to have higher income (median of about $45,000)
versus counties without a gain (median of about $40,000). The variability is also slightly larger for the population gain
group. This is evident in the IQR, which is about 50% bigger in the gain group. Both distributions show slight to
moderate right skew and are unimodal. The box plots indicate there are many observations far above the median in
each group, though we should anticipate that many observations will fall beyond the whiskers when examining any
dataset that contain more than a few hundred data points.
4 Answers will vary. The side-by-side box plots are especially useful for comparing centers and spreads, while the
hollow histograms are more useful for seeing distribution shape, skew, modes, and potential anomalies.
5 The ridge plot give us a better sense of the shape, and especially modality, of the data.
4.6. COMPARING NUMERICAL DATA ACROSS GROUPS 65
(a) By population gain. (b) By both population gain and metropolitan area.
We can continue building upon this visualization to add one more variable, median_edu, which is the
median education level in the county. In Figure 4.11, we represent median education level using color,
where pink (solid line) represents counties where the median education level is high school diploma,
yellow (dashed line) is some college degree, and red (dotted line) is Bachelor’s.
GUIDED PRACTICE
Based on Figure 4.11, what can you say about how median household income in coun-
ties vary depending on population gain/no gain, metropolitan area/not, and median
degree?6
Figure 4.11: Distribution of median income in counties using a ridge plot, faceted by whether the county had
a population gain or not as well as whether the county is in a metropolitan area and colored by the median
education level in the county.
6 Regardless of the location (metropolitan or not) or change in population, it seems like there is an increase in
median household income from individuals with only a HS diploma, to individuals with some college, to individuals
with a Bachelor’s degree.
66 CHAPTER 4. EXPLORING CATEGORICAL DATA
4.7.1 Summary
Fluently working with categorical variables is an important skill for data analysts. In this chapter we
have introduced different visualizations and numerical summaries applied to categorical variables. The
graphical visualizations are even more descriptive when two variables are presented simultaneously.
We presented bar plots, mosaic plots, pie charts, and estimations of conditional proportions.
4.7.2 Terms
The terms introduced in this chapter are presented in Table 4.7. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
4.8 Exercises
a. What features are apparent in the bar plot but not in the pie chart?
b. What features are apparent in the pie chart but not in the bar plot?
c. Which graph would you prefer to use for displaying these categorical data?
2. Views on immigration. Nine-hundred and ten (910) randomly sampled registered voters
from Tampa, FL were asked if they thought workers who have illegally entered the US should
be (i) allowed to keep their jobs and apply for US citizenship, (ii) allowed to keep their jobs as
temporary guest workers but not allowed to apply for US citizenship, or (iii) lose their jobs and
have to leave the country. The results of the survey by political ideology are shown below.8
3. Black Lives Matter. A Washington Post-Schar School poll conducted in the United States in
June 2020, among a random national sample of 1,006 adults, asked respondents whether they
support or oppose protests following George Floyd’s killing that have taken place in cities across
the US. The survey also collected information on the age of the respondents. (Washington Post
2020) The results are summarized in the stacked bar plot below.
a. Based on the stacked bar plot, do views on the protests and age appear to be associated?
Explain your reasoning.
b. Conjecture other possible variables that might explain the potential association between
these two variables.
4. Raise taxes. A random sample of registered voters nationally were asked whether they think
it’s better to raise taxes on the rich or raise taxes on the poor. The survey also collected
information on the political party affiliation of the respondents. (Polling 2015)
a. Based on the stacked bar plot shown above, do views on raising taxes and political affiliation
appear to be associated? Explain your reasoning.
b. Conjecture other possible variables that might explain the potential association between
these two variables.
4.8. EXERCISES 69
5. Heart transplant data display. The Stanford University Heart Transplant Study was con-
ducted to determine whether an experimental heart transplant program increased lifespan. Each
patient entering the program was officially designated a heart transplant candidate, meaning
that they were gravely ill and might benefit from a new heart. Patients were randomly assigned
into treatment and control groups. Patients in the treatment group received a transplant, and
those in the control group did not. The visualizations below display two different versions of
the study results.9 (Turnbull, Brown, and Hu 1974)
a. Provide one aspect of the two group comparison that is easier to see from the stacked bar
plot (left)?
b. Provide one aspect of the two group comparison that is easeir to see from the standardized
bar plot (right)?
c. For the Heart Transplant Study which of those aspects would be more important to display?
That is, which bar plot would be better as a data visualization?
6. Shipping holiday gifts data display. A local news survey asked 500 randomly sampled Los
Angeles residents which shipping carrier they prefer to use for shipping holiday gifts. The bar
plots below show the distribution of responses by age group as well as distribution of responses
by shipping method.
a. Which graph (top or bottom) would you use to understand the shipping choices of people
of different ages? Explain.
b. Which graph (top or bottom) would you use to understand the age distribution across
different types of shipping choices? Explain.
c. A new shipping company would like to market to people over the age of 55. Who will be
their biggest competitor? Explain.
d. FedEx would like to reach out to grow their market share so as to balance the age demo-
graphics of FedEx users. To what age group should FedEx market?
9 The heart_transplant data used in this exercise can be found in the openintro R package.
70 CHAPTER 4. EXPLORING CATEGORICAL DATA
7. Meat consumption and life expectancy. In data collected for You et al. (2022), total meat
intake is associated with life expectancy (at birth) in 175 countries. Meat intake is measured in
kg per capita per year (averaged over 2011 to 2013). The two ridge plots show an association
between income and meat consumption (higher income countries tend to eat more meat) and
an association between income and life expectancy (higher income countries have higher life
expectancy).
a. Do the graphs above demonstrate that meat consumption and life expectancy are associ-
ated? That is, can you tell if countries with low meat consumption have low life expectancy?
Explain.
b. Let’s assume that you had a plot comparing meat consumption and life expectancy, and
they do seem associated. Your friend says that the plot shows that high meat consumption
leads to a longer life. You correctly say, no, we can’t tell if there is a causal realtionship
because the relationship is confounded by income level. Explain what you mean.
c. How can you investigate the relationship between meat consumption and life expectancy
in the presence of confounding variables (like income)?
8. Florence Nightingale. Florence Nightingale was a nurse in the Crimean War and an early
statistician. In her notes, she opined, “In comparing the deaths of one hospital with those of
another, any statistics are justly considered absolutely valueless which do not give the ages, the
sexes, and the diseases of all the cases.” (Nightingale 1859)
a. Nightingale describes three confounding variables to consider when comparing death rates
across hospitals. What are they? Describe what makes each variable potentially confound-
ing.
b. Provide two additional potential confounding variables for this situation. Check to make
sure that the variables are associated with both the explanatory variable (hospital) and the
response variable (death).
c. Why does Nightingale say that the statistics are “valueless” if given without being broken
down by age, sex, and disease? Explain.
4.8. EXERCISES 71
9. On-time arrivals. Consider all of the flights out of New York City in 2013 that flew into
Puerto Rico (BQN), Los Angeles (LAX), or San Francisco (SFO) on the following two airlines:
JetBlue (B6) or United Airlines (UA). Below are the tabulated counts for the number of flights
delayed and on time for each airline into each city.10
a. What percent of all JetBlue flights were delayed? What percent of all United Airlines flights
were delayed? (Note, the overall delay proportions are typically what would be reported
and associated with an airline.)
b. For each of the three airports, find the percent of delayed flights for each of JetBlue and
United (you should have 6 numbers).
c. United has a higher proportion of delayed flights for each of the three cities, yet JetBlue
has a higher proportion of delayed flights overall. Explain, using the data counts provided,
how the seeming paradox could happen.11
10. US House of Representatives. The US House of Representatives is dominated by two
political parties: Democrats and Republicans. Democrats are thought to be the more liberal
party and Republicans are considered to be the more conservative party. However, within
each party there is an internal spectrum of liberal to conservative. For example, conservative
Democrats and liberal Republicans would be labeled moderate. Consider an election where the
only change in membership is that the most conservative Democrats are replaced by a set of
liberal Republicans who are more liberal than the incumbent Republicans but more conservative
than the Democrats they replaced.
a. After the election, is the Democratic wing of the House more conservative or more liberal?
Explain.
b. After the election, is the Republican wing of the House more conservative or more liberal?
Explain.
c. After the election, is the overall House membership more conservative or more liberal?
Explain.
d. In what settings would you report the outcome of the change in House membership to be
more conservative? And in what settings would you report the outcome of the change in
House membership to be more liberal?12
10 The flights data used in this exercise can be found in the nycflights13 R package.
11 The conundrum is known as Simpson’s Paradox and is explored in Chapter 3.
12 The conundrum is known as Simpson’s Paradox and is explored in Chapter 3.
72
Chapter 5
Consider the loan_amount variable from the loan50 dataset, which represents the loan size for each
of 50 loans in the dataset.
This variable is numerical since we can sensibly discuss the numerical difference of the size of two
loans. On the other hand, area codes and zip codes are not numerical, but rather they are categorical
variables.
Throughout this chapter, we will apply numerical methods using the loan50 and county datasets,
which were introduced in Section 1.2. If you’d like to review the variables from either dataset, see
Tables Table 1.4 and Table 1.6.
The county data can be found in the usdata R package and the loan50 data can be
found in the openintro R package.
A scatterplot provides a case-by-case view of data for two numerical variables. In Figure 1.2, a
scatterplot was used to examine the homeownership rate against the percentage of housing units that
are in multi-unit structures (e.g., apartments) in the county dataset. Another scatterplot is shown in
Figure 5.1, comparing the total income of a borrower total_income and the amount they borrowed
loan_amount for the loan50 dataset. In any scatterplot, each point represents a single case. Since
there are 50 cases in loan50, there are 50 points in Figure 5.1.
5.1. SCATTERPLOTS FOR PAIRED DATA 73
Figure 5.1: A scatterplot of loan amount versus total income for the loan50 dataset.
Looking at Figure 5.1, we see that there are many borrowers with income below $100,000 on the left
side of the graph, while there are a handful of borrowers with income above $250,000.
Figure 5.2: A scatterplot of the median household income against the poverty rate for the county dataset.
Data are from 2017. A statistical model has also been fit to the data and is shown as a dashed line.
EXAMPLE
Figure 5.2 shows a plot of median household income against the poverty rate for 3142 counties
in the US. What can be said about the relationship between these variables?
The relationship is evidently nonlinear, as highlighted by the dashed line. This is different
from previous scatterplots we have seen, which indicate very little, if any, curvature in the
trend.
GUIDED PRACTICE
What do scatterplots reveal about the data, and how are they useful?1
GUIDED PRACTICE
Describe two variables that would have a horseshoe-shaped association in a scatterplot
(∩ or ⌢).2
1 Answers may vary. Scatterplots are helpful in quickly spotting associations relating variables, whether those
associations come in the form of simple trends or whether those relationships are more complex.
2 Consider the case where your vertical axis represents something “good” and your horizontal axis represents some-
thing that is only good in moderation. Health and water consumption fit this description: we require some water to
survive, but consume too much and it becomes toxic and can kill a person.
74 CHAPTER 5. EXPLORING NUMERICAL DATA
Sometimes we are interested in the distribution of a single variable. In these cases, a dot plot provides
the most basic of displays. A dot plot is a one-variable scatterplot; an example using the interest
rate of 50 loans is shown in Figure 5.3.
Figure 5.3: A dot plot of interest rate for the loan50 dataset. The rates have been rounded to the nearest
percent in this plot, and the distribution’s mean is shown as a red triangle.
The mean, often called the average is a common way to measure the center of a distribution of
data. To compute the mean interest rate, we add up all the interest rates and divide by the number
of observations.
The sample mean is often labeled 𝑥.̄ The letter 𝑥 is being used as a generic placeholder for the variable
of interest and the bar over the 𝑥 communicates we are looking at the average interest rate, which for
these 50 loans is 11.57%. It’s useful to think of the mean as the balancing point of the distribution,
and it’s shown as a triangle in Figure 5.3.
Mean.
The sample mean can be calculated as the sum of the observed values divided by the
number of observations:
𝑥 + 𝑥2 + ⋯ + 𝑥 𝑛
𝑥̄ = 1
𝑛
GUIDED PRACTICE
Examine the equation for the mean. What does 𝑥1 correspond to? And 𝑥2 ? Can you
infer a general meaning to what 𝑥𝑖 might represent?3
GUIDED PRACTICE
What was 𝑛 in this sample of loans?4
The loan50 dataset represents a sample from a larger population of loans made through Lending Club.
We could compute a mean for the entire population in the same way as the sample mean. However,
the population mean has a special label: 𝜇. The symbol 𝜇 is the Greek letter mu and represents the
average of all observations in the population. Sometimes a subscript, such as 𝑥 , is used to represent
which variable the population mean refers to, e.g., 𝜇𝑥 . Oftentimes it is too expensive to measure the
population mean precisely, so we often estimate 𝜇 using the sample mean, 𝑥.̄
3 𝑥 corresponds to the interest rate for the first loan in the sample, 𝑥 to the second loan’s interest rate, and 𝑥
1 2 𝑖
corresponds to the interest rate for the 𝑖𝑡ℎ loan in the dataset. For example, if 𝑖 = 4, then we are examining 𝑥4 , which
refers to the fourth observation in the dataset.
4 The sample size was 𝑛 = 50.
5.2. DOT PLOTS AND THE MEAN 75
EXAMPLE
Although we do not have an ability to calculate the average interest rate across all loans in
the populations, we can estimate the population value using the sample data. Based on the
sample of 50 loans, what would be a reasonable estimate of 𝜇𝑥 , the mean interest rate for all
loans in the full dataset?
The sample mean, 11.57, provides a rough estimate of 𝜇𝑥 . While it is not perfect, this is our
single best guess point estimate of the average interest rate of all the loans in the population
under study. In Chapter 11 and beyond, we will develop tools to characterize the accuracy of
point estimates, like the sample mean. As you might have guessed, point estimates based on
larger samples tend to be more accurate than those based on smaller samples.
The mean is useful because it allows us to rescale or standardize a metric into something more easily
interpretable and comparable. Suppose we would like to understand if a new drug is more effective at
treating asthma attacks than the standard drug. A trial of 1,500 adults is set up, where 500 receive the
new drug, and 1000 receive a standard drug in the control group. Results of this trial are summarized
in Table 5.1.
Table 5.1: Results of a trial of 1500 adults that suffer from asthma.
Comparing the raw counts of 200 to 300 asthma attacks would make it appear that the new drug is
better, but this is an artifact of the imbalanced group sizes.
Instead, we should look at the average number of asthma attacks per patient in each group:
• New drug: 200/500 = 0.4 asthma attacks per patient
• Standard drug: 300/1000 = 0.3 asthma attacks per patient
The standard drug has a lower average number of asthma attacks per patient than the average in the
treatment group.
EXAMPLE
Come up with another example where the mean is useful for making comparisons.
Emilio opened a food truck last year where he sells burritos, and his business has stabilized
over the last 3 months. Over that 3-month period, he has made $11,000 while working 625
hours. Emilio’s average hourly earnings provides a useful statistic for evaluating whether his
venture is, at least from a financial perspective, worth it:
$11000
= $17.60 per hour
625 hours
By knowing his average hourly wage, Emilio now has put his earnings into a standard unit
that is easier to compare with many other jobs that he might consider.
76 CHAPTER 5. EXPLORING NUMERICAL DATA
EXAMPLE
Suppose we want to compute the average income per person in the US. To do so, we might
first think to take the mean of the per capita incomes across the 3,142 counties in the county
dataset. What would be a better approach?
The county dataset is special in that each county actually represents many individual people.
If we were to simply average across the income variable, we would be treating counties with
5,000 and 5,000,000 residents equally in the calculations. Instead, we should compute the total
income for each county, add up all the counties’ totals, and then divide by the number of
people in all the counties. If we completed these steps with the county data, we would find
that the per capita income for the US is $30,861. Had we computed the simple mean of per
capita income across counties, the result would have been just $26,093!
This example used what is called a weighted mean. For more information on this topic,
check out the following online supplement regarding weighted means.
Dot plots show the exact value for each observation. They are useful for small datasets but can become
hard to read with larger samples. Rather than showing the value of each observation, we prefer to
think of the value as belonging to a bin. For example, in the loan50 dataset, we created a table of
counts for the number of loans with interest rates between 5.0% and 7.5%, then the number of loans
with rates between 7.5% and 10.0%, and so on. Observations that fall on the boundary of a bin (e.g.,
10.00%) are allocated to the lower bin. The tabulation is shown in Table 5.2, and the binned counts
are plotted as bars in Figure 5.4 into what is called a histogram. Note that the histogram resembles
a more heavily binned version of the stacked dot plot shown in Figure 5.3.
Figure 5.4: A histogram of interest rate. This distribution is strongly skewed to the right.
5.3. HISTOGRAMS AND SHAPE 77
Histograms provide a view of the data density. Higher bars represent where the data are relatively
more common. For instance, there are many more loans with rates between 5% and 10% than loans
with rates between 20% and 25% in the dataset. The bars make it easy to see how the density of the
data changes relative to the interest rate.
Histograms are especially convenient for understanding the shape of the data distribution. Figure 5.4
suggests that most loans have rates under 15%, while only a handful of loans have rates above 20%.
When the distribution of a variable trails off to the right in this way and has a longer right tail, the
shape is said to be right skewed.5
Figure 5.5: A density plot of interest rate. Again, the distribution is strongly skewed to the right.
Figure 5.5 shows a density plot which is a smoothed out histogram. The technical details for how
to draw density plots (precisely how to smooth out the histogram) are beyond the scope of this text,
but you will note that the shape, scale, and spread of the observations are displayed similarly in a
histogram as in a density plot.
Variables with the reverse characteristic – a long, thinner tail to the left – are said to be left skewed.
We also say that such a distribution has a long left tail. Variables that show roughly equal trailing
off in both directions are called symmetric.
When data trail off in one direction, the distribution has a long tail. If a distribution
has a long left tail, it is left skewed. If a distribution has a long right tail, it is right
skewed.
GUIDED PRACTICE
Besides the mean (since it was labeled), what can you see in the dot plot in Figure 5.3
that you cannot see in the histogram in Figure 5.4?6
5 Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed to the
positive end.
6 The interest rates for individual loans.
78 CHAPTER 5. EXPLORING NUMERICAL DATA
Figure 5.6 shows histograms that have one, two, or three prominent peaks. Such distributions are
called unimodal, bimodal, and multimodal, respectively. Any distribution with more than two
prominent peaks is called multimodal. Notice that there was one prominent peak in the unimodal
distribution with a second less prominent peak that was not counted since it only differs from its
neighboring bins by a few observations.
Figure 5.6: Counting only prominent peaks, the distributions are (left to right) unimodal, bimodal, and
multimodal. Note that the left plot is unimodal because we are counting prominent peaks, not just any peak.
EXAMPLE
Figure 5.4 reveals only one prominent mode in the interest rate. Is the distribution unimodal,
bimodal, or multimodal?
Remember that uni stands for 1 (think unicycles), and bi stands for 2 (think bicycles).
GUIDED PRACTICE
Height measurements of young students and adult teachers at an elementary school
were taken. How many modes would you expect in this height dataset?7
Looking for modes isn’t about finding a clear and correct answer about the number of modes in a
distribution, which is why prominent is not rigorously defined in this book. The most important part
of this examination is to better understand your data.
The mean was introduced as a method to describe the center of a variable, and variability in the
data is also important. Here, we introduce two measures of variability: the variance and the standard
deviation. Both of these are very useful in data analysis, even though their formulas are a bit tedious
to calculate by hand. The standard deviation is the easier of the two to comprehend, as it roughly
describes how far away the typical observation is from the mean.
We call the distance of an observation from its mean its deviation. Below are the deviations for the
1𝑠𝑡 , 2𝑛𝑑 , 3𝑟𝑑 , and 50𝑡ℎ observations in the interest_rate variable:
7 There might be two height groups visible in the dataset: the children (students) and the adults (teachers). That
If we square these deviations and then take an average, the result is equal to the sample variance,
denoted by 𝑠2 :
Standard deviation.
The sample standard deviation can be calculated as the square root of the sum of the
squared distance of each value from the mean divided by the number of observations
minus one:
𝑛 2
∑ (𝑥𝑖 − 𝑥)̄
𝑠 = √ 𝑖=1
𝑛−1
√
𝑠= 25.52 = 5.05
While often omitted, a subscript of 𝑥 may be added to the variance and standard deviation, i.e., 𝑠2𝑥 and
𝑠𝑥 , if it is useful as a reminder that these are the variance and standard deviation of the observations
represented by 𝑥1 , 𝑥2 , …, 𝑥𝑛 .
The variance is the average squared distance from the mean. The standard deviation is
the square root of the variance. The standard deviation is useful when considering how
far the data are distributed from the mean.
The standard deviation represents the typical deviation of observations from the mean.
Often about 68% of the data will be within one standard deviation of the mean and
about 95% will be within two standard deviations. However, these percentages are not
strict rules.
Like the mean, the population values for variance and standard deviation have special symbols: 𝜎2
for the variance and 𝜎 for the standard deviation.
GUIDED PRACTICE
A good description of the shape of a distribution should include modality and whether
the distribution is symmetric or skewed to one side. Using Figure 5.8 as an example,
explain why such a description is important.8
8 Figure 5.8 shows three distributions that look quite different, but all have the same mean, variance, and standard
deviation. Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal). Using
skewness, we can distinguish between the last plot (right skewed) and the first two. While a picture, like a histogram,
tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about
a distribution.
80 CHAPTER 5. EXPLORING NUMERICAL DATA
Figure 5.7: For the interest rate variable, 34 of the 50 loans (68%) had interest rates within 1 standard
deviation of the mean, and 48 of the 50 loans (96%) had rates within 2 standard deviations. Usually about
68% of the data are within 1 standard deviation of the mean and 95% within 2 standard deviations, though
this is far from a hard rule.
Figure 5.8: Three different population distributions with the same mean (0) and standard deviation (1).
EXAMPLE
Describe the distribution of the interest_rate variable using the histogram in Figure 5.4.
The description should incorporate the center, variability, and shape of the distribution, and
it should also be placed in context. Also note any especially unusual cases.
The distribution of interest rates is unimodal and skewed to the high end. Many of the rates
fall near the mean at 11.57%, and most fall within one standard deviation (5.05%) of the mean.
There are a few exceptionally large interest rates in the sample that are above 20%.
In practice, the variance and standard deviation are sometimes used as a means to an end, where
the “end” is being able to accurately estimate the uncertainty associated with a sample statistic. For
example, in Chapter 13 the standard deviation is used in calculations that help us understand how
much a sample mean varies from one sample to the next.
5.5. BOX PLOTS, QUARTILES, AND THE MEDIAN 81
A box plot summarizes a dataset using five statistics while also identifying unusual observations.
Figure 5.9 provides a dot plot and a box plot of the interest_rate variable from the loan50 dataset.9
The dark line inside the box represents the median, which splits the data in half. 50% of the data fall
below this value and 50% fall above it. Since in the loan50 dataset there are 50 observations (an even
number), the median is defined as the average of the two observations closest to the 50𝑡ℎ percentile.
Table 5.3 shows all interest rates, arranged in ascending order. We can see that the 25𝑡ℎ and the 26𝑡ℎ
values are both 9.93, which corresponds to the thick line in Figure 5.9b.
Table 5.3: Interest rates from the loan50 dataset, arranged in ascending order.
1 2 3 4 5 6 7 8 9 10
1 5.31 5.31 5.32 6.08 6.08 6.08 6.71 6.71 7.34 7.35
10 7.35 7.96 7.96 7.96 7.97 9.43 9.43 9.44 9.44 9.44
20 9.92 9.92 9.92 9.92 9.93 9.93 10.42 10.42 10.90 10.90
30 10.91 10.91 10.91 11.98 12.62 12.62 12.62 14.08 15.04 16.02
40 17.09 17.09 17.09 18.06 18.45 19.42 20.00 21.45 24.85 26.30
When there are an odd number of observations, there will be exactly one observation that splits the
data into two halves, and in such a case that observation is the median (no average needed).
If the data are ordered from smallest to largest, the median is the observation right in
the middle. If there are an even number of observations, there will be two values in the
middle, and the median is taken as their average.
9 Box plots were introducted by Mary Eleanor Spear who considered them to be a particular type of bar plot, see
page 166 of Spear (1952). Mistakenly, box plots are often attributed to John Tukey who was the first person to call
them “box-and-whisker plots.”
82 CHAPTER 5. EXPLORING NUMERICAL DATA
The second step in building a box plot is drawing a rectangle to represent the middle 50% of the data.
The length of the box is called the interquartile range, or IQR for short. It, like the standard
deviation, is a measure of variability in data. The more variable the data, the larger the standard
deviation and IQR tend to be. The two boundaries of the box are called the first quartile (the 25𝑡ℎ
percentile, i.e., 25% of the data fall below this value) and the third quartile (the 75𝑡ℎ percentile, i.e.,
75% of the data fall below this value), and these are often labeled 𝑄1 and 𝑄3 , respectively.
The IQR interquartile range is the length of the box in a box plot. It is computed as
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 , where 𝑄1 and 𝑄3 are the 25𝑡ℎ and 75𝑡ℎ percentiles, respectively.
GUIDED PRACTICE
What percent of the data fall between 𝑄1 and the median? What percent is between
the median and 𝑄3 ?10
Extending out from the box, the whiskers attempt to capture the data outside of the box. The
whiskers of a box plot reach to the minimum and the maximum values in the data, unless there are
points that are considered unusually high or unusually low, which are identified as potential outliers
by the box plot. These are labeled with a dot on the box plot. The purpose of labeling the outlying
points – instead of extending the whiskers to the minimum and maximum observed values – is to help
identify any observations that appear to be unusually distant from the rest of the data. There are
a variety of formulas for determining whether a particular data point is considered an outlier, and
different statistical software use different formulas. A commonly used formula is that any observation
beyond 1.5 × 𝐼𝑄𝑅 away from the first or the third quartile is considered an outlier. In a sense, the
box is like the body of the box plot and the whiskers are like its arms trying to reach the rest of the
data, up to the outliers.
An outlier is an observation that appears extreme relative to the rest of the data.
Examining data for outliers serves many useful purposes, including
Keep in mind, however, that some datasets have a naturally long skew and outlying
points do not represent any sort of problem in the dataset.
GUIDED PRACTICE
Using the box plot in Figure 5.9b, estimate the values of the 𝑄1 , 𝑄3 , and IQR for
interest_rate in the loan50 dataset.11
10 Since 𝑄 and 𝑄 capture the middle 50% of the data and the median splits the data in the middle, 25% of the
1 3
data fall between 𝑄1 and the median, and another 25% falls between the median and 𝑄3 .
11 These visual estimates will vary a little from one person to the next: 𝑄 ≈ 8%, 𝑄 ≈ 14%, IQR ≈ 14 - 8 = 6%.
1 3
5.6. ROBUST STATISTICS 83
How are the sample statistics of the interest_rate dataset affected by the observation, 26.3%?
What would have happened if this loan had instead been only 15%? What would happen to these
summary statistics if the observation at 26.3% had been even larger, say 35%? The three conjectured
scenarios are plotted alongside the original data in Figure 5.10, and sample statistics are computed
under each scenario in Table 5.4.
Table 5.4: A comparison of how the median, IQR, mean, and standard deviation change as the value of an
extreme observation from the original interest data changes.
Figure 5.10: Dot plots of the original interest rate data and two modified datasets.
84 CHAPTER 5. EXPLORING NUMERICAL DATA
GUIDED PRACTICE
Which is more affected by extreme observations, the mean or median? Is the standard
deviation or IQR more affected by extreme observations?12
The median and IQR are called robust statistics because extreme observations have little effect
on their values: moving the most extreme value generally has little influence on these statistics. On
the other hand, the mean and standard deviation are more heavily influenced by changes in extreme
observations, which can be important in some situations.
EXAMPLE
The median and IQR did not change under the three scenarios in Table 5.4. Why might this
be the case?
The median and IQR are only sensitive to numbers near 𝑄1 , the median, and 𝑄3 . Since values
in these regions are stable in the three datasets, the median and IQR estimates are also stable.
You might not be surprised that the answer to the question “which is better, the mean or the median?”
is: it depends. The two statistics measure different things, and so their use is dependent on the context
in the analysis. Consider the following scenarios:
• Is it better to measure the average profit per customer or the median profit per customer?
– If concern is about the overall profit margin of the company, the mean is a better measure to
assess what is happening across the company. The company could have a positive median
profit per customer and still be unprofitable.
– If concern is around understanding the profit per typical customer, possibly to understand
the growth headroom for the company’s profit, the median profit per customer would tell
you more about individual customer profits.
• If you operate an app and want to know how long it takes for the app to open on your customers’
phones, do you want the mean amount of time or the median amount of time?
– The mean leads to an understanding of the overall amount of time being wasted in opening
the app.
– The median tells you about the typical user experience.
– However, if the app takes less than 5 milliseconds to launch for 50% of the users but more
than 10 seconds to launch for 10% of the users, the median doesn’t give the information
you need. In that scenario, you might want an upper percentile, like the 95th percentile.
GUIDED PRACTICE
The distribution of loan amounts in the loan50 dataset is right skewed, with a few large
loans lingering out into the right tail. If you were wanting to understand the typical
loan size, should you be more interested in the mean or median?13
Regardless of the choice of centrality statistic (either mean or median), for most analyses, it is impor-
tant to consider more than just the centrality. Other statistics like upper and lower percentiles, IQR,
or standard deviation provide information about the variability of the observations. And visualizing
the data through a graphical representation will typically provide a wealth of information necessary
for understanding the full data picture associated with the research question.
12 Mean
is affected more than the median. Standard deviation is affected more than the IQR.
13 If
we are looking to simply understand what a typical individual loan looks like, the median is probably more
useful. However, if the goal is to understand something that scales well, such as the total amount of money we might
need to have on hand if we were to offer 1,000 loans, then the mean would be more useful.
5.7. TRANSFORMING DATA 85
When data are very strongly skewed, we sometimes transform them, so they are easier to model.
Figure 5.11a and Figure 5.11b show right-skewed distributions: distribution of the percentage of
unemployed people and the distribution of the population in all counties in the United States. The
distribution of population is more strongly skewed than the distribution of percentage unemployed,
hence the log transformation results in a much bigger change in the shape of the distribution.
EXAMPLE
Consider the histogram of county populations shown in Figure 5.11c, which shows extreme
skew. What characteristics of the plot keep it from being useful?
Nearly all of the data fall into the left-most bin, and the extreme skew obscures many of the
potentially interesting details at the low values.
Figure 5.11: Histograms of percentage unenmployed, population, and their log transformed versions in all
US counties. For the plots of transformed variables, the x-value corresponds to the power of 10, e.g., 1 on the
x-axis corresponds to 101 = 10 and 5 on the x-axis corresponds to 105 = 100,000. Data are from 2017.
There are some standard transformations that may be useful for strongly right skewed data where
much of the data is positive but clustered near zero. A transformation is a rescaling of the data
using a function. For instance, a plot of the logarithm (base 10) of unemployment rates and county
populations results in the new histograms in Figure 5.11b. The transformed data are symmetric,
and any potential outliers appear much less extreme than in the original dataset. By reigning in the
outliers and extreme skew, transformations often make it easier to build statistical models for the
data.
Transformations can also be applied to one or both variables in a scatterplot. A scatterplot of the
population change from 2010 to 2017 against the population in 2010 is shown in Figure 5.12a. It’s
difficult to decipher any interesting patterns because the population variable is so strongly skewed.
However, if we apply a log10 transformation to the population variable, as shown in Figure 5.12b, a
positive association between the variables is revealed. In fact, we may be interested in fitting a trend
line to the data when we explore methods around fitting regression lines in Chapter 7.
86 CHAPTER 5. EXPLORING NUMERICAL DATA
Figure 5.12: Scatterplots of population change and log10 -transformed population change vs. population
before change.
Transformations other than the logarithm can be useful, too. For instance, the square root
√ 1
( original observation) and inverse ( original observation ) are commonly used by data scientists.
Common goals in transforming data are to see the data structure differently, reduce skew, assist in
modeling, or straighten a nonlinear relationship in a scatterplot.
The county dataset offers many numerical variables that we could plot using dot plots, scatterplots, or
box plots, but they can miss the true nature of the data as geographic. When we encounter geographic
data, we should create an intensity map, where colors are used to show higher and lower values of
a variable. Figure 5.13 shows intensity maps for poverty rate in percent (poverty), unemployment
rate in percent (unemployment_rate), homeownership rate in percent (homeownership), and median
household income in $1000s (median_hh_income). The color key indicates which colors correspond to
which values. The intensity maps are not generally very helpful for getting precise values in any given
county, but they are very helpful for seeing geographic trends and generating interesting research
questions or hypotheses.
EXAMPLE
What interesting features are evident in the poverty and unemployment rate intensity maps in
Figure 5.13c and Figure 5.13d?
Poverty rates are evidently higher in a few locations. Notably, the deep south shows higher
poverty rates, as does much of Arizona and New Mexico. High poverty rates are evident in
the Mississippi flood plains a little north of New Orleans and in a large section of Kentucky.
The unemployment rate follows similar trends, and we can see correspondence between the
two variables. In fact, it makes sense for higher rates of unemployment to be closely related
to poverty rates. One observation that stands out when comparing the two maps: the poverty
rate is much higher than the unemployment rate, meaning while many people may be working,
they are not making enough to break out of poverty.
GUIDED PRACTICE
What interesting features are evident in the median household income intensity map in
Figure 5.13b?14
14 Answers will vary. There is some correspondence between high earning and metropolitan areas, where we can see
darker spots (higher median household income), though there are several exceptions. You might look for large cities
you are familiar with and try to spot them on the map as dark spots.
5.8. MAPPING DATA 87
5.9.1 Summary
Fluently working with numerical variables is an important skill for data analysts. In this chapter we
have introduced different visualizations and numerical summaries applied to numeric variables. The
graphical visualizations are even more descriptive when two variables are presented simultaneously. We
presented scatterplots, dot plots, histograms, and box plots. Numerical variables can be summarized
using the mean, median, quartiles, standard deviation, and variance.
5.9.2 Terms
The terms introduced in this chapter are presented in Table 5.5. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
5.10 Exercises
2. Associations. Indicate which of the plots show (a) a positive association, (b) a negative
association, or (c) no association. Also determine if the positive and negative associations are
linear or nonlinear. Each part may refer to more than one plot.
3. Reproducing bacteria. Suppose that there is only sufficient space and nutrients to support
one million bacterial cells in a petri dish. You place a few bacterial cells in this petri dish, allow
them to reproduce freely, and record the number of bacterial cells in the dish over time. Sketch
a plot representing the relationship between number of bacterial cells and time.
4. Office productivity. Office productivity is relatively low when the employees feel no stress
about their work or job security. However, high levels of stress can also lead to reduced employee
productivity. Sketch a plot to represent the relationship between stress and productivity.
5. Make-up exam. In a class of 25 students, 24 of them took an exam in class and 1 student
took a make-up exam the following day. The professor graded the first batch of 24 exams and
found an average score of 74 points with a standard deviation of 8.9 points. The student who
took the make-up the following day scored 64 points on the exam.
a. Does the new student’s score increase or decrease the average score?
b. What is the new average?
c. Does the new student’s score increase or decrease the standard deviation of the scores?
15 The mammals data used in this exercise can be found in the openintro R package.
90 CHAPTER 5. EXPLORING NUMERICAL DATA
6. Infant mortality. The infant mortality rate is defined as the number of infant deaths per
1,000 live births. This rate is often used as an indicator of the level of health in a country. The
relative frequency histogram below shows the distribution of estimated infant death rates for
224 countries for which such data were available in 2014.16
7. Days off at a mining plant. Workers at a particular mining site receive an average of 35 days
paid vacation, which is lower than the national average. The manager of this plant is under
pressure from a local union to increase the amount of paid time off. However, he does not want
to give more days off to the workers because that would be costly. Instead he decides he should
fire 10 employees in such a way as to raise the average number of days off that are reported by
his employees. In order to achieve this goal, should he fire employees who have the most number
of days off, least number of days off, or those who have about the average number of days off?
8. Medians and IQRs. For each part, compare distributions A and B based on their medians
and IQRs. You do not need to calculate these statistics; simply state how the medians and IQRs
compare. Make sure to explain your reasoning. Hint: It may be useful to sketch dot plots of
the distributions.
a. A: 3, 5, 6, 7, 9; B: 3, 5, 6, 7, 20
b. A: 3, 5, 6, 7, 9; B: 3, 5, 7, 8, 9
c. A: 1, 2, 3, 4, 5; B: 6, 7, 8, 9, 10
d. A: 0, 10, 50, 60, 100; B: 0, 100, 500, 600, 1000
9. Means and SDs. For each part, compare distributions A and B based on their means and
standard deviations. You do not need to calculate these statistics; simply state how the means
and the standard deviations compare. Make sure to explain your reasoning. Hint: It may be
useful to sketch dot plots of the distributions.
a. A: 3, 5, 5, 5, 8, 11, 11, 11, 13; B: 3, 5, 5, 5, 8, 11, 11, 11, 20
b. A: -20, 0, 0, 0, 15, 25, 30, 30; B: -40, 0, 0, 0, 15, 25, 30, 30
c. A: 0, 2, 4, 6, 8, 10; B: 20, 22, 24, 26, 28, 30
d. A: 100, 200, 300, 400, 500; B: 0, 50, 300, 550, 600
10. Histograms and box plots. Describe (in words) the distribution in the histograms below and
match them to the box plots.
16 The cia_factbook data used in this exercise can be found in the openintro R package.
5.10. EXERCISES 91
11. Air quality. Daily air quality is measured by the air quality index (AQI) reported by the
Environmental Protection Agency. This index reports the pollution level and what associated
health effects might be a concern. The index is calculated for five major air pollutants regulated
by the Clean Air Act and takes values from 0 to 300, where a higher value indicates lower air
quality. AQI was reported for a 356 days in 2022 in Durham, NC. The histogram below shows
the distribution of the AQI values on these days.17
12. Median vs. mean. Estimate the median for the 400 observations shown in the histogram, and
note whether you expect the mean to be higher or lower than the median.
13. Histograms vs. box plots. Compare the two plots below. What characteristics of the distri-
bution are apparent in the histogram and not in the box plot? What characteristics are apparent
in the box plot but not in the histogram?
17 The pm25_2022_durham data used in this exercise can be found in the openintro R package.
92 CHAPTER 5. EXPLORING NUMERICAL DATA
14. Facebook friends. Facebook data indicate that 50% of Facebook users have 100 or more
friends, and that the average friend count of users is 190. What do these findings suggest about
the shape of the distribution of number of friends of Facebook users? (Backstrom 2011)
15. Distributions and appropriate statistics. For each of the following, state whether you ex-
pect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the
mean or median would best represent a typical observation in the data, and whether the vari-
ability of observations would be best represented using the standard deviation or IQR. Explain
your reasoning.
a. Number of pets per household.
b. Distance to work, i.e., number of miles between work and home.
c. Heights of adult males.
d. Age at death.
e. Exam grade on an easy test.
16. Distributions and appropriate statistics. For each of the following, state whether you ex-
pect the distribution to be symmetric, right skewed, or left skewed. Also specify whether the
mean or median would best represent a typical observation in the data, and whether the vari-
ability of observations would be best represented using the standard deviation or IQR. Explain
your reasoning.
a. Housing prices in a country where 25% of the houses cost below $350,000, 50% of the houses
cost below $450,000, 75% of the houses cost below $1,000,000, and there are a meaningful
number of houses that cost more than $6,000,000.
b. Housing prices in a country where 25% of the houses cost below $300,000, 50% of the houses
cost below $600,000, 75% of the houses cost below $900,000, and very few houses that cost
more than $1,200,000.
c. Number of alcoholic drinks consumed by college students in a given week. Assume that
most of these students don’t drink since they are under 21 years old, and only a few drink
excessively.
d. Annual salaries of the employees at a Fortune 500 company where only a few high level
executives earn much higher salaries than all the other employees.
e. Gestation time in humans where 25% of the babies are born by 38 weeks of gestation, 50%
of the babies are born by 39 weeks, 75% of the babies are born by 40 weeks, and the
maximum gestation length is 46 weeks.
17. TV watchers. College students in a statistics class were asked how many hours of television
they watch per week, including online streaming services. This sample yielded an average of 8.28
hours, with a standard deviation of 7.18 hours. Is the distribution of number of hours students
watch television weekly symmetric? If not, what shape would you expect this distribution to
have? Explain your reasoning.
18. Exam scores. The average on a history exam (scored out of 100 points) was 85, with a standard
deviation of 15. Is the distribution of the scores on this exam symmetric? If not, what shape
would you expect this distribution to have? Explain your reasoning.
19. Midrange. The midrange of a distribution is defined as the average of the maximum and the
minimum of that distribution. Is this statistic robust to outliers and extreme skew? Explain
your reasoning.
5.10. EXERCISES 93
20. Oscar winners. The first Oscar awards for best actor and best actress were given out in 1929.
The histograms below show the age distribution for all of the best actor and best actress winners
from 1929 to 2019. Summary statistics for these distributions are also provided. Compare the
distributions of ages of best actor and actress winners.18
Mean SD n
Best actor 43.8 8.8 92
Best actress 36.2 11.9 92
21. Stats scores. The final exam scores of twenty introductory statistics students, arranged in
ascending order, as as follows: 57, 66, 69, 71, 72, 73, 74, 77, 78, 78, 79, 79, 81, 81, 82, 83, 83,
88, 89, 94. Suppose students who score above the 75th percentile on the final exam get an A in
the class. How many students will get an A in this class?
22. Income at the coffee shop. The first histogram below shows the distribution of the yearly
incomes of 40 patrons at a college coffee shop. Suppose two new people walk into the coffee
shop: one making $225,000 and the other $250,000. The second histogram shows the new income
distribution. Summary statistics are also provided, rounded to the nearest whole number.
a. Would the mean or the median best represent what we might think of as a typical income
for the 42 patrons at this coffee shop? What does this say about the robustness of the two
measures?
b. Would the standard deviation or the IQR best represent the amount of variability in the
incomes of the 42 patrons at this coffee shop? What does this say about the robustness of
the two measures?
18 The oscars data used in this exercise can be found in the openintro R package.
94 CHAPTER 5. EXPLORING NUMERICAL DATA
𝑥̄
23. A new statistic. The statistic 𝑚𝑒𝑑𝑖𝑎𝑛 can be used as a measure of skewness. Suppose we have
a distribution where all observations are greater than 0, 𝑥𝑖 > 0. What is the expected shape of
the distribution under the following conditions? Explain your reasoning.
𝑥̄
a. 𝑚𝑒𝑑𝑖𝑎𝑛 =1
𝑥̄
b. 𝑚𝑒𝑑𝑖𝑎𝑛 <1
𝑥̄
c. 𝑚𝑒𝑑𝑖𝑎𝑛 >1
24. Commute times. The US census collects data on the time it takes Americans to commute
to work, among many other variables. The histogram below shows the distribution of mean
commute times in 3,142 US counties in 2017. Also shown below is a spatial intensity map of the
same data.19
a. Describe the numerical distribution and comment on whether a log transformation may be
advisable for these data.
b. Describe the spatial distribution of commuting times using the map.
25. Hispanic population. The US census collects data on race and ethnicity of Americans, among
many other variables. The histogram below shows the distribution of the percentage of the
population that is Hispanic in 3,142 counties in the US in 2010. Also shown is a histogram of
logs of these values.20
a. Describe the numerical distribution and comment on why we might want to use log-
transformed values in analyzing or modeling these data.
b. What features of the distribution of the Hispanic population in US counties are apparent
in the map but not in the histogram? What features are apparent in the histogram but
not the map?
c. Is one visualization more appropriate or helpful than the other? Explain your reasoning.
19 The county_complete data used in this exercise can be found in the usdata R package.
20 The county_complete data used in this exercise can be found in the usdata R package.
5.10. EXERCISES 95
26. NYC marathon winners. The histogram and box plots below show the distribution of
finishing times for male and female (combined) winners of the New York City Marathon between
1970 and 2023.21
a. What features of the distribution are apparent in the histogram and not the box plot?
What features are apparent in the box plot but not in the histogram?
b. What may be the reason for the bimodal distribution? Explain.
c. Compare the distribution of marathon times for men and women based on the box plot
shown below.
d. The time series plot shown below is another way to look at these data. Describe what is
visible in this plot but not in the others.
21 The nyc_marathon data used in this exercise can be found in the openintro R package.
96
Chapter 6
Applications: Explore
Graphs can powerfully communicate ideas directly and quickly. We all know, after all, that “a picture
is worth 1000 words.” Unfortunately, however, there are times when an image conveys a message
which is inaccurate or misleading.
This chapter focuses on how graphs can best be utilized to present data accurately and effectively.
Along with data modeling, creative visualization is somewhat of an art. However, even with an
art, there are recommended guiding principles. We provide a few best practices for creating data
visualizations.
Figure 6.1: Same information displayed with two very different visualizations.
(a) Default coloring does nothing for the understanding (b) Color draws attention directly to the bar on Buildings
of the data. and Administration.
Figure 6.2: Three bar charts visualizing the same information with different coloring to highlight different
aspects.
98 CHAPTER 6. APPLICATIONS: EXPLORE
Figure 6.3: Time series plot showing monthly Duke University hiring trends over five calendar years.
How well or badly do you think the government are doing at handling Britain’s exit from
the European Union?
• Very well
• Fairly well
• Fairly badly
• Very badly
• Don’t know
1 Source: YouGov Survey Results, retrieved Oct 7, 2019.
6.1. CASE STUDY: EFFECTIVE COMMUNICATION OF EXPLORATORY RESULTS 99
Figure 6.4: Three bar charts visualizing the same information with arrangement of levels.
Figure 6.5: Stacked bar plots. Horizontal orientation makes the region labels easier to read.
100 CHAPTER 6. APPLICATIONS: EXPLORE
Figure 6.6: Three different representations of two variables from the survey, region and opinion.
6.1. CASE STUDY: EFFECTIVE COMMUNICATION OF EXPLORATORY RESULTS 101
Figure 6.7: Identical bar plots with two different coloring options.
In this chapter different representations are contrasted to demonstrate best practices in creating
graphs. The fundamental principle is that your graph should provide maximal information succinctly
and clearly. Labels should be clear and oriented horizontally for the reader. Don’t forget titles and,
if possible, include the source of the data.
102 CHAPTER 6. APPLICATIONS: EXPLORE
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 2: Exploratory data analysis
https://openintrostat.github.io/ims-tutorials/02-explore
Tutorial 2 - Lesson 1: Visualizing categorical data
https://openintro.shinyapps.io/ims-02-explore-01
Tutorial 2 - Lesson 2: Visualizing numerical data
https://openintro.shinyapps.io/ims-02-explore-02
Tutorial 2 - Lesson 3: Summarizing with statistics
https://openintro.shinyapps.io/ims-02-explore-03
Tutorial 2 - Lesson 4: Case study
https://openintro.shinyapps.io/ims-02-explore-04
You can also access the full list of tutorials supporting this book at https://openintrostat.github.io/
ims-tutorials.
6.3 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Intro to data - Flight delays
https://www.openintro.org/go?id=ims-r-lab-intro-to-data
You can also access the full list of labs supporting this book at https://www.openintro.org/go?id=ims-
r-labs.
103
PART III
Regression modeling
104
Among the most ubiquitous methods used to model a response variable given one or more predictor
variables is regression. Linear regression is most commonly used when the response variable is numeric
(and even better, continuous); logistic regression is used when the response variable is binary.
• In Chapter 7 you are introduced to finding a best fit line using a least squares method. Addi-
tionally, the correlation and coefficient of determination are presented as a way to describe the
strength of the linear model.
• In Chapter 8 the linear model is expanded to include multiple predictor variables in a single
model. We discuss the benefits as well as the pitfalls that can arise when using multiple predic-
tors.
• In Chapter 9 the response variable is constrained to be binary which changes the entire structure
and produces the logistic regression model. The similarities between the regression models
(namely, linear combinations of the predictors) are presented. Additionally, you see that the
logistic regression predictions are now probabilities.
• Chapter 10 includes an application on the Houses for sale case study where the topics from this
part of the book are fully developed.
Later on in the textbook, in the Inferential modeling part, we will consider how a regression model
built on a sample may or may not describe a particular population of interest.
105
Chapter 7
When considering linear regression, it’s helpful to think deeply about the line fitting process. In this
section, we define the form of a linear model, explore criteria for what makes a good fit, and introduce
a new statistic called correlation.
Figure 7.1: Requests from twelve separate buyers were simultaneously placed with a trading company to
purchase Target Corporation stock (ticker TGT, December 28th, 2018), and the total cost of the shares were
reported. Because the cost is computed using a linear formula, the linear fit is perfect.
106 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Linear regression is the statistical method for fitting a line to data where the relationship between
two variables, 𝑥 and 𝑦, can be modeled by a straight line with some error:
𝑦 = 𝑏 0 + 𝑏1 𝑥 + 𝑒
The values 𝑏0 and 𝑏1 represent the model’s intercept and slope, respectively, and the error is represented
by 𝑒. These values are calculated based on the data, i.e., they are sample statistics. If the observed
data is a random sample from a target population that we are interested in making inferences about,
these values are considered to be point estimates for the population parameters 𝛽0 and 𝛽1 . We will
discuss how to make inferences about parameters of a linear model based on sample statistics in
Chapter 24.
When we use 𝑥 to predict 𝑦, we usually call 𝑥 the predictor variable and we call 𝑦 the outcome.
We also often drop the 𝑒 term when writing down the model since our main focus is often on the
prediction of the average outcome.
It is rare for all of the data to fall perfectly on a straight line. Instead, it’s more common for data
to appear as a cloud of points, such as those examples shown in Figure 7.2. In each case, the data
fall around a straight line, even if none of the observations fall exactly on the line. The first plot
shows a relatively strong downward linear trend, where the remaining variability in the data around
the line is minor relative to the strength of the relationship between 𝑥 and 𝑦. The second plot shows
an upward trend that, while evident, is not as strong as the first. The last plot shows a very weak
downward trend in the data, so slight we can hardly notice it. In each of these examples, we will have
some uncertainty regarding our estimates of the model parameters, 𝛽0 and 𝛽1 . For instance, we might
wonder, should we move the line up or down a little, or should we tilt it more or less? As we move
forward in this chapter, we will learn about criteria for line-fitting, and we will also learn about the
uncertainty associated with estimates of model parameters.
Figure 7.2: Three datasets where a linear model may be useful even though the data do not all fall exactly
on the line.
There are also cases where fitting a straight line to the data, even if there is a clear relationship
between the variables, is not helpful. One such case is shown in Figure 7.3 where there is a very clear
relationship between the variables even though the trend is not linear. We discuss nonlinear trends
in this chapter and the next, but details of fitting nonlinear models are saved for a later course.
Figure 7.3: The best fitting line for these data is flat, which is not a useful way to describe the non-linear
relationship. These data are from a physics experiment.
7.1. FITTING A LINE, RESIDUALS, AND CORRELATION 107
Figure 7.4: The common brushtail possum of Australia. Photo by Greg Schecter, flic.kr/p/9BAFbR, CC
BY 2.0 license.
Figure 7.5 shows a scatterplot for the head length (mm) and total length (cm) of the possums. Each
point represents a single possum from the data. The head and total length variables are associated:
possums with an above average total length also tend to have above average head lengths. While
the relationship is not perfectly linear, it could be helpful to partially explain the connection between
these variables with a straight line.
Figure 7.5: A scatterplot showing head length against total length for 104 brushtail possums. A point
representing a possum with head length 86.7 mm and total length 84 cm is highlighted.
We want to describe the relationship between head and total length of possum’s with a line. In this
example, we will use the total length as the predictor variable, 𝑥, to predict a possum’s head length,
𝑦. We could fit the linear relationship by eye, as in Figure 7.6.
108 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Figure 7.6: A reasonable linear model was fit to represent the relationship between head length and total
length.
A “hat” on 𝑦 is used to signify that this is an estimate. We can use this line to discuss properties of
possums. For instance, the equation predicts a possum with a total length of 80 cm will have a head
length of
𝑦 ̂ = 41 + 0.59 × 80 = 88.2
The estimate may be viewed as an average: the equation predicts that possums with a total length
of 80 cm will have an average head length of 88.2 mm. Absent further information about an 80 cm
possum, the prediction for head length that uses the average is a reasonable estimate.
There may be other variables that could help us predict the head length of a possum besides its length.
Perhaps the relationship would be a little different for male possums than female possums, or perhaps
it would differ for possums from one region of Australia versus another region. Figure 7.7a shows the
relationship between total length and head length of brushtail possums, taking into consideration their
sex. Male possums (represented by blue triangles) seem to be larger in terms of total length and head
length than female possums (represented by red circles). Figure 7.7b shows the same relationship,
taking into consideration their age. It’s harder to tell if age changes the relationship between total
length and head length for these possums.
Figure 7.7: Relationship between total length and head length of brushtail possums, taking into consideration
their sex or age.
In Chapter 8, we’ll learn about how we can include more than one predictor in our model. Before we
get there, we first need to better understand how to best build a linear model with one predictor.
7.1. FITTING A LINE, RESIDUALS, AND CORRELATION 109
7.1.3 Residuals
Residuals are the leftover variation in the data after accounting for the model fit:
Each observation will have a residual, and three of the residuals for the linear model we fit for the
possum data are shown in Figure 7.8. If an observation is above the regression line, then its residual,
the vertical distance from the observation to the line, is positive. Observations below the line have
negative residuals. One goal in picking the right linear model is for residuals to be as small as possible.
Figure 7.8 is almost a replica of Figure 7.6, with three points from the data highlighted. The obser-
vation marked by a red circle has a small, negative residual of about -1; the observation marked by a
gray diamond has a large positive residual of about +7; and the observation marked by a pink triangle
has a moderate negative residual of about -4. The size of a residual is usually discussed in terms of
its absolute value. For example, the residual for the observation marked by a pink triangle is larger
than that of the observation marked by a red circle because | − 4| is larger than | − 1|.
Figure 7.8: A reasonable linear model was fit to represent the relationship between head length and total
length, with three points highlighted.
The residual of the 𝑖𝑡ℎ observation (𝑥𝑖 , 𝑦𝑖 ) is the difference of the observed outcome (𝑦𝑖 )
and the outcome we would predict based on the model fit (𝑦𝑖̂ ) ∶
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂
EXAMPLE
The linear fit shown in Figure 7.8 is given as 𝑦 ̂ = 41 + 0.59𝑥. Based on this line, compute the
residual of the observation (76.0, 85.1). This observation is marked by a red circle in Figure 7.8.
Check it against the earlier visual estimate, -1.
We first compute the predicted value of the observation marked by a red circle based on the
model: 𝑦 ̂ = 41 + 0.59𝑥 = 41 + 0.59 × 76.0 = 85.84. Next we compute the difference of the actual
head length and the predicted head length: 𝑒 = 𝑦 − 𝑦 ̂ = 85.1 − 85.84 = −0.74. The model’s
error is 𝑒 = −0.74 mm, which is very close to the visual estimate of -1 mm. The negative
residual indicates that the linear model overpredicted head length for this possum.
GUIDED PRACTICE
If a model underestimates an observation, will the residual be positive or negative?
What about if it overestimates the observation?1
GUIDED PRACTICE
Compute the residuals for the observation marked by a blue diamond, (85.0, 98.6), and
the observation marked by a pink triangle, (95.5, 94.0), in the figure using the linear
relationship 𝑦 ̂ = 41 + 0.59𝑥.2
Residuals are helpful in evaluating how well a linear model fits a dataset. We often display them in
a scatterplot such as the one shown in Figure 7.9 for the regression line in Figure 7.8. The residuals
are plotted with their predicted outcome variable value as the horizontal coordinate, and the vertical
coordinate as the residual. For instance, the point (85.0, 98.6) (marked by the blue diamond) had
a predicted value of 91.4 mm and had a residual of 7.45 mm, so in the residual plot it is placed at
(91.4, 7.45). Creating a residual plot is sort of like tipping the scatterplot over so the regression line
is horizontal, as indicated by the dashed line.
Figure 7.9: Residual plot for the model predicting head length from total length for brushtail possums.
1 If a model underestimates an observation, then the model estimate is below the actual. The residual, which is
the actual observation value minus the model estimate, must then be positive. The opposite is true when the model
overestimates the observation: the residual is negative.
2 Gray diamond: 𝑦 ̂ = 41 + 0.59𝑥 = 41 + 0.59 × 85.0 = 91.15 → 𝑒 = 𝑦 − 𝑦 ̂ = 98.6 − 91.15 = 7.45. This is close to the
earlier estimate of 7. pink triangle: 𝑦̂ = 41 + 0.59𝑥 = 97.3 → 𝑒 = −3.3. This is also close to the estimate of -4.
7.1. FITTING A LINE, RESIDUALS, AND CORRELATION 111
EXAMPLE
One purpose of residual plots is to identify characteristics or patterns still apparent in data
after fitting a model. The figure below shows three scatterplots with linear models in the first
row and residual plots in the second row. Can you identify any patterns in the residuals?
Dataset 1: the residuals show no obvious patterns. The residuals are scattered randomly
around 0, represented by the dashed line.
Dataset 2: The second dataset shows a pattern in the residuals. There is some curvature in
the scatterplot, which is more obvious in the residual plot. We should not use a straight line
to model these data. Instead, a more advanced technique should be used to model the curved
relationship, such as the variable transformations discussed in Section 5.7.
Dataset 3: The last plot shows very little upwards trend, and the residuals also show no obvious
patterns. It is reasonable to try to fit a linear model to the data. However, it is unclear whether
there is evidence that the slope parameter is different from zero. The point estimate of the
slope parameter is not zero, but we might wonder if this could just be due to chance. We will
address this scenario in Chapter 24.
Correlation which always takes values between -1 and 1, describes the strength and
direction of the linear relationship between two variables. We denote the correlation by
𝑟.
The correlation value has no units and will not be affected by a linear change in the
units (e.g., going from inches to centimeters).
We can compute the correlation using a formula, just as we did with the sample mean and standard
deviation. The formula for correlation, however, is rather complex3 , and like with other statistics, we
generally perform the calculations on a computer or calculator.
𝑛
1 𝑥 − 𝑥 ̄ 𝑦𝑖 − 𝑦 ̄
𝑟= ∑ 𝑖
𝑛 − 1 𝑖=1 𝑠𝑥 𝑠𝑦
where 𝑥,̄ 𝑦,̄ 𝑠𝑥 , and 𝑠𝑦 are the sample means and standard deviations for each variable.
3 Formally, we can compute the correlation for observations (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), …, (𝑥𝑛 , 𝑦𝑛 ) using the formula
112 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Figure 7.10 shows eight plots and their corresponding correlations. Only when the relationship is
perfectly linear is the correlation either -1 or +1. If the relationship is strong and positive, the
correlation will be near +1. If it is strong and negative, it will be near -1. If there is no apparent
linear relationship between the variables, then the correlation will be near zero.
Figure 7.10: Sample scatterplots and their correlations. The first row shows variables with a positive
relationship, represented by the trend up and to the right. The second row shows variables with a negative
trend, where a large value in one variable is associated with a lower value in the other.
The correlation is intended to quantify the strength of a linear trend. Nonlinear trends, even when
strong, sometimes produce correlations that do not reflect the strength of the relationship; see three
such examples in Figure 7.11.
Figure 7.11: Sample scatterplots and their correlations. In each case, there is a strong relationship between
the variables. However, because the relationship is not linear, the correlation is relatively weak.
GUIDED PRACTICE
No straight line is a good fit for any of the datasets represented in Figure 7.11. Try
drawing nonlinear curves on each plot. Once you create a curve for each, describe what
is important in your fit.4
4 We’ll leave it to you to draw the lines. In general, the lines you draw should be close to most points and reflect
EXAMPLE
The plot below displays the relationships between various crop yields in countries. In the plots,
each point represents a different country. The x and y variables represent the proportion of
total yield in the last 50 years which is due to that crop type.
Order the six scatterplots from strongest negative to strongest positive linear relationship.
𝐴→𝐷→𝐵→𝐶→𝐸→𝐹
One important aspect of the correlation is that it’s unitless. That is, unlike a measurement of the slope
of a line (see the next section) which provides an increase in the y-coordinate for a one unit increase in
the x-coordinate (in units of the x and y variable), there are no units associated with the correlation
of x and y. Figure 7.12 shows the relationship between weights and heights of 507 physically active
individuals. In Figure 7.12a, weight is measured in kilograms (kg) and height in centimeters (cm). In
Figure 7.12b, weight has been converted to pounds (lbs) and height to inches (in). The correlation
coefficient (𝑟 = 0.72) is also noted on both plots. We can see that the shape of the relationship has
not changed, and neither has the correlation coefficient. The only visual change to the plot is the axis
labeling of the points.
114 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Figure 7.12: Two scatterplots, both displaying the relationship between weights and heights of 507 physically
healthy adults and the correlation coefficient, 𝑟 = 0.72.
Fitting linear models by eye is open to criticism since it is based on an individual’s preference. In this
section, we use least squares regression as a more rigorous approach to fitting a line to a scatterplot.
GUIDED PRACTICE
Is the correlation positive or negative in Figure 7.13?5
5 Larger family incomes are associated with lower amounts of aid, so the correlation will be negative. Using a
Figure 7.13: Gift aid and family income for a random sample of 50 first-year students from Elmhurst College.
which we could accomplish with a computer program. The resulting dashed line shown in Figure 7.14
demonstrates this fit can be quite reasonable.
Figure 7.14: Gift aid and family income for a random sample of 50 first-year Elmhurst College students.
The dashed line is the line that minimizes the sum of the absolute value of residuals, the solid line is the line
that minimizes the sum of squared residuals, i.e., the least squares line.
However, a more common practice is to choose the line that minimizes the sum of the squared residuals:
The line that minimizes this least squares criterion is represented as the solid line in Figure 7.14 and is
commonly called the least squares line. The following are three possible reasons to choose the least
squares option instead of trying to minimize the sum of residual magnitudes without any squaring:
1. It is the most commonly used method.
2. Computing the least squares line is widely supported in statistical software.
3. In many applications, a residual twice as large as another residual is more than twice as bad.
For example, being off by 4 is usually more than twice as bad as being off by 2. Squaring the
residuals accounts for this discrepancy.
4. The analyses which link the model to inference about a population are most straightforward
when the line is fit through least squares.
The first two reasons are largely for tradition and convenience; the third and fourth reasons explain
why the least squares criterion is typically most helpful when working with real data.6
̂ = 𝛽0 + 𝛽1 × family_income
aid
Here the equation is set up to predict gift aid based on a student’s family income, which would be
useful to students considering Elmhurst. These two values, 𝛽0 and 𝛽1 , are the parameters of the
regression line.
The parameters are estimated using the observed data. In practice, this estimation is done using
a computer in the same way that other estimates, like a sample mean, can be estimated using a
computer or calculator.
The dataset where these data are stored is called elmhurst. The first 5 rows of this dataset are given
in Table 7.1.
We can see that family income is recorded in a variable called family_income and gift aid from
university is recorded in a variable called gift_aid. For now, we won’t worry about the price_paid
variable. We should also note that these data are from the 2011-2012 academic year, and all monetary
amounts are given in $1,000s, i.e., the family income of the first student in the data shown in Table 7.1
is $92,920 and they received a gift aid of $21,700. (The data source states that all numbers have been
rounded to the nearest whole dollar.)
Statistical software is usually used to compute the least squares line and the typical output generated
as a result of fitting regression models looks like the one shown in Table 7.2. For now we will focus
on the first column of the output, which lists 𝑏0 and 𝑏1 . In Chapter 24 we will dive deeper into the
remaining columns which give us information on how accurate and precise these values of intercept
and slope that are calculated from a sample of 50 students are in estimating the population parameters
of intercept and slope for all students.
6 There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other
criteria we might consider. However, this book only applies the least squares criterion.
7.2. LEAST SQUARES REGRESSION 117
Table 7.2: Summary of least squares fit for the Elmhurst data.
The model output tells us that the intercept is approximately 24.319 and the slope on family_income
is approximately -0.043.
But what do these values mean? Interpreting parameters in a regression model is often one of the
most important steps in the analysis.
EXAMPLE
The intercept and slope estimates for the Elmhurst data are 𝑏0 = 24.319 and 𝑏1 = -0.043. What
do these numbers really mean?
Interpreting the slope parameter is helpful in almost any application. For each additional $1,000
of family income, we would expect a student to receive a net difference of 1,000 × (-0.0431)
= -$43.10 in aid on average, i.e., $43.10 less. Note that a higher family income corresponds
to less aid because the coefficient of family income is negative in the model. We must be
cautious in this interpretation: while there is a real association, we cannot interpret a causal
connection between the variables because these data are observational. That is, increasing
a particular student’s family income may not cause the student’s aid to drop. (Although it
would be reasonable to contact the college and ask if the relationship is causal, i.e., if Elmhurst
College’s aid decisions are partially based on students’ family income.)
The estimated intercept 𝑏0 = 24.319 describes the average aid if a student’s family had no
income, $24,319. The meaning of the intercept is relevant to this application since the family
income for some students at Elmhurst is $0. In other applications, the intercept may have
little or no practical value if there are no observations where 𝑥 is near zero.
The slope describes the estimated difference in the predicted average outcome of 𝑦 if
the predictor variable 𝑥 happened to be one unit larger. The intercept describes the
average outcome of 𝑦 if 𝑥 = 0 and the linear model is valid all the way to 𝑥 = 0 (values
of 𝑥 = 0 are not observed or relevant in many applications).
If you would like to learn more about using R to fit linear models, see Section 10.2 for the interactive
R tutorials. An alternative way of calculating the values of intercept and slope of a least squares line is
manual calculations using formulas. While manual calculations are not commonly used by practicing
statisticians and data scientists, it is useful to work through the first time you’re learning about the
least squares line and modeling in general. Calculating the values by hand leverages two properties
of the least squares line:
1. The slope of the least squares line can be estimated by
𝑠𝑦
𝑏1 = 𝑟
𝑠𝑥
where 𝑟 is the correlation between the two variables, and 𝑠𝑥 and 𝑠𝑦 are the sample standard deviations
of the predictor and outcome, respectively.
2. If 𝑥̄ is the sample mean of the predictor variable and 𝑦 ̄ is the sample mean of the outcome
variable, then the point (𝑥,̄ 𝑦)̄ falls on the least squares line.
118 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Table 7.3 shows the sample means for the family income and gift aid as $101,780 and $19,940, respec-
tively. We could plot the point (102, 19.9) on Figure 7.13 to verify it falls on the least squares line
(the solid line).
Table 7.3: Summary statistics for family income and gift aid.
EXAMPLE
Using the summary statistics in Table 7.3, compute the slope for the regression line of gift aid
against family income.
Compute the slope using the summary statistics from Table 7.3:
𝑠𝑦 5.46
𝑏1 = 𝑟= (−0.499) = −0.0431
𝑠𝑥 63.2
You might recall the form of a line from math class, which we can use to find the model fit, including
the estimate of 𝑏0 . Given the slope of a line and a point on the line, (𝑥0 , 𝑦0 ), the equation for the line
can be written as
𝑦 − 𝑦0 = 𝑠𝑙𝑜𝑝𝑒 × (𝑥 − 𝑥0 )
EXAMPLE
Using the point (102, 19.9) from the sample means and the slope estimate 𝑏1 = −0.0431, find
the least-squares line for predicting aid based on family income.
Apply the point-slope equation using (102, 19.9) and the slope 𝑏1 = −0.0431:
𝑦 − 𝑦0 = 𝑏1 (𝑥 − 𝑥0 )
𝑦 − 19.9 = −0.0431(𝑥 − 102)
Expanding the right side and then adding 19.9 to each side, the equation simplifies:
̂ = 24.3 − 0.0431 × family_income
aid
Here we have replaced 𝑦 with aid ̂ and 𝑥 with family_income to put the equation in context.
The final least squares equation should always include a “hat” on the variable being predicted,
whether it is a generic ‶ 𝑦" or a named variable like ‶ 𝑎𝑖𝑑".
7.2. LEAST SQUARES REGRESSION 119
EXAMPLE
Suppose a high school senior is considering Elmhurst College. Can they simply use the linear
equation that we have estimated to calculate her financial aid from the university?
She may use it as an estimate, though some qualifiers on this approach are important. First,
all data come from one first-year class, and the way aid is determined by the university may
change from year to year. Second, the equation will provide an imperfect estimate. While the
linear equation is good at modeling the trend in the data, no individual student’s aid will be
perfectly predicted (as can be seen from the individual data points around the line).
EXAMPLE
̂ = 24.3 − 0.0431 × family_income to estimate the aid of another first-year
Use the model aid
student whose family had income of $1 million.
We want to calculate the aid for a family with $1 million income. Note that in our model this
will be represented as 1,000 since the data are in $1,000s.
24.3 − 0.0431 × 1000 = −18.8
The model predicts this student will have -$18,800 in aid (!). However, Elmhurst College does
not offer negative aid where they select some students to pay extra on top of tuition to attend.
Applying a model estimate to values outside of the realm of the original data is called extrapolation.
Generally, a linear model is only an approximation of the real relationship between two variables. If
we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid
in places where it has not been analyzed.
or about 25%, of the outcome variable’s variation by using information about family income for
predicting aid using a linear model. It turns out that 𝑅2 corresponds exactly to the squared value of
the correlation:
𝑟 = −0.499 → 𝑅2 = 0.25
120 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
GUIDED PRACTICE
If a linear model has a very strong negative relationship with a correlation of -0.97, how
much of the variation in the outcome is explained by the predictor?7
Since 𝑟 is always between -1 and 1, 𝑅2 will always be between 0 and 1. This statistic is
called the coefficient of determination, and it measures the proportion of variation
in the outcome variable, 𝑦, that can be explained by the linear model with predictor 𝑥.
More generally, 𝑅2 can be calculated as a ratio of a measure of variability around the line divided by
a measure of total variability.
We can measure the variability in the 𝑦 values by how far they tend to fall from their
mean, 𝑦.̄ We define this value as the total sum of squares, calculated using the
formula below, where 𝑦𝑖 represents each 𝑦 value in the sample, and 𝑦 ̄ represents the
mean of the 𝑦 values in the sample.
EXAMPLE
Among 50 students in the elmhurst dataset, the total variability in gift aid is 𝑆𝑆𝑇 = 1461.9
The sum of squared residuals is 𝑆𝑆𝐸 = 1098. Find 𝑅2 .
𝑆𝑆𝐸 1098
𝑅2 = 1 − =1− = 0.25,
𝑆𝑆𝑇 1461
the same value we found when we squared the correlation: 𝑅2 = (−0.499)2 = 0.25.
7 About 𝑅2 = (−0.97)2 = 0.94 or 94% of the variation in the outcome variable is explained by the linear model.
8 Thedifference 𝑆𝑆𝑇 − 𝑆𝑆𝐸 is called the regression sum of squares, 𝑆𝑆𝑅, and can also be calculated as
𝑆𝑆𝑅 = (𝑦1̂ − 𝑦)̄ 2 + (𝑦2̂ − 𝑦)̄ 2 + ⋯ + (𝑦𝑛
̂ − 𝑦)̄ 2 . 𝑆𝑆𝑅 represents the variation in 𝑦 that was accounted for in our model.
9 𝑆𝑆𝑇 can be calculated by finding the sample variance of the outcome variable, 𝑠2 and multiplying by 𝑛 − 1.
7.2. LEAST SQUARES REGRESSION 121
A plot of the auction data is shown in Figure 7.15. Note that the original dataset contains some Mario
Kart games being sold at prices above $100 but for this analysis we have limited our focus to the 141
Mario Kart games that were sold below $100.
Figure 7.15: Total auction prices for the video game Mario Kart, divided into used (𝑥 = 0) and new (𝑥 = 1)
condition games. The least squares regression line is also shown.
To incorporate the game condition variable into a regression equation, we must convert the categories
into a numerical form. We will do so using an indicator variable called condnew, which takes value
1 when the game is new and 0 when the game is used. Using this indicator variable, the linear model
may be written as
̂ = 𝑏0 + 𝑏1 × condnew
price
Table 7.4: Least squares regression summary for the final auction price against the condition of the game.
Using values from Table 7.4, the model equation can be summarized as
EXAMPLE
Interpret the two parameters estimated in the model for the price of Mario Kart in eBay
auctions.
The intercept is the estimated price when condnew has a value 0, i.e., when the game is in used
condition. That is, the average selling price of a used version of the game is $42.9. The slope
indicates that, on average, new games sell for about $10.9 more than used games.
The estimated intercept is the value of the outcome variable for the first category (i.e.,
the category corresponding to an indicator value of 0). The estimated slope is the
average change in the outcome variable between the two categories.
Note that, fundamentally, the intercept and slope interpretations do not change when modeling cat-
egorical variables with two levels. However, when the predictor variable is binary, the coefficient
estimates (𝑏0 and 𝑏1 ) are directly interpretable with respect to the dataset at hand.
We’ll elaborate further on modeling categorical predictors in Chapter 8, where we examine the influence
of many predictor variables simultaneously using multiple regression.
In this section, we discuss when outliers are important and influential. Outliers in a regression model
with one predictor and one outcome are observations that fall far from the cloud of points. These
points are especially important because they can have a strong influence on the least squares line.
Note that there are times when observations are outlying in the 𝑥 direction, the 𝑦 direction, or both.
However, being outlying in a univariate sense (either 𝑥 or 𝑦 or both) is not outlying from the bivariate
model. If the points are in-line with the bivariate model, they will not influence the least squares
regression line (even if the observations are outlying in the 𝑥 or 𝑦 or both directions!).
EXAMPLE
There are three plots shown in Figure 7.16a along with the corresponding least squares line
and residual plots. For each scatterplot and residual plot pair, identify the outliers and note
how they influence the least squares line. Recall that an outlier is any point that does not
appear to belong with the vast majority of the other points.
A: There is one outlier far from the other points (in the 𝑦 direction and it is an outlier of the
bivariate model), though it only appears to slightly influence the line.
B: There is one outlier on the right (in the 𝑥 and 𝑦 direction although it is not an outlier of
the bivariate model), though it is quite close to the least squares line, which suggests it wasn’t
very influential.
C: There is one point far away from the cloud (in the 𝑥 and 𝑦 direction and an outlier of
the bivariate model), and this outlier appears to pull the least squares line up on the right;
examine how the line around the primary cloud does not appear to fit very well.
7.3. OUTLIERS IN LINEAR REGRESSION 123
EXAMPLE
There are three plots shown in Figure 7.16b along with the least squares line and residual plots.
As you did in previous exercise, for each scatterplot and residual plot pair, identify the outliers
and note how they influence the least squares line. Recall that an outlier is any point that
does not appear to belong with the vast majority of the other points. A point can be outlying
in the 𝑥 direction, in the 𝑦 direction, or in relation to the bivariate model.
D: There is a primary cloud and then a small secondary cloud of four outliers (with respect
to both 𝑥 and the bivariate model). The secondary cloud appears to be influencing the line
somewhat strongly, making the least square line fit poorly almost everywhere. There might be
an interesting explanation for the dual clouds, which is something that could be investigated.
E: There is no obvious trend in the main cloud of points and the outlier on the right (with
respect to both 𝑥 and 𝑦) appears to largely (and problematically) control the slope of the least
squares line. The point creates a bivariate model when seemingly there is none.
F: There is one outlier far from the cloud (with respect to both 𝑥 and 𝑦). However, it falls
quite close to the least squares line and does not appear to be very influential (it is not outlying
with respect to the bivariate model).
Figure 7.16: Plots of six datasets, each with a least squares line and corresponding residual plot. Each
dataset has at least one outlier.
124 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
Examine the residual plots in Figure 7.16a and Figure 7.16b. In Plots C, D, and E, you will probably
find that there are a few observations which are both away from the remaining points along the x-axis
and not in the trajectory of the trend in the rest of the data. In these cases, the outliers influenced
the slope of the least squares lines. In Plot E, the bulk of the data show no clear trend, but if we fit
a line to these data, we impose a trend where there isn’t really one.
A good practice for dealing with outlying observations is to produce two analyses: one with and one
without the outlying observations. Presenting both analyses to a client and discussing the role of the
outlying observations should lead you to a more holistic understanding of the appropriate model for
the data.
Leverage.
Points that fall horizontally away from the center of the cloud tend to pull harder on
the line, so we call them points with high leverage or leverage points.
Points that fall horizontally far from the line are points of high leverage; these points can strongly
influence the slope of the least squares line. If one of these high leverage points does appear to actually
invoke its influence on the slope of the line – as in Plots C, D, and E of Figure 7.16a and Figure 7.16b
– then we call it an influential point. Usually we can say a point is influential if, had we fitted the
line without it, the influential point would have been unusually far from the least squares line.
Types of outliers.
A point (or a group of points) that stands out from the rest of the data is called an
outlier. Outliers that fall horizontally away from the center of the cloud of points
are called leverage points. Outliers that influence on the slope of the line are called
influential points.
It is tempting to remove outliers. Don’t do this without a very good reason. Models that ignore
exceptional (and interesting) cases often perform poorly. For instance, if a financial firm ignored the
largest market swings – the “outliers” – they would soon go bankrupt by making poorly thought-out
investments.
7.4. CHAPTER REVIEW 125
7.4.1 Summary
Throughout this chapter, the nuances of the linear model have been described. You have learned
how to create a linear model with explanatory variables that are numerical (e.g., total possum length)
and those that are categorical (e.g., whether a video game was new). The residuals in a linear model
are an important metric used to understand how well a model fits; high leverage points, influential
points, and other types of outliers can impact the fit of a model. Correlation is a measure of the
strength and direction of the linear relationship of two variables, without specifying which variable is
the explanatory and which is the outcome. Future chapters will focus on generalizing the linear model
from the sample of data to claims about the population of interest.
7.4.2 Terms
The terms introduced in this chapter are presented in Table 7.5. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
7.5 Exercises
2. Trends in residuals. Shown below are two plots of residuals remaining after fitting a linear
model to two different sets of data. For each plot, describe important features and determine if
a linear model would be appropriate for these data. Explain your reasoning.
3. Identify relationships, I. For each of the six plots, identify the strength of the relationship
(e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be
reasonable.
4. Identify relationships, II. For each of the six plots, identify the strength of the relationship
(e.g., weak, moderate, or strong) in the data and whether fitting a linear model would be
reasonable.
7.5. EXERCISES 127
5. Midterms and final. The two scatterplots below show the relationship between the overall
course average and two midterm exams (Exam 1 and Exam 2) recorded for 233 students during
several years for a statistics course at a university.10
a. Based on these graphs, which of the two exams has the strongest correlation with the course
grade? Explain.
b. Can you think of a reason why the correlation between the exam you chose in part (a) and
the course grade is higher?
6. Meat consumption and life expectancy. In data collected for You et al. (2022), total meat
intake is associated with life expectancy (at birth) in 175 countries. Meat intake is measured in
kg per capita per year (averaged over 2011 to 2013). Additionally, the authors collected data on
carbohydrate crops (e.g., cereals, root, sugar, etc.) in kg per capita per year (averaged over 2011
and 2013). The scatterplot on the left shows the life expectancy at birth plotted against the
per capita meat consumption. The scatterplot on the right shows the amount of carbohydrate
consumption plotted against the per capita meat consumption.
10 The exam_grades data used in this exercise can be found in the openintro R package.
128 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
a. 𝑟 = −0.7
b. 𝑟 = 0.45
c. 𝑟 = 0.06
d. 𝑟 = 0.92
8. Match the correlation, II. Match each correlation to the corresponding scatterplot.12
a. 𝑟 = 0.49
b. 𝑟 = −0.48
c. 𝑟 = −0.03
d. 𝑟 = −0.85
11 The corr_match data used in this exercise can be found in the openintro R package.
12 The corr_match data used in this exercise can be found in the openintro R package.
13 The bdims data used in this exercise can be found in the openintro R package.
7.5. EXERCISES 129
10. Compare correlations. Eduardo and Rosie are both collecting data on number of rainy days
in a year and the total rainfall for the year. Eduardo records rainfall in inches and Rosie in
centimeters. How will their correlation coefficients compare?
11. The Coast Starlight, correlation. The Coast Starlight Amtrak train runs from Seattle to
Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the
amount of time it takes to travel from one stop to another (in minutes).14
12. Crawling babies, correlation. A study conducted at the University of Denver investigated
whether babies take longer to learn to crawl in cold months, when they are often bundled in
clothes that restrict their movement, than in warmer months. Infants born during the study
year were split into twelve groups, one for each birth month. We consider the average crawling
age of babies in each group against the average temperature when the babies are six months old
(that’s when babies often begin trying to crawl). Temperature is measured in degrees Fahrenheit
(F) and age is measured in weeks.15 (Benson 1993)
14 The coast_starlight data used in this exercise can be found in the openintro R package.
15 The babies_crawl data used in this exercise can be found in the openintro R package.
130 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
13. Meat and carbohydrate consumption. What would be the correlation between the per
capita meat consumption and per capita carbohydrate consumption if, for each country, people
always consumed
1. 3 kg more of meat than of carbohydrates each year?
2. 2 kg less of meat than of carbohydrates each year?
3. half as much meat as carbohydrates each year?
14. Graduate degrees and salaries. What would be the correlation between the annual salaries
of people with and without a graduate degree at a company if, for a certain type of position,
someone with a graduate degree always made
a. $5,000 more than those without a graduate degree?
b. 25% more than those without a graduate degree?
c. 15% less than those without a graduate degree?
15. Units of regression. Consider a regression predicting the number of calories (cal) from width
(cm) for a sample of square shaped chocolate brownies. What are the units of the correlation
coefficient, the intercept, and the slope?
16. Which is higher? Determine if (I) or (II) is higher or if they are equal: “For a regression line,
the uncertainty associated with the slope estimate, 𝑏1 , is higher when (I) there is a lot of scatter
around the regression line or (II) there is very little scatter around the regression line.” Explain
your reasoning.
17. Over-under, I. Suppose we fit a regression line to predict the shelf life of an apple based on
its weight. For a particular apple, we predict the shelf life to be 4.6 days. The apple’s residual
is -0.6 days. Did we over or under estimate the shelf-life of the apple? Explain your reasoning.
18. Over-under, II. Suppose we fit a regression line to predict the number of incidents of skin
cancer per 1,000 people from the number of sunny days in a year. For a particular year, we
predict the incidence of skin cancer to be 1.5 per 1,000 people, and the residual for this year is
0.5. Did we over or under estimate the incidence of skin cancer? Explain your reasoning.
19. Starbucks, calories, and protein. The scatterplot below shows the relationship between
the number of calories and amount of protein (in grams) Starbucks food menu items contain.
Since Starbucks only lists the number of calories on the display items, we might be interested
in predicting the amount of protein a menu item has based on its calorie content.16
a. Describe the relationship between number of calories and amount of protein (in grams) that
Starbucks food menu items contain.
b. In this scenario, what are the predictor and outcome variables?
c. Why might we want to fit a regression line to these data?
d. What does the residuals vs. predicted plot tell us about the variability in prediction errors
based on this model for items with lower vs. higher predicted protein?
16 The starbucks data used in this exercise can be found in the openintro R package.
7.5. EXERCISES 131
20. Starbucks, calories, and carbs. The scatterplot below shows the relationship between the
number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain.
Since Starbucks only lists the number of calories on the display items, we might be interested
in predicting the amount of carbs a menu item has based on its calorie content.17
a. Describe the relationship between number of calories and amount of carbohydrates (in
grams) that Starbucks food menu items contain.
b. In this scenario, what are the predictor and outcome variables?
c. Why might we want to fit a regression line to these data?
d. What does the residuals vs. predicted plot tell us about the variability in prediction errors
based on this model for items with lower vs. higher predicted carbs?
21. The Coast Starlight, regression. The Coast Starlight Amtrak train runs from Seattle to
Los Angeles. The scatterplot below displays the distance between each stop (in miles) and the
amount of time it takes to travel from one stop to another (in minutes). The mean travel time
from one stop to the next on the Coast Starlight is 129 mins, with a standard deviation of 113
minutes. The mean distance traveled from one stop to the next is 108 miles with a standard
deviation of 99 miles. The correlation between travel time and distance is 0.636.18
a. Write the equation of the regression line for predicting travel time.
b. Interpret the slope and the intercept in this context.
c. Calculate 𝑅2 of the regression line for predicting travel time from distance traveled for the
Coast Starlight, and interpret 𝑅2 in the context of the application.
d. The distance between Santa Barbara and Los Angeles is 103 miles. Use the model to
estimate the time it takes for the Starlight to travel between these two cities.
e. It takes the Coast Starlight about 168 mins to travel from Santa Barbara to Los Angeles.
Calculate the residual and explain the meaning of this residual value.
f. Suppose Amtrak is considering adding a stop to the Coast Starlight 500 miles away from
Los Angeles. Would it be appropriate to use this linear model to predict the travel time
from Los Angeles to this point?
17 The starbucks data used in this exercise can be found in the openintro R package.
18 The coast_starlight data used in this exercise can be found in the openintro R package.
132 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
22. Body measurements, regression. Researchers studying anthropometry collected body and
skeletal diameter measurements, as well as age, weight, height and sex for 507 physically active
individuals. The scatterplot below shows the relationship between height and shoulder girth
(circumference of shoulders measured over deltoid muscles), both measured in centimeters. The
mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is
171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder
girth is 0.67.19 (Heinz et al. 2003)
23. Poverty and unemployment. The following scatterplot shows the relationship between per-
cent of population below the poverty level (poverty) from unemployment rate among those
ages 20-64 (unemployment_rate) in counties in the US, as provided by data from the 2019
American Community Survey. The regression output for the model for predicting poverty from
unemployment_rate is also provided.20
19 The bdims data used in this exercise can be found in the openintro R package.
20 The county_2019 data used in this exercise can be found in the usdata R package.
7.5. EXERCISES 133
24. Cat weights. The following regression output is for predicting the heart weight (Hwt, in g) of
cats from their body weight (Bwt, in kg). The coefficients are estimated using a dataset of 144
domestic cats.21
25. Outliers, I. Identify the outliers in the scatterplots shown below, and determine what type of
outliers they are. Explain your reasoning.
26. Outliers, II. Identify the outliers in the scatterplots shown below and determine what type of
outliers they are. Explain your reasoning.
27. Urban homeowners, outliers. The scatterplot below shows the percent of families who own
their home vs. the percent of the population living in urban areas. There are 52 observations,
each corresponding to a state in the US. Puerto Rico and District of Columbia are also included.22
21 The cats data used in this exercise can be found in the MASS R package.
22 The urban_owner data used in this exercise can be found in the usdata R package.
134 CHAPTER 7. LINEAR REGRESSION WITH A SINGLE PREDICTOR
28. Crawling babies, outliers. A study conducted at the University of Denver investigated
whether babies take longer to learn to crawl in cold months, when they are often bundled in
clothes that restrict their movement, than in warmer months. The plot below shows the relation-
ship between average crawling age of babies born in each month and the average temperature
in the month when the babies are six months old. The plot reveals a potential outlying month
when the average temperature is about 53F and average crawling age is about 28.5 weeks. Does
this point have high leverage? Is it an influential point?23 (Benson 1993)
29. True / False. Determine if the following statements are true or false. If false, explain why.
a. A correlation coefficient of -0.90 indicates a stronger linear relationship than a correlation
of 0.5.
b. Correlation is a measure of the association between any two variables.
30. Cherry trees. The scatterplots below show the relationship between height, diameter, and
volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet
above the ground.24
23 The babies_crawl data used in this exercise can be found in the openintro R package.
24 The trees data used in this exercise can be found in the datasets R package.
7.5. EXERCISES 135
31. Match the correlation, III. Match each correlation to the corresponding scatterplot.25
a. r = 0.69
b. r = 0.09
c. r = -0.91
d. r = 0.97
32. Helmets and lunches. The scatterplot shows the relationship between socioeconomic status
measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school
(lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet). The
average percentage of children receiving reduced-fee lunches is 30.833% with a standard deviation
of 26.724% and the average percentage of bike riders wearing helmets is 30.883% with a standard
deviation of 16.948%.
a. If the 𝑅2 for the least-squares regression line for these data is 72%, what is the correlation
between lunch and helmet?
b. Calculate the slope and intercept for the least-squares regression line for these data.
c. Interpret the intercept of the least-squares regression line in the context of the application.
d. Interpret the slope of the least-squares regression line in the context of the application.
e. What would the value of the residual be for a neighborhood where 40% of the children
receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning
of this residual in the context of the application.
25 The corr_match data used in this exercise can be found in the openintro R package.
136
Chapter 8
Multiple regression extends single predictor variable regression to the case that still has one response
but many predictors (denoted 𝑥1 , 𝑥2 , 𝑥3 , …). The method is motivated by scenarios where many
variables may be simultaneously connected to an output.
We will consider data about loans from the peer-to-peer lender, Lending Club, which is a dataset we
first encountered in Chapter 1. The loan data includes terms of the loan as well as information about
the borrower. The outcome variable we would like to better understand is the interest rate assigned to
the loan. For instance, all other characteristics held constant, does it matter how much debt someone
already has? Does it matter if their income has been verified? Multiple regression will help us answer
these and other questions.
The dataset includes information on 10000 loans, and we’ll be looking at a subset of the available
variables, some of which will be new from those we saw in earlier chapters. The first six observations
in the dataset are shown in Table 8.1, and descriptions for each variable are shown in Table 8.2. Notice
that the past bankruptcy variable (bankruptcy) is an indicator variable, where it takes the value 1
if the borrower had a past bankruptcy in their record and 0 if not. Using an indicator variable in
place of a category name allows for these variables to be directly used in regression. Two of the other
variables are categorical (verified_income and issue_month), each of which can take one of a few
different non-numerical values; we’ll discuss how these are handled in the model in Section 8.1.
Table 8.2: Variables and their descriptions for the loans dataset.
Variable Description
interest_rate Interest rate on the loan, in an annual percentage.
verified_income Categorical variable describing whether the borrower’s income
source and amount have been verified, with levels Verified (source
and amount verified), Source Verified (source only verified), and Not
Verified.
debt_to_income Debt-to-income ratio, which is the percentage of total debt of the
borrower divided by their total income.
credit_util Of all the credit available to the borrower, what fraction are they
utilizing. For example, the credit utilization on a credit card would
be the card’s balance divided by the card’s credit limit.
bankruptcy An indicator variable for whether the borrower has a past
bankruptcy in their record. This variable takes a value of ‘1‘ if the
answer is *yes* and ‘0‘ if the answer is *no*.
Let’s start by fitting a linear regression model for interest rate with a single predictor indicating
whether a person has a bankruptcy in their record:
̂
interest_rate = 12.34 + 0.74 × bankruptcy
Table 8.4: Summary of a linear model for predicting interest_rate based on whether the borrower has a
bankruptcy in their record. Degrees of freedom for this model is 9998.
EXAMPLE
Interpret the coefficient for the past bankruptcy variable in the model.
The variable takes one of two values: 1 when the borrower has a bankruptcy in their history
and 0 otherwise. A slope of 0.74 means that the model predicts a 0.74% higher interest rate
for those borrowers with a bankruptcy in their record. (See Section 7.2.6 for a review of the
interpretation for two-level categorical predictor variables.)
Suppose we had fit a model using a 3-level categorical variable, such as verified_income. The
output from software is shown in Table 8.5. This regression output provides multiple rows for the
variable. Each row represents the relative difference for each level of verified_income. However, we
are missing one of the levels: Not Verified. The missing level is called the reference level and it
represents the default level that other levels are measured against.
Table 8.5: Summary of a linear model for predicting interest_rate from the borrower’s income source and
amount verification. This predictor has three levels, which results in 2 rows in the regression output.
EXAMPLE
How would we write an equation for this regression model?
The equation for the regression model may be written as a model with two predictors:
̂
interest_rate = 11.10
+ 1.42 × verified_incomeSource Verified
+ 3.25 × verified_incomeVerified
We use the notation variablelevel to represent indicator variables for when the categorical
variable takes a particular value. For example, verified_incomeSource Verified would take a
value of 1 if it was for a borrower that was source verified, and it would take a value of 0
otherwise. Likewise, verified_incomeVerified would take a value of 1 if it was for a borrower
that was verified, and 0 if it took any other value.
The notation variablelevel may feel a bit confusing. Let’s figure out how to use the equation for each
level of the verified_income variable.
EXAMPLE
Using the model for predicting interest rate from income verification type, compute the average
interest rate for borrowers whose income source and amount are both unverified.
When verified_income takes a value of Not Verified, then both indicator functions in the
equation for the linear model are set to 0:
̂
interest_rate = 11.10 + 1.42 × 0 + 3.25 × 0 = 11.10
The average interest rate for these borrowers is 11.1%. Because the level does not have its own
coefficient and it is the reference value, the indicators for the other levels for this variable all
drop out.
8.1. INDICATOR AND CATEGORICAL PREDICTORS 139
EXAMPLE
Using the model for predicting interest rate from income verification type, compute the average
interest rate for borrowers whose income source is source verified.
When verified_income takes a value of Source Verified, then the corresponding variable
takes a value of 1 while the other is 0:
̂
interest_rate = 11.10 + 1.42 × 1 + 3.25 × 0 = 12.52
GUIDED PRACTICE
Compute the average interest rate for borrowers whose income source and amount are
both verified.1
When fitting a regression model with a categorical variable that has 𝑘 levels where
𝑘 > 2, software will provide a coefficient for 𝑘 − 1 of those levels. For the last level that
does not receive a coefficient, this is the reference level, and the coefficients listed for
the other levels are all considered relative to this reference level.
GUIDED PRACTICE
Interpret the coefficients from the model above.2
The higher interest rate for borrowers who have verified their income source or amount is surprising.
Intuitively, we would think that a loan would look less risky if the borrower’s income has been verified.
However, note that the situation may be more complex, and there may be confounding variables that
we didn’t account for. For example, perhaps lenders require borrowers with poor credit to verify
their income. That is, verifying income in our dataset might be a signal of some concerns about the
borrower rather than a reassurance that the borrower will pay back the loan. For this reason, the
borrower could be deemed higher risk, resulting in a higher interest rate. (What other confounding
variables might explain this counter-intuitive relationship suggested by the model?)
GUIDED PRACTICE
How much larger of an interest rate would we expect for a borrower who has verified
their income source and amount vs a borrower whose income source has only been
verified?3
1 When verified_income takes a value of Verified, then the corresponding variable takes a value of 1 while the
other is 0: 11.10 + 1.42 × 0 + 3.25 × 1 = 14.35. The average interest rate for these borrowers is 14.35%.
2 Each of the coefficients gives the incremental interest rate for the corresponding level relative to the Not Verified
level, which is the reference level. For example, for a borrower whose income source and amount have been verified, the
model predicts that they will have a 3.25% higher interest rate than a borrower who has not had their income source
or amount verified.
3 Relative to the Not Verified category, the Verified category has an interest rate of 3.25% higher, while the Source
Verified category is only 1.42% higher. Thus, Verified borrowers will tend to get an interest rate about 3.25 higher
than Source Verified borrowers.
140 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
The world is complex, and it can be helpful to consider many factors at once in statistical modeling.
For example, we might like to use the full context of borrowers to predict the interest rate they receive
rather than using a single variable. This is the strategy used in multiple regression. While we
remain cautious about making any causal interpretations using multiple regression on observational
data, such models are a common first step in gaining insights or providing some evidence of a causal
connection.
We want to construct a model that accounts not only for any past bankruptcy or whether the borrower
had their income source or amount verified, but simultaneously accounts for all the variables in the
loans dataset: verified_income, debt_to_income, credit_util, bankruptcy, term, issue_month,
and credit_checks.
̂
interest_rate = 𝑏0
+ 𝑏1 × verified_incomeSource Verified + 𝑏2 × verified_incomeVerified
+ 𝑏3 × debt_to_income + 𝑏4 × credit_util
+ 𝑏5 × bankruptcy + 𝑏6 × term
+ 𝑏7 × credit_checks + 𝑏8 × issue_monthJan-2018
+ 𝑏9 × issue_monthMar-2018
This equation represents a holistic approach for modeling all of the variables simultaneously. Notice
that there are two coefficients for verified_income and two coefficients for issue_month, since both
are 3-level categorical variables.
We calculate 𝑏0 , 𝑏1 , 𝑏2 , ⋯, 𝑏9 the same way as we did in the case of a model with a single predictor –
we select values that minimize the sum of the squared residuals:
10000 10000
2
𝑆𝑆𝐸 = 𝑒21 + 𝑒22 + ⋯ + 𝑒210000 = ∑ 𝑒2𝑖 = ∑ (𝑦𝑖 − 𝑦𝑖̂ )
𝑖=1 𝑖=1
where 𝑦𝑖 and 𝑦𝑖̂ represent the observed interest rates and their estimated values according to the
model, respectively. 10,000 residuals are calculated, one for each observation. Note that these values
are sample statistics and in the case where the observed data is a random sample from a target
population that we are interested in making inferences about, they are estimates of the population
parameters 𝛽0 , 𝛽1 , 𝛽2 , ⋯, 𝛽9 . We will discuss inference based on linear models in Chapter 25, for now
we will focus on calculating sample statistics 𝑏𝑖 .
We typically use a computer to minimize the sum of squares and compute point estimates, as shown in
the sample output in Table 8.6. Using this output, we identify 𝑏𝑖 , just as we did in the one-predictor
case.
Table 8.6: Output for the regression model, where interest rate is the outcome and the variables listed are
the predictors. Degrees of freedom for this model is 9990.
𝑦 ̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
EXAMPLE
Write out the regression model using the regression output from Table 8.6. How many predic-
tors are there in this model?
̂
interest_rate = 1.89
+ 1.00 × verified_incomeSource Verified + 2.56 × verified_incomeVerified
+ 0.02 × debt_to_income + 4.90 × credit_util
+ 0.39 × bankruptcy + 0.15 × term
+ 0.23 × credit_checks + 0.05 × issue_monthJan-2018
− 0.04 × issue_monthMar-2018
If we count up the number of predictor coefficients, we get the effective number of predictors
in the model; there are nine of those. Notice that the categorical predictor counts as two,
once for each of the two levels shown in the model. In general, a categorical predictor with 𝑝
different levels will be represented by 𝑝 − 1 terms in a multiple regression model. A total of
seven variables were used as predictors to fit this model: verified_income, debt_to_income,
credit_util, bankruptcy, term, credit_checks, issue_month.
GUIDED PRACTICE
Interpret the coefficient of the variable credit_checks.4
GUIDED PRACTICE
Compute the residual of the first observation in Table 8.1 using the full model.5
4 All else held constant, for each additional inquiry into the applicant’s credit during the last 12 months, we would
expect the interest rate for the loan to be higher, on average, by 0.23 points.
5 To compute the residual, we first need the predicted value, which we compute by plugging values into the equation
from earlier. For example, verified_incomeSource Verified takes a value of 0, verified_incomeVerified takes a value of
1 (since the borrower’s income source and amount were verified), debt_to_income was 18.01, and so on. This leads
̂
to a prediction of interest_rate 1 = 17.84. The observed interest rate was 14.07%, which leads to a residual of
𝑒1 = 14.07 − 17.84 = −3.77.
142 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
EXAMPLE
We calculated a slope coefficient of 0.74 for bankruptcy in Section 8.1 while the coefficient is
0.39 here. Why is there a difference between the coefficient values between the models with
single and multiple predictors?
If we examined the data carefully, we would see that some predictors are correlated. For
instance, when we modeled the relationship of the outcome interest_rate and predictor
bankruptcy using linear regression, we were unable to control for other variables like whether
the borrower had their income verified, the borrower’s debt-to-income ratio, and other variables.
That original model was constructed in a vacuum and did not consider the full context of
everything that is considered when an interest rate is decided. When we include all of the
variables, underlying and unintentional bias that was missed by not including these other
variables is reduced or eliminated. Of course, bias can still exist from other confounding
variables.
The previous example describes a common issue in multiple regression: correlation among predictor
variables. We say the two predictor variables are collinear (pronounced as co-linear) when they
are correlated, and this multicollinearity complicates model estimation. While it is impossible
to prevent multicollinearity from arising in observational data, experiments are usually designed to
prevent predictors from being multicollinear.
GUIDED PRACTICE
The estimated value of the intercept is 1.89, and one might be tempted to make some
interpretation of this coefficient, such as, it is the model’s predicted interest rate when
each of the variables take value zero: income source is not verified, the borrower has no
debt (debt-to-income and credit utilization are zero), and so on. Is this reasonable? Is
there any value gained by making this interpretation?6
We first used 𝑅2 in Section 7.2.5 to determine the amount of variability in the response that was
explained by the model:
where 𝑒𝑖 represents the residuals of the model and 𝑦𝑖 the outcomes. This equation remains valid in
the multiple regression framework, but a small enhancement can make it even more informative when
comparing models.
GUIDED PRACTICE
The variance of the residuals for the model given in the earlier Guided Practice is 18.53,
and the variance of the total price in all the auctions is 25.01. Calculate 𝑅2 for this
model.7
This strategy for estimating 𝑅2 works when there is a single predictor. However, it becomes less
helpful when there are many variables. The regular 𝑅2 is a biased estimate of the amount of variability
explained by the model when applied to model with more than one predictor. To get a better estimate,
we use the adjusted 𝑅2 .
6 Many of the variables do take a value 0 for at least one data point, and for those variables, it is reasonable. However,
one variable never takes a value of zero: term, which describes the length of the loan, in months. If term is set to zero,
then the loan must be paid back immediately; the borrower must give the money back as soon as they receive it, which
means it is not a real loan. Ultimately, the interpretation of the intercept in this setting is not insightful.
7 𝑅2 = 1 − 18.53 = 0.2591.
25.01
8.4. MODEL SELECTION 143
where 𝑛 is the number of observations used to fit the model and 𝑘 is the number of
predictor variables in the model. Remember that a categorical predictor with 𝑝 levels
will contribute 𝑝 − 1 to the number of variables in the model.
Because 𝑘 is never negative, the adjusted 𝑅2 will be smaller – often times just a little smaller – than
the unadjusted 𝑅2 . The reasoning behind the adjusted 𝑅2 lies in the degrees of freedom associated
with each variance, which is equal to 𝑛 − 𝑘 − 1 in the multiple regression context. If we were to make
predictions for new data using our current model, we would find that the unadjusted 𝑅2 would tend
to be slightly overly optimistic, while the adjusted 𝑅2 formula helps correct this bias.
GUIDED PRACTICE
There were n = 10,000 auctions in the dataset and 𝑘 = 9 predictor variables in the model.
Use 𝑛, 𝑘, and the variances from the earlier Guided Practice to calculate adjusted 𝑅2
for the interest rate model.8
GUIDED PRACTICE
Suppose you added another predictor to the model, but the variance of the errors
𝑉 𝑎𝑟(𝑒𝑖 ) didn’t go down. What would happen to the 𝑅2 ? What would happen to the
adjusted 𝑅2 ?9
Adjusted 𝑅2 could also have been used in Chapter 7 where we introduced regression models with a
single predictor. However, when there is only 𝑘 = 1 predictors, adjusted 𝑅2 is very close to regular
𝑅2 , so this nuance isn’t typically important when the model has only one predictor.
The best model is not always the most complicated. Sometimes including predictors that are not
evidently important can actually reduce the accuracy of predictions. In this section, we discuss model
selection strategies, which will help us eliminate predictors from the model that are found to be less
important. It’s common (and hip, at least in the statistical world) to refer to models that have
undergone such predictor pruning as parsimonious.
In practice, the model that includes all available predictors is often referred to as the full model.
The full model may not be the best model, and if it isn’t, we want to identify a smaller model that is
preferable.
8 𝑅2
𝑎𝑑𝑗 = 1 − 18.53 10000−1
25.01 × 10000−9−1 = 0.2584. While the difference is very small, it will be important when we fine tune
the model in the next section.
9 The unadjusted 𝑅2 would stay the same and the adjusted 𝑅2 would go down.
144 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
Backward elimination starts with the full model – the model that includes all potential predictor
variables. Predictors are eliminated one-at-a-time from the model until we cannot improve the model
any further.
Forward selection is the reverse of the backward elimination technique. Instead, of eliminating
predictors one-at-a-time, we add predictors one-at-a-time until we cannot find any predictors that
improve the model any further.
An important consideration in implementing either of these stepwise selection strategies is the criterion
used to decide whether to eliminate or add a predictors. One commonly used decision criterion is
adjusted 𝑅2 . When using adjusted 𝑅2 as the decision criterion, we seek to eliminate or add predictors
depending on whether they lead to the largest improvement in adjusted 𝑅2 and we stop when adding
or elimination of another predictor does not lead to further improvement in adjusted 𝑅2 .
Adjusted 𝑅2 describes the strength of a model fit, and it is a useful tool for evaluating which predictors
are adding value to the model, where adding value means they are (likely) improving the accuracy in
predicting future outcomes.
Let’s consider two models, which are shown in Table 8.7 and Table 8.8. The first table summarizes
the full model since it includes all predictors, while the second does not include the issue_month
variable.
Table 8.7: The fit for the full regression model, including the adjusted 𝑅2 .
Table 8.8: The fit for the regression model after dropping issue month, including the adjusted 𝑅2 .
EXAMPLE
Which of the two models is better?
We compare the adjusted 𝑅2 of each model to determine which to choose. Since the second
model has a higher adjusted 𝑅2 compared to the first model, we prefer the second model to
the first. We cannot know for sure, but based on the adjusted 𝑅2 , this is our best assessment.
EXAMPLE
Results corresponding to the full model for the loans data are shown in Table 8.7. How should
we proceed under the backward elimination strategy?
Our baseline adjusted 𝑅2 from the full model is 0.2597, and we need to determine whether
dropping a predictor will improve the adjusted 𝑅2 . To check, we fit models that each drop a
different predictor, and we record the adjusted 𝑅2 :
The model without issue_month has the highest adjusted 𝑅2 of 0.2598, higher than the ad-
justed 𝑅2 for the full model; therefore, we drop issue_month from the model.
Since we eliminated a predictor from the model in the first step, we see whether we should
eliminate any additional predictors. Our baseline adjusted 𝑅2 is now 𝑅𝑎𝑑𝑗
2
= 0.2598. We now
fit new models, which consider eliminating each of the remaining predictors in addition to
issue_month:
EXAMPLE
Construct a model for predicting interest_rate from the loans data using forward selection.
We start with the model that includes no predictors. Then we fit each of the possible models
with just one predictor. Then we examine the adjusted 𝑅2 for each of these models:
In this first step, we compare the adjusted 𝑅2 against a baseline model that has no predictors,
2
which always has 𝑅𝑎𝑑𝑗 = 0. The model with one predictor that has the largest adjusted 𝑅2 is
the model with the term predictor, so we will add this variable to our model.
We repeat the process again, this time considering 2-predictor models where one of the predic-
2
tors is term and with a new baseline of 𝑅𝑎𝑑𝑗 = 0.12855 ∶
The model including credit_util has the largest increase in adjusted 𝑅2 (0.20046) from the
baseline (0.12855), Thus, we will also add credit_util to the model as a predictor.
Now we have a new baseline adjusted 𝑅2 of 0.20046. We can continue on and see whether it
would be beneficial to add a third predictor:
The model including verified_income has the largest increase in adjusted 𝑅2 (0.24183) from
the baseline (0.20046), so we add verified_income to the model as a predictor as well.
We continue in this way, next adding debt_to_income, then credit_checks, and bankruptcy.
At this point, we come again to the issue_month variable: adding this as a predictor leads to
2
𝑅𝑎𝑑𝑗 = 0.25843, while keeping all the other predictors but excluding issue_month has a higher
2
𝑅𝑎𝑑𝑗 = 0.25854. This means we do not add issue_month to the model as a predictor. In this
example, we have arrived at the same model that we identified with backward elimination.
8.4. MODEL SELECTION 147
Backward elimination begins with the model having the largest number of predictors
and eliminates predictors one-by-one until we are satisfied that all remaining predictors
are important to the model. Forward selection starts with no predictors included in the
model, then it adds in predictors according to their importance until no other important
predictors are found. Notice that, for both methods, we have always chosen to retain
the model with the largest adjusted 𝑅2 value, even if the difference is less than half a
percent (e.g., 0.2597 versus 0.2598). One could argue that the difference between these
two models is negligible, as they both explain nearly the same amount of variability
in the interest_rate. These negligible differences are an important aspect to model
selection. It is highly advised that before you begin the model selection process, you
decide what a “meaningful” difference in adjusted 𝑅2 is for the context of your data.
Maybe this difference is 1% or maybe it is 5%. This “threshold” is what you will then
use to decide if one model is “better” than another model. Using meaningful thresholds
in model selection requires more critical thinking about what the adjusted 𝑅2 values
mean.
Additionally, backward elimination and forward selection can arrive at different final
models, because the decision for whether to include a given predictor or not depends on
the other predictors that are already in the model. With forward selection, you start
with a model that includes no predictors, and add predictors one at a time. In backward
elimination, you start with a model that includes all of the potential predictors, and
remove predictors one at a time. How much a given predictor changes the percentage
of the variability in the outcome that is explained by the model depends on the other
predictors in the model, particularly if the predictor variables are correlated with each
other.
There is no “one size fits all” model selection strategy, which is why there are so many different
model selection methods. We hope you walk away from this exploration understanding how stepwise
selection is carried out and the considerations that should be made when using stepwise selection with
regression models.
Stepwise selection using adjusted 𝑅2 as the decision criteria is one of many commonly used model
selection strategies. Stepwise selection can also be carried out with decision criteria other than adjusted
𝑅2 , such as p-values, which you’ll learn about in Chapter 24 onward, or AIC (Akaike information
criterion) or BIC (Bayesian information criterion), which you might learn about in more advanced
courses.
Alternatively, one could choose to include or exclude predictors from a model based on expert opinion
or due to research focus. In fact, many statisticians discourage the use of stepwise regression alone
for model selection and advocate, instead, for a more thoughtful approach that carefully considers the
research focus and features of the data.
148 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
8.5.1 Summary
With real data, there is often a need to describe how multiple variables can be modeled together.
In this chapter, we have presented one approach using multiple linear regression. Each coefficient
represents how the model predicts the outcome might change with one unit increase of that predictor
given the rest of the predictor variables in the model. Working with and interpreting multivariable
models can be tricky, especially when the predictor variables show multicollinearity. There is often no
perfect or “right” final model, however, using the adjusted 𝑅2 value is one way to identify important
predictor variables for a final regression model. In later chapters we will generalize multiple linear
regression models to a larger population of interest from which the dataset was sampled.
8.5.2 Terms
The terms introduced in this chapter are presented in Table 8.9. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
8.6 Exercises
4. Arrival delays. Consider all of the flights out of New York City in 2013 that flew into Puerto
Rico (BQN) or San Francisco (SFO) on the following two airlines: JetBlue (B6) or United Air-
lines (UA). We consider the relationship between day of the year and arrival delay (in minutes).
Note that a negative arrival delay means that the flight arrived early. The figures display a least
squares regression line for arrival delay versus time (day of the year).10
a. Does it seem like there are differences in arrival delays across time when looking at BQN
and SFO airports combined (left figure)? Explain.
b. Does it seem like there are difference in arrival delays across time when looking at BQN
and SFO airports separately (right figure)? Explain.
c. Would it be more appropriate to display the combined plot or the plots which display the
airports separately? Explain.
5. Training for a 5K. Nico signs up for a 5K (a 5,000 metre running race) 30 days prior to
the race. They decide to run a 5K every day to train for it, and each day they record the
following information: days_since_start (number of days since starting training), days_-
till_race (number of days left until the race), mood (poor, good, awesome), tiredness (1-not
tired to 10-very tired), and time (time it takes to run 5K, recorded as mm:ss). Top few rows of
the data they collect is shown below. Using these data Nico wants to build a model predicting
time from the other variables. Should they include all variables shown above in their model?
Why or why not?
6. Multiple regression fact checking. Determine which of the following statements are true
and false. For each statement that is false, explain why it is false.
a. If predictors are collinear, then removing one variable will have no influence on the point
estimate of another variable’s coefficient.
b. Suppose a numerical variable 𝑥 has a coefficient of 𝑏1 = 2.5 in the multiple regression model.
Suppose also that the first observation has 𝑥1 = 7.2, the second observation has a value of
𝑥1 = 8.2, and these two observations have the same values for all other predictors. Then
the predicted value of the second observation will be 2.5 higher than the prediction of the
first observation based on the multiple regression model.
c. If a regression model’s first variable has a coefficient of 𝑏1 = 5.7, then if we are able to
influence the data so that an observation will have its 𝑥1 be 1 larger than it would otherwise,
the value 𝑦1 for this observation would increase by 5.7.
10 The flights data used in this exercise can be found in the nycflights13 R package.
8.6. EXERCISES 151
7. Baby weights and smoking. US Department of Health and Human Services, Centers for
Disease Control and Prevention collect information on births recorded in the country. The data
used here are a random sample of 1,000 births from 2014. Here, we study the relationship
between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a
smoker, and 0 if not. The summary table below shows the results of a linear regression model
for predicting the average birth weight of babies, measured in pounds, based on the smoking
status of the mother.11 (ICPSR 2014)
8. Baby weights and mature moms. The following is a model for predicting baby weight from
whether the mom is classified as a mature mom (35 years or older at the time of pregnancy).
(ICPSR 2014)
9. Movie returns, prediction. A model was fit to predict return-on-investment (ROI) on movies
based on release year and genre (Adventure, Action, Drama, Horror, and Comedy). The model
output is shown below. (FiveThirtyEight 2015)
a. For a given release year, which genre of movies are predicted, on average, to have the
highest predicted return on investment?
b. The adjusted 𝑅2 of this model is 10.71%. Adding the production budget of the movie to
the model increases the adjusted 𝑅2 to 10.84%. Should production budget be added to the
model?
11 The births14 data used in this exercise can be found in the openintro R package.
152 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
10. Movie returns by genre. A model was fit to predict return-on-investment (ROI) on movies
based on release year and genre (Adventure, Action, Drama, Horror, and Comedy). The plots
below show the predicted ROI vs. actual ROI for each of the genres separately. Do these
figures support the comment in the FiveThirtyEight.com article that states, “The return-on-
investment potential for horror movies is absurd.” Note that the x-axis range varies for each
plot. (FiveThirtyEight 2015)
11. Predicting baby weights. A more realistic approach to modeling baby weights is to consider
all possibly related variables at once. Other variables of interest include length of pregnancy in
weeks (weeks), mother’s age in years (mage), the sex of the baby (sex), smoking status of the
mother (habit), and the number of hospital (visits) visits during pregnancy. Below are three
observations from this dataset.
The summary table below shows the results of a regression model for predicting the average
birth weight of babies based on all of the variables presented above.
a. Write the equation of the regression model that includes all of the variables.
b. Interpret the slopes of weeks and habit in this context.
c. If we fit a model predicting baby weight from only habit (whether the mom smokes), we
observe a difference in the slope coefficient for habit in this small model and the slope
coefficient for habit in the larger model. Why might there be a difference?
d. Calculate the residual for the first observation in the dataset.
8.6. EXERCISES 153
12. Palmer penguins, predicting body mass. Researchers studying a community of Antarctic
penguins collected body measurement (bill length, bill depth, and flipper length measured in
millimeters and body mass, measured in grams), species (Adelie, Chinstrap, or Gentoo), and sex
(female or male) data on 344 penguins living on three islands (Torgersen, Biscoe, and Dream) in
the Palmer Archipelago, Antarctica.12 The summary table below shows the results of a linear
regression model for predicting body mass (which is more difficult to measure) from the other
variables in the dataset. (Gorman, Williams, and Fraser 2014a)
13. Baby weights, backwards elimination. Let’s consider a model that predicts weight of
newborns using several predictors: whether the mother is considered mature, number of weeks
of gestation, number of hospital visits during pregnancy, weight gained by the mother during
pregnancy, sex of the baby, and whether the mother smoke cigarettes during pregnancy (habit).
(ICPSR 2014)
The adjusted 𝑅2 of the full model is 0.326. We remove each variable one by one, refit the model,
and record the adjusted 𝑅2 . Which, if any, variable should be removed from the model?
• Drop mature: 0.321
• Drop weeks: 0.061
• Drop visits: 0.326
• Drop gained: 0.327
• Drop sex: 0.301
12 The penguins data used in this exercise can be found in the palmerpenguins R package.
154 CHAPTER 8. LINEAR REGRESSION WITH MULTIPLE PREDICTORS
14. Palmer penguins, backwards elimination. The following full model is built to predict
the weights of three species (Adelie, Chinstrap, or Gentoo) of penguins living in the Palmer
Archipelago, Antarctica. (Gorman, Williams, and Fraser 2014a)
The adjusted 𝑅2 of the full model is 0.9. In order to evaluate whether any of the predictors can
be dropped from the model without losing predictive performance of the model, the researchers
dropped one variable at a time, refit the model, and recorded the adjusted 𝑅2 of the smaller
model. These values are given below.
• Drop bill_length_mm: 0.87
• Drop bill_depth_mm: 0.869
• Drop flipper_length_mm: 0.861
• Drop sex: 0.845
• Drop species: 0.821
Which, if any, variable should be removed from the model first?
15. Baby weights, forward selection. Using information on the mother and the sex of the baby
(which can be determined prior to birth), we want to build a model that predicts the birth weight
of babies. In order to do so, we will evaluate six candidate predictors: whether the mother is
considered mature, number of weeks of gestation, number of hospital visits during pregnancy,
weight gained by the mother during pregnancy, sex of the baby, and whether the mother smoke
cigarettes during pregnancy (habit). And we will make a decision about including them in the
model using forward selection and adjusted 𝑅2 . Below are the six models we evaluate and their
adjusted 𝑅2 values. (ICPSR 2014)
• Predict weight from mature: 0.002
• Predict weight from weeks: 0.3
• Predict weight from visits: 0.034
• Predict weight from gained: 0.021
• Predict weight from sex: 0.018
• Predict weight from habit: 0.021
Which variable should be added to the model first?
16. Palmer penguins, forward selection. Using body measurement and other relevant data
on three species (Adelie, Chinstrap, or Gentoo) of penguins living in the Palmer Archipelago,
Antarctica, we want to predict their body mass. In order to do so, we will evaluate five candidate
predictors and make a decision about including them in the model using forward selection and
adjusted 𝑅2 . Below are the five models we evaluate and their adjusted 𝑅2 values:
• Predict body mass from bill_length_mm: 0.352
• Predict body mass from bill_depth_mm: 0.22
• Predict body mass from flipper_length_mm: 0.758
• Predict body mass from sex: 0.178
• Predict body mass from species: 0.668
Which variable should be added to the model first?
155
Chapter 9
Logistic regression
We will consider experiment data from a study that sought to understand the effect of race and sex
on job application callback rates (Bertrand and Mullainathan 2003). To evaluate which factors were
important, job postings were identified in Boston and Chicago for the study, and researchers created
many fake resumes to send off to these jobs to see which would elicit a callback.1 The researchers
enumerated important characteristics, such as years of experience and education details, and they used
these characteristics to randomly generate fake resumes. Finally, they randomly assigned a name to
each resume, where the name would imply the applicant’s sex and race.
The first names that were used and randomly assigned in the experiment were selected so that they
would predominantly be recognized as belonging to Black or White individuals; other races were
not considered in the study. While no name would definitively be inferred as pertaining to a Black
individual or to a White individual, the researchers conducted a survey to check for racial association
of the names; names that did not pass the survey check were excluded from usage in the experiment.
You can find the full set of names that did pass the survey test and were ultimately used in the study
in Table 9.1. For example, Lakisha was a name that their survey indicated would be interpreted as
a Black woman, while Greg was a name that would generally be interpreted to be associated with a
White male.
1We did omit discussion of some structure in the data for the analysis presented: the experiment design included
blocking, where typically four resumes were sent to each job: one for each inferred race/sex combination (as inferred
based on the first name). We did not worry about the blocking aspect, since accounting for the blocking would reduce
the standard error without notably changing the point estimates for the race and sex variables versus the analysis
performed in the section. That is, the most interesting conclusions in the study are unaffected even when completing a
more sophisticated analysis.
156 CHAPTER 9. LOGISTIC REGRESSION
Table 9.1: List of all 36 unique names along with the commonly inferred race and sex associated with these
names.
The response variable of interest is whether there was a callback from the employer for the applicant,
and there were 8 attributes that were randomly assigned that we’ll consider, with special interest in
the race and sex variables. Race and sex are protected classes in the United States, meaning they are
not legally permitted factors for hiring or employment decisions. The full set of attributes considered
is provided in Table 26.1.
Table 9.2: Descriptions of nine variables from the resume dataset. Many of the variables are indicator
variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
variable description
received_callback Specifies whether the employer called the applicant following
submission of the application for the job.
job_city City where the job was located: Boston or Chicago.
college_degree An indicator for whether the resume listed a college degree.
years_experience Number of years of experience listed on the resume.
honors Indicator for the resume listing some sort of honors, e.g. employee
of the month.
military Indicator for if the resume listed any military experience.
has_email_address Indicator for if the resume listed an email address for the applicant.
race Race of the applicant, implied by their first name listed on the
resume.
sex Sex of the applicant (limited to only man and woman), implied by
the first name listed on the resume.
All of the attributes listed on each resume were randomly assigned, which means that no attributes
that might be favorable or detrimental to employment would favor one demographic over another
on these resumes. Importantly, due to the experimental nature of the study, we can infer causation
between these variables and the callback rate, if substantial differences are found. Our analysis will
allow us to compare the practical importance of each of the variables relative to each other.
9.2. MODELLING THE PROBABILITY OF AN EVENT 157
The outcome variable for a GLM is denoted by 𝑌𝑖 , where the index 𝑖 is used to represent
observation 𝑖. In the resume application, 𝑌𝑖 will be used to represent whether resume 𝑖
received a callback (𝑌𝑖 = 1) or not (𝑌𝑖 = 0).
The predictor variables are represented as follows: 𝑥1,𝑖 is the value of variable 1 for observation 𝑖, 𝑥2,𝑖
is the value of variable 2 for observation 𝑖, and so on.
We want to choose a transformation in the equation that makes practical and mathematical sense.
For example, we want a transformation that makes the range of possibilities on the left hand side of
the equation equal to the range of possibilities for the right hand side; if there was no transformation
in the equation, the left hand side could only take values between 0 and 1, but the right hand side
could take values outside well outside of the range from 0 to 1.
A common transformation for 𝑝𝑖 is the logit transformation, which may be written as
𝑝𝑖
𝑙𝑜𝑔𝑖𝑡(𝑝𝑖 ) = log𝑒 ( )
1 − 𝑝𝑖
The logit transformation is shown in Figure 9.1. Below, we rewrite the equation relating 𝑌𝑖 to its
predictors using the logit transformation of 𝑝𝑖 :
𝑝𝑖
log𝑒 ( ) = 𝛽0 + 𝛽1 𝑥1,𝑖 + 𝛽2 𝑥2,𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘,𝑖
1 − 𝑝𝑖
In our resume example, there are 8 predictor variables, so 𝑘 = 8. While the precise choice of a logit
function isn’t intuitive, it is based on theory that underpins generalized linear models, which is beyond
the scope of this book. Fortunately, once we fit a model using software, it will start to feel like we are
back in the multiple regression context, even if the interpretation of the coefficients is more complex.
To convert from values on the logistic regression scale to the probability scale, we need to back
transform and then solve for 𝑝𝑖 :
𝑝𝑖
log𝑒 ( ) = 𝛽0 + 𝛽1 𝑥1,𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘,𝑖
1 − 𝑝𝑖
𝑝𝑖
= 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
1 − 𝑝𝑖
𝑝𝑖 = (1 − 𝑝𝑖 ) 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
𝑝𝑖 = 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖 − 𝑝𝑖 × 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
𝑝𝑖 + 𝑝𝑖 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖 = 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
𝑝𝑖 (1 + 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖 ) = 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
𝑝𝑖 =
1 + 𝑒𝛽0 +𝛽1 𝑥1,𝑖 +⋯+𝛽𝑘 𝑥𝑘,𝑖
As with most applied data problems, we substitute in the point estimates (the observed 𝑏𝑖 ) to calculate
relevant probabilities.
EXAMPLE
We start by fitting a model with a single predictor: honors. This variable indicates whether
the applicant had any type of honors listed on their resume, such as employee of the month. A
logistic regression model was fit using statistical software and the following model was found:
𝑝𝑖̂
log𝑒 ( ) = −2.4998 + 0.8668 × honors
1 − 𝑝𝑖̂
a. If a resume is randomly selected from the study and it does not have any honors listed,
what is the probability it resulted in a callback?
b. What would the probability be if the resume did list some honors?
a. If a randomly chosen resume from those sent out is considered, and it does not list honors,
then honors takes the value of 0 and the right side of the model equation equals -2.4998.
𝑒−2.4998
Solving for 𝑝𝑖 : 1+𝑒 −2.4998 = 0.076. Just as we labeled a fitted value of 𝑦𝑖 with a “hat” in
single-variable and multiple regression, we do the same for this probability: 𝑝𝑖̂ = 0.076.
b. If the resume had listed some honors, then the right side of the model equation is
−2.4998 + 0.8668 × 1 = −1.6330, which corresponds to a probability 𝑝𝑖̂ = 0.163. Notice
that we could examine -2.4998 and -1.6330 in Figure 9.1 to estimate the probability
before formally calculating the value.
While knowing whether a resume listed honors provides some signal when predicting whether the
employer would call, we would like to account for many different variables at once to understand how
each of the different resume characteristics affected the chance of a callback.
9.3. LOGISTIC MODEL WITH MANY VARIABLES 159
We used statistical software to fit the logistic regression model with all 8 predictors described in
Table 26.1. Like multiple regression, the result may be presented in a summary table, which is shown
in Table 9.3.
Table 9.3: Summary table for the full logistic regression model for the resume callback example.
Just like multiple regression, we could trim some variables from the model. Here we’ll use a statistic
called Akaike information criterion (AIC), which is analogous to how we used adjusted 𝑅2 in
multiple regression. AIC is a popular model selection method used in many disciplines, and is praised
for its emphasis on model uncertainty and parsimony. AIC selects a “best” model by ranking models
from best to worst according to their AIC values. In the calculation of a model’s AIC, a penalty is
given for including additional variables. The penalty for added model complexity attempts to strike
a balance between underfitting (too few variables in the model) and overfitting (too many variables
in the model). When using AIC for model selection, models with a lower AIC value are considered
to be “better.” Remember that when using adjusted 𝑅2 we select models with higher values instead.
It is important to note that AIC provides information about the quality of a model relative to other
models, but does not provide information about the overall quality of a model.
Table 9.4 provides the AIC and the number of observations used to fit the model. We also know from
Table 9.3 that eight variables (with nine coefficients, including the intercept) were fit.
Table 9.4: AIC for the full logistic regression model fit to the full resume callback example.
AIC number_observations
2677 4870
We will look for models with a lower AIC using a backward elimination strategy. Table 9.5 provides
the AIC values for the model with variables as given in Table 9.6. Notice that the same number of
observations are used, but one fewer variable (college_degree is dropped from the model).
Table 9.5: AIC for the logistic regression model fit to the resume callback example without college_degree.
AIC number_observations
2676 4870
After using the AIC criteria, the variable college_degree is eliminated (the AIC value without
college_degree is smaller than the AIC value on the full model), giving the model summarized in
Table 9.6 with fewer variables, which is what we’ll rely on for the remainder of the section.
160 CHAPTER 9. LOGISTIC REGRESSION
Table 9.6: Summary table for the logistic regression model for the resume callback example, where variable
selection has been performed using AIC and college_degree has been dropped from the model.
EXAMPLE
The race variable had taken only two levels: Black and White. Based on the model results,
what does the coefficient of the race variable say about callback decisions?
The coefficient shown corresponds to the level of White, and it is positive. The positive
coefficient reflects a positive gain in callback rate for resumes where the candidate’s first name
implied they were White. The model results suggest that prospective employers favor resumes
where the first name is typically interpreted to be White.
The coefficient of raceWhite in the full model in Table 9.3, is nearly identical to the model shown in
Table 9.6. The predictors in the experiment were thoughtfully laid out so that the coefficient estimates
would typically not be much influenced by which other predictors were in the model, which aligned
with the motivation of the study to tease out which effects were important to getting a callback. In
most observational data, it’s common for point estimates to change a little, and sometimes a lot,
depending on which other variables are included in the model.
EXAMPLE
Use the model summarized in Table 9.6 to estimate the probability of receiving a callback for a
job in Chicago where the candidate lists 14 years experience, no honors, no military experience,
includes an email address, and has a first name that implies they are a White male.
We can start by writing out the equation using the coefficients from the model:
𝑝̂
𝑙𝑜𝑔𝑒 ( ) = −2.7162 − 0.4364 × job_cityChicago + 0.0206 × years_experience
1 − 𝑝̂
+ 0.7634 × honors − 0.3443 × military + 0.2221 × email
+ 0.4429 × raceWhite − 0.1959 × sexman
Now we can add in the corresponding values of each variable for the individual of interest:
𝑝̂
𝑙𝑜𝑔𝑒 ( ) = −2.7162 − 0.4364 × 1 + 0.0206 × 14
1 − 𝑝̂
+ 0.7634 × 0 − 0.3443 × 0 + 0.2221 × 1
+ 0.4429 × 1 − 0.1959 × 1 = −2.3955
We can now back-solve for 𝑝:̂ the chance such an individual will receive a callback is about
𝑒−2.3955
1+𝑒−2.3955
= 0.0835.
9.4. GROUPS OF DIFFERENT SIZES 161
EXAMPLE
Compute the probability of a callback for an individual with a name commonly inferred to be
from a Black male but who otherwise has the same characteristics as the one described in the
previous example.
We can complete the same steps for an individual with the same characteristics who is Black,
where the only difference in the calculation is that the indicator variable raceWhite will take a
value of 0. Doing so yields a probability of 0.0553. Let’s compare the results with those of the
previous example.
In practical terms, an individual perceived as White based on their first name would need to
1
apply to 0.0835 ≈ 12 jobs on average to receive a callback, while an individual perceived as
1
Black based on their first name would need to apply to 0.0553 ≈ 18 jobs on average to receive a
callback. That is, applicants who are perceived as Black need to apply to 50% more employers
to receive a callback than someone who is perceived as White based on their first name for
jobs like those in the study.
What we have quantified in the current section is alarming and disturbing. However, one aspect that
makes the racism so difficult to address is that the experiment, as well-designed as it is, cannot send us
much signal about which employers are discriminating. It is only possible to say that discrimination is
happening, even if we cannot say which particular callbacks — or non-callbacks — represent discrim-
ination. Finding strong evidence of racism for individual cases is a persistent challenge in enforcing
anti-discrimination laws.
Any form of discrimination is concerning, which is why we decided it was so important to discuss the
topic using data. The resume study also only examined discrimination in a single aspect: whether a
prospective employer would call a candidate who submitted their resume. There was a 50% higher
barrier for resumes simply when the candidate had a first name that was perceived to be of a Black
individual. It’s unlikely that discrimination would stop there.
EXAMPLE
Let’s consider a sex-imbalanced company that consists of 20% women and 80% men, and we’ll
suppose that the company is very large, consisting of perhaps 20,000 employees. (A more
deliberate example would include more inclusive gender identities.) Suppose when someone
goes up for promotion at the company, 5 of their colleagues are randomly chosen to provide
feedback on their work.
Now let’s imagine that 10% of the people in the company are prejudiced against the other sex.
That is, 10% of men are prejudiced against women, and similarly, 10% of women are prejudiced
against men. Who is discriminated against more at the company, men or women?
Let’s suppose we took 100 men who have gone up for promotion in the past few years. For
these men, 5 × 100 = 500 random colleagues will be tapped for their feedback, of which about
20% will be women (100 women). Of these 100 women, 10 are expected to be biased against
the man they are reviewing. Then, of the 500 colleagues reviewing them, men will experience
discrimination by about 2% of their colleagues when they go up for promotion.
Let’s do a similar calculation for 100 women who have gone up for promotion in the last few
years. They will also have 500 random colleagues providing feedback, of which about 400
(80%) will be men. Of these 400 men, about 40 (10%) hold a bias against women. Of the 500
colleagues providing feedback on the promotion packet for these women, 8% of the colleagues
hold a bias against the women.
162 CHAPTER 9. LOGISTIC REGRESSION
The example highlights something profound: even in a hypothetical setting where each demographic
has the same degree of prejudice against the other demographic, the smaller group experiences the
negative effects more frequently. Additionally, if we would complete a handful of examples like the
one above with different numbers, we would learn that the greater the imbalance in the population
groups, the more the smaller group is disproportionately impacted.2
Of course, there are other considerable real-world omissions from the hypothetical example. For
example, studies have found instances where people from an oppressed group also discriminate against
others within their own oppressed group. As another example, there are also instances where a
majority group can be oppressed, with apartheid in South Africa being one such historic example.
Ultimately, discrimination is complex, and there are many factors at play beyond the mathematics
property we observed in the previous example.
We close the chapter on the serious topic of discrimination, and we hope it inspires you to think about
the power of reasoning with data. Whether it is with a formal statistical model or by using critical
thinking skills to structure a problem, we hope the ideas you have learned will help you do more and
do better in life.
9.5.1 Summary
Logistic and linear regression models have many similarities. The strongest of which is the linear
combination of the explanatory variables which is used to form predictions related to the response
variable. However, with logistic regression, the response variable is binary and therefore a prediction is
given on the probability of a successful event. Logistic model fit and variable selection can be carried
out in similar ways as multiple linear regression.
9.5.2 Terms
The terms introduced in this chapter are presented in Table 9.7. If you’re not sure what some of these
terms mean, we recommend you go back in the text and review their definitions. You should be able
to easily spot them as bolded text.
2 If a proportion 𝑝 of a company are women and the rest of the company consists of men, then under the hypothetical
situation the ratio of rates of discrimination against women versus men would be given by (1 − 𝑝)/𝑝, a ratio that is
always greater than 1 when 𝑝 < 0.5.
9.6. EXERCISES 163
9.6 Exercises
2. Logistic regression fact checking. Determine which of the following statements are true and
false. For each statement that is false, explain why it is false.
a. Suppose we consider the first two observations based on a logistic regression model, where
the first variable in observation 1 takes a value of 𝑥1 = 6 and observation 2 has 𝑥1 = 4.
Suppose we realized we made an error for these two observations, and the first observation
was actually 𝑥1 = 7 (instead of 6) and the second observation actually had 𝑥1 = 5 (instead
of 4). Then the predicted probability from the logistic regression model would increase the
same amount for each observation after we correct these variables.
b. When using a logistic regression model, it is impossible for the model to predict a probability
that is negative or a probability that is greater than 1.
c. Because logistic regression predicts probabilities of outcomes, observations used to build a
logistic regression model need not be independent.
d. When fitting logistic regression, we typically complete model selection using adjusted 𝑅2 .
3. Possum classification, comparing models. The common brushtail possum of the Australia
region is a bit cuter than its distant cousin, the American opossum (see Figure 7.4). We consider
104 brushtail possums from two regions in Australia, where the possums may be considered a
random sample from the population. The first region is Victoria, which is in the eastern half of
Australia and traverses the southern coast. The second region consists of New South Wales and
Queensland, which make up eastern and northeastern Australia.3
We use logistic regression to differentiate between possums in these two regions. The outcome
variable, called pop, takes value 1 when a possum is from Victoria and 0 when it is from New
South Wales or Queensland. We consider five predictors: sex (an indicator for a possum being
male), head_l (head length), skull_w (skull width), total_l (total length), and tail_l (tail
length). Each variable is summarized in a histogram. The full logistic regression model and a
reduced model after variable selection are summarized in the tables below.
See the next page for the questions.
3 The possum data used in this exercise can be found in the openintro R package.
164 CHAPTER 9. LOGISTIC REGRESSION
a. Examine each of the predictors given by the individual graphs. Are there any outliers that
are likely to have a very large influence on the logistic regression model?
b. Two models are provided above for predicting the region of the possum. (In Chapter 26 we
will cover a method for deciding between the models based on p-values.) The first model
includes head_l and the second model does not. Explain why the remaining estimates
(model coefficients) change between the two models.
9.6. EXERCISES 165
4. Challenger disaster and model building. On January 28, 1986, a routine launch was antic-
ipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened:
the shuttle broke apart, killing all seven crew members on board. An investigation into the
cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage
to these O-rings during a shuttle launch may be related to the ambient temperature during
the launch. The table below summarizes observational data on O-rings for 23 shuttle missions,
where the mission order is based on the temperature at the time of the launch. temperature
gives the temperature in Fahrenheit, damaged represents the number of damaged O-rings, and
undamaged represents the number of O-rings that were not damaged.4
mission 1 2 3 4 5 6 7 8 9 10 11 12
temperature 53 57 58 63 66 67 67 67 68 69 70 70
damaged 5 1 1 1 0 0 0 0 0 0 1 0
undamaged 1 5 5 5 6 6 6 6 6 6 5 6
mission 13 14 15 16 17 18 19 20 21 22 23
temperature 70 70 72 73 75 75 76 76 78 79 81
damaged 1 0 0 0 0 1 0 0 0 0 0
undamaged 5 6 6 6 6 5 6 6 6 6 6
a. Each column of the table above represents a different shuttle mission. Examine these data
and describe what you observe with respect to the relationship between temperatures and
damaged O-rings.
b. Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and
a logistic regression model was fit to these data. The regression output for this model is
given above. Describe the key components of the output in words.
c. Write out the logistic model using the point estimates of the model parameters.
d. Based on the model, do you think concerns regarding O-rings are justified? Explain.
5. Possum classification, prediction. A logistic regression model was proposed for classifying
common brushtail possums into their two regions. The outcome variable took value 1 if the
possum was from Victoria and 0 otherwise.
a. Write out the form of the model. Also identify which of the variables are positively associ-
ated with the outcome of living in Victoria, when controlling for other variables.
b. Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had
been captured in the wild in Australia, but it doesn’t say which part of Australia. However,
the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37
cm long, and its total length is 83 cm. What is the reduced model’s computed probability
that this possum is from Victoria? How confident are you in the model’s accuracy of this
probability calculation?
4 The orings data used in this exercise can be found in the openintro R package.
166 CHAPTER 9. LOGISTIC REGRESSION
6. Challenger disaster and prediction. On January 28, 1986, a routine launch was anticipated
for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the
shuttle broke apart, killing all seven crew members on board. An investigation into the cause of
the disaster focused on a critical seal called an O-ring, and it is believed that damage to these
O-rings during a shuttle launch may be related to the ambient temperature during the launch.
The investigation found that the ambient temperature at the time of the shuttle launch was
closely related to the damage of O-rings, which are a critical component of the shuttle.
a. The data provided in the previous exercise are shown in the plot. The logistic model fit to
these data may be written as
𝑝̂
log ( ) = 11.6630 − 0.2162 × temperature
1 − 𝑝̂
where 𝑝̂ is the model-estimated probability that an O-ring will become damaged. Use the model
to calculate the probability that an O-ring will become damaged at each of the following ambient
temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several
additional ambient temperatures are provided below, where subscripts indicate the temperature:
𝑝57
̂ = 0.341 𝑝59
̂ = 0.251 𝑝61
̂ = 0.179 𝑝63
̂ = 0.124
𝑝65
̂ = 0.084 𝑝67
̂ = 0.056 𝑝69
̂ = 0.037 𝑝71
̂ = 0.024
b. Add the model-estimated probabilities from part (a) on the plot, then connect these dots
using a smooth curve to represent the model-estimated probabilities.
c. Describe any concerns you may have regarding applying logistic regression in this applica-
tion, and note any assumptions that are required to accept the model’s validity.
9.6. EXERCISES 167
7. Spam filtering, model selection. Spam filters are built on principles similar to those used
in logistic regression. Using characteristics of individual emails, we fit a probability that each
message is spam or not spam. We have several email variables for this problem, and we won’t
describe what each variable means here for the sake of brevity, but each is either a numerical or
indicator variable.5
The AIC of the full model is 1863.5. We remove each variable one by one, refit the model, and
record the updated AIC.
a. For variable selection, we fit the full model, which includes all variables, and then we also
fit each model where we’ve dropped exactly one of the variables. In each of these reduced
models, the AIC value for the model is reported below. Based on these results, which
variable, if any, should we drop as part of model selection? Explain.
b. Consider the subsequent model selection stage (where the variable from part (a) has been
removed, and we are considering removal of a second variable). Here again we’ve computed
the AIC for each leave-one-variable-out model. Based on the results, which variable, if any,
should we drop as part of model selection? Explain.
c. Consider one more step in the process. Here again we’ve computed the AIC for each leave-
one-variable-out model. Based on the results, which variable, if any, should we drop as part
of model selection? Explain.
5 The email data used in this exercise can be found in the openintro R package.
168 CHAPTER 9. LOGISTIC REGRESSION
8. Spam filtering, prediction. Recall running a logistic regression to aid in spam classification
for individual emails. In this exercise, we’ve taken a small set of the variables and fit a logistic
model with the following output:
a. Write down the model using the coefficients from the model fit.
b. Suppose we have an observation where to_multiple = 0, winner = 1, format = 0, and
re_subj = 0. What is the predicted probability that this message is spam?
c. Put yourself in the shoes of a data scientist working on a spam filter. For a given message,
how high must the probability a message is spam be before you think it would be reasonable
to put it in a spambox (which the user is unlikely to check)? What tradeoffs might you
consider? Any ideas about how you might make your spam-filtering system even better
from the perspective of someone using your email service?
9. Possum classification, model selection via AIC. A logistic regression model was proposed
for classifying common brushtail possums into their two regions. The outcome variable took
value 1 if the possum was from Victoria and 0 otherwise.
We use logistic regression to classify the 104 possums in our dataset in these two regions. The
outcome variable, called pop, takes value 1 when the possum is from Victoria and 0 when it
is from New South Wales or Queensland. We consider five predictors: sex (an indicator for a
possum being male), head_l (head length), skull_w (skull width), total_l (total length), and
tail_l (tail length).
A summary of the three models we fit and their AIC values are given below:
formula AIC
sex + head_l + skull_w + total_l + tail_l 84.2
sex + skull_w + total_l + tail_l 83.5
sex + head_l + total_l + tail_l 84.7
a. Using the AIC metric, which of the three models would be best to report?
b. If, for example, the AIC is virtually equivalent for two models that have differing numbers
of variables, which model would be prefered: the model with more variables or the model
with fewer variables? Explain.
10. Model selection. An important aspect of building a logistic regression model is figuring out
which variables to include in the model. In Chapter 9 we covered using AIC to choose between
variable subsets. In Chapter 26 we will cover using something called p-values to choose between
variables subsets. Alternatively, you might hope that a model gave the smallest number of false
positives, the smallest number of false negatives, or the highest overall accuracy. If different
criteria produce outcomes of different variable subsets for the final model, how might you decide
which model to put forward? (Hint: There is no single correct answer to this question.)
169
Chapter 10
Applications: Model
Take a walk around your neighborhood and you’ll probably see a few houses for sale, and you might be
able to look up its price online. You’ll note that house prices are somewhat arbitrary – the homeowners
get to decide the listing price, and many criteria factor into this decision, e.g., what do comparable
houses (“comps” in real estate speak) sell for, how quickly they need to sell the house, etc.
In this case study we’ll formalize the process of determining the listing price of a house by using
data on current home sales. In November of 2020, information on 98 houses in the Duke Forest
neighborhood of Durham, NC were scraped from Zillow. The homes were all recently sold at the time
of data collection, and the goal of the project was to build a model for predicting the sale price based
on a particular home’s characteristics. The first four homes are shown in Table 10.1, and descriptions
of each variable are shown in Table 10.2.
Table 10.2: Variables and their descriptions for the duke_forest dataset.
Variable Description
price Sale price, in USD
bed Number of bedrooms
bath Number of bathrooms
area Area of home, in square feet
year_built Year the home was built
cooling Cooling system: central or other (other is baseline)
lot Area of the entire property, in acres
170 CHAPTER 10. APPLICATIONS: MODEL
Figure 10.1: Scatterplots describing six different predictor variables’ relationship with the price of a home.
GUIDED PRACTICE
In Figure 10.1 there does not appear to be a correlation value calculated for the predictor
variable, cooling. Why not? Can the variable still be used in the linear model?1
EXAMPLE
In Figure 10.1 which variable seems to be most informative for predicting house price? Provide
two reasons for your answer.
The area of the home is the variable which is most highly correlated with price. Additionally,
the scatterplot for price vs. area seems to show a strong linear relationship between the two
variables. Note that the correlation coefficient and the scatterplot linearity will often give the
same conclusion. However, recall that the correlation coefficient is very sensitive to outliers,
so it is always wise to look at the scatterplot even when the variables are highly correlated.
1 The correlation coefficient can only be calculated to describe the relationship between two numerical variables.
The predictor variable cooling is categorical, not numerical. It can, however, be used in the linear model as a binary
indicator variable coded, for example, with a 1 for central and 0 for other.
10.1. CASE STUDY: HOUSES FOR SALE 171
GUIDED PRACTICE
Interpret the value of 𝑏1 = 159 in the context of the problem.2
GUIDED PRACTICE
Using the output in Table 10.3, write out the model for predicting price from area.3
The residuals from the linear model can be used to assess whether a linear model is appropriate.
Figure 10.2 plots the residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ on the 𝑦-axis and the fitted (or predicted) values 𝑦𝑖̂ on the
𝑥-axis.
Figure 10.2: Residuals versus predicted values for the model predicting sale price from area of home.
2 For each additional square foot of house, we would expect such houses to cost, on average, $159 more.
3 price
̂ = 116, 652 + 159 × area
172 CHAPTER 10. APPLICATIONS: MODEL
GUIDED PRACTICE
What aspect(s) of the residual plot indicate that a linear model is appropriate? What
aspect(s) of the residual plot seem concerning when fitting a linear model?4
Table 10.4: Summary of least squares fit for price on multiple predictor variables.
EXAMPLE
Using Table 10.4, write out the linear model of price on the six predictor variables.
GUIDED PRACTICE
The value of the estimated coefficient on coolingcentral is 𝑏5 = 84, 065. Interpret the
value of 𝑏5 in the context of the problem.5
A friend suggests that maybe you do not need all six variables to have a good model for price. You
consider taking a variable out, but you aren’t sure which one to remove.
4 The residual plot shows that the relationship between area and price of a home is indeed linear. However, the
residuals are quite large for expensive homes. The large residuals indicate potential outliers or increasing variability,
either of which could warrant more involved modeling techniques than are presented in this chapter.
5 The coefficient indicates that if all the other variables are kept constant, homes with central air conditioning cost
EXAMPLE
Results corresponding to the full model for the housing data are shown in Table 10.4. How
should we proceed under the backward elimination strategy?
Our baseline adjusted 𝑅2 from the full model is 0.59, and we need to determine whether
dropping a predictor will improve the adjusted 𝑅2 . To check, we fit models that each drop a
different predictor, and we record the adjusted 𝑅2 :
The model without bed has the highest adjusted 𝑅2 of 0.593, higher than the adjusted 𝑅2 for
the full model. Because eliminating bed leads to a model with a higher adjusted 𝑅2 than the
full model, we drop bed from the model. It might seem counter-intuitive to exclude number of
bedrooms from the model. After all, we would expect homes with more bedrooms to cost more,
and we can see a clear relationship between number of bedrooms and sale price in Figure 10.1.
However, note that area is still in the model, and it’s quite likely that the area of the home and
the number of bedrooms are highly associated. Therefore, the model already has information
on “how much space is available in the house” with the inclusion of area.
Since we eliminated a predictor from the model in the first step, we see whether we should
eliminate any additional predictors. Our baseline adjusted 𝑅2 is now 0.593. We fit another set
of new models, which consider eliminating each of the remaining predictors in addition to bed:
That is, after backward elimination, we are left with the model that keeps all predictors except bed,
which we can summarize using the coefficients from Table 10.5.
Table 10.5: Summary of least squares fit for price on multiple predictor variables, excluding number of
bedrooms.
Then, the linear model for predicting sale price based on this model is as follows:
EXAMPLE
The residual plot for the model with all of the predictor variables except bed is given in
Figure 10.3. How do the residuals in Figure 10.3 compare to the residuals in Figure 10.2?
The residuals, for the most part, are randomly scattered around 0. However, there is one
extreme outlier with a residual of -$750,000, a house whose actual sale price is a lot lower
than its predicted price. Also, we observe again that the residuals are quite large for expensive
homes.
Figure 10.3: Residuals versus predicted values for the model predicting sale price from all predictors except
for number of bedrooms.
GUIDED PRACTICE
Consider a house with 1,803 square feet, 2.5 bathrooms, 0.145 acres, built in 1941, that
has central air conditioning. What is the predicted price of the home?6
GUIDED PRACTICE
If you later learned that the house (with a predicted price of $297,570) had recently
sold for $804,133, would you think the model was terrible? What if you learned that
the house was in California?7
6 price
̂ = −2, 952, 641 + 99 × 1803 + 36, 228 × 2.5 + 1, 466 × 1941 + 83, 856 × 1 + 357, 119 × 0.145 = $297, 570.
7A residual of $506,563 is reasonably big. Note that the large residuals (except a few homes) in Figure 10.3 are
closer to $250,000 (about half as big). After we learn that the house is in California, we realize that the model shouldn’t
be applied to the new home at all! The original data are from Durham, NC, and models based on the Durham, NC
data should be used only to explore patterns in prices for homes in Durham, NC.
10.2. INTERACTIVE R TUTORIALS 175
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 3: Regression modeling
https://openintrostat.github.io/ims-tutorials/03-model
Tutorial 3 - Lesson 1: Visualizing two variables
https://openintro.shinyapps.io/ims-03-model-01
Tutorial 3 - Lesson 2: Correlation
https://openintro.shinyapps.io/ims-03-model-02
Tutorial 3 - Lesson 3: Simple linear regression
https://openintro.shinyapps.io/ims-03-model-03
Tutorial 3 - Lesson 4: Interpreting regression models
https://openintro.shinyapps.io/ims-03-model-04
Tutorial 3 - Lesson 5: Model fit
https://openintro.shinyapps.io/ims-03-model-05
Tutorial 3 - Lesson 6: Parallel slopes
https://openintro.shinyapps.io/ims-03-model-06
Tutorial 3 - Lesson 7: Evaluating and extending parallel slopes model
https://openintro.shinyapps.io/ims-03-model-07
Tutorial 3 - Lesson 8: Multiple regression
https://openintro.shinyapps.io/ims-03-model-08
Tutorial 3 - Lesson 9: Logistic regression
https://openintro.shinyapps.io/ims-03-model-09
Tutorial 3 - Lesson 10: Case study: Italian restaurants in NYC
https://openintro.shinyapps.io/ims-03-model-10
You can also access the full list of tutorials supporting this book at https://openintrostat.github.io/
ims-tutorials.
10.3 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Introduction to linear regression - Human Freedom Index
https://www.openintro.org/go?id=ims-r-lab-model
You can also access the full list of labs supporting this book at https://www.openintro.org/go?id=ims-
r-labs.
176
PART IV
Foundations of inference
177
Among the key concepts in statistics is making conclusions about a population using information in
a sample; the process is called statistical inference. By using computational methods as well as well-
developed mathematical theory, we can understand how one dataset differs from a different dataset
— even when the two datasets have been collected under identical settings. In this part, we will walk
through the key concepts and terms which will be applied more explicitly in later chapters.
• Chapter 11 describes randomization which involves repeatedly permuting observations to repre-
sent scenarios in which there is no association between two variables of interest.
• Chapter 12 describes bootstrapping which involves repeatedly sampling (with replacement) from
the observed data in order to produce many samples which are similar to, but different from,
the original data.
• Chapter 13 introduces the Central Limit Theorem which is a theoretical mathematical approxi-
mation to the variability in data seen through randomization and bootstrapping.
• In Chapter 14 you will be presented with a structure for describing when and how errors can
happen within statistical inference.
• Chapter 15 includes an application on the Malaria vaccine case study where the topics from this
part of the book are fully developed.
Although often computational and mathematical methods are both appropriate (and give similar
results), your study of both approaches should convince you that (1) there is almost never a single
“correct” approach, and (2) there are different ways to quantify the variability seen from dataset to
dataset.
178
Chapter 11
Throughout the book so far, you have worked with data in a variety of contexts. You have learned
how to summarize and visualize the data as well as how to model multiple variables at the same time.
Sometimes the dataset at hand represents the entire research question. But more often than not, the
data have been collected to answer a research question about a larger group of which the data are a
(hopefully) representative subset.
You may agree that there is almost always variability in data – one dataset will not be identical to
a second dataset even if they are both collected from the same population using the same methods.
However, quantifying the variability in the data is neither obvious nor easy to do, i.e., answering the
question “how different is one dataset from another?” is not trivial.
First, a note on notation. We generally use 𝑝 to denote a population proportion and 𝑝̂ to a sample
proportion. Similarly, we generally use 𝜇 to denote a population mean and 𝑥̄ to denote a sample
mean.
EXAMPLE
Suppose your professor splits the students in your class into two groups: students who sit on
the left side of the classroom and students who sit on the right side of the classroom. If 𝑝𝐿 ̂
represents the proportion of students who prefer to read books on screen who sit on the left
side of the classroom and 𝑝𝑅̂ represents the proportion of students who prefer to read books on
screen who sit on the right side of the classroom, would you be surprised if 𝑝𝐿̂ did not exactly
equal 𝑝𝑅̂ ?
GUIDED PRACTICE
If we do not think the side of the room a person sits on in class is related to whether they
prefer to read books on screen, what assumption are we making about the relationship
between these two variables?1
Studying randomness of this form is a key focus of statistics. Throughout this chapter, and those
that follow, we provide three different approaches for quantifying the variability inherent in data:
randomization, bootstrapping, and mathematical models. Using the methods provided in this chapter,
we will be able to draw conclusions beyond the dataset at hand to research questions about larger
populations that the samples come from.
The first type of variability we will explore comes from experiments where the explanatory variable (or
treatment) is randomly assigned to the observational units. As you learned in Chapter 1, a randomized
experiment can be used to assess whether one variable (the explanatory variable) causes changes in a
second variable (the response variable). Every dataset has some variability in it, so to decide whether
the variability in the data is due to (1) the causal mechanism (the randomized explanatory variable in
the experiment) or instead (2) natural variability inherent to the data, we set up a sham randomized
experiment as a comparison. That is, we assume that each observational unit would have gotten the
exact same response value regardless of the treatment level. By reassigning the treatments many many
times, we can compare the actual experiment to the sham experiment. If the actual experiment has
more extreme results than any of the sham experiments, we are led to believe that it is the explanatory
variable which is causing the result and not just variability inherent to the data. Using a few different
case studies, let’s look more carefully at this idea of a randomization test.
We consider a study investigating sex discrimination in the 1970s, which is set in the context of
personnel decisions within a bank. The research question we hope to answer is, “Are individuals who
identify as female discriminated against in promotion decisions made by their managers who identify
as male?” (Rosen and Jerdee 1974)
This study considered sex roles, and only allowed for options of “male” and “female”. We should note
that the identities being considered are not gender identities and that the study allowed only for a
binary classification of sex.
GUIDED PRACTICE
Is this an observational study or an experiment? How does the type of study impact
what can be inferred from the results?2
For each supervisor both the sex associated with the assigned file and the promotion decision were
recorded. Using the results of the study summarized in Table 11.1, we would like to evaluate if
individuals who identify as female are unfairly discriminated against in promotion decisions. In this
study, a smaller proportion of female identifying applications were promoted than males (0.583 versus
0.875), but it is unclear whether the difference provides convincing evidence that individuals who
identify as female are unfairly discriminated against.
decision
sex promoted not promoted Total
male 21 3 24
female 14 10 24
Total 35 13 48
The data are visualized in Figure 11.1 as a set of cards. Note that each card denotes a personnel file
(an observation from our dataset) and the colors indicate the decision: red for promoted and white for
not promoted. Additionally, the observations are broken up into groups of male and female identifying
groups.
Figure 11.1: The sex discrimination study can be thought of as 48 red and white cards.
EXAMPLE
Statisticians are sometimes called upon to evaluate the strength of evidence. When looking at
the rates of promotion in this study, why might we be tempted to immediately conclude that
individuals identifying as female are being discriminated against?
The large difference in promotion rates (58.3% for female personnel versus 87.5% for male per-
sonnel) suggests there might be discrimination against women in promotion decisions. However,
we cannot yet be sure if the observed difference represents discrimination or is just due to ran-
dom chance when there is no discrimination occurring. Since we wouldn’t expect the sample
proportions to be exactly equal, even if the truth was that the promotion decisions were in-
dependent of sex, we can’t rule out random chance as a possible explanation when simply
comparing the sample proportions.
11.1. SEX DISCRIMINATION CASE STUDY 181
The previous example is a reminder that there will always be variability in data (making the groups
differ), even if there are no underlying causes for that difference (e.g., even if there is no discrimination).
Table 11.1 shows there were 7 fewer promotions for female identifying personnel than for the male
personnel, a difference in promotion rates of 29.2% ( 21 14
24 − 24 = 0.292) . This observed difference is what
we call a point estimate of the true difference. The point estimate of the difference in promotion
rate is large, but the sample size for the study is small, making it unclear if the observed difference
represents discrimination or is simply due to chance. Chance can be thought of as the claim due to
natural variability; discrimination can be thought of as the claim the researchers set out to demonstrate.
We label these two competing claims, 𝐻0 and 𝐻𝐴 ∶
• 𝐻0 ∶ Null hypothesis. The variables sex and decision are independent. The difference in
promotion rates of 29.2% was due to natural variability inherent in the population.
• 𝐻𝐴 ∶ Alternative hypothesis. The variables sex and decision are not independent. The
difference in promotion rates of 29.2% was not due to natural variability, and equally qualified
female personnel are less likely to be promoted than male personnel.
Hypothesis testing.
These hypotheses are part of what is called a hypothesis test. A hypothesis test is
a statistical technique used to evaluate competing claims using data. Often times, the
null hypothesis takes a stance of no difference or no effect. This hypothesis assumes
that any differences observed are due to the variability inherent in the population and
could have occurred by random chance.
If the null hypothesis and the data notably disagree, then we reject the null hypothesis
in favor of the alternative hypothesis.
There are many nuances to hypothesis testing, so do not worry if you don’t feel like a
master of hypothesis testing at the end of this section. We’ll discuss these ideas and
details many times in this chapter as well as in the chapters that follow.
What would it mean if the null hypothesis, which says the variables sex and decision are unrelated,
was true? It would mean each banker would decide whether to promote the candidate without regard
to the sex indicated on the personnel file. That is, the difference in the promotion percentages would
be due to the natural variability in how the files were randomly allocated to different bankers, and
this randomization just happened to give rise to a relatively large difference of 29.2%.
Consider the alternative hypothesis: bankers were influenced by which sex was listed on the personnel
file. If this was true, and especially if this influence was substantial, we would expect to see some
difference in the promotion rates of male and female candidates. If this sex bias was against female
candidates, we would expect a smaller fraction of promotion recommendations for female personnel
relative to the male personnel.
We will choose between the two competing claims by assessing if the data conflict so much with 𝐻0
that the null hypothesis cannot be deemed reasonable. If data and the null claim seem to be at odds
with one another, and the data seem to support 𝐻𝐴 , then we will reject the notion of independence
and conclude that the data provide evidence of discrimination.
3 The test procedure we employ in this section is sometimes referred to as a randomization test. If the explanatory
variable had not been randomly assigned, as in an observational study, the procedure would be referred to as a permu-
tation test. Permutation tests are used for observational studies, where the explanatory variable was not randomly
assigned.
182 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
In the simulation, we thoroughly shuffle the 48 personnel files, 35 labelled promoted and 13 labelled
not promoted, together and we deal files into two new stacks. Note that by keeping 35 promoted and
13 not promoted, we are assuming that 35 of the bank managers would have promoted the individual
whose content is contained in the file independent of the sex indicated on their file. We will deal
24 files into the first stack, which will represent the 24 “female” files. The second stack will also
have 24 files, and it will represent the 24 “male” files. Figure 11.2 highlights both the shuffle and the
reallocation to the sham sex groups.
Figure 11.2: The sex discrimination data are shuffled and reallocated to new groups of male and female files.
Then, as we did with the original data, we tabulate the results and determine the fraction of personnel
files designated as “male” and “female” who were promoted.
Since the randomization of files in this simulation is independent of the promotion decisions, any
difference in promotion rates is due to chance. Table 11.2 show the results of one such simulation.
Table 11.2: Simulation results, where the difference in promotion rates between male and female is purely
due to random chance.
decision
sex promoted not promoted Total
male 18 6 24
female 17 7 24
Total 35 13 48
GUIDED PRACTICE
What is the difference in promotion rates between the two simulated groups in Ta-
ble 11.2? How does this compare to the observed difference 29.2% from the actual
study?4
4 18/24 − 17/24 = 0.042 or about 4.2% in favor of the male personnel. This difference due to chance is much smaller
Figure 11.3 shows that the difference in promotion rates is much larger in the original data than it is
in the simulated groups (0.292 > 0.042). The quantity of interest throughout this case study has been
the difference in promotion rates. We call the summary value the statistic of interest (or often the
test statistic). When we encounter different data structures, the statistic is likely to change (e.g.,
we might calculate an average instead of a proportion), but we will always want to understand how
the statistic varies from sample to sample.
Figure 11.3: We summarize the randomized data to produce one estimate of the difference in proportions
given no sex discrimination. Note that the sort step is only used to make it easier to visually calculate the
simulated sample proportions.
Figure 11.4: A stacked dot plot of differences from 100 simulations produced under the null hypothesis, 𝐻0 ,
where the simulated sex and decision are independent. Two of the 100 simulations had a difference of at least
29.2%, the difference observed in the study, and are shown as solid blue dots.
184 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
Note that the distribution of these simulated differences in proportions is centered around 0. Under
the null hypothesis our simulations made no distinction between male and female personnel files. Thus,
a center of 0 makes sense: we should expect differences from chance alone to fall around zero with
some random fluctuation for each simulation.
EXAMPLE
How often would you observe a difference of at least 29.2% (0.292) according to Figure 11.4?
Often, sometimes, rarely, or never?
It appears that a difference of at least 29.2% under the null hypothesis would only happen
about 2% of the time according to Figure 11.4. Such a low probability indicates that observing
such a large difference from chance alone is rare.
The difference of 29.2% is a rare event if there really is no impact from listing sex in the candidates’
files, which provides us with two possible interpretations of the study results:
• If 𝐻0 , the Null hypothesis is true: Sex has no effect on promotion decision, and we observed
a difference that is so large that it would only happen rarely.
• If 𝐻𝐴 , the Alternative hypothesis is true: Sex has an effect on promotion decision, and what
we observed was actually due to equally qualified female candidates being discriminated against
in promotion decisions, which explains the large difference of 29.2%.
When we conduct formal studies, we reject a null position (the idea that the data are a result of
chance only) if the data strongly conflict with that null position.5 In our analysis, we determined
that there was only a ≈ 2% probability of obtaining a sample where ≥ 29.2% more male candidates
than female candidates get promoted under the null hypothesis, so we conclude that the data provide
strong evidence of sex discrimination against female candidates by the male supervisors. In this case,
we reject the null hypothesis in favor of the alternative.
Statistical inference is the practice of making decisions and conclusions from data in the context
of uncertainty. Errors do occur, just like rare events, and the dataset at hand might lead us to the
wrong conclusion. While a given dataset may not always lead us to a correct conclusion, statistical
inference gives us tools to control and evaluate how often these errors occur. Before getting into the
nuances of hypothesis testing, let’s work through another case study.
How rational and consistent is the behavior of the typical American college student? In this section,
we’ll explore whether college student consumers always consider the following: money not spent now
can be spent later.
In particular, we are interested in whether reminding students about this well-known fact about money
causes them to be a little thriftier. A skeptic might think that such a reminder would have no impact.
We can summarize the two different perspectives using the null and alternative hypothesis framework.
• 𝐻0 ∶ Null hypothesis. Reminding students that they can save money for later purchases will
not have any impact on students’ spending decisions.
• 𝐻𝐴 ∶ Alternative hypothesis. Reminding students that they can save money for later pur-
chases will reduce the chance they will continue with a purchase.
In this section, we’ll explore an experiment conducted by researchers that investigates this very ques-
tion for students at a university in the southwestern United States. (Frederick et al. 2009)
5 This reasoning does not generally extend to anecdotal observations. Each of us observes incredibly rare events every
day, events we could not possibly hope to predict. However, in the non-rigorous setting of anecdotal evidence, almost
anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous. For
example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the
largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those
numbers came up! However, no matter what numbers had turned up, they would have had the same incredibly rare
odds. That is, any set of numbers we could have observed would ultimately be incredibly rare. This type of situation is
typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those
outcomes are also incredibly rare. We should be cautious not to misinterpret such anecdotal evidence.
11.2. OPPORTUNITY COST CASE STUDY 185
decision
group buy video not buy video Total
control 56 19 75
treatment 41 34 75
Total 97 53 150
It might be a little easier to review the results using a visualization. Figure 11.5 shows that a higher
proportion of students in the treatment group chose not to buy the video compared to those in the
control group.
Figure 11.5: Stacked bar plot of results of the opportunity cost study.
6 This context might feel strange if physical video stores predate you. If you’re curious about what those were like,
look up “Blockbuster”.
186 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
Another useful way to review the results from Table 11.3 is using row proportions, specifically con-
sidering the proportion of participants in each group who said they would buy or not buy the video.
These summaries are given in Table 11.4.
Table 11.4: The opportunity cost data are summarized using row proportions. Row proportions are partic-
ularly useful here since we can view the proportion of buy and not buy decisions in each group.
decision
group buy video not buy video Total
control 0.747 0.253 1
treatment 0.547 0.453 1
We will define a success in this study as a student who chooses not to buy the video.7 Then, the
value of interest is the change in video purchase rates that results by reminding students that not
spending money now means they can spend the money later.
We can construct a point estimate for this difference as (𝑇 for treatment and 𝐶 for control):
34 19
𝑝𝑇̂ − 𝑝𝐶
̂ = − = 0.453 − 0.253 = 0.200
75 75
The proportion of students who chose not to buy the video was 20 percentage points higher in the
treatment group than the control group. Is this 20% difference between the two groups so prominent
that it is unlikely to have occurred from chance alone, if there is no difference between the spending
habits of the two groups?
7 Success is often defined in a study as the outcome of interest, and a “success” may or may not actually be a positive
outcome. For example, researchers working on a study on COVID prevalence might define a “success” in the statistical
sense as a patient who has COVID-19. A more complete discussion of the term success will be given in Chapter 16.
11.2. OPPORTUNITY COST CASE STUDY 187
EXAMPLE
If we are randomly assigning the cards into the simulated treatment and control groups, how
many “not buy video” cards would we expect to end up in each simulated group? What would
be the expected difference between the proportions of “not buy video” cards in each group?
Since the simulated groups are of equal size, we would expect 53/2 = 26.5, i.e., 26 or 27, “not
buy video” cards in each simulated group, yielding a simulated point estimate of the difference
in proportions of 0%. However, due to random chance, we might also expect to sometimes
observe a number a little above or below 26 and 27.
Table 11.5: Summary of student choices against their simulated groups. The group assignment had no
connection to the student decisions, so any difference between the two groups is due to chance.
decision
group buy video not buy video Total
control 46 29 75
treatment 51 24 75
Total 97 53 150
From this table, we can compute a difference that occurred from the first shuffle of the data (i.e., from
chance alone):
24 29
𝑝𝑇̂ ,𝑠ℎ𝑓𝑙1 − 𝑝𝐶,𝑠ℎ𝑓𝑙1
̂ = − = 0.32 − 0.387 = −0.067
75 75
Just one simulation will not be enough to get a sense of what sorts of differences would happen from
chance alone.
We’ll simulate another set of simulated groups and compute the new difference: 0.04.
And again: 0.12.
And again: -0.013.
We’ll do this 1,000 times.
The results are summarized in a dot plot in Figure 11.6, where each point represents the difference
from one randomization.
Figure 11.6: A stacked dot plot of 1,000 simulated (null) differences produced under the null hypothesis, 𝐻0 .
Six of the 1,000 simulations had a difference of at least 20%, which was the difference observed in the study.
188 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
Since there are so many points and it is difficult to discern one point from the other, it is more
convenient to summarize the results in a histogram such as the one in Figure 11.7, where the height of
each histogram bar represents the number of simulations resulting in an outcome of that magnitude.
Figure 11.7: A histogram of 1,000 chance differences produced under the null hypothesis. Histograms like
this one are a convenient representation of data or results when there are a large number of simulations.
Under the null hypothesis (no treatment effect), we would observe a difference of at least +20% about
0.6% of the time. That is really rare! Instead, we will conclude the data provide strong evidence there
is a treatment effect: reminding students before a purchase that they could instead spend the money
later on something else lowers the chance that they will continue with the purchase. Notice that we
are able to make a causal statement for this study since the study is an experiment, although we do
not know why the reminder induces a lower purchase rate.
In the last two sections, we utilized a hypothesis test, which is a formal technique for evaluating two
competing possibilities. In each scenario, we described a null hypothesis, which represented either a
skeptical perspective or a perspective of no difference. We also laid out an alternative hypothesis,
which represented a new perspective such as the possibility of a relationship between two variables or
a treatment effect in an experiment. The alternative hypothesis is usually the reason the scientists
set out to do the research in the first place.
The null hypothesis (𝐻0 ) often represents either a skeptical perspective or a claim of
“no difference” to be tested.
If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is
sufficient evidence that supports the claim, we set aside our skepticism. The hallmarks of hypothesis
testing are also found in the US court system.
11.3. HYPOTHESIS TESTING 189
EXAMPLE
The US court considers two possible claims about a defendant: they are either innocent or
guilty.
If we set these claims up in a hypothesis framework, which would be the null hypothesis and
which the alternative?
The jury considers whether the evidence is so convincing (strong) that there is no reasonable
doubt regarding the person’s guilt. That is, the skeptical perspective (null hypothesis) is that
the person is innocent until evidence is presented that convinces the jury that the person is
guilty (alternative hypothesis).
Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Notice that if
a jury finds a defendant not guilty, this does not necessarily mean the jury is confident in the person’s
innocence. They are simply not convinced of the alternative, that the person is guilty. This is also
the case with hypothesis testing: even if we fail to reject the null hypothesis, we do not accept the null
hypothesis as truth.
Failing to find evidence in favor of the alternative hypothesis is not equivalent to finding evidence that
the null hypothesis is true. We will see this idea in greater detail in Chapter 14.
p-value.
The p-value is the probability of observing data at least as favorable to the alternative
hypothesis as our current dataset, if the null hypothesis were true. We typically use a
summary statistic of the data, such as a difference in proportions, to help compute the
p-value and evaluate the hypotheses. This summary value that is used to compute the
p-value is often called the test statistic.
190 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
EXAMPLE
In the sex discrimination study, the difference in discrimination rates was our test statistic.
What was the test statistic in the opportunity cost study covered in Section 11.2)?
The test statistic in the opportunity cost study was the difference in the proportion of students
who decided against the video purchase in the treatment and control groups. In each of these
examples, the point estimate of the difference in proportions was used as the test statistic.
When the p-value is small, i.e., less than a previously set threshold, we say the results are statistically
discernible. This means the data provide such strong evidence against 𝐻0 that we reject the null
hypothesis in favor of the alternative hypothesis.8 The threshold is called the discernibility level
and often represented by 𝛼 (the Greek letter alpha). 9 The value of 𝛼 represents how rare an event
needs to be in order for the null hypothesis to be rejected. Historically, many fields have set 𝛼 = 0.05,
if the null hypothesis is to be rejected. The value of 𝛼 can vary depending on the the field or the
application.
Note that you may have heard the phrase “statistically significant” as a way to describe “statistically
discernible.” Although in everyday language “significant” would indicate that a difference is large
or meaningful, that is not necessarily the case here. The term “statistically discernible” indicates
that the p-value from a study fell below the chosen discernibility level. For example, in the sex
discrimination study, the p-value was found to be approximately 0.02. Using a discernibility level
of 𝛼 = 0.05, we would say that the data provided statistically discernible evidence against the null
hypothesis. However, this conclusion gives us no information regarding the size of the difference in
promotion rates!
Statistical discernibility.
We say that the data provide statistically discernible evidence against the null hy-
pothesis if the p-value is less than some predetermined threshold (e.g., 0.01, 0.05, 0.1).
EXAMPLE
In the opportunity cost study in Section 11.2, we analyzed an experiment where study partici-
pants had a 20% drop in likelihood of continuing with a video purchase if they were reminded
that the money, if not spent on the video, could be used for other purchases in the future. We
determined that such a large difference would only occur 6-in-1,000 times if the reminder ac-
tually had no influence on student decision-making. What is the p-value in this study? Would
you classify the result as “statistically discernible”?
The p-value was 0.006. Since the p-value is less than 0.05, the data provide statistically
discernible evidence that US college students were actually influenced by the reminder.
https://www.openintro.org/book/stat/why05/
Sometimes it’s also a good idea to deviate from the standard. We’ll discuss when to
choose a threshold different than 0.05 in Chapter 14.
8 Many texts use the phrase “statistically significant” instead of “statistically discernible”. We have chosen to use
“discernible” to indicate that a precise statistical event has happened, as opposed to a notable effect which may or may
not fit the statistical definition of discernible or significant.
9 Here, too, we have chosen “discernibility level” instead of “significance level” which you will see in some texts.
11.4. CHAPTER REVIEW 191
11.4.1 Summary
Figure 11.8 provides a visual summary of the randomization testing procedure.
Figure 11.8: An example of one simulation of the full randomization procedure from a hypothetical dataset
as visualized in the first panel. We repeat the steps hundreds or thousands of times.
Question Answer
What does it do? Shuffles the explanatory variable to mimic the natural
variability found in a randomized experiment
What is it best for? Hypothesis testing (can also be used for confidence intervals,
but not covered in this text)
11.4.2 Terms
The terms introduced in this chapter are presented in Table 11.7. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
11.5 Exercises
2. Identify the parameter, II. For each of the following situations, state whether the parameter
of interest is a mean or a proportion.
a. A poll shows that 64% of Americans personally worry a great deal about federal spending
and the budget deficit.
b. A survey reports that local TV news has shown a 17% increase in revenue within a two
year period while newspaper revenues decreased by 6.4% during this time period.
c. In a survey, high school and college students are asked whether they use geolocation services
on their smart phones.
d. In a survey, smart phone users are asked whether they use a web-based taxi service.
e. In a survey, smart phone users are asked how many times they used a web-based taxi
service over the last year.
3. Hypotheses. For each of the research statements below, note whether it represents a null
hypothesis claim or an alternative hypothesis claim.
a. The number of hours that grade-school children spend doing homework predicts their future
success on standardized tests.
b. King cheetahs on average run the same speed as standard spotted cheetahs.
c. For a particular student, the probability of correctly answering a 5-option multiple choice
test is larger than 0.2 (i.e., better than guessing).
d. The mean length of African elephant tusks has changed over the last 100 years.
e. The risk of facial clefts is equal for babies born to mothers who take folic acid supplements
compared with those from mothers who do not.
f. Caffeine intake during pregnancy affects mean birth weight.
g. The probability of getting in a car accident is the same if using a cell phone than if not
using a cell phone.
194 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
4. True null hypothesis. Unbeknownst to you, let’s say that the null hypothesis is actually true
in the population. You plan to run a study anyway.
a. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.05, how likely
is it that you will mistakenly reject the null hypothesis?
b. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.01, how likely
is it that you will mistakenly reject the null hypothesis?
c. If the level of discernibility you choose (i.e., the cutoff for your p-value) is 0.10, how likely
is it that you will mistakenly reject the null hypothesis?
5. Identify hypotheses, I. Write the null and alternative hypotheses in words and then symbols
for each of the following situations.
a. New York is known as “the city that never sleeps”. A random sample of 25 New Yorkers
were asked how much sleep they get per night. Do these data provide convincing evidence
that New Yorkers on average sleep less than 8 hours a night?
b. Employers at a firm are worried about the effect of March Madness, a basketball cham-
pionship held each spring in the US, on employee productivity. They estimate that on a
regular business day employees spend on average 15 minutes of company time checking
personal email, making personal phone calls, etc. They also collect data on how much com-
pany time employees spend on such non- business activities during March Madness. They
want to determine if these data provide convincing evidence that employee productivity
decreases during March Madness.
6. Identify hypotheses, II. Write the null and alternative hypotheses in words and using symbols
for each of the following situations.
a. Since 2008, chain restaurants in California have been required to display calorie counts of
each menu item. Prior to menus displaying calorie counts, the average calorie intake of
diners at a restaurant was 1100 calories. After calorie counts started to be displayed on
menus, a nutritionist collected data on the number of calories consumed at this restaurant
from a random sample of diners. Do these data provide convincing evidence of a difference
in the average calorie intake of a diners at this restaurant?
b. Based on the performance of those who took the GRE exam between July 1, 2004 and
June 30, 2007, the average Verbal Reasoning score was calculated to be 462. In 2021 the
average verbal score was slightly higher. Do these data provide convincing evidence that
the average GRE Verbal Reasoning score has changed since 2021?
7. Side effects of Avandia. Rosiglitazone is the active ingredient in the controversial type 2
diabetes medicine Avandia and has been linked to an increased risk of serious cardiovascular
problems such as stroke, heart failure, and death. A common alternative treatment is Pioglita-
zone, the active ingredient in a diabetes medicine called Actos. In a nationwide retrospective
observational study of 227,571 Medicare beneficiaries aged 65 years or older, it was found that
2,593 of the 67,593 patients using Rosiglitazone and 5,386 of the 159,978 using Pioglitazone had
serious cardiovascular problems. These data are summarized in the contingency table below.10
(Graham et al. 2010)
10 The avandia data used in this exercise can be found in the openintro R package.
11.5. EXERCISES 195
a. Determine if each of the following statements is true or false. If false, explain why. Be
careful: The reasoning may be wrong even if the statement’s conclusion is correct. In such
cases, the statement should be considered false.
i. Since more patients on Pioglitazone had cardiovascular problems (5,386 vs. 2,593),
we can conclude that the rate of cardiovascular problems for those on a Pioglitazone
treatment is higher.
ii. The data suggest that diabetic patients who are taking Rosiglitazone are more likely to
have cardiovascular problems since the rate of incidence was (2,593 / 67,593 = 0.038)
3.8% for patients on this treatment, while it was only (5,386 / 159,978 = 0.034) 3.4%
for patients on Pioglitazone.
iii. The fact that the rate of incidence is higher for the Rosiglitazone group proves that
Rosiglitazone causes serious cardiovascular problems.
iv. Based on the information provided so far, we cannot tell if the difference between the
rates of incidences is due to a relationship between the two variables or due to chance.
b. What proportion of all patients had cardiovascular problems?
c. If the type of treatment and having cardiovascular problems were independent, how many
patients in the Rosiglitazone group would we expect to have had cardiovascular problems?
d. We can investigate the relationship between outcome and treatment in this study using a
randomization technique. While in reality we would carry out the simulations required for
randomization using statistical software, suppose we actually simulate using index cards.
In order to simulate from the independence model, which states that the outcomes were
independent of the treatment, we write whether each patient had a cardiovascular problem
on cards, shuffled all the cards together, then deal them into two groups of size 67,593 and
159,978. We repeat this simulation 100 times and each time record the difference between
the proportions of cards that say “Yes” in the Rosiglitazone and Pioglitazone groups. Use
the histogram of these differences in proportions to answer the following questions.
i. What are the claims being tested?
ii. Compared to the number calculated in part (b), which would provide more support for
the alternative hypothesis, higher or lower proportion of patients with cardiovascular
problems in the Rosiglitazone group?
iii. What do the simulation results suggest about the relationship between taking Rosigli-
tazone and having cardiovascular problems in diabetic patients?
196 CHAPTER 11. HYPOTHESIS TESTING WITH RANDOMIZATION
8. Heart transplants. The Stanford University Heart Transplant Study was conducted to de-
termine whether an experimental heart transplant program increased lifespan. Each patient
entering the program was designated an official heart transplant candidate, meaning that they
were gravely ill and would most likely benefit from a new heart. Some patients got a transplant
and some did not. The variable transplant indicates which group the patients were in; patients
in the treatment group got a transplant and those in the control group did not. Of the 34 pa-
tients in the control group, 30 died. Of the 69 people in the treatment group, 45 died. Another
variable called survived was used to indicate whether the patient was alive at the end of the
study.11 (Turnbull, Brown, and Hu 1974)
a. Does the stacked bar plot indicate that survival is independent of whether the patient got
a transplant? Explain your reasoning.
b. What do the box plots suggest about the efficacy of heart transplants.
c. What proportions of patients in the treatment and control groups died?
d. One approach for investigating whether the treatment is discernably effective is random-
ization testing.
i. What are the claims being tested?
ii. The paragraph below describes the set up for a randomization test, if we were to do it
without using statistical software. Fill in the blanks with a number or phrase.
We write alive on cards representing patients who were alive at the end
of the study, and deceased on cards representing patients who were not.
Then, we shuffle these cards and split them into two groups: one group of size
representing treatment, and another group of size representing
control. We calculate the difference between the proportion of deceased cards in
the treatment and control groups (treatment - control) and record this value. We
repeat this 100 times to build a distribution centered at . Lastly, we calcu-
late the proportion of simulations where the simulated differences in proportions
are . If this proportion is low, we conclude that it is unlikely to have ob-
served such an outcome by chance and that the null hypothesis should be rejected
in favor of the alternative.
iii. What do the simulation results shown below suggest about the effectiveness of heart
transplants?
11 The heart_transplant data used in this exercise can be found in the openintro R package.
197
Chapter 12
When the variability across the samples is large, we would assume that
the original statistic is possibly far from the true population parameter
of interest (and the interval estimate will be wide). When the variability
across the samples is small, we expect the sample statistic to be close to
the true parameter of interest (and the interval estimate will be narrow).
The ideal world where sampling data is free or extremely cheap is almost
never the case, and taking repeated samples from a population is usually
impossible. So, instead of using a “resample from the population” ap-
proach, bootstrapping uses a “resample from the sample” approach. In
this chapter we discuss in detail the bootstrapping process.
As seen in Chapter 11, randomization is a statistical technique suitable for evaluating whether a
difference in sample proportions is due to chance.
Randomization tests are best suited for modeling experiments where the treatment (explanatory
variable) has been randomly assigned to the observational units and there is an attempt to answer a
simple yes/no research question.
198 CHAPTER 12. CONFIDENCE INTERVALS WITH BOOTSTRAPPING
For example, consider the following research questions that can be well assessed with a randomization
test:
• Does this vaccine make it less likely that a person will get malaria?
• Does drinking caffeine affect how quickly a person can tap their finger?
• Can we predict whether candidate A will win the upcoming election?
In this chapter, however, we are instead interested in a different approach to understanding population
parameters. Instead, of testing a claim, the goal now is to estimate the unknown value of a population
parameter.
For example,
• How much less likely am I to get malaria if I get the vaccine?
• How much faster (or slower) can a person tap their finger, on average, if they drink caffeine
first?
• What proportion of the vote will go to candidate A?
Here, we explore the situation where the focus is on a single proportion, and we introduce a new
simulation method: bootstrapping.
Bootstrapping is best suited for modeling studies where the data have been generated through random
sampling from a population. As with randomization tests, our goal with bootstrapping is to under-
stand variability of a statistic. Unlike randomization tests (which modeled how the statistic would
change if the treatment had been allocated differently), the bootstrap will model how a statistic varies
from one sample to another taken from the population. This will provide information about how
different the statistic is from the parameter of interest.
Quantifying the variability of a statistic from sample to sample is a hard problem. Fortunately,
sometimes the mathematical theory for how a statistic varies (across different samples) is well-known;
this is the case for the sample proportion as seen in Chapter 13.
However, some statistics do not have simple theory for how they vary, and bootstrapping provides
a computational approach for providing interval estimates for almost any population parameter. In
this chapter we will focus on bootstrapping to estimate a single proportion, and we will revisit boot-
strapping in Chapter 19 through Chapter 21, so you’ll get plenty of practice as well as exposure to
bootstrapping in many different datasettings.
Our goal with bootstrapping will be to produce an interval estimate (a range of plausible values) for
the population parameter.
People providing an organ for donation sometimes seek the help of a special medical consultant. These
consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of
complications during the medical procedure and recovery. Patients might choose a consultant based
in part on the historical complication rate of the consultant’s clients.
EXAMPLE
We will let 𝑝 represent the true complication rate for liver donors working with this consultant.
(The “true” complication rate will be referred to as the parameter.) We estimate 𝑝 using the
data, and label the estimate 𝑝.̂
The sample proportion for the complication rate is 3 complications divided by the 62 surgeries
the consultant has worked on: 𝑝̂ = 3/62 = 0.048.
EXAMPLE
Is it possible to assess the consultant’s claim (that the reduction in complications is due to her
work) using the data?
No. The claim is that there is a causal connection, but the data are observational, so we must
be on the lookout for confounding variables. For example, maybe patients who can afford a
medical consultant can afford better medical care, which can also lead to a lower complication
rate. While it is not possible to assess the causal claim, it is still possible to understand the
consultant’s true rate of complications.
Parameter.
We typically estimate the parameter using a point estimate from a sample of data. The
point estimate is also known as the statistic.
For example, we estimate the probability 𝑝 of a complication for a client of the medical
consultant by examining the past complications rates of her clients:
Figure 12.1: The unknown population is estimated using the observed sample data. Note that we can use
the sample to create an estimated or bootstrapped population from which to sample. The observed data
include three red and four white marbles, so the estimated population contains 3/7 red marbles and 4/7 white
marbles.
By taking repeated samples from the estimated population, the variability from sample to sample can
be observed. In Figure 12.2 the repeated bootstrap samples are obviously different both from each
other and from the original population. Recall that the bootstrap samples were taken from the same
(estimated) population, and so the differences are due entirely to natural variability in the sampling
procedure.
Figure 12.2: Bootstrap sampling provides a measure of the sample to sample variability. Note that we are
taking samples from the estimated population that was created from the observed data.
By summarizing each of the bootstrap samples (here, using the sample proportion), we see, directly,
the variability of the sample proportion, 𝑝,̂ from sample to sample. The distribution of 𝑝𝑏𝑜𝑜𝑡
̂ for the
example scenario is shown in Figure 12.3, and the full bootstrap distribution for the medical consultant
data is shown in Figure 12.6.
12.1. MEDICAL CONSULTANT CASE STUDY 201
Figure 12.3: The bootstrapped proportion is estimated for each bootstrap sample. The resulting bootstrap
distribution (dotplot) provides a measure for how the proportions vary from sample to sample
It turns out that in practice, it is very difficult for computers to work with an infinite population
(with the same proportional breakdown as in the sample). However, there is a physical and compu-
tational method which produces an equivalent bootstrap distribution of the sample proportion in a
computationally efficient manner.
Consider the observed data to be a bag of marbles 3 of which are success (red) and 4 of which are
failures (white). By drawing the marbles out of the bag with replacement, we depict the exact same
sampling process as was done with the infinitely large estimated population.
Figure 12.4: Taking repeated resamples from the sample data is the same process as creating an infinitely
large estimate of the population. It is computationally more feasible to take resamples directly from the
sample. Note that the resampling is now done with replacement (that is, the original sample does not ever
change) so that the original sample and estimated hypothetical population are equivalent.
202 CHAPTER 12. CONFIDENCE INTERVALS WITH BOOTSTRAPPING
Figure 12.5: A comparison of the process of sampling from the estimate infinite population and resampling
with replacement from the original sample. Note that the dotplot of bootstrapped proportions is the same
because the process by which the statistics were estimated is equivalent.
If we apply the bootstrap sampling process to the medical consultant example, we consider each client
to be one of the marbles in the bag. There will be 59 white marbles (no complication) and 3 red
marbles (complication). If we choose 62 marbles out of the bag (one at a time with replacement)
and compute the proportion of simulated patients with complications, 𝑝𝑏𝑜𝑜𝑡
̂ , then this “bootstrap”
proportion represents a single simulated proportion from the “resample from the sample” approach.
GUIDED PRACTICE
In a simulation of 62 patients, about how many would we expect to have had a compli-
cation?1
One simulation isn’t enough to get a sense of the variability from one bootstrap proportion to another
bootstrap proportion, so we repeat the simulation 10,000 times using a computer.
Figure 12.6 shows the distribution from the 10,000 bootstrap simulations. The bootstrapped pro-
portions vary from about zero to 11.3%. The variability in the bootstrapped proportions leads us to
believe that the true probability of complication (the parameter, 𝑝) is likely to fall somewhere between
0% and 11.3%, as these numbers capture 95% of the bootstrap resampled values.
The range of values for the true proportion is called a bootstrap percentile confidence interval,
and we will see it again throughout the next few sections and chapters.
1 About 4.8% of the patients (3 on average) in the simulation will have a complication, as this is what was seen in
the sample. We will, however, see a little variation from one simulation to the next.
12.2. TAPPERS AND LISTENERS CASE STUDY 203
Figure 12.6: The original medical consultant data is bootstrapped 10,000 times. Each simulation creates a
sample from the original data where the probability of a complication is 𝑝̂ = 3/62. The bootstrap 2.5 percentile
proportion is 0 and the 97.5 percentile is 0.113. The result is: we are confident that, in the population, the
true probability of a complication is between 0% and 11.3%.
EXAMPLE
The original claim was that the consultant’s true rate of complication was under the national
rate of 10%. Does the interval estimate of 0% to 11.3% for the true probability of complica-
tion indicate that the surgical consultant has a lower rate of complications than the national
average? Explain.
No. Because the interval overlaps 10%, it might be that the consultant’s work is associated
with a lower risk of complications, or it might be that the consultant’s work is associated with
a higher risk (i.e., greater than 10%) of complications! Additionally, as previously mentioned,
because this is an observational study, even if an association can be measured, there is no
evidence that the consultant’s work is the cause of the complication rate (being higher or
lower).
Here’s a game you can try with your friends or family: pick a simple, well-known song, tap that tune
on your desk, and see if the other person can guess the song. In this simple game, you are the tapper,
and the other person is the listener.
2 This case study is described in Made to Stick by Chip and Dan Heath. Little known fact: the teaching principles
W W W R W
Wrong Wrong Wrong Correct Wrong
As before, we’ll run a total of 10,000 simulations using a computer. As seen in Figure 12.7, the range
of 95% of the resampled values of 𝑝𝑏𝑜𝑜𝑡
̂ is 0.000 to 0.0583. That is, we expect that between 0% and
5.83% of people are truly able to guess the tapper’s tune.
Figure 12.7: The original listener-tapper data is bootstrapped 10,000 times. Each simulation creates a
sample where the probability of being correct is 𝑝̂ = 3/120. The 2.5 percentile proportion is 0 and the 97.5
percentile is 0.0583. The result is that we are confident that, in the population, the true percent of people
who can guess correctly is between 0% and 5.83%.
GUIDED PRACTICE
Do the data provide convincing evidence against the claim that 50% of listeners can
guess the tapper’s tune?3
3 Because 50% is not in the interval estimate for the true parameter, we can say that there is convincing evidence
against the hypothesis that 50% of listeners can guess the tune. Moreover, 50% is a substantial distance from the largest
resample statistic, suggesting that there is very convincing evidence against this hypothesis.
12.3. CONFIDENCE INTERVALS 205
A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely
perfect; usually there is some error in the estimate. In addition to supplying a point estimate of a
parameter, a next logical step would be to provide a plausible range of values for the parameter.
GUIDED PRACTICE
If we want to be very certain we capture the population parameter, should we use a
wider interval (e.g., 99%) or a smaller interval (e.g., 80%)?4
The 95% bootstrap confidence interval for the parameter 𝑝 can be obtained directly
using the ordered 𝑝𝑏𝑜𝑜𝑡
̂ values.
In Section 16.1 we will discuss different percentages for the confidence interval (e.g., 90% confidence
interval or 99% confidence interval).
Section Section 16.1 also provides a longer discussion on what “95% confidence” actually means.
4 If we want to be more certain we will capture the fish, we might use a wider net. Likewise, we use a wider confidence
12.4.1 Summary
Figure 12.8 provides a visual summary of creating bootstrap confidence intervals.
Figure 12.8: We will use sampling with replacement to measure the variability of the statistic of interest
(here the proportion). Sampling with replacement is a computational tool which is equivalent to using the
sample as a way of estimating an infinitely large population from which to sample.
• Create the interval. After choosing a particular confidence level, use the variability of the
bootstrapped statistics to create an interval estimate which will hope to capture the true pa-
rameter. While the interval estimate associated with the particular sample at hand may or may
not capture the parameter, the researcher knows that over their lifetime, the confidence level
will determine the percentage of their research confidence intervals that do capture the true
parameter.
• Form a conclusion. Using the confidence interval from the analysis, report on the interval
estimate for the parameter of interest. Also, be sure to write the conclusion in plain language
so casual readers can understand the results.
Table 12.2 is another look at the bootstrap process summary.
Question Answer
What does it do? Resamples (with replacement) from the observed data to
mimic the sampling variability found by collecting data from
a population
What is it best for? Confidence intervals (can also be used for bootstrap
hypothesis testing for one proportion as well)
What physical object represents Pulling marbles from a bag with replacement
the simulation process?
12.4.2 Terms
The terms introduced in this chapter are presented in Table 12.3. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
12.5 Exercises
a. Describe in words the relevant statistic and parameter for this problem. If you know the
numerical value for either one, provide it. If you don’t know the numerical value, explain
why the value is unknown.
b. What notation is used to describe, respectively, the statistic and the parameter?
c. If using software to bootstrap the original dataset, what is the statistic calculated on each
bootstrap sample?
d. When creating a bootstrap sampling distribution (histogram) of the bootstrapped sample
proportions, where should the center of the histogram lie?
e. The histogram provides a bootstrap sampling distribution for the sample proportion (with
1000 bootstrap repetitions). Using the histogram, estimate a 90% confidence interval for
the proportion of YouTube videos which take place outdoors.
f. Interpret the confidence interval in context of the data.
2. Chronic illness. In 2012 the Pew Research Foundation reported that “45% of US adults report
that they live with one or more chronic conditions.” However, this value was based on a sample,
so it may not be a perfect estimate for the population parameter of interest on its own. The
study was based on a sample of 3014 adults. Below is a distribution of 1000 bootstrapped sample
proportions from the Pew dataset. (Pew Research Center 2013) Using the distribution of 1,000
bootstrapped proportions, approximate a 92% confidence interval for the true proportion of US
adults who live with one or more chronic conditions and interpret it.
5 There are many choices for implementing a random selection of YouTube videos, but it isn’t clear how “random”
they are.
12.5. EXERCISES 209
3. Social media users and news, bootstrapping. A poll conducted in 2022 found that 50%
of U.S. adults get news from social media sometimes or often. However, the value was based
on a sample, so it may not be a perfect estimate for the population parameter of interest on
its own. The study was based on a sample of 12,147 adults. Below is a distribution of 1,000
bootstrapped sample proportions from the Pew dataset. (Pew Research Center 2022) Using the
distribution of 1,000 bootstrapped proportions, approximate a 98% confidence interval for the
true proportion of US adult social media users (in 2022) who get at least some of their news
from Twitter. Interpret the interval in the context of the problem.
4. Bootstrap distributions of 𝑝,̂ I. Each of the following four distributions was created using
a different dataset. Each dataset was based on 𝑛 = 23 observations. The original datasets had
the following proportions of successes:
5. Bootstrap distributions of 𝑝,̂ II. Each of the following four distributions was created using
a different dataset. Each dataset was based on 𝑛 = 23 observations.
Consider each of the following values for the true popluation 𝑝 (proportion of success). Datasets
A, B, C, D were bootstrapped 1000 times, with bootstrap proportions as given in the histograms
provided. For each parameter value, list the datasets which could plausibly have come from that
population. (Hint: there may be more than one dataset for each parameter value.)
a. 𝑝 = 0.05
b. 𝑝 = 0.25
c. 𝑝 = 0.45
d. 𝑝 = 0.55
e. 𝑝 = 0.75
6. Bootstrap distributions of 𝑝,̂ III. Each of the following four distributions was created using
a different dataset. Each dataset had the same proportion of successes (𝑝̂ = 0.4) but a different
sample size. The four datasets were given by 𝑛 = 10, 100, 500, and 1000.
Consider each of the following values for the true popluation 𝑝 (proportion of success). Datasets
A, B, C, D were bootstrapped 1000 times, with bootstrap proportions as given in the histograms
provided. For each parameter value, list the datasets which could plausibly have come from that
population. (Hint: there may be more than one dataset for each parameter value.)
a. 𝑝 = 0.05
b. 𝑝 = 0.25
c. 𝑝 = 0.45
d. 𝑝 = 0.55
e. 𝑝 = 0.75
12.5. EXERCISES 211
7. Cyberbullying rates. Teens were surveyed about cyberbullying, and 54% to 64% reported
experiencing cyberbullying (95% confidence interval). Answer the following questions based on
this interval. (Pew Research Center 2018)
a. A newspaper claims that a majority of teens have experienced cyberbullying. Is this claim
supported by the confidence interval? Explain your reasoning.
b. A researcher conjectured that 70% of teens have experienced cyberbullying. Is this claim
supported by the confidence interval? Explain your reasoning.
c. Without actually calculating the interval, determine if the claim of the researcher from
part (b) would be supported based on a 90% confidence interval?
8. Waiting at an ER. A 95% confidence interval for the mean waiting time at an emergency room
(ER) of (128 minutes, 147 minutes). Answer the following questions based on this interval.
a. A local newspaper claims that the average waiting time at this ER exceeds 3 hours. Is this
claim supported by the confidence interval? Explain your reasoning.
b. The Dean of Medicine at this hospital claims the average wait time is 2.2 hours. Is this
claim supported by the confidence interval? Explain your reasoning.
c. Without actually calculating the interval, determine if the claim of the Dean from part (b)
would be supported based on a 99% confidence interval?
212
Chapter 13
In recent chapters, we have encountered four case studies. While they differ in the settings, in their
outcomes, and in the technique we have used to analyze the data, they all have something in common:
the general shape of the distribution of the statistics (called the sampling distribution). You may
have noticed that the distributions were symmetric and bell-shaped.
13.1. CENTRAL LIMIT THEOREM 213
Sampling distribution.
A sampling distribution is the distribution of all possible values of a sample statistic from
samples of a given sample size from a given population. We can think about the sample
distribution as describing how sample statistics (e.g., the sample proportion 𝑝̂ or the
sample mean 𝑥)̄ varies from one study to another. A sampling distribution is contrasted
with a data distribution which shows the variability of the observed data values. The
data distribution can be visualized from the observations themselves. However, because
a sampling distribution describes sample statistics computed from many studies, it
cannot be visualized directly from a single dataset. Instead, we use either computational
or mathematical structures to estimate the sampling distribution and hence to describe
the expected variability of the sample statistic in repeated studies.
Figure 13.1 shows the null distributions in each of the four case studies where we ran 10,000 simulations.
Note that the null distribution is the sampling distribution of the statistic created under the setting
where the null hypothesis is true. Therefore, the null distribution will always be centered at the
value of the parameter given by the null hypothesis. In the case of the opportunity cost study, which
originally had just 1,000 simulations, we have included an additional 9,000 simulations.
Figure 13.1: The null distribution for each of the four case studies presented previously. Note that the center
of each distribution is given by the value of the parameter set in the null hypothesis.
GUIDED PRACTICE
Describe the shape of the distributions and note anything that you find interesting.1
The case study for the medical consultant is the only distribution with any evident skew. As we
observed in Chapter 1, it’s common for distributions to be skewed or contain outliers. However, the
null distributions we have so far encountered have all looked somewhat similar and, for the most
part, symmetric. They all resemble a bell-shaped curve. The bell-shaped curve similarity is not a
coincidence, but rather, is guaranteed by mathematical theory.
1 In general, the distributions are reasonably symmetric. The case study for the medical consultant is the only
If we look at a proportion (or difference in proportions) and the scenario satisfies certain
conditions, then the sample proportion (or difference in proportions) will appear to
follow a bell-shaped curve called the normal distribution.
An example of a perfect normal distribution is shown in Figure 13.2. Imagine laying a normal curve
over each of the four null distributions in Figure 13.1. While the mean (center) and standard deviation
(width or spread) may change for each plot, the general shape remains roughly intact.
Mathematical theory guarantees that if repeated samples are taken a sample proportion or a difference
in sample proportions will follow something that resembles a normal distribution when certain condi-
tions are met. (Note: we typically only take one sample, but the mathematical model lets us know
what to expect if we had taken repeated samples.) These conditions fall into two general categories
describing the independence between observations and the need to take a sufficiently large sample
size.
1. Observations in the sample are independent. Independence is guaranteed when we take a
random sample from a population. Independence can also be guaranteed if we randomly divide
individuals into treatment and control groups.
2. The sample is large enough. The sample size cannot be too small. What qualifies as “small”
differs from one context to the next, and we’ll provide suitable guidelines for proportions in
Chapter 16.
So far we have had no need for the normal distribution. We’ve been able to answer our questions
somewhat easily using simulation techniques. However, soon this will change. Simulating data can be
non-trivial. For example, some of the scenarios encountered in Chapter 8 where we introduced regres-
sion models with multiple predictors would require complex simulations in order to make inferential
conclusions. Instead, the normal distribution and other distributions like it offer a general framework
for statistical inference that applies to a very large number of settings.
Technical Conditions.
In order for the normal approximation to describe the sampling distribution of the
sample proportion as it varies from sample to sample, two conditions must hold. If
these conditions do not hold, it is unwise to use the normal distribution (and related
concepts like Z scores, probabilities from the normal curve, etc.) for inferential analyses.
1. Independent observations
2. Large enough sample: For proportions, at least 10 expected successes and 10
expected failures in the sample.
13.2. NORMAL DISTRIBUTION 215
Among all the distributions we see in statistics, one is overwhelmingly the most common. The sym-
metric, unimodal, bell curve is ubiquitous throughout statistics. It is so common that people know
it as a variety of names including the normal curve, normal model, or normal distribution.2
Under certain conditions, sample proportions, sample means, and sample differences can be modeled
using the normal distribution. Additionally, some variables such as SAT scores and heights of US
adult males closely follow the normal distribution.
Distributions of many variables are nearly normal, but none are exactly normal. Thus,
the normal distribution, while not perfect for any single problem, is very useful for a
variety of problems.
In this section, we will discuss the normal distribution in the context of data to become familiar with
normal distribution techniques. In the following sections and beyond, we’ll move our discussion to
focus on applying the normal distribution and other related distributions to model point estimates
for hypothesis tests and for constructing confidence intervals.
Figure 13.3: Two normal distributions with different centers and spreads.
Figure 13.4: The two normal models shown in Figure 13.3 but plotted together on the same scale.
2 It is also introduced as the Gaussian distribution after Frederic Gauss, the first person to formalize its mathematical
expression.
216 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
If a normal distribution has mean 𝜇 and standard deviation 𝜎, we may write the distribution as
𝑁 (𝜇, 𝜎). The two distributions in Figure 13.4 can be written as
𝑁 (𝜇 = 0, 𝜎 = 1) and 𝑁 (𝜇 = 19, 𝜎 = 4)
Because the mean and standard deviation describe a normal distribution exactly, they are called the
distribution’s parameters.
EXAMPLE
Write down the short-hand for a normal distribution with the following parameters.
a. 𝑁 (𝜇 = 5, 𝜎 = 3)
b. 𝑁 (𝜇 = −100, 𝜎 = 10)
c. 𝑁 (𝜇 = 2, 𝜎 = 9)
GUIDED PRACTICE
SAT scores follow a nearly normal distribution with a mean of 1500 points and a stan-
dard deviation of 300 points. ACT scores also follow a nearly normal distribution with
mean of 21 points and a standard deviation of 5 points. Suppose Nel scored 1800 points
on their SAT and Sian scored 24 points on their ACT. Who performed better?3
Figure 13.5: Nel’s and Sian’s scores shown with the distributions of SAT and ACT scores.
3 We use the standard deviation as a guide. Nel is 1 standard deviation above average on the SAT: 1500+300 = 1800.
Sian is 0.6 standard deviations above the mean on the ACT: 21 + 0.6 × 5 = 24. In Figure 13.5, we can see that Nel did
better compared to other test takers than Sian did, so their score was better.
13.2. NORMAL DISTRIBUTION 217
The solution to the previous example relies on a standardization technique called a Z score, a method
most commonly employed for nearly normal observations (but that may be used with any distribution).
The Z score of an observation is defined as the number of standard deviations it falls above or below
the mean. If the observation is one standard deviation above the mean, its Z score is 1. If it is 1.5
standard deviations below the mean, then its Z score is -1.5. If 𝑥 is an observation from a distribution
𝑁 (𝜇, 𝜎), we define the Z score mathematically as
𝑥−𝜇
𝑍=
𝜎
Using 𝜇𝑆𝐴𝑇 = 1500, 𝜎𝑆𝐴𝑇 = 300, and 𝑥𝑁𝑒𝑙 = 1800, we find Nel’s Z score:
The Z score.
The Z score of an observation is the number of standard deviations it falls above or below
the mean. We compute the Z score for an observation 𝑥 that follows a distribution with
mean 𝜇 and standard deviation 𝜎 using
𝑥−𝜇
𝑍=
𝜎
GUIDED PRACTICE
Use Sian’s ACT score, 24, along with the ACT mean and standard deviation to compute
their Z score.4
Observations above the mean always have positive Z scores while those below the mean have negative
Z scores. If an observation is equal to the mean (e.g., SAT score of 1500), then the Z score is 0.
EXAMPLE
Let 𝑋 represent a random variable from 𝑁 (𝜇 = 3, 𝜎 = 2), and suppose we observe 𝑥 = 5.19.
Find the Z score of 𝑥. Then, use the Z score to determine how many standard deviations above
or below the mean 𝑥 falls.
GUIDED PRACTICE
Head lengths of brushtail possums follow a nearly normal distribution with mean 92.6
mm and standard deviation 3.6 mm. Compute the Z scores for possums with head
lengths of 95.4 mm and 85.8 mm.5
𝑥𝑆𝑖𝑎𝑛 −𝜇𝐴𝐶𝑇
4𝑍
𝑆𝑖𝑎𝑛 = 𝜎𝐴𝐶𝑇 = 24−21
5 = 0.6
5 For 𝑥 =
1 95.4 mm: 𝑍1 = 𝑥1𝜎−𝜇 = 95.4−92.6
3.6 = 0.78. For 𝑥2 = 85.8 mm: 𝑍2 = 85.8−92.6
3.6 = −1.89.
218 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
We can use Z scores to roughly identify which observations are more unusual than others. One
observation 𝑥1 is said to be more unusual than another observation 𝑥2 if the absolute value of its Z
score is larger than the absolute value of the other observation’s Z score: |𝑍1 | > |𝑍2 |. This technique
is especially insightful when a distribution is symmetric.
GUIDED PRACTICE
Which of the two brushtail possum observations in the previous guided practice is more
unusual?6
EXAMPLE
Nel from the SAT Guided Practice earned a score of 1800 on their SAT with a corresponding
𝑍 = 1. They would like to know what percentile they fall in among all SAT test-takers.
Nel’s percentile is the percentage of people who earned a lower SAT score than Nel. We
shade the area representing those individuals in Figure 13.6. The total area under the normal
curve is always equal to 1, and the proportion of people who scored below Nel on the SAT is
equal to the area shaded in Figure 13.6: 0.8413. In other words, Nel is in the 84𝑡ℎ percentile
of SAT takers.
Figure 13.6: The normal model for SAT scores, shading the area of those individuals who scored below Nel.
We can use the normal model to find percentiles or probabilities. A normal probability table,
which lists Z scores and corresponding percentiles, can be used to identify a percentile based on the
Z score (and vice versa). Statistical software can also be used.
Normal probabilities are most commonly found using statistical software which we will show here
using R. We use the software to identify the percentile corresponding to any particular Z score. For
instance, the percentile of 𝑍 = 0.43 is 0.6664, or the 66.64𝑡ℎ percentile. The pnorm() function is
available in default R and will provide the percentile associated with any cutoff on a normal curve.
The normTail() function is available in the openintro R package and will draw the associated normal
distribution curve.
pnorm(0.43, mean = 0, sd = 1)
[1] 0.666
openintro::normTail(m = 0, s = 1, L = 0.43)
6 Because the absolute value of Z score for the second observation is larger than that of the first, the second observation
We can also find the Z score associated with a percentile. For example, to identify Z for the 80𝑡ℎ
percentile, we use qnorm() which identifies the quantile for a given percentage. The quantile repre-
sents the cutoff value. (To remember the function qnorm() as providing a cutozff, notice that both
qnorm() and “cutoff” start with the sound “kuh”. To remember the pnorm() function as providing a
probability from a given cutoff, notice that both pnorm() and probability start with the sound “puh”.)
We determine the Z score for the 80𝑡ℎ percentile using qnorm(): 0.84.
qnorm(0.80, mean = 0, sd = 1)
[1] 0.842
openintro::normTail(m = 0, s = 1, L = 0.842)
GUIDED PRACTICE
Determine the proportion of SAT test takers who scored better than Nel on the SAT.7
EXAMPLE
Shannon is a randomly selected SAT taker, and nothing is known about Shannon’s SAT apti-
tude. What is the probability that Shannon scores at least 1630 on their SATs?
First, always draw and label a picture of the normal distribution. (Drawings need not be exact
to be useful.) We are interested in the chance they score above 1630, so we shade the upper
tail. See the normal curve below.
The 𝑥-axis identifies the mean and the values at 2 standard deviations above and below the
mean. The simplest way to find the shaded area under the curve makes use of the Z score
of the cutoff value. With 𝜇 = 1500, 𝜎 = 300, and the cutoff value 𝑥 = 1630, the Z score is
computed as
We use software to find the percentile of 𝑍 = 0.43, which yields 0.6664. However, the percentile
describes those who had a Z score lower than 0.43. To find the area above 𝑍 = 0.43, we compute
one minus the area of the lower tail, as seen below.
The probability Shannon scores at least 1630 on the SAT is 0.3336. This calculation is visual-
ized in Figure 13.7.
7 If 84% had lower scores than Nel, the number of people who had better scores must be 16%. (Generally ties are
ignored when the normal model, or any other continuous distribution, is used.)
220 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
Figure 13.7: Visual calculation of the probability that Shannon scores at least 1630 on the SAT.
For any normal probability situation, always always always draw and label the normal
curve and shade the area of interest first. The picture will provide an estimate of the
probability.
After drawing a figure to represent the situation, identify the Z score for the observation
of interest.
GUIDED PRACTICE
If the probability of Shannon scoring at least 1630 is 0.3336, then what is the probability
they score less than 1630? Draw the normal curve representing this exercise, shading
the lower region instead of the upper one.8
EXAMPLE
Edward earned a 1400 on their SAT. What is their percentile?
First, a picture is needed. Edward’s percentile is the proportion of people who do not get as
high as a 1400. These are the scores to the left of 1400, as shown below.
The mean 𝜇 = 1500, the standard deviation 𝜎 = 300, and the cutoff for the tail area 𝑥 = 1400
are used to compute the Z score:
Statistical software can be used to find the proportion of the 𝑁 (0, 1) curve to the left of −0.33
which is 0.3707. Edward is at the 37𝑡ℎ percentile.
8 We found the probability to be 0.6664. A picture for this exercise is represented by the shaded area below “0.6664”.
13.2. NORMAL DISTRIBUTION 221
EXAMPLE
Use the results of the previous example to compute the proportion of SAT takers who did
better than Edward. Also draw a new picture.
If Edward did better than 37% of SAT takers, then about 63% must have done better than
them, as shown below.
Most statistical software, as well as normal probability tables in most books, give the
area to the left. If you would like the area to the right, first find the area to the left
and then subtract the amount from one.
GUIDED PRACTICE
Stuart earned an SAT score of 2100. Draw a picture for each part. (a) What is their
percentile? (b) What percent of SAT takers did better than Stuart?9
Based on a sample of 100 men,10 the heights of adults who identify as male, between the ages 20 and
62 in the US is nearly normal with mean 70.0” and standard deviation 3.3”.
EXAMPLE
Kamron is 5’7” (67 inches) and Adrian is 6’4” (76 inches). (a) What is Kamron’s height
percentile? (b) What is Adrian’s height percentile? Also draw one picture for each part.
Numerical answers, calculated using statistical software (e.g., pnorm() in R): (a) 18.17th per-
centile. (b) 96.55th percentile.
The last several problems have focused on finding the probability or percentile for a particular obser-
vation. What if you would like to know the observation corresponding to a particular percentile?
EXAMPLE
Yousef’s height is at the 40𝑡ℎ percentile. How tall are they?
In this case, the lower tail probability is known (0.40), which can be shaded on the diagram.
We want to find the observation that corresponds to the known probability of 0.4. We can find
the observation in two different ways: using the height curve seen above or using the Z score
associated with the standard normal curve centered at zero with a standard deviation of one.
If you have access to software (like R, code seen below) that allows you to specify the mean
and standard deviation of the normal curve, you can calculate the observed value on the curve
(i.e., Yousef’s height) directly.
[1] 69.2
Yousef is 69.2 inches tall. That is, Yousef is about 5’9” (this is notation for 5-feet, 9-inches).
Without access to flexible software, you will need the information given by a standard normal
curve (a normal curve centered at zero with a standard deviation of one). First, determine the
Z score associated with the 40𝑡ℎ percentile.
Because the percentile is below 50%, we know 𝑍 will be negative. Statistical software provides
the 𝑍 value to be −0.25.
qnorm(0.4, mean = 0, sd = 1)
[1] -0.253
Knowing 𝑍𝑌 𝑜𝑢𝑠𝑒𝑓 = −0.25 and the population parameters 𝜇 = 70 and 𝜎 = 3.3 inches, the Z
score formula can be set up to determine Yousef’s unknown height, labeled 𝑥𝑌 𝑜𝑢𝑠𝑒𝑓 :
𝑥𝑌 𝑜𝑢𝑠𝑒𝑓 − 𝜇 𝑥𝑌 𝑜𝑢𝑠𝑒𝑓 − 70
−0.253 = 𝑍𝑌 𝑜𝑢𝑠𝑒𝑓 = =
𝜎 3.3
Solving for 𝑥𝑌 𝑜𝑢𝑠𝑒𝑓 yields the height 69.2 inches. Again, Yousef is about 5’9”.
13.2. NORMAL DISTRIBUTION 223
EXAMPLE
What is the adult male height at the 82𝑛𝑑 percentile?
In order to practice using Z scores, we will use the standard normal curve to solve the problem.
[1] 0.915
Next, we want to find the Z score at the 82𝑛𝑑 percentile, which will be a positive value (because
the percentile is bigger than 50%). Using qnorm(), the 82𝑛𝑑 percentile corresponds to 𝑍 = 0.92.
Finally, the height 𝑥 is found using the Z score formula with the known mean 𝜇, standard
deviation 𝜎, and Z score 𝑍 = 0.92:
𝑥−𝜇 𝑥 − 70
0.92 = 𝑍 = =
𝜎 3.3
This yields 73.04 inches or about 6’1” as the height at the 82𝑛𝑑 percentile.
GUIDED PRACTICE
(b) What is the 97.5𝑡ℎ percentile of the male heights? As always with normal proba-
bility problems, first draw a picture.11
GUIDED PRACTICE
(a) What is the probability that a randomly selected male adult is at least 6’2” (74
inches)?
(b) What is the probability that a male adult is shorter than 5’9” (69 inches)?12
11 Remember: draw a picture first, then find the Z score. (We leave the pictures to you.) The Z score can be found by
using the percentiles and the normal probability table. (a) We look for 0.95 in the probability portion (middle part) of
the normal probability table, which leads us to row 1.6 and (about) column 0.05, i.e., 𝑍95 = 1.65. Knowing 𝑍95 = 1.65,
𝜇 = 1500, and 𝜎 = 300, we setup the Z score formula: 1.65 = 𝑥95300 −1500
. We solve for 𝑥95 : 𝑥95 = 1995. (b) Similarly,
we find 𝑍97.5 = 1.96, again setup the Z score formula for the heights, and calculate 𝑥97.5 = 76.5.
12 Numerical answers: (a) 0.1131. (b) 0.3821.
224 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
EXAMPLE
What is the probability that a randomly selected adult male is between 5’9” and 6’2”?
These heights correspond to 69 inches and 74 inches. First, draw the figure. The area of
interest is no longer an upper or lower tail.
The total area under the curve is 1. If we find the area of the two tails that are not shaded
(from the previous Guided Practice, these areas are 0.3821 and 0.1131), then we can find the
middle area:
That is, the probability of being between 5’9” and 6’2” is 0.5048.
GUIDED PRACTICE
Find the percent of SAT takers who earn between 1500 and 2000.13
GUIDED PRACTICE
What percent of adult males are between 5’5” and 5’7”?14
As seen in later chapters, it turns out that many of the statistics used to summarize data (e.g., the
sample proportion, the sample mean, differences in two sample proportions, differences in two sample
means, the sample slope from a linear model, etc.) vary according to the normal distribution seen
above. The mathematical models are derived from the normal theory, but even the computational
methods (and the intuitive thinking behind both approaches) use the general bell-shaped variability
seen in most of the distributions constructed so far.
13 This is an abbreviated solution. (Be sure to draw a figure!) First find the percent who get below 1500 and the
percent that get above 2000: 𝑍1500 = 0.00 → 0.5000 (area below), 𝑍2000 = 1.67 → 0.0475 (area above). Final answer:
1.0000 − 0.5000 − 0.0475 = 0.4525.
14 5’5” is 65 inches. 5’7” is 67 inches. Numerical solution: 1.000 − 0.0649 − 0.8183 = 0.1168, i.e., 11.68%.
13.3. QUANTIFYING THE VARIABILITY OF A STATISTIC 225
Figure 13.8: Probabilities for falling within 1, 2, and 3 standard deviations of the mean in a normal
distribution.
GUIDED PRACTICE
Use pnorm() (or a Z table) to confirm that about 68%, 95%, and 99.7% of observations
fall within 1, 2, and 3, standard deviations of the mean in the normal distribution,
respectively. For instance, first find the area that falls between 𝑍 = −1 and 𝑍 = 1,
which should have an area of about 0.68. Similarly there should be an area of about
0.95 between 𝑍 = −2 and 𝑍 = 2.15
It is possible for a normal random variable to fall 4, 5, or even more standard deviations from the
mean. However, these occurrences are very rare if the data are nearly normal. The probability of
being further than 4 standard deviations from the mean is about 1-in-30,000. For 5 and 6 standard
deviations, it is about 1-in-3.5 million and 1-in-1 billion, respectively.
GUIDED PRACTICE
SAT scores closely follow the normal model with mean 𝜇 = 1500 and standard deviation
𝜎 = 300. About what percent of test takers score 900 to 2100? What percent score
between 1500 and 2100 ?16
to determine the areas below 𝑍 = −1 and above 𝑍 = 1. Next verify the area between 𝑍 = −1 and 𝑍 = 1 is about 0.68.
Repeat this for 𝑍 = −2 to 𝑍 = 2 and for 𝑍 = −3 to 𝑍 = 3.
16 900 and 2100 represent two standard deviations above and below the mean, which means about 95% of test takers
will score between 900 and 2100. Since the normal model is symmetric, then half of the test takers from part (a)
( 95%
2 = 47.5% of all test takers) will score 900 to 1500 while 47.5% score between 1500 and 2100.
226 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
𝑧⋆ is the cutoff value found on the normal distribution. The most common value of 𝑧⋆
is 1.96 (often approximated to be 2) indicating that the margin of error describes the
variability associated with 95% of the sampled statistics.
Notice that if the spread of the observations goes from some lower bound to some upper bound, a rough
approximation of the SE is to divide the range by 4. That is, if you notice the sample proportions go
from 0.1 to 0.4, the SE can be approximated to be 0.075.
The approach for using the normal model in the context of inference is very similar to the practice of
applying the model to individual observations that are nearly normal. We will replace null distributions
we previously obtained using the randomization or simulation techniques and verify the results once
again using the normal model. When the sample size is sufficiently large, the normal approximation
generally provides us with the same conclusions as the simulation model.
Figure 13.9: Null distribution of differences with an overlaid normal curve for the opportunity cost study.
10,000 simulations were run for this figure.
13.5. CASE STUDY (TEST): MEDICAL CONSULTANT 227
Next, we can calculate the Z score using the observed difference, 0.20, and the two model parameters.
The standard error, 𝑆𝐸 = 0.078, is the equivalent of the model’s standard deviation.
We can either use statistical software or look up 𝑍 = 2.56 in the normal probability table to determine
the right tail area: 0.0052, which is about the same as what we got for the right tail using the
randomization approach (0.006). Using this area as the p-value, we see that the p-value is less than
0.05, we conclude that the treatment did indeed impact students’ spending.
The standard error in this case is the equivalent of the standard deviation of the point
estimate, and the null value comes from the claim made in the null hypothesis.
We have confirmed that the randomization approach we used earlier and the normal distribution
approach provide almost identical p-values and conclusions in the opportunity cost case study. Next,
let’s turn our attention to the medical consultant case study.
Figure 13.10: The null distribution for the sample proportion, created from 10,000 simulated studies from
the medical consultant, along with the best-fitting normal model.
Next, we can calculate the Z score using the observed complication rate, 𝑝̂ = 0.048 along with the
mean and standard deviation of the normal model. Here again, we use the standard error for the
standard deviation.
𝑝 ̂ − 𝑝0 0.048 − 0.10
𝑍= = = −1.37
𝑆𝐸𝑝̂ 0.038
Identifying 𝑍 = −1.37 using statistical software or in the normal probability table, we can determine
that the left tail area is 0.0853 which is the estimated p-value for the hypothesis test. There is a small
problem: the p-value of 0.0853 is almost 30% smaller than the simulation p-value of 0.1222 which will
be calculated in Section 16.1.
The discrepancy is explained by the normal model’s poor representation of the null distribution in
Figure 13.10. As noted earlier, the null distribution from the simulations is not very smooth, and the
distribution itself is slightly skewed. That’s the bad news. The good news is that we can foresee these
problems using some simple checks. We’ll learn more about these checks in the following chapters.
In Section 13.1 we noted that the two common requirements to apply the Central Limit Theorem are
(1) the observations in the sample must be independent, and (2) the sample must be sufficiently large.
The guidelines for this particular situation – which we will learn in Chapter 16 – would have alerted
us that the normal model was a poor approximation.
Statistical techniques are like a carpenter’s tools. When used responsibly, they can produce amazing
and precise results. However, if the tools are applied irresponsibly or under inappropriate conditions,
they will produce unreliable results. For this reason, with every statistical method that we introduce
in future chapters, we will carefully outline conditions when the method can reasonably be used. These
conditions should be checked in each application of the technique.
After covering the introductory topics in this course, advanced study may lead to working with
complex models which, for example, bring together many variables with different variability structure.
Working with data that come from normal populations makes higher-order models easier to estimate
and interpret. There are times when simulation, randomization, or bootstrapping are unwieldy in
either structure or computational demand. Normality can often lead to excellent approximations of
the data using straightforward modeling techniques.
A point estimate is our best guess for the value of the parameter, so it makes sense to build the
confidence interval around that value. The standard error, which is a measure of the uncertainty
associated with the point estimate, provides a guide for how large we should make the confidence
interval. The 68-95-99.7 rule tells us that, in general, 95% of observations are within 2 standard errors
of the mean. Here, we use the value 1.96 to be slightly more precise.
GUIDED PRACTICE
Compute the area between -1.96 and 1.96 for a normal distribution with mean 0 and
standard deviation 1.17
EXAMPLE
The point estimate in the opportunity cost study was that 20% fewer students would buy a
video if they were reminded that money not spent now could be spent later on something else.
This point estimate can reasonably be modeled with a normal distribution with a standard
error of 𝑆𝐸 = 0.078. Construct a 95% confidence interval for the point estimate.
Since we’re told the point estimate can be modeled with a normal distribution:
point estimate ± 1.96 × 𝑆𝐸 = 0.20 ± 1.96 × 0.078 = (0.047, 0.353)
We are 95% confident that the video purchase rate resulting from the treatment is between 4.7%
and 35.3% lower than in the control group. Since this confidence interval does not contain 0,
it is consistent with our earlier hypothesis test where we rejected the notion of “no difference”.
Note that we have used SE = 0.078 from the last section. However, it would more generally be
appropriate to recompute the SE slightly differently for this confidence interval using sample
proportions. Don’t worry about this detail for now since the two resulting standard errors are,
in this case, almost identical.
Table 13.1: Descriptive statistics for 30-day results for the stent study.
EXAMPLE
Consider the stent study and results. The conditions necessary to ensure the point estimate
𝑝𝑡𝑟𝑚𝑡 − 𝑝𝑐𝑡𝑟𝑙 = 0.090 is nearly normal have been verified for you, and the estimate’s standard
error is 𝑆𝐸 = 0.028. Construct a 95% confidence interval for the change in 30-day stroke rates
from usage of the stent.
The conditions for applying the normal model have already been verified, so we can proceed
to the construction of the confidence interval:
We are 95% confident that implanting a stent in a stroke patient’s brain increased the risk of
stroke within 30 days by a rate of 0.035 to 0.145. This confidence interval can also be used in a
way analogous to a hypothesis test: since the interval does not contain 0 (is completely above
0), it means the data provide convincing evidence that the stent used in the study changed the
risk of stroke within 30 days.
As with hypothesis tests, confidence intervals are imperfect. About 1-in-20 properly constructed 95%
confidence intervals will fail to capture the parameter of interest, simply due to natural variability in
the observed data. Figure 13.11 shows 25 confidence intervals for a proportion that were constructed
from 25 different datasets that all came from the same population where the true proportion was
𝑝 = 0.3. However, 1 of these 25 confidence intervals happened not to include the true value. The
interval which does not capture 𝑝 = 0.3 is not due to bad science. Instead, it is due to natural
variability, and we should expect some of our intervals to miss the parameter of interest. Indeed,
over a lifetime of creating 95% intervals, you should expect 5% of your reported intervals to miss the
parameter of interest (unfortunately, you will not ever know which of your reported intervals captured
the parameter and which missed the parameter).
13.6. CASE STUDY (INTERVAL): STENTS 231
Figure 13.11: Twenty-five samples of size 𝑛 = 300 were collected from a population with 𝑝 = 0.30. For each
sample, a confidence interval was created to try to capture the true proportion 𝑝. However, 1 of these 25
intervals did not capture 𝑝 = 0.30.
GUIDED PRACTICE
In Figure 13.11, one interval does not contain the true proportion, 𝑝 = 0.3. Does this
imply that there was a problem with the datasets that were selected?18
We are XX% confident that the population parameter is between lower and upper (where
lower and upper are both numerical values).
Incorrect language might try to describe the confidence interval as capturing the pop-
ulation parameter with a certain probability.
This is one of the most common errors: while it might be useful to think of it as a
probability, the confidence level only quantifies how plausible it is that the parameter
is in the interval.
Another especially important consideration of confidence intervals is that they only try to capture the
population parameter. Our intervals say nothing about the confidence of capturing individual obser-
vations, a proportion of the observations, or about capturing point estimates. Confidence intervals
provide an interval estimate for and attempt to capture population parameters.
18 No. Just as some observations occur more than 1.96 standard deviations from the mean, some point estimates
will be more than 1.96 standard errors from the parameter. A confidence interval only provides a plausible range of
values for a parameter. While we might say other values are implausible based on the data, this does not mean they
are impossible.
232 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
13.7.1 Summary
We can summarise the process of using the normal model as follows:
• Frame the research question. The mathematical model can be applied to both the hypothesis
testing and the confidence interval framework. Make sure that your research question is being
addressed by the most appropriate inference procedure.
• Collect data with an observational study or experiment. To address the research ques-
tion, collect data on the variables of interest. Note that your data may be a random sample
from a population or may be part of a randomized experiment.
• Model the randomness of the statistic. In many cases, the normal distribution will be
an excellent model for the randomness associated with the statistic of interest. The Central
Limit Theorem tells us that if the sample size is large enough, sample averages (which can be
calculated as either a proportion or a sample mean) will be approximately normally distributed
when describing how the statistics change from sample to sample.
• Calculate the variability of the statistic. Using formulas, come up with the standard
deviation (or more typically, an estimate of the standard deviation called the standard error) of
the statistic. The SE of the statistic will give information on how far the observed statistic is from
the null hypothesized value (if performing a hypothesis test) or from the unknown population
parameter (if creating a confidence interval).
• Use the normal distribution to quantify the variability. The normal distribution will
provide a probability which measures how likely it is for your observed and hypothesized (or
observed and unknown) parameter to differ by the amount measured. The unusualness (or not)
of the discrepancy will form the conclusion to the research question.
• Form a conclusion. Using the p-value or the confidence interval from the analysis, report
on the research question of interest. Also, be sure to write the conclusion in plain language so
casual readers can understand the results.
Table 13.2 is another look at the mathematical model approach to inference.
Question Answer
What does it do? Uses theory (primarily the Central Limit Theorem) to
describe the hypothetical variability resulting from either
repeated randomized experiments or random samples
What is it best for? Quick analyses through, for example, calculating a Z score
13.7.2 Terms
The terms introduced in this chapter are presented in Table 13.3. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
13.8 Exercises
2. Area under the curve, Part II. What percent of a standard normal distribution 𝑁 (𝜇 =
0, 𝜎 = 1) is found in each region denoted by a 𝑍 inequality below? Be sure to draw a graph. In
the text above, we used R to calculate normal probabilities. You might choose to use a different
source, such as a Shiny App or a normal table.
a. 𝑍 > −1.13
b. 𝑍 < 0.18
c. 𝑍 > 8
d. |𝑍| < 0.5
3. GRE scores, Z scores. Sophia who took the Graduate Record Examination (GRE) scored
160 on the Verbal Reasoning section and 157 on the Quantitative Reasoning section. The mean
score for Verbal Reasoning section for all test takers was 151 with a standard deviation of 7,
and the mean score for the Quantitative Reasoning was 153 with a standard deviation of 7.67.
Suppose that both distributions are nearly normal. Use the information to compute each of the
following. In the text above, we used R to calculate normal probabilities. You might choose to
use a different source, such as a Shiny App or a normal table.
a. Write down the short-hand for each of the two normal distributions.
b. What is Sophia’s Z score on the Verbal Reasoning section? On the Quantitative Reasoning
section? Draw a standard normal distribution curve and mark the two Z scores.
c. What do the Z scores tell you?
d. Relative to others, which section did Sophia do better on?
e. Find her percentile scores for each of the two exams.
f. What percent of the test takers did better than her on the Verbal Reasoning section? On
the Quantitative Reasoning section?
g. Explain why simply comparing raw scores from the two sections could lead to an incorrect
conclusion as to which section a student did better on.
h. If the distributions of the scores on these exams are not nearly normal, would your answers
to parts (b) - (f) change? Explain your reasoning.
13.8. EXERCISES 235
4. Triathlon times, Z scores. In triathlons, it is common for racers to be placed into age and
gender groups. Two friends, Leo and Mary, both completed the Hermosa Beach Triathlon, where
Leo competed in the “Men, Ages 30 - 34” group and Mary competed in the “Women, Ages 25
- 29” group. Leo completed the race in 1:22:28 (4948 seconds), while Mary completed the race
in 1:31:53 (5513 seconds). We can see that Leo finished faster, but they are curious about how
they did within their respective groups. Can you help them? Below is some information on the
performance of their groups. Use the information to compute each of the following. In the text
above, we used R to calculate normal probabilities. You might choose to use a different source,
such as a Shiny App or a normal table.
• The finishing times of the “Men, Ages 30 - 34” group has a mean of 4313 seconds with a
standard deviation of 583 seconds.
• The finishing times of the “Women, Ages 25 - 29” group has a mean of 5261 seconds with
a standard deviation of 807 seconds.
• The distributions of finishing times for both groups are approximately Normal.
Remember: a better performance corresponds to a faster finish.
a. Write down the short-hand for the two normal distributions.
b. What are the Z scores for each of Leo’s and Mary’s finishing times? What do the Z scores
tell you?
c. Did Leo or Mary rank better in their respective group? Explain your reasoning.
d. What percent of the triathletes did Leo finish faster than in his group?
e. What percent of the triathletes did Mary finish faster than in her group?
f. If the distributions of finishing times are not nearly normal, would your answers to parts
(b) – (e) change? Explain your reasoning.
5. GRE scores, cutoffs. Consider the previous two distributions for GRE scores: 𝑁 (𝜇 = 151, 𝜎 =
7) for the Verbal Reasoning part of the exam and 𝑁 (𝜇 = 153, 𝜎 = 7.67) for the Quantitative
Reasoning part. Use the information to compute each of the following. In the text above, we
used R to calculate normal probabilities. You might choose to use a different source, such as a
Shiny App or a normal table.
a. The score of a student who scored in the 80𝑡ℎ percentile on the Quantitative Reasoning
section.
b. The score of a student who scored worse than 70% of test takers in the Verbal Reasoning
section.
6. Triathlon times, cutoffs. Recall the two different distributions for triathlon times: 𝑁 (𝜇 =
4313, 𝜎 = 583) for “Men, Ages 30 - 34” and 𝑁 (𝜇 = 5261, 𝜎 = 807) for the “Women, Ages 25 - 29”
group. Times are listed in seconds. Use this information to compute each of the following. In
the text above, we used R to calculate normal probabilities. You might choose to use a different
source, such as a Shiny App or a normal table.
a. The cutoff time for the fastest 5% of athletes in the men’s group, i.e., those who took the
shortest 5% of time to finish.
b. The cutoff time for the slowest 10% of athletes in the women’s group.
7. LA weather, Fahrenheit. The average daily high temperature in June in LA is 77∘ F with
a standard deviation of 5∘ F. Suppose that the temperatures in June closely follow a normal
distribution. Use the information to compute each of the following. In the text above, we used
R to calculate normal probabilities. You might choose to use a different source, such as a Shiny
App or a normal table.
a. What is the probability of observing an 83∘ F temperature or higher in LA during a ran-
domly chosen day in June?
b. How cool are the coldest 10% of the days (days with lowest high temperature) during June
in LA?
236 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
8. CAPM. The Capital Asset Pricing Model (CAPM) is a financial model that assumes returns
on a portfolio are normally distributed. Suppose a portfolio has an average annual return of
14.7% (i.e., an average gain of 14.7%) with a standard deviation of 33%. A return of 0% means
the value of the portfolio doesn’t change, a negative return means that the portfolio loses money,
and a positive return means that the portfolio gains money.
a. What percent of years does this portfolio lose money, i.e., have a return less than 0%?
b. What is the cutoff for the highest 15% of annual returns with this portfolio?
9. LA weather, Celsius. Recall the set-up that average daily high temperature in June in LA
is 77∘ F with a standard deviation of 5∘ F, and it can be assumed that the high temperatures
follow a normal distribution. We use the following equation to convert ∘ F (Fahrenheit) to ∘ C
(Celsius): 𝐶 = (𝐹 − 32) × 95 .
a. Write the probability model for the distribution of temperature in ∘ C in June in LA.
b. What is the probability of observing a 28∘ C (which roughly corresponds to 83∘ F) temper-
ature or higher in June in LA? Calculate using the ∘ C model from part (a).
c. Did you get the same answer or different answers in part (b) of this question and part (a)
of the previous question on the LA weather? Are you surprised? Explain.
d. Estimate the IQR of the temperatures (in ∘ C) in June in LA.
10. Find the SD. Find the standard deviation of the distribution in the following situations.
a. MENSA is an organization whose members have IQs in the top 2% of the population. IQs
are normally distributed with mean 100, and the minimum IQ score required for admission
to MENSA is 132.
b. Cholesterol levels for women aged 20 to 34 follow an approximately normal distribution
with mean 185 milligrams per deciliter (mg/dl). Women with cholesterol levels above 220
mg/dl are considered to have high cholesterol and about 18.5% of women fall into this
category.
11. Chronic illness. In 2013, the Pew Research Foundation reported that “45% of U.S. adults
report that they live with one or more chronic conditions”. However, this value was based on
a sample, so it may not be a perfect estimate for the population parameter of interest on its
own. The study reported a standard error of about 1.2%, and a normal model may reasonably
be used in this setting.
a. Create a 95% confidence interval for the proportion of U.S. adults who live with one or
more chronic conditions. Also interpret the confidence interval in the context of the study.
(Pew Research Center 2013)
b. Identify each of the following statements as true or false. Provide an explanation to justify
each of your answers.
i. We can say with certainty that the confidence interval from part (a) contains the true
percentage of U.S. adults who suffer from a chronic illness.
ii. If we repeated this study 1,000 times and constructed a 95% confidence interval for
each study, then approximately 950 of those confidence intervals would contain the
true fraction of U.S. adults who suffer from chronic illnesses.
iii. The poll provides statistically discernible evidence (at the 𝛼 = 0.05 level) that the
percentage of U.S. adults who suffer from chronic illnesses is below 50%.
iv. Since the standard error is 1.2%, only 1.2% of people in the study communicated
uncertainty about their answer.
13.8. EXERCISES 237
12. Social media users and news, mathematical model. A poll conducted in 2022 found that
50% of U.S. adults (i.e., a proportion of 0.5) get news from social media sometimes or often.
The standard error for this estimate was 0.5% (i.e., 0.005), and a normal distribution may be
used to model the sample proportion. (Pew Research Center 2022)
a. Construct a 99% confidence interval for the fraction of U.S. adults who get news on social
media sometimes or often, and interpret the confidence interval in context.
b. Identify each of the following statements as true or false. Provide an explanation to justify
each of your answers.
i. The data provide statistically discernible evidence that more than half of U.S. adults
users get news through social media sometimes or often. Use a discernibility level of
𝛼 = 0.01.
ii. Since the standard error is 0.5%, we can conclude that 99.5% of all U.S. adults users
were included in the study.
iii. If we want to reduce the standard error of the estimate, we should collect less data.
iv. If we construct a 90% confidence interval for the percentage of U.S. adults who get
news through social media sometimes or often, the resulting confidence interval will be
wider than a corresponding 99% confidence interval.
13. Interpreting a Z score from a sample proportion. Suppose that you conduct a hypothesis
test about a population proportion and calculate the Z score to be 0.47. Which of the following
is the best interpretation of this value? For the problems which are not a good interpretation,
indicate the statistical idea being described.19
a. The probability is 0.47 that the null hypothesis is true.
b. If the null hypothesis were true, the probability would be 0.47 of obtaining a sample pro-
portion as far as observed from the hypothesized value of the population proportion.
c. The sample proportion is 0.47 standard errors greater than the hypothesized value of the
population proportion.
d. The sample proportion is equal to 0.47 times the standard error.
e. The sample proportion is 0.47 away from the hypothesized value of the population.
f. The sample proportion is 0.47.
14. Mental health. The General Social Survey asked the question: “For how many days during
the past 30 days was your mental health, which includes stress, depression, and problems with
emotions, not good?” Based on responses from 1,151 US residents, the survey reported a 95%
confidence interval of 3.40 to 4.24 days in 2010.
a. Interpret this interval in context of the data.
b. What does “95% confident” mean? Explain in the context of the application.
c. Suppose the researchers think a 99% confidence level would be more appropriate for this
interval. Will this new interval be smaller or wider than the 95% confidence interval?
d. If a new survey were to be done with 500 Americans, do you think the standard error of
the estimate be larger, smaller, or about the same.
19 This exercise was inspired by discussion on Dr. Allan Rossman’s blog Ask Good Questions.
238 CHAPTER 13. INFERENCE WITH MATHEMATICAL MODELS
15. Repeated water samples. A nonprofit wants to understand the fraction of households that
have elevated levels of lead in their drinking water. They expect at least 5% of homes will
have elevated levels of lead, but not more than about 30%. They randomly sample 800 homes
and work with the owners to retrieve water samples, and they compute the fraction of these
homes with elevated lead levels. They repeat this 1,000 times and build a distribution of sample
proportions.
a. What is this distribution called?
b. Would you expect the shape of this distribution to be symmetric, right skewed, or left
skewed? Explain your reasoning.
c. What is the name of the variability of this distribution.
d. Suppose the researchers’ budget is reduced, and they are only able to collect 250 observa-
tions per sample, but they can still collect 1,000 samples. They build a new distribution
of sample proportions. How will the variability of this new distribution compare to the
variability of the distribution when each sample contained 800 observations?
16. Repeated student samples. Of all freshman at a large college, 16% made the dean’s list in
the current year. As part of a class project, students randomly sample 40 students and check if
those students made the list. They repeat this 1,000 times and build a distribution of sample
proportions.
a. What is this distribution called?
b. Would you expect the shape of this distribution to be symmetric, right skewed, or left
skewed? Explain your reasoning.
c. What is the name of the variability of this distribution?
d. Suppose the students decide to sample again, this time collecting 90 students per sample,
and they again collect 1,000 samples. They build a new distribution of sample proportions.
How will the variability of this new distribution compare to the variability of the distribution
when each sample contained 40 observations?
239
Chapter 14
Decision Errors
Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes
wrongly convicted and the guilty sometimes walk free. Similarly, data can point to the wrong con-
clusion. However, what distinguishes statistical hypothesis tests from a court system is that our
framework allows us to quantify and control how often the data lead us to the incorrect conclusion.
In a hypothesis test, there are two competing hypotheses: the null and the alternative. We make a
statement about which one might be true, but we might choose incorrectly. There are four possible
scenarios in a hypothesis test, which are summarized in Table 14.1.
Test conclusion
Truth Reject null hypothesis Fail to reject null
hypothesis
Null hypothesis is true Type I error Good decision
Alternative hypothesis is true Good decision Type II error
A Type I error is rejecting the null hypothesis when 𝐻0 is actually true. Since we rejected the null
hypothesis in the sex discrimination and opportunity cost studies, it is possible that we made a Type
I error in one or both of those studies. A Type II error is failing to reject the null hypothesis when
the alternative is actually true.
240 CHAPTER 14. DECISION ERRORS
EXAMPLE
In a US court, the defendant is either innocent (𝐻0 ) or guilty (𝐻𝐴 ). What does a Type I error
represent in this context? What does a Type II error represent? Table 14.1 may be useful.
If the court makes a Type I error, this means the defendant is innocent (𝐻0 true) but wrongly
convicted. A Type II error means the court failed to reject 𝐻0 (i.e., failed to convict the person)
when they were in fact guilty (𝐻𝐴 true).
GUIDED PRACTICE
Consider the opportunity cost study where we concluded students were less likely to
make a DVD purchase if they were reminded that money not spent now could be spent
later. What would a Type I error represent in this context?1
EXAMPLE
How could we reduce the Type I error rate in US courts? What influence would this have on
the Type II error rate?
To lower the Type I error rate, we might raise our standard for conviction from “beyond a
reasonable doubt” to “beyond a conceivable doubt” so fewer people would be wrongly convicted.
However, this would also make it more difficult to convict the people who are actually guilty,
so we would make more Type II errors.
GUIDED PRACTICE
How could we reduce the Type II error rate in US courts? What influence would this
have on the Type I error rate?2
The example and guided practice above provide an important lesson: if we reduce how often we make
one type of error, we generally make more of the other type.
The discernibility level provides the cutoff for the p-value which will lead to a decision of “reject
the null hypothesis.” Choosing a discernibility level for a test is important in many contexts, and the
traditional level is 0.05. However, it is sometimes helpful to adjust the discernibility level based on the
application. We may select a level that is smaller or larger than 0.05 depending on the consequences
of any conclusions reached from the test.
If making a Type I error is dangerous or especially costly, we should choose a small discernibility level
(e.g., 0.01 or 0.001). If we want to be very cautious about rejecting the null hypothesis, we demand
very strong evidence favoring the alternative 𝐻𝐴 before we would reject 𝐻0 .
If a Type II error is relatively more dangerous or much more costly than a Type I error, then we
should choose a higher discernibility level (e.g., 0.10). Here we want to be cautious about failing to
reject 𝐻0 when the null is actually false.
1 Making a Type I error in this context would mean that reminding students that money not spent now can be
spent later does not affect their buying habits, despite the strong evidence (the data suggesting otherwise) found in
the experiment. Notice that this does not necessarily mean something was wrong with the data or that we made a
computational mistake. Sometimes data simply point us to the wrong conclusion, which is why scientific studies are
often repeated to check initial findings.
2 To lower the Type II error rate, we want to convict more guilty people. We could lower the standards for conviction
from “beyond a reasonable doubt” to “beyond a little doubt”. Lowering the bar for guilt will also result in more wrongful
convictions, raising the Type I error rate.
14.2. TWO-SIDED HYPOTHESES 241
The discernibility level selected for a test should reflect the real-world consequences
associated with making a Type I or Type II error.
In Chapter 11 we explored whether women were discriminated against and whether a simple trick could
make students a little thriftier. In these two case studies, we have actually ignored some possibilities:
• What if men are actually discriminated against?
• What if the money trick actually makes students spend more?
These possibilities weren’t considered in our original hypotheses or analyses. The disregard of the
extra alternatives may have seemed natural since the data pointed in the directions in which we
framed the problems. However, there are two dangers if we ignore possibilities that disagree with our
data or that conflict with our world view:
1. Framing an alternative hypothesis simply to match the direction that the data point will gen-
erally inflate the Type I error rate. After all the work we have done (and will continue to do)
to rigorously control the error rates in hypothesis tests, careless construction of the alternative
hypotheses can disrupt that hard work.
2. If we only use alternative hypotheses that agree with our worldview, then we are going to be
subjecting ourselves to confirmation bias, which means we are looking for data that supports
our ideas. That’s not very scientific, and we can do better!
The original hypotheses we have seen are called one-sided hypothesis tests because they only
explored one direction of possibilities. Such hypotheses are appropriate when we are exclusively
interested in the single direction, but usually we want to consider all possibilities. To do so, let’s learn
about two-sided hypothesis tests in the context of a new study that examines the impact of using
blood thinners on patients who have undergone CPR.
Cardiopulmonary resuscitation (CPR) is a procedure used on individuals suffering a heart attack
when other emergency resources are unavailable. This procedure is helpful in providing some blood
circulation to keep a person alive, but CPR chest compression can also cause internal injuries. Internal
bleeding and other injuries that can result from CPR complicate additional treatment efforts. For
instance, blood thinners may be used to help release a clot that is causing the heart attack once a
patient arrives in the hospital. However, blood thinners negatively affect internal injuries.
Here we consider an experiment with patients who underwent CPR for a heart attack and were
subsequently admitted to a hospital. Each patient was randomly assigned to either receive a blood
thinner (treatment group) or not receive a blood thinner (control group). The outcome variable of
interest was whether the patient survived for at least 24 hours. (Böttiger et al. 2001)
EXAMPLE
Form hypotheses for this study in plain and statistical language. Let 𝑝𝐶 represent the true
survival rate of people who do not receive a blood thinner (corresponding to the control group)
and 𝑝𝑇 represent the survival rate for people receiving a blood thinner (corresponding to the
treatment group).
We want to understand whether blood thinners are helpful or harmful. We’ll consider both of
these possibilities using a two-sided hypothesis test.
• 𝐻0 ∶ Blood thinners do not have an overall survival effect, i.e., the survival proportions
are the same in each group. 𝑝𝑇 − 𝑝𝐶 = 0.
• 𝐻𝐴 ∶ Blood thinners have an impact on survival, either positive or negative, but not zero.
𝑝𝑇 − 𝑝𝐶 ≠ 0.
Note that if we had done a one-sided hypothesis test, the resulting hypotheses would have
been:
• 𝐻0 ∶ Blood thinners do not have a positive overall survival effect, i.e., the survival
proportions for the blood thinner group is the same or lower than the control group.
𝑝𝑇 − 𝑝𝐶 ≤ 0.
• 𝐻𝐴 ∶ Blood thinners have a positive impact on survival. 𝑝𝑇 − 𝑝𝐶 > 0.
There were 50 patients in the experiment who did not receive a blood thinner and 40 patients who
did. The study results are shown in Table 14.2.
Table 14.2: Results for the CPR study. Patients in the treatment group were given a blood thinner, and
patients in the control group were not.
GUIDED PRACTICE
What is the observed survival rate in the control group? And in the treatment group?
Also, provide a point estimate (𝑝𝑇̂ − 𝑝𝐶
̂ ) for the true difference in population survival
proportions across the two groups: 𝑝𝑇 − 𝑝𝐶 .3
According to the point estimate, for patients who have undergone CPR outside of the hospital, an
additional 13% of these patients survive when they are treated with blood thinners. However, we
wonder if this difference could be easily explainable by chance, if the treatment has no effect on
survival.
As we did in past studies, we will simulate what type of differences we might see from chance alone
under the null hypothesis. By randomly assigning each of the patient’s files to a “simulated treatment”
or “simulated control” allocation, we get a new grouping. If we repeat this simulation 1,000 times, we
can build a null distribution of the differences shown in Figure 14.1.
Figure 14.1: Null distribution of the point estimate for the difference in proportions, 𝑝𝑇̂ − 𝑝𝐶
̂ . The shaded
right tail shows observations that are at least as large as the observed difference, 0.13.
The right tail area is 0.135. (Note: it is only a coincidence that we also have 𝑝𝑇̂ − 𝑝𝐶
̂ = 0.13.) However,
contrary to how we calculated the p-value in previous studies, the p-value of this test is not actually
the tail area we calculated, i.e., it’s not 0.135!
The p-value is defined as the probability we observe a result at least as favorable to the alternative
hypothesis as the observed difference. In this case, any differences less than or equal to -0.13 would
also provide equally strong evidence favoring the alternative hypothesis as a difference of +0.13 did. A
difference of -0.13 would correspond to 13% higher survival rate in the control group than the treatment
group. In Figure 14.2 we have also shaded these differences in the left tail of the distribution. These
two shaded tails provide a visual representation of the p-value for a two-sided test.
Figure 14.2: Null distribution of the point estimate for the difference in proportions, 𝑝𝑇̂ − 𝑝𝐶
̂ . All values
that are at least as extreme as +0.13 but in either direction away from 0 are shaded.
For a two-sided test, take the single tail (in this case, 0.131) and double it to get the p-value: 0.262.
Since this p-value is larger than 0.05, we do not reject the null hypothesis. That is, we do not find
convincing evidence that the blood thinner has any influence on survival of patients who undergo
CPR prior to arriving at the hospital.
We want to be rigorous and keep an open mind when we analyze data and evidence.
Use a one-sided hypothesis test only if you truly have interest in only one direction.
244 CHAPTER 14. DECISION ERRORS
First compute the p-value for one tail of the distribution, then double that value to get
the two-sided p-value. That’s it!
EXAMPLE
Consider the situation of the medical consultant. Now that you know about one-sided and
two-sided tests, which type of test do you think is more appropriate?
The setting has been framed in the context of the consultant being helpful (which is what led
us to a one-sided test originally), but what if the consultant actually performed worse than the
average? Would we care? More than ever! Since it turns out that we care about a finding in
either direction, we should run a two-sided test. The p-value for the two-sided test is double
that of the one-sided test, here the simulated p-value would be 0.2444.
Generally, to find a two-sided p-value we double the single tail area, which remains a reasonable
approach even when the distribution is asymmetric. However, the approach can result in p-values
larger than 1 when the point estimate is very near the mean in the null distribution; in such cases,
we write that the p-value is 1. Also, very large p-values computed in this way (e.g., 0.85), may also
be slightly inflated. Typically, we do not worry too much about the precision of very large p-values
because they lead to the same analysis conclusion, even if the value is slightly off.
Now that we understand the difference between one-sided and two-sided tests, we must recognize
when to use each type of test. Because of the result of increased error rates, it is never okay to change
two-sided tests to one-sided tests after observing the data. We explore the consequences of ignoring
this advice in the next example.
EXAMPLE
Using 𝛼 = 0.05, we show that freely switching from two-sided tests to one-sided tests will lead
us to make twice as many Type I errors as intended.
Suppose we are interested in finding any difference from 0. We’ve created a smooth-looking
null distribution representing differences due to chance below.
First, suppose the sample difference was larger than 0. In a one-sided test, we would set 𝐻𝐴 ∶
difference > 0. If the observed difference falls in the upper 5% of the distribution, we would
reject 𝐻0 since the p-value would just be a the single tail. Thus, if 𝐻0 is true, we incorrectly
reject 𝐻0 about 5% of the time when the sample mean is above the null value, as shown above.
Then, suppose the sample difference was smaller than 0. In a one-sided test, we would set 𝐻𝐴 ∶
difference < 0. If the observed difference falls in the lower 5% of the figure, we would reject
𝐻0 . That is, if 𝐻0 is true, then we would observe this situation about 5% of the time.
By examining these two scenarios, we can determine that we will make a Type I error 5%+5% =
10% of the time if we are allowed to swap to the “best” one-sided test for the data. This is
twice the error rate we prescribed with our discernibility level: 𝛼 = 0.05!
14.4. POWER 245
After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid
this temptation. Hypotheses should be set up before observing the data.
14.4 Power
Although we won’t go into extensive detail here, power is an important topic for follow-up consideration
after understanding the basics of hypothesis testing. A good power analysis is a vital preliminary step
to any study as it will inform whether the data you collect are sufficient for being able to conclude
your research broadly.
Often times in experiment planning, there are two competing considerations:
• We want to collect enough data that we can detect important effects.
• Collecting data can be expensive, and, in experiments involving people, there may be some risk
to patients.
When planning a study, we want to know how likely we are to detect an effect we care about. In other
words, if there is a real effect, and that effect is large enough that it has practical value, then what is
the probability that we detect that effect? This probability is called the power, and we can compute
it for different sample sizes or different effect sizes.
Power.
The power of the test is the probability of rejecting the null claim when the alternative
claim is true.
How easy it is to detect the effect depends on both how big the effect is (e.g., how good
the medical treatment is) as well as the sample size.
We think of power as the probability that you will become rich and famous from your science. In order
for your science to make a splash, you need to have good ideas! That is, you won’t become famous
if you happen to find a single Type I error which rejects the null hypothesis. Instead, you’ll become
famous if your science is very good and important (that is, if the alternative hypothesis is true). The
better your science is (i.e., the better the medical treatment), the larger the effect size and the easier
it will be for you to convince people of your work.
Not only does your science need to be solid, but you also need to have evidence (i.e., data) that shows
the effect. A few observations (e.g., 𝑛 = 2) is unlikely to be convincing because of well known ideas
of natural variability. Indeed, the larger the dataset which provides evidence for your scientific claim,
the more likely you are to convince the community that your idea is correct.
Although a full discussion of relative power is beyond the scope of this text, you might be interested
to know that, often, paired t-tests (discussed in Section 21.3) are more powerful than independent t-
tests (discussed in Section 20.3) because the pairing reduces the inherent variability across observations.
Additionally, because the median is almost always more variable than the mean, tests based on the
mean are more powerful than tests based on the median. That is to say, reducing variability (done
in different ways depending on the experimental design and set-up of the analysis) makes a test more
powerful in such that the data are more likely to reject the null hypothesis.
246 CHAPTER 14. DECISION ERRORS
14.5.1 Summary
Although hypothesis testing provides a strong framework for making decisions based on data, as the
analyst, you need to understand how and when the process can go wrong. That is, always keep in
mind that the conclusion to a hypothesis test may not be right! Sometimes when the null hypothesis
is true, we will accidentally reject it and commit a Type I error; sometimes when the alternative
hypothesis is true, we will fail to reject the null hypothesis and commit a Type II error. The power
of the test quantifies how likely it is to obtain data which will reject the null hypothesis when indeed
the alternative is true; the power of the test is increased when larger sample sizes are taken.
14.5.2 Terms
The terms introduced in this chapter are presented in Table 14.3. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
14.6 Exercises
2. Testing for food safety. A food safety inspector is called upon to investigate a restaurant
with a few customer reports of poor sanitation practices. The food safety inspector uses a
hypothesis testing framework to evaluate whether regulations are not being met. If he decides
the restaurant is in gross violation, its license to serve food will be revoked.
a. Write the hypotheses in words.
b. What is a Type I error in this context?
c. What is a Type II error in this context?
d. Which error is more problematic for the restaurant owner? Why?
e. Which error is more problematic for the diners? Why?
f. As a diner, would you prefer that the food safety inspector requires strong evidence or very
strong evidence of health concerns before revoking a restaurant’s license? Explain your
reasoning.
3. Which is higher? In each part below, there is a value of interest and two scenarios: (i) and
(ii). For each part, report if the value of interest is larger under scenario (i), scenario (ii), or
whether the value is equal under the scenarios.
a. The standard error of 𝑝̂ when (i) 𝑛 = 125 or (ii) 𝑛 = 500.
b. The margin of error of a confidence interval when the confidence level is (i) 90% or (ii) 80%.
c. The p-value for a Z-statistic of 2.5 calculated based on a (i) sample with 𝑛 = 500 or based
on a (ii) sample with 𝑛 = 1000.
d. The probability of making a Type II error when the alternative hypothesis is true and the
discernibility level is (i) 0.05 or (ii) 0.10.
4. True / False. Determine if the following statements are true or false, and explain your reasoning.
If false, state how it could be corrected.
a. If a given value (for example, the null hypothesized value of a parameter) is within a 95%
confidence interval, it will also be within a 99% confidence interval.
b. Decreasing the discernibility level (𝛼) will increase the probability of making a Type I error.
c. Suppose the null hypothesis is 𝑝 = 0.5 and we fail to reject 𝐻0 . Under this scenario, the
true population proportion is 0.5.
d. With large sample sizes, even small differences between the null value and the observed
point estimate, a difference often called the effect size, will be identified as statistically
discernible.
248 CHAPTER 14. DECISION ERRORS
5. Online communication. A study suggests that 60% of college student spend 10 or more hours
per week communicating with others online. You believe that this is incorrect and decide to
collect your own sample for a hypothesis test. You randomly sample 160 students from your
dorm and find that 70% spent 10 or more hours a week communicating with others online.
A friend of yours, who offers to help you with the hypothesis test, comes up with the following
set of hypotheses. Indicate any errors you see.
6. Same observation, different sample size. Suppose you conduct a hypothesis test based on
a sample where the sample size is 𝑛 = 50, and arrive at a p-value of 0.08. You then refer back
to your notes and discover that you made a careless mistake, the sample size should have been
𝑛 = 500. Will your p-value increase, decrease, or stay the same? Explain.
7. Estimating 𝜋. In a class activity, each of 100 students experimentally estimates the value of 𝜋,
10 separate times. Using the 10 measurements for 𝜋 (10 values of 𝜋),̂ each student calculates a
confidence interval for 𝜋. In grading the 100 student assignments, the professor marks 7 of the
assignments wrong, indicating that the 7 students must have done their experiments or analysis
incorrectly because each of the 7 students reported confidence intervals that did not capture
the known true value of 𝜋, roughly 3.14159. Was the professor correct to mark the assignments
wrong for having CIs that did not capture the value of 3.14159? Explain.4
8. Fermenting yeast. Twenty students work individually in a biology lab to test whether using
raw sucrose versus refined sugar will lead to the same yeast fermentation rate. Each student runs
a full experiment independently of the other students in the lab. Of the twenty students, twelve
are able to reject the null hypothesis and to claim that the fermentation rates are different.5
a. Explain what type of error was likely to have occurred in this situation.
b. What change would you suggest that would lower the error rate?
9. Practical importance vs. statistical discernibility. Determine whether the following state-
ment is true or false, and explain your reasoning: “With large sample sizes, even small differences
between the null value and the observed point estimate can be statistically discernible.”
10. Hypothesis statements. For each of the research claims below, fill in the value and the
direction of the null and alternative hypotheses. That is, complete all aspects of the following
hypothesis statements. Additionally, for each item, describe 𝑝 in words.
𝐻0 ∶ 𝑝________ 𝐻𝐴 ∶ 𝑝________
a. On a pre-test to assess knowledge of the upcoming material, a professor wants to determine
if their students know, on average, more than if they were just randomly guessing. The
pre-test is 30 multiple choice questions, where each question has 5 possible responses.
b. A standard treatment is known to reduce blood pressure in 32% of patients. A clinical trial
is conducted to assess whether a new medical intervention will produce results which are
different than the standard treatment, in terms of the percent of patients who will have
reduced blood pressure.
c. In the last presidential election 67% of registered voters turned out to vote. Will the next
presidential election have a higher turn-out of voters?
Chapter 15
Applications: Foundations
In the foundations of inference chapters, we have provided three different methods for statistical
inference. We will continue to build on all three of the methods throughout the text, and by the
end, you should have an understanding of the similarities and differences between them. Meanwhile,
it is important to note that the methods are designed to mimic variability with data, and we know
that variability can come from different sources (e.g., random sampling vs. random allocation, see
Figure 2.8). In Table 15.1, we have summarized some of the ways the inferential procedures feature
specific sources of variability. We hope that you refer back to the table often as you dive more deeply
into inferential ideas in future chapters.
You might have noticed that the word distribution is used throughout this part (and will continue
to be used in future chapters). A distribution always describes variability, but sometimes it is worth
reflecting on what is varying. Typically the distribution either describes how the observations vary or
how a statistic varies. But even when describing how a statistic varies, there is a further consideration
with respect to the study design, e.g., does the statistic vary from random sample to random sample
or does it vary from random allocation to random allocation? The methods presented in this text
(and used in science generally) are typically used interchangeably across ideas of random samples or
random allocations of the treatment. Often, the two different analysis methods will give equivalent
conclusions. The most important thing to consider is how to contextualize the conclusion in terms of
the problem. See Figure 2.8 to confirm that your conclusions are appropriate.
Below, we synthesize the different types of distributions discussed throughout the text. Reading
through the different definitions and solidifying your understanding will help as you come across these
distributions in future chapters and you can always return back here to refresh your understanding of
the differences between the various distributions.
250 CHAPTER 15. APPLICATIONS: FOUNDATIONS
Table 15.1: Summary and comparison of randomization, bootstrapping, and mathematical models as infer-
ential statistical methods.
Answer
Question Randomization Bootstrapping Mathematical models
What does it do? Shuffles the Resamples (with Uses theory (primarily
explanatory variable to replacement) from the the Central Limit
mimic the natural observed data to mimic Theorem) to describe
variability found in a the sampling variability the hypothetical
randomized experiment found by collecting variability resulting
data from a population from either repeated
randomized
experiments or random
samples
What other Can also be used to Can also be used to Can also be used to
random processes describe random describe random describe random
can be sampling in an allocation in an sampling in an
approximated? observational model experiment observational model or
random allocation in an
experiment
What is it best Hypothesis testing (can Confidence intervals Quick analyses through,
for? also be used for (can also be used for for example, calculating
confidence intervals, bootstrap hypothesis a Z score
but not covered in this testing for one
text) proportion as well)
Distributions.
• A data distribution describes the shape, center, and variability of the observed
data.
This can also be referred to as the sample distribution but we’ll avoid that
phrase as it sounds too much like sampling distribution, which is different.
• A sampling distribution describes the shape, center, and variability of all pos-
sible values of a sample statistic from samples of a given sample size from a
given population.
Since the population is never observed, it’s never possible to observe the true
sampling distribution either. However, when certain conditions hold, the Central
Limit Theorem tells us what the sampling distribution is.
In this case study, we consider a new malaria vaccine called PfSPZ. In the malaria study, volunteer
patients were randomized into one of two experiment groups: 14 patients received an experimental
vaccine and 6 patients received a placebo vaccine. Nineteen weeks later, all 20 patients were exposed
to a drug-sensitive strain of the malaria parasite; the motivation of using a drug-sensitive strain here
is for ethical considerations, allowing any infections to be treated effectively.
The results are summarized in Table 15.2, where 9 of the 14 treatment patients remained free of signs
of infection while all of the 6 patients in the control group showed some baseline signs of infection.
252 CHAPTER 15. APPLICATIONS: FOUNDATIONS
GUIDED PRACTICE
Is this an observational study or an experiment? What implications does the study
type have on what can be inferred from the results?1
EXAMPLE
Statisticians and data scientists are sometimes called upon to evaluate the strength of evidence.
When looking at the rates of infection for patients in the two groups in this study, what comes
to mind as we try to determine whether the data show convincing evidence of a real difference?
The observed infection rates (35.7% for the treatment group versus 100% for the control group)
suggest the vaccine may be effective. However, we cannot be sure if the observed difference
represents the vaccine’s efficacy or if there is no treatment effect and the observed difference
is just from random chance. Generally there is a little bit of fluctuation in sample data, and
we wouldn’t expect the sample proportions to be exactly equal, even if the truth was that the
infection rates were independent of getting the vaccine. Additionally, with such small samples,
perhaps it’s common to observe such large differences when we randomly split a group due to
chance alone!
This example is a reminder that the observed outcomes in the data sample may not perfectly reflect
the true relationships between variables since there is random noise. While the observed difference
in rates of infection is large, the sample size for the study is small, making it unclear if this observed
difference represents efficacy of the vaccine or whether it is simply due to chance. We label these two
competing claims, 𝐻0 and 𝐻𝐴 :
• 𝐻0 : Independence model. The variables are independent. They have no relationship, and
the observed difference between the proportion of patients who developed an infection in the
two groups, 64.3%, was due to chance.
• 𝐻𝐴 : Alternative model. The variables are not independent. The difference in infection rates
of 64.3% was not due to chance. Here (because an experiment was done), if the difference in
infection rate is not due to chance, it was the vaccine that affected the rate of infection.
What would it mean if the independence model, which says the vaccine had no influence on the rate
of infection, is true? It would mean 11 patients were going to develop an infection no matter which
group they were randomized into, and 9 patients would not develop an infection no matter which group
they were randomized into. That is, if the vaccine did not affect the rate of infection, the difference in
the infection rates was due to chance alone in how the patients were randomized.
Now consider the alternative model: infection rates were influenced by whether a patient received the
vaccine or not. If this was true, and especially if this influence was substantial, we would expect to
1 The study is an experiment, as patients were randomly assigned an experiment group. Since this is an experiment,
the results can be used to evaluate a causal relationship between the malaria vaccine and whether patients showed signs
of an infection.
15.2. CASE STUDY: MALARIA VACCINE 253
Table 15.3: Simulation results, where any difference in infection ratio is purely due to chance.
GUIDED PRACTICE
How does this compare to the observed 64.3% difference in the actual data?2
And another:
3 8
− = −0.071
6 14
And so on until we repeat the simulation enough times to create a distribution of differences that could
have occurred if the null hypothesis was true.
2 4/6 − 7/14 = 0.167 or about 16.7% in favor of the vaccine. This difference due to chance is much smaller than the
Figure 15.1 shows a stacked plot of the differences found from 100 simulations, where each dot repre-
sents a simulated difference between the infection rates (control rate minus treatment rate).
Figure 15.1: A stacked dot plot of differences from 100 simulations produced under the independence mode,
𝐻0 , where in these simulations infections are unaffected by the vaccine. Two of the 100 simulations had a
difference of at least 64.3%, the difference observed in the study.
Note that the distribution of these simulated differences is centered around 0. We simulated these
differences assuming that the independence model was true, and under this condition, we expect the
difference to be near zero with some random fluctuation, where near is pretty generous in this case
since the sample sizes are so small in this study.
EXAMPLE
How often would you observe a difference of at least 64.3% (0.643) according to Figure 15.1?
Often, sometimes, rarely, or never?
It appears that a difference of at least 64.3% due to chance alone would only happen about
2% of the time according to Figure 15.1. Such a low probability indicates a rare event.
The difference of 64.3% being a rare event suggests two possible interpretations of the results of the
study:
• 𝐻0 : Independence model. The vaccine has no effect on infection rate, and we just happened
to observe a difference that would only occur on a rare occasion.
• 𝐻𝐴 : Alternative model. The vaccine has an effect on infection rate, and the difference we
observed was actually due to the vaccine being effective at combating malaria, which explains
the large difference of 64.3%.
Based on the simulations, we have two options. (1) We conclude that the study results do not provide
strong evidence against the independence model. That is, we do not have sufficiently strong evidence
to conclude the vaccine had an effect in this clinical setting. (2) We conclude the evidence is sufficiently
strong to reject 𝐻0 and assert that the vaccine was useful. When we conduct formal studies, usually
we reject the notion that we just happened to observe a rare event. So in the vaccine case, we reject
the independence model in favor of the alternative. That is, we are concluding the data provide strong
evidence that the vaccine provides some protection against malaria in this clinical setting.
One field of statistics, statistical inference, is built on evaluating whether such differences are due
to chance. In statistical inference, data scientists evaluate which model is most reasonable given the
data. Errors do occur, just like rare events, and we might choose the wrong model. While we do not
always choose correctly, statistical inference gives us tools to control and evaluate how often decision
errors occur.
15.3. INTERACTIVE R TUTORIALS 255
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 4: Foundations of inference
https://openintrostat.github.io/ims-tutorials/04-foundations
Tutorial 4 - Lesson 1: Sampling variability
https://openintro.shinyapps.io/ims-04-foundations-01
Tutorial 4 - Lesson 2: Randomization test
https://openintro.shinyapps.io/ims-04-foundations-02
Tutorial 4 - Lesson 3: Errors in hypothesis testing
https://openintro.shinyapps.io/ims-04-foundations-03
Tutorial 4 - Lesson 4: Parameters and confidence intervals
https://openintro.shinyapps.io/ims-04-foundations-04
You can also access the full list of tutorials supporting this book at https://openintrostat.github.io/
ims-tutorials.
15.4 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Sampling distributions - Does science benefit you?
https://www.openintro.org/go?id=ims-r-lab-foundations-1
https://www.openintro.org/go?id=ims-r-lab-foundations-2
You can also access the full list of labs supporting this book at https://www.openintro.org/go?id=ims-
r-labs.
256
PART V
Statistical inference
257
While the previous part of the textbook, Foundations of inference, introduces the core of statistical
inference, this part provides details for applying statistical inference in particular settings. Under-
standing the specific nuances given for different inferential methods can be helpful in communicating
the resulting analyses.
• Chapter 16 provides specific details about inference for a single proportion.
• Chapter 17 provides specific details about inference for comparing two proportions.
• Chapter 18 provides specific details about inference for two-way tables.
• Chapter 19 provides specific details about inference for a single mean.
• Chapter 20 provides specific details about inference for comparing two independent means.
• Chapter 21 provides specific details about inference for comparing paired means.
• Chapter 22 provides specific details about inference for comparing many means.
• Chapter 23 includes an application on the Redundant adjectives case study where the topics
from this part of the book on means are fully developed.
After working through the details in this part, you should have a good sense for how inferential
methods are similar and different when applied to different data structures. Additionally, you should
be able to apply both computational methods as well as mathematical techniques to the inference
problem at hand.
258
Chapter 16
We encountered inference methods for a single proportion in Chapter 12, exploring point estimates
and confidence intervals. In this section, we’ll do a review of these topics and how to choose an
appropriate sample size when collecting data for single proportion contexts.
Note that there is only one variable being measured in a study which focuses on one proportion. For
each observational unit, the single variable is measured as either a success or failure (e.g., “surgical
complication” vs. “no surgical complication”). Because the nature of the research question at hand
focuses on only a single variable, there is not a way to randomize the variable across a different
(explanatory) variable. For this reason, we will not use randomization as an analysis tool when
focusing on a single proportion. Instead, we will apply bootstrapping techniques to test a given
hypothesis, and we will also revisit the associated mathematical models.
EXAMPLE
Using the data, is it possible to assess the consultant’s claim that her complication rate is less
than 10%?
No. The claim is that there is a causal connection, but the data are observational. Patients
who hire this medical consultant may have lower complication rates for other reasons.
While it is not possible to assess this causal claim, it is still possible to test for an association
using these data. For this question we ask, could the low complication rate of 𝑝̂ = 0.0484 have
simply occurred by chance, if her complication rate does not differ from the US standard rate?
GUIDED PRACTICE
Write out hypotheses in both plain and statistical language to test for the association
between the consultant’s work and the true complication rate, 𝑝, for the consultant’s
clients.1
Because, as it turns out, the conditions of working with the normal distribution are not met (see
Section 16.2), the uncertainty associated with the sample proportion should not be modeled using
the normal distribution, as doing so would underestimate the uncertainty associated with the sample
statistic. However, we would still like to assess the hypotheses from the previous Guided Practice in
absence of the normal framework. To do so, we need to evaluate the possibility of a sample value (𝑝)̂
as far below the null value, 𝑝0 = 0.10 as what was observed. The deviation of the sample value from
the hypothesized parameter is usually quantified with a p-value.
The p-value is computed based on the null distribution, which is the distribution of the test statistic
if the null hypothesis is true. Supposing the null hypothesis is true, we can compute the p-value by
identifying the probability of observing a test statistic that favors the alternative hypothesis at least
as strongly as the observed test statistic. Here we will use a bootstrap simulation to calculate the
p-value.
1 𝐻 ∶ There is no association between the consultant’s contributions and the clients’ complication rate. In statistical
0
language, 𝑝 = 0.10. 𝐻𝐴 ∶ Patients who work with the consultant tend to have a complication rate lower than 10%, i.e.,
𝑝 < 0.10.
260 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
Similar to the process described in Chapter 12, each client can be simulated using a bag of marbles
with 10% red marbles and 90% white marbles. Sampling a marble from the bag (with 10% red
marbles) is one way of simulating whether a patient has a complication if the true complication rate
is 10%. If we select 62 marbles and then compute the proportion of patients with complications in
the simulation, 𝑝𝑠𝑖𝑚1
̂ , then the resulting sample proportion is a sample from the null distribution.
There were 5 simulated cases with a complication and 57 simulated cases without a complication, i.e.,
𝑝𝑠𝑖𝑚1
̂ = 5/62 = 0.081.
EXAMPLE
Is this one simulation enough to determine whether we should reject the null hypothesis?
No. To assess the hypotheses, we need to see a distribution of many values of 𝑝𝑠𝑖𝑚
̂ , not just a
single draw from this sampling distribution.
Figure 16.1: The null distribution for 𝑝,̂ created from 10,000 simulated studies. The left tail, representing
the p-value for the hypothesis test is colored in blue.
GUIDED PRACTICE
Because the estimated p-value is 0.117, which is larger than the discernibility level 0.05,
we cannot reject the null hypothesis. Explain what this means in plain language in the
context of the problem.2
16.2. MATHEMATICAL MODEL FOR A PROPORTION 261
GUIDED PRACTICE
Does the conclusion in the previous Guided Practice imply the consultant is good at
their job? Explain.3
Regardless of the statistical method chosen, the p-value is always derived by analyzing
the null distribution of the test statistic. The normal model poorly approximates the
null distribution for 𝑝̂ (the sample proportion) when the success-failure condition is not
satisfied. As a substitute, we can generate the null distribution using simulated sample
proportions and use this distribution to compute the tail area, i.e., the p-value.
In the previous Guided Practice, the p-value is estimated. It is not exact because the simulated null
distribution itself is only a close approximation of the sampling distribution of the sample statistic. An
exact p-value can be generated using the binomial distribution, but that method will not be covered
in this text.
16.2.1 Conditions
In Section 13.2, we introduced the normal distribution and showed how it can be used as a mathe-
matical model to describe the variability of a statistic. There are conditions under which a sample
proportion 𝑝̂ is well modeled with a normal distribution. When the observations are independent and
the sample size is sufficiently large, the normal model will describe the sampling distribution of the
sample proportion quite well; when the observations violate the conditions, the normal model can be
inaccurate. Particularly, it can underestimate the variability of the sample proportion.
The sampling distribution for 𝑝̂ (the sample proportion) based on a sample of size 𝑛
from a population with a true proportion 𝑝 is nearly normal when:
1. The sample’s observations are independent, e.g., are from a simple random sample.
2. We expected to see at least 10 successes and 10 failures in the sample, i.e., 𝑛𝑝 ≥ 10
and 𝑛(1 − 𝑝) ≥ 10. This is called the success-failure condition.
When these conditions are met, then the sampling distribution of 𝑝̂ is nearly normal
with mean 𝑝 and standard error of 𝑝̂ as 𝑆𝐸 = √ 𝑝(1−
̂
𝑛
𝑝)̂
.
Recall that the margin of error is defined by the standard error. The margin of error for 𝑝̂ can be
directly obtained from 𝑆𝐸(𝑝).
̂
𝑝(1−
̂ 𝑝)̂
The margin of error is 𝑧⋆ × √ 𝑛 where 𝑧 ⋆ is calculated from a specified percentile
on the normal distribution.
2 There is not enough evidence to reject the null hypothesis in favor of the alternative hypothesis. We cannot
conclude that there is evidence that the consultant’s surgery complication rate is lower than the US standard rate of
10%. That said, we also cannot conclude that there is evidence that the consultant’s surgery complication rate is higher
than the US standard rate of 10%. When the p-value is larger than the discernibility level, we are unable to make
conclusions about the research statement.
3 Not necessarily. There is no evidence to make a claim in either direction, so we cannot make any claims about
Typically we do not know the true proportion 𝑝, so we substitute some value to check conditions
and estimate the standard error. For confidence intervals, the sample proportion 𝑝̂ is used to check
the success-failure condition and compute the standard error. For hypothesis tests, typically the null
value – that is, the proportion claimed in the null hypothesis – is used in place of 𝑝.
The independence condition is a more nuanced requirement. When it isn’t met, it is important to
understand how and why it is violated. For example, there exist no statistical methods available to
truly correct the inherent biases of data from a convenience sample. On the other hand, if we took a
cluster sample (see Section 2.1.5), the observations wouldn’t be independent, but suitable statistical
methods are available for analyzing the data (but they are beyond the scope of even most second or
third courses in statistics).
EXAMPLE
In the examples based on large sample theory, we modeled 𝑝̂ using the normal distribution.
Why is this not appropriate for the case study on the medical consultant?
The independence assumption may be reasonable if each of the surgeries is from a different
surgical team. However, the success-failure condition is not satisfied. Under the null hypothesis,
we would anticipate seeing 62 × 0.10 = 6.2 complications, not the 10 required for the normal
approximation.
While this book is scoped to well-constrained statistical problems, do remember that this is just the
first book in what is a large library of statistical methods that are suitable for a very wide range of
data and contexts.
When the conditions are met so that the distribution of 𝑝̂ (the sample proportion) is
nearly normal, the variability of a single proportion, 𝑝̂ is well described by:
𝑝(1 − 𝑝)
𝑆𝐸(𝑝)̂ = √
𝑛
Note that we almost never know the true value of 𝑝 (the population probability or
proportion). A more helpful formula to use is:
(best guess of 𝑝)(1 − best guess of 𝑝)
𝑆𝐸(𝑝)̂ ≈ √
𝑛
For hypothesis testing, we use 𝑝0 (the proportion specified in the null hypothesis) as
the best guess of 𝑝. For confidence intervals, we use 𝑝̂ as the best guess of 𝑝.
GUIDED PRACTICE
Consider taking many polls of registered voters (i.e., random samples) of size 300 asking
them if they support legalized marijuana. It is suspected that about 2/3 of all voters
support legalized marijuana. To understand how the sample proportion (𝑝)̂ would vary
across the samples, calculate the standard error of 𝑝.̂ 4
16.2. MATHEMATICAL MODEL FOR A PROPORTION 263
EXAMPLE
A simple random sample of 826 payday loan borrowers was surveyed to better understand
their interests around regulation and costs. 70% of the responses supported new regulations
on payday lenders.
1. The data are a random sample, so it is reasonable to assume that the observations are
independent and representative of the population of interest. We also must check the
success-failure condition, using 𝑝̂ in place of 𝑝 when computing a confidence interval.
Since both values are at least 10, we can use the normal distribution to model 𝑝.̂
3. Using 𝑝̂ = 0.70, 𝑧⋆ = 1.96 for a 95% confidence interval, and the standard error 𝑆𝐸 =
0.016 from the previous Guided Practice, the confidence interval is
point estimate ± 𝑧 ⋆ × 𝑆𝐸
0.70 ± 1.96 × 0.016
(0.669 , 0.731)
We are 95% confident that the true proportion of payday borrowers who supported regulation
at the time of the poll was between 0.669 and 0.731.
There are three steps to constructing a confidence interval for 𝑝 (the true population
proportion or probability).
1. Check if it seems reasonable to assume the observations are independent and check
the success-failure condition using 𝑝̂ (the sample proportion). If the conditions
are met, the sampling distribution of 𝑝̂ may be well-approximated by the normal
model.
2. Calculate the standard error using 𝑝̂ instead of 𝑝.
3. Apply the general confidence interval formula.
4 Because the 𝑝 is unknown but expected to be around 2/3, we will use 2/3 in place of 𝑝 in the formula for the
standard error. 𝑆𝐸 = √ 𝑝(1−𝑝)
𝑛 ≈ √ 2/3(1−2/3)
300 = 0.027.
264 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
There are three components to this interval: the point estimate, “1.96”, and the standard error. The
choice of 1.96 × 𝑆𝐸 was based on capturing 95% of the data since the estimate is within 1.96 standard
errors of the true value about 95% of the time. 1.96 corresponds to the 95% confidence level.
GUIDED PRACTICE
If 𝑋 is a normally distributed random variable, how often will 𝑋 be within 2.58 standard
deviations of the mean?5
Figure 16.2: The area between -𝑧⋆ and 𝑧⋆ increases as |𝑧⋆ | becomes larger. If the confidence level is 99%, we
choose 𝑧 ⋆ such that 99% of the normal curve is between -𝑧 ⋆ and 𝑧⋆ , which corresponds to 0.5% in the lower
tail and 0.5% in the upper tail: 𝑧⋆ = 2.58.
To create a 99% confidence interval, change 1.96 in the 95% confidence interval formula to be 2.58. The
previous Guided Practice highlights that 99% of the time a normal random variable will be within 2.58
standard deviations of its mean. This approach – using the Z scores in the normal model to compute
confidence levels – is appropriate when the point estimate is associated with a normal distribution
and we can properly compute the standard error. Thus, the formula for a 99% confidence interval is:
The normal approximation is crucial to the precision of the 𝑧⋆ confidence intervals (in contrast to
the bootstrap percentile confidence intervals). When the normal model is not a good fit, we will use
alternative distributions that better characterize the sampling distribution or we will use bootstrapping
procedures.
5 This is equivalent to asking how often the 𝑍 score will be larger than -2.58 but less than 2.58. (For a picture, see
Figure 16.2.) To determine this probability, look up -2.58 and 2.58 in the normal probability table (0.0049 and 0.9951).
Thus, there is a 0.9951 − 0.0049 ≈ 0.99 probability that the unobserved random variable 𝑋 will be within 2.58 standard
deviations of the mean.
16.2. MATHEMATICAL MODEL FOR A PROPORTION 265
GUIDED PRACTICE
Create a 99% confidence interval for the impact of the stent on the risk of stroke using
the data from Section 1.1. The point estimate is 0.090, and the standard error is
𝑆𝐸 = 0.028. It has been verified for you that the point estimate can reasonably be
modeled by a normal distribution.6
If the point estimate follows the normal model with standard error 𝑆𝐸, then a confidence
interval for the population parameter is
point estimate ± 𝑧 ⋆ × 𝑆𝐸
Figure 16.2 provides a picture of how to identify 𝑧 ⋆ based on a confidence level. We select 𝑧⋆ so that
the area between -𝑧 ⋆ and 𝑧⋆ in the normal model corresponds to the confidence level.
GUIDED PRACTICE
Previously, we found that implanting a stent in the brain of a patient at risk for a
stroke increased the risk of a stroke. The study estimated a 9% increase in the number
of patients who had a stroke, and the standard error of this estimate was about 𝑆𝐸 = 2.8
Compute a 90% confidence interval for the effect.7
GUIDED PRACTICE
Set up hypotheses to evaluate whether borrowers have a majority support for this type
of regulation.8
To apply the normal distribution framework in the context of a hypothesis test for a proportion, the
independence and success-failure conditions must be satisfied. In a hypothesis test, the success-failure
condition is checked using the null proportion: we verify 𝑛𝑝0 and 𝑛(1 − 𝑝0 ) are at least 10, where 𝑝0
is the null value.
6 Since the necessary conditions for applying the normal model have already been checked for us, we can go straight
to the construction of the confidence interval: point estimate ± 2.58 × 𝑆𝐸 Which gives an interval of (0.018, 0.162).$ We
are 99% confident that implanting a stent in the brain of a patient who is at risk of stroke increases the risk of stroke
within 30 days by a rate of 0.018 to 0.162 (assuming the patients are representative of the population).
7 We must find 𝑧⋆ such that 90% of the distribution falls between -𝑧⋆ and 𝑧⋆ in the standard normal model, 𝑁(𝜇 =
0, 𝜎 = 1). We can look up -𝑧⋆ in the normal probability table by looking for a lower tail of 5% (the other 5% is in the upper
tail), thus 𝑧⋆ = 1.65. The 90% confidence interval can then be computed as point estimate±1.65×𝑆𝐸 → (4.4%, 13.6%).
(Note: the conditions for normality had earlier been confirmed for us.) That is, we are 90% confident that implanting
a stent in a stroke patient’s brain increased the risk of stroke within 30 days by 4.4% to 13.6%.
Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such as 95%
or 99%). A lower degree of confidence increases potential for error, but it also produces a more narrow interval.
8 𝐻 ∶ there is not support for the regulation; 𝐻 ∶ 𝑝 ≤ 0.50. 𝐻 ∶ the majority of borrowers support the regulation;
0 0 𝐴
𝐻𝐴 ∶ 𝑝 > 0.50.
266 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
The Z score is a ratio of how the sample proportion differs from the hypothesized
proportion (𝑝0 ) as compared to the expected variability of the 𝑝̂ (sample proportion)
values.
𝑝 ̂ − 𝑝0
𝑍=
√𝑝0 (1 − 𝑝0 )/𝑛
When the null hypothesis is true and the conditions are met, Z has a standard normal
distribution.
Conditions:
• independent observations
GUIDED PRACTICE
Do payday loan borrowers support a regulation that would require lenders to pull their
credit report and evaluate their debt payments? From a random sample of 826 borrow-
ers, 51% said they would support such a regulation. Is it reasonable to use a normal
distribution to model 𝑝̂ for a hypothesis test here?9
Set up hypotheses and verify the conditions using the null value, 𝑝0 , to ensure 𝑝̂ (the
sample proportion) is nearly normal under 𝐻0 . If the conditions hold, calculate the
standard error, again using 𝑝0 , and show the p-value in a drawing. Lastly, compute the
p-value and evaluate the hypotheses.
9 Independence holds since the poll is based on a random sample. The success-failure condition also holds, which is
checked using the null value (𝑝0 = 0.5) from 𝐻0 ∶ 𝑛𝑝0 = 826 × 0.5 = 413, 𝑛(1 − 𝑝0 ) = 826 × 0.5 = 413. Recall that here,
the best guess for 𝑝 is 𝑝0 which comes from the null hypothesis (because we assume the null hypothesis is true when
performing the testing procedure steps). 𝐻0 ∶ there is not support for the regulation; 𝐻0 ∶ 𝑝 ≤ 0.50. 𝐻𝐴 ∶ the majority
of borrowers support the regulation; 𝐻𝐴 ∶ 𝑝 > 0.50.
16.2. MATHEMATICAL MODEL FOR A PROPORTION 267
EXAMPLE
Using the hypotheses and data from the previous Guided Practices, evaluate whether the poll
on lending regulations provides convincing evidence that a majority of payday loan borrowers
support a new regulation that would require lenders to pull credit reports and evaluate debt
payments.
With hypotheses already set up and conditions checked, we can move onto calculations. The
standard error in the context of a one-proportion hypothesis test is computed using the null
value, 𝑝0 ∶
𝑝0 (1 − 𝑝0 ) √ 0.5(1 − 0.5)
𝑆𝐸 = √ = = 0.017
𝑛 826
A picture of the normal model is shown below with the p-value represented by the shaded
region.
Based on the normal model, the test statistic can be computed as the Z score of the point
estimate:
point estimate − null value 0.51 − 0.50
𝑍= = = 0.59
𝑆𝐸 0.017
The single tail area which represents the p-value is 0.2776. Because the p-value is larger than
0.05, we do not reject 𝐻0 . The poll does not provide convincing evidence that a majority
of payday loan borrowers support regulations around credit checks and evaluation of debt
payments.
In Section 17.1 we discuss two-sided hypothesis tests of which the payday example may have
been better structured. That is, we might have wanted to ask whether the borrows support or
oppose the regulations (to study opinion in either direction away from the 50% benchmark).
In that case, the p-value would have been doubled to 0.5552 (again, we would not reject 𝐻0 ).
In the two-sided hypothesis setting, the appropriate conclusion would be to claim that the
poll does not provide convincing evidence that a majority of payday loan borrowers support
or oppose regulations around credit checks and evaluation of debt payments.
In both the one-sided or two-sided setting, the conclusion is somewhat unsatisfactory because
there is no conclusion. That is, there is no resolution one way or the other about public opinion.
We cannot claim that exactly 50% of people support the regulation, but we cannot claim a
majority in either direction.
16.3.1 Summary
Building on the foundational ideas from the previous few ideas, this chapter focused exclusively on
the single population proportion as the parameter of interest. Note that it is not possible to do a
randomization test with only one variable, so to do computational hypothesis testing, we applied a
bootstrapping framework. The bootstrap confidence interval and the mathematical framework for
both hypothesis testing and confidence intervals are similar to those applied to other data structures
and parameters. When using the mathematical model, keep in mind the success-failure conditions.
Additionally, know that bootstrapping is always more accurate with larger samples.
16.3.2 Terms
The terms introduced in this chapter are presented in Table 16.1. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
16.4 Exercises
2. Married at 25. A study suggests that the 25% of 25 year-olds have gotten married. You
believe that this is incorrect and decide to collect your own sample for a hypothesis test. From
a random sample of 776 25 year-olds, you find that 24% of them are married. A friend of yours
offers to help you with setting up the hypothesis test and comes up with the following hypotheses.
Indicate any errors you see.
𝐻0 ∶ 𝑝̂ = 0.24 𝐻𝐴 ∶ 𝑝̂ ≠ 0.24
3. Defund the police. A Survey USA poll conducted in Seattle, WA in May 2021 reports that
of the 650 respondents (adults living in this area), 159 support proposals to defund police
departments. (Survey USA 2021)
a. A journalist writing a news story on the poll results wants to use the headline “More than 1
in 5 adults living in Seattle support proposals to defund police departments.” You caution
the journalist that they should first conduct a hypothesis test to see if the poll data provide
convincing evidence for this claim Write the hypotheses for this test.
b. Calculate the proportion of Seattle adults in the sample who support proposals to defund
police departments.
c. Describe a setup for a simulation that would be appropriate in this situation and how the
p-value can be calculated using the simulation results.
d. The histogram below shows the distribution of 1,000 𝑝𝑠𝑖𝑚
̂ s under the null hypothesis. Es-
timate the p-value using the plot and use it to evaluate the hypotheses.
270 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
d. After performing this analysis, the consumer group releases the following news headline:
“Infertility clinic falsely advertises better success rates”. Comment on the appropriateness
of this statement.
5. If I fits, I sits, simulated null hypothesis. A citizen science project on which type of
enclosed spaces cats are most likely to sit in compared (among other options) two different
spaces taped to the ground. The first was a square, and the second was a shape known as
Kanizsa square illusion. When comparing the two options given to 7 cats, 5 chose the square,
and 2 chose the Kanizsa square illusion. We are interested to know whether these data provide
convincing evidence that cats prefer one of the shapes over the other. (Smith, Chouinard, and
Byosiere 2021)
a. What are the null and alternative hypotheses for evaluating whether these data provide
convincing evidence that cats have preference for one of the shapes
b. A null hypothesis simulation (with 1,000 draws) was run, and the resulting null distribution
is displayed in the histogram below. Find the p-value using this distribution and conclude
the hypothesis test in the context of the problem.
16.4. EXERCISES 271
6. Legalization of marijuana, simulated null hypothesis. The 2022 General Social Survey
asked a random sample of 1,207 US adults: “Do you think the use of marijuana should be made
legal, or not?” 65.3% of the respondents said it should be made legal. (NORC 2022) Consider
a scenario where, in order to become legal, 55% (or more) of voters must approve.
a. What are the null and alternative hypotheses for evaluating whether these data provide
convincing evidence that, if voted on, marijuana would be legalized in the US.
b. A null hypothesis simulation (with 1,000 draws) was run, and the resulting null distribution
is displayed in the histogram below. Find the p-value using this distribution and conclude
the hypothesis test in the context of the problem.
7. If I fits, I sits, standard errors. The results of a study on the type of enclosed spaces cats
are most likely to sit in show that 5 out of 7 cats chose a square taped to the ground over
a shape known as Kanizsa square illusion, which was preferred by the remaining 2 cats. To
evaluate whether these data provide convincing evidence that cats prefer one of the shapes over
the other, we set 𝐻0 ∶ 𝑝 = 0.5, where 𝑝 is the population proportion of cats who prefer square
over the Kanizsa square illusion and 𝐻𝐴 ∶ 𝑝 ≠ 0.5, which suggests some preference, without
specifying which shape is more preferred. (Smith, Chouinard, and Byosiere 2021)
a. Using the mathematical model, calculate the standard error of the sample proportion in
repeated samples of size 7.
b. A null hypothesis simulation (with 1,000 draws) was run, and the resulting null distribution
is displayed in the histogram below. This distribution shows the variability of the sample
proportion in samples of size 7 when 50% of cats prefer the square shape over the Kanizsa
square illusion. What is the approximate standard error of the sample proportion based
on this distribution?
c. Do the mathematical model and simulated draws yield similar standard errors?
d. In order to approach the problem using the mathematical model, is the success-failure
condition met for this study? Explain.
e. What features of the null distribution shown above tells us that the mathematical model
should probably not be used?
272 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
8. Legalization of marijuana, standard errors. According to the 2022 General Social Survey,
in a random sample of 1,207 US adults, 65.3% think marijuana should be made legal. (NORC
2022) Consider a scenario where, in order to become legal, 55% (or more) of voters must approve.
a. Calculate the standard error of the sample proportion using the mathematical model.
b. 1,000 sample proportions from samples of size 1,207 were drawn from a null distribution
where 55% of voters approve legalizing marijuana. The distribution of these proportions is
shown in the histogram below. Approximate the standard error of the sample proportion
based on this distribution.
c. Do the mathematical model and simulated draws yield similar standard errors?
d. In this setting (to test whether the true underlying population proportion is greater than
0.55), would there be a strong reason to choose the mathematical model over the simulated
null hypothesis (or vice versa)?
9. Statistics and employment, describe the bootstrap. A large university knows that about
70% of the full-time students are employed at least 5 hours per week. The members of the
Statistics Department wonder if the same proportion of their students work at least 5 hours per
week. They randomly sample 25 majors and find that 15 of the students work 5 or more hours
each week.
Two sampling distributions are created to describe the variability in the proportion of statistics
majors who work at least 5 hours per week. The null hypothesis simulation imposes a true
population proportion of 𝑝 = 0.7 while the data bootstrap resamples from the actual data
(which has 60% of the observations who work at least 5 hours per week).
a. The sampling was done under two different settings to generate each of the distributions
shown above. Describe the two different settings.
b. Where are each of the two distributions centered? How do their centers compare?
c. Estimate the standard error of the simulated proportions based on each distribution. Are
the two standard errors you estimate roughly equal?
d. Describe the shapes of the two distributions. Are they roughly the same?
16.4. EXERCISES 273
10. National Health Plan, simulated null hypothesis. A Kaiser Family Foundation poll for a
random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and
24% of Republicans supported a generic “National Health Plan”. There were 347 Democrats,
298 Republicans, and 617 Independents surveyed. (Kaiser Family Foundation 2019)
A political pundit on TV claims that a majority of Independents support a National Health
Plan. Do these data provide strong evidence to support this type of statement? One approach
to assessing the question of whether a majority of Independents support a National Health Plan
is to simulate 1,000 draws from a null hypothesis with 𝑝 = 0.5 as the proportion of Independents
in support.
a. Which distribution should be used to test whether the proportion of all statistics majors
who work at least 5 hours per week is 70%? And which distribution should be used to find
a confidence interval for the true poportion of statistics majors who work at least 5 hours
per week?
b. Using the appropriate histogram, test the claim that 70% of statistics majors, like their
peers, work at least 5 hours per week. State the null and alternative hypotheses, find the
p-value, and conclude the test in the context of the problem.
c. Using the appropriate histogram, find a 98% bootstrap percentile confidence interval for
the true proportion of statistics majors who work at least 5 hours per week. Interpret the
confidence interval in the context of the problem.
d. Using the appropriate historgram, find a 98% bootstrap SE confidence interval for the true
proportion of statistics majors who work at least 5 hours per week. Interpret the confidence
interval in the context of the problem.
12. CLT for proportions. Define the term “sampling distribution” of the sample proportion, and
describe how the shape, center, and spread of the sampling distribution change as the sample
size increases when 𝑝 = 0.1.
13. Vegetarian college students. Suppose that 8% of college students are vegetarians. Determine
if the following statements are true or false, and explain your reasoning.
a. The distribution of the sample proportions of vegetarians in random samples of size 60 is
approximately normal since 𝑛 ≥ 30.
b. The distribution of the sample proportions of vegetarian college students in random samples
of size 50 is right skewed.
c. A random sample of 125 college students where 12% are vegetarians would be considered
unusual.
d. A random sample of 250 college students where 12% are vegetarians would be considered
unusual.
e. The standard error would be reduced by one-half if we increased the sample size from 125
to 250.
14. Young Americans, American dream. About 77% of young adults think they can achieve
the American dream. Determine if the following statements are true or false, and explain your
reasoning. (Vaughn 2011)
a. The distribution of sample proportions of young Americans who think they can achieve the
American dream in random samples of size 20 is left skewed.
b. The distribution of sample proportions of young Americans who think they can achieve the
American dream in random samples of size 40 is approximately normal since 𝑛 ≥ 30.
c. A random sample of 60 young Americans where 85% think they can achieve the American
dream would be considered unusual.
d. A random sample of 120 young Americans where 85% think they can achieve the American
dream would be considered unusual.
15. Orange tabbies. Suppose that 90% of orange tabby cats are male. Determine if the following
statements are true or false, and explain your reasoning.
a. The distribution of sample proportions of random samples of size 30 is left skewed.
b. Using a sample size that is 4 times as large will reduce the standard error of the sample
proportion by one-half.
c. The distribution of sample proportions of random samples of size 140 is approximately
normal.
d. The distribution of sample proportions of random samples of size 280 is approximately
normal.
16.4. EXERCISES 275
16. Young Americans, starting a family. About 25% of young Americans have delayed starting
a family due to the continued economic slump. Determine if the following statements are true
or false, and explain your reasoning. (Demos 2011)
a. The distribution of sample proportions of young Americans who have delayed starting a
family due to the continued economic slump in random samples of size 12 is right skewed.
b. In order for the distribution of sample proportions of young Americans who have delayed
starting a family due to the continued economic slump to be approximately normal, we
need random samples where the sample size is at least 40.
c. A random sample of 50 young Americans where 20% have delayed starting a family due to
the continued economic slump would be considered unusual.
d. A random sample of 150 young Americans where 20% have delayed starting a family due
to the continued economic slump would be considered unusual.
e. Tripling the sample size will reduce the standard error of the sample proportion by one-
third.
17. Sex equality. The General Social Survey asked a random sample of 1,390 Americans the
following question: “On the whole, do you think it should or should not be the government’s
responsibility to promote equality between men and women?” 82% of the respondents said it
“should be”. At a 95% confidence level, this sample has 2% margin of error. Based on this
information, determine if the following statements are true or false, and explain your reasoning.
(NORC 2016)
a. We are 95% confident that 80% to 84% of Americans in this sample think it’s the govern-
ment’s responsibility to promote equality between men and women.
b. We are 95% confident that 80% to 84% of all Americans think it’s the government’s respon-
sibility to promote equality between men and women.
c. If we considered many random samples of 1,390 Americans, and we calculated 95% confi-
dence intervals for each, 95% of these intervals would include the true population proportion
of Americans who think it’s the government’s responsibility to promote equality between
men and women.
d. In order to decrease the margin of error to 1%, we would need to quadruple (multiply by
4) the sample size.
e. Based on this confidence interval, there is sufficient evidence to conclude that a majority
of Americans think it’s the government’s responsibility to promote equality between men
and women.
18. Elderly drivers. A Marist Poll report states that 66% of American adults think licensed
drivers should be required to retake their road test once they reach 65 years of age, based on
a random sample of 1,018 American adults. They also report a margin of error was 3% at the
95% confidence level. (Poll 2011)
a. Verify the margin of error reported by The Marist Poll using a mathematical model.
b. Based on a 95% confidence interval, does the poll provide convincing evidence that more
than two thirds of the population think that licensed drivers should be required to retake
their road test once they turn 65?
19. Fireworks on July 4th . A local news outlet reported that 56% of 600 randomly sampled
Kansas residents planned to set off fireworks on July 4𝑡ℎ . Determine the margin of error for
the 56% point estimate using a 95% confidence level using a mathematical model. (Survey USA
2012)
276 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
20. Proof of COVID-19 vaccination. In the US, businesses and schools shut down due to the
COVID-19 pandemic in March 2020, and a vaccine became publicly available for the first time
in April 2021. That month, a Gallup poll surveyed a random sample of 3,731 US adults, asking
how they felt about the COVID-19 vaccine requirement for air travel. The poll found that 57%
said they would favor it. (Gallup 2021b)
a. Describe the population parameter of interest. What is the value of the point estimate of
this parameter?
b. Check if the conditions required for constructing a confidence interval using a mathematical
model based on these data are met.
c. Construct a 95% confidence interval for the proportion of US adults who favor requiring
proof of COVID-19 vaccination for travel by airplane.
d. Without doing any calculations, describe what would happen to the confidence interval if
we decided to use a higher confidence level.
e. Without doing any calculations, describe what would happen to the confidence interval if
we used a larger sample.
21. Study abroad. A survey on 1,509 high school seniors who took the SAT and who completed
an optional web survey shows that 55% of high school seniors are fairly certain that they will
participate in a study abroad program in college. (American Council on Education 2008)
a. Is this sample a representative sample from the population of all high school seniors in the
US? Explain your reasoning.
b. Suppose the conditions for inference are met, regardless of your answer to part (a). Using a
mathematical model, construct a 90% confidence interval for the proportion of high school
seniors (of those who took the SAT) who are fairly certain they will participate in a study
abroad program in college, and interpret this interval in context.
c. What does “90% confidence” mean?
d. Based on this interval, would it be appropriate to claim that the majority of high school
seniors are fairly certain that they will participate in a study abroad program in college?
22. Legalization of marijuana, mathematical interval. The General Social Survey asked a
random sample of 1,563 US adults: “Do you think the use of marijuana should be made legal,
or not?” 60% of the respondents said it should be made legal. (NORC 2022)
a. Is 60% a sample statistic or a population parameter? Explain.
b. Using a mathematical model, construct a 95% confidence interval for the proportion of US
adults who think marijuana should be made legal, and interpret it in the context of the
data.
c. A critic points out that this 95% confidence interval is only accurate if the statistic follows
a normal distribution, or if the normal model is a good approximation. Do the technical
conditions hold for these data? Explain.
d. A news piece on this survey’s findings states, “Majority of US adults think marijuana should
be legalized.” Based on your confidence interval, is the news piece’s statement justified?
16.4. EXERCISES 277
23. National Health Plan, mathematical inference. A Kaiser Family Foundation poll for a
random sample of US adults in 2019 found that 79% of Democrats, 55% of Independents, and
24% of Republicans supported a generic “National Health Plan”. There were 347 Democrats,
298 Republicans, and 617 Independents surveyed. (Kaiser Family Foundation 2019)
a. A political pundit on TV claims that a majority of Independents support a National Health
Plan. Do these data provide strong evidence to support this type of statement? Your
response should use a mathematical model.
b. Would you expect a confidence interval for the proportion of Independents who oppose the
public option plan to include 0.5? Explain.
24. Is college worth it? Among a simple random sample of 331 American adults who do not have
a four-year college degree and are not currently enrolled in school, 48% said they decided not to
go to college because they could not afford school. (Pew Research Center 2011)
a. A newspaper article states that only a minority of the Americans who decide not to go to
college do so because they cannot afford it and uses the point estimate from this survey
as evidence. Conduct a hypothesis test to determine if these data provide strong evidence
supporting this statement.
b. Would you expect a confidence interval for the proportion of American adults who decide
not to go to college because they cannot afford it to include 0.5? Explain.
25. Taste test. Some people claim that they can tell the difference between a diet soda and a
regular soda in the first sip. A researcher wanting to test this claim randomly sampled 80 such
people. He then filled 80 plain white cups with soda, half diet and half regular through random
assignment, and asked each person to take one sip from their cup and identify the soda as diet
or regular. 53 participants correctly identified the soda.
a. Do these data provide strong evidence that these people are able to detect the difference
between diet and regular soda, in other words, are the results discernibly better than just
random guessing? Your response should use a mathematical model.
b. Interpret the p-value in this context.
26. Will the coronavirus bring the world closer together? In early 2020 the COVID-19
pandemic arrived in the US; by December 2020 the first COVID-19 vaccine was available. An
April 2021 YouGov poll asked 4,265 UK adults whether they think the coronavirus bring the
world closer together or leave us further apart. 12% of the respondents said it will bring the
world closer together. 37% said it would leave us further apart, 39% said it won’t make a
difference and the remainder didn’t have an opinion on the matter. (YouGov 2021)
a. Calculate, using a mathematical model, a 90% confidence interval for the proportion of UK
adults who think the coronavirus will bring the world closer together, and interpret the
interval in context.
b. Suppose we wanted the margin of error for the 90% confidence level to be about 0.5%. How
large of a sample size would you recommend for the poll?
278 CHAPTER 16. INFERENCE FOR A SINGLE PROPORTION
27. Quality control. As part of a quality control process for computer chips, an engineer at a
factory randomly samples 212 chips during a week of production to test the current rate of chips
with severe defects. She finds that 27 of the chips are defective.
a. What population is under consideration in the dataset?
b. What parameter is being estimated?
c. What is the point estimate for the parameter?
d. What is the name of the statistic that can be used to measure the uncertainty of the point
estimate?
e. Compute the value of the statistic from part (d) using a mathematical model.
f. The historical rate of defects is 10%. Should the engineer be surprised by the observed rate
of defects during the current week?
g. Suppose the true population value was found to be 10%. If we use this proportion to
recompute the value in part (d) using 𝑝 = 0.1 instead of 𝑝,̂ how much does the resulting
value of the statistic change?
28. Nearsighted children. Nearsightedness (myopia) is a common vision condition in which you
can see near objects clearly, but farther away objects blurry. It is believed that nearsightedness
affects about 8% of all children. In a random sample of 194 children, 21 are nearsighted. Using a
mathematical model, conduct a hypothesis test for the following question: do these data provide
evidence that the 8% value is inaccurate?
29. Website registration. A website is trying to increase registration for first-time visitors, ex-
posing 1% of these visitors to a new site design. Of 752 randomly sampled visitors over a month
who saw the new design, 64 registered.
a. Check the conditions for constructing a confidence interval for the proportion of first-time
visitors of the site who would register under the new design using a mathematical model.
b. Compute the standard error which would describe the variability os the point estimate
associated with repeated samples of size 752.
c. Construct and interpret a 90% confidence interval for the fraction of first-time visitors of
the site who would register under the new design (assuming stable behaviors by new visitors
over time).
30. Coupons driving visits. A store randomly samples 603 shoppers over the course of a year and
finds that 142 of them made their visit because of a coupon they’d received in the mail. Using a
mathematical model, construct a 95% confidence interval for the fraction of all shoppers during
the year whose visit was because of a coupon they’d received in the mail.
279
Chapter 17
The results are summarized in Table 17.1 (which is a replica of Table 14.2). 11 out of the 50 patients
in the control group and 14 out of the 40 patients in the treatment group survived.
Table 17.1: Results for the CPR study. Patients in the treatment group were given a blood thinner, and
patients in the control group were not.
GUIDED PRACTICE
Is this an observational study or an experiment? What implications does the study
type have on what can be inferred from the results?1
In this study, a larger proportion of patients who received blood thinner after CPR,𝑝𝑇̂ = 1440 = 0.35,
11
survived compared to those who did not receive blood thinner, 𝑝𝐶 ̂ = 50 = 0.22. However, based on
these observed proportions alone, we cannot determine whether the difference (𝑝𝑇̂ − 𝑝𝐶
̂ = 0.35−0.22 =
0.13) provides convincing evidence that blood thinner usage after CPR is effective.
As we saw in Chapter 11, we can re-randomize the responses (survived or died) to the treatment
conditions assuming the null hypothesis is true and compute possible differences in proportions. The
process by which we randomize observations to two groups is summarized and visualized in Fig-
ure 11.8).
Figure 17.1: A stacked dot plot of differences from 100 simulations produced under the independence model
𝐻0 , where in these simulations survival is unaffected by the treatment. Twelve of the 100 simulations had a
difference of at least 13%, the difference observed in the study.
EXAMPLE
How often would you observe a difference of at least 13% (0.13) according to Figure 17.1? Is
this a rare event?
It appears that a difference of at least 13% due to chance alone, if the null hypothesis was true
would happen about 12% of the time according to Figure 17.1. This is not a very rare event.
1 The study is an experiment, as patients were randomly assigned an experiment group. Since this is an experiment,
the results can be used to evaluate a causal relationship between blood thinner use after CPR and whether patients
survived.
17.2. BOOTSTRAP CONFIDENCE INTERVAL FOR THE DIFFERENCE IN PROPORTIONS281
The difference of 13% not being a rare event suggests two possible interpretations of the results of the
study:
• 𝐻0 Independence model. Blood thinners after CPR have no effect on survival, and we just
happened to observe a difference that would only occur on a rare occasion.
• 𝐻𝐴 Alternative model. Blood thinners after CPR increase chance of survival, and the difference
we observed was actually due to the blood thinners after CPR being effective at increasing the
chance of survival, which explains the difference of 13%.
Since we determined that the outcome is not that rare (12% chance of observing a difference of 13% or
more under the assumption that blood thinners after CPR have no effect on survival), we fail to reject
𝐻0 , and conclude that the study results do not provide strong evidence against the independence
model. This does not mean that we have proved that blood thinners are not effective, it just means
that this study does not provide convincing evidence that they are effective in this setting.
Statistical inference, is built on evaluating how likely such differences are to occur due to chance if in
fact the null hypothesis is true. In statistical inference, data scientists evaluate which model is most
reasonable given the data. Errors do occur, just like rare events, and we might choose the wrong
model. While we do not always choose correctly, statistical inference gives us tools to control and
evaluate how often these errors occur.
In Section 17.1, we worked with the randomization distribution to understand the distribution of
𝑝1̂ − 𝑝2̂ when the null hypothesis 𝐻0 ∶ 𝑝1 − 𝑝2 = 0 is true. Now, through bootstrapping, we study the
variability of 𝑝1̂ − 𝑝2̂ without assuming the null hypothesis is true.
Figure 17.2: Creating two populations from which to take each of the bootstrap samples.
As before, once the population is estimated, we can randomly resample observations to create boot-
strap samples, as seen in Figure 17.3.
282 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
Figure 17.3: Taking each bootstrap sample from the estimated population.
The variability of the statistic (the difference in sample proportions) can be calculated by taking one
bootstrap resample from Sample 1 and one bootstrap resample from Sample 2 and calculating the
difference in the bootstrap proportions.
Figure 17.4: For example, the first bootstrap resamples from Sample 1 and Sample 2 provide resample
proportions of 2/7 and 5/9, respectively.
As always, the variability of the difference in proportions can only be estimated by repeated simula-
tions, in this case, repeated bootstrap resamples. Figure 17.4 shows multiple bootstrap differences
calculated for each of the repeated bootstrap samples.
Figure 17.5: For each pair of bootstrap samples, we calculate the difference in sample proportions
Repeated bootstrap simulations lead to a bootstrap sampling distribution of the statistic of interest,
here the difference in sample proportions. Figure 17.6 visualizes the process and Figure 17.7 shows
283
1,000 bootstrap differences in proportions for the CPR data. Note that the CPR data includes 40
and 50 people in the respective groups, and the illustrated example includes 7 and 9 people in the
two groups. Accordingly, the variability in the distribution of sample proportions is higher for the
illustrated example. As you will see in the mathematical models discussed in Section 17.3, large
sample sizes lead to smaller standard errors for a difference in proportions.
Figure 17.6: The differences in each bootstrapped pair of proportions are combined to create the sampling
distribution of the differences in proportions.
Figure 17.7: A histogram of differences in proportions from 1,000 bootstrap simulations of the CPR data.
Note that because the CPR data has a larger sample size than the illustrated example, the variability of the
difference in proportions is much smaller with the CPR histogram.
percentile interval. Note that here we calculate the 90% confidence interval by finding the 5𝑡ℎ and
95𝑡ℎ percentile values from the bootstrapped differences. The bootstrap 5 percentile proportion is
-0.032 and the 95 percentile is 0.284. The result is: we are 90% confident that, in the population, the
true difference in probability of survival for individuals receiving blood thinners after CPR is between
-0.032 lower and 0.284 higher than those who did not receive blood thinners. The interval shows that
we do not have much definitive evidence of the effect of blood thinners, one way or another.
Figure 17.8: The CPR data is bootstrapped 1,000 times. Each simulation creates a sample from the original
data where the probability of survival in the treatment group is 𝑝𝑇̂ = 14/40 and the probability of survival in
the control group is 𝑝𝐶
̂ = 11/50.
Alternatively, we can use the variability in the bootstrapped differences to calculate a standard error of
the difference. The resulting interval is called the SE interval. Section 17.3 details the mathematical
model for the standard error of the difference in sample proportions, but the bootstrap distribution
typically does an excellent job of estimating the variability of the sampling distribution of the sample
statistic.
𝑆𝐸(𝑝𝑇̂ − 𝑝𝐶
̂ ) ≈ 𝑆𝐸(𝑝𝑇̂ ,𝑏𝑜𝑜𝑡 − 𝑝𝐶,𝑏𝑜𝑜𝑡
̂ ) = 0.098
The variability of the difference in proportions was calculated in R using the sd() function, but any
statistical software will calculate the standard deviation of the differences, here, the exact quantity
we hope to approximate.
Note that we do not know the true distribution of 𝑝𝑇̂ − 𝑝𝐶̂ , so we will use a rough approximation to find
a confidence interval for 𝑝𝑇 − 𝑝𝐶 . As seen in the bootstrap histograms, the shape of the distribution is
roughly symmetric and bell-shaped. So for a rough approximation, we will apply the 67-95-99.7 rule
which tells us that 95% of observed differences should be roughly no farther than 2 SE from the true
parameter (difference in proportions). A 95% confidence interval for 𝑝𝑇 − 𝑝𝐶 is given by:
𝑝𝑇̂ − 𝑝𝐶
̂ ± 2 ⋅ 𝑆𝐸 → 14/40 − 11/50 ± 2 ⋅ 0.098 → (−0.067, 0.327)
We are 95% confident that the true value of 𝑝𝑇 − 𝑝𝐶 is between -0.067 and 0.327. Again, the wide
confidence interval that contains zero indicates that the study provides very little evidence about the
effectiveness of blood thinners. For other percentages, e.g., a 90% bootstrap SE confidence interval, we
will use quantiles given by the standard normal distribution, as seen in Section 13.2 and Figure 13.8.
17.3. MATHEMATICAL MODEL FOR THE DIFFERENCE IN PROPORTIONS 285
Figure 17.9: One hypothetical population, parameter value of: 𝑝1 − 𝑝2 = 0.47. Twenty-five different studies
all which led to a different point estimate, SE, and confidence interval. The study at hand is one of the
horizontal lines (hopefully a blue line!).’
The choice of 95% or 90% or even 99% as a confidence level is admittedly somewhat arbitrary; however,
it is related to the logic we used when deciding that a p-value should be declared as “discernible” if
it is lower than 0.05 (or 0.10 or 0.01, respectively). Indeed, one can show mathematically, that a 95%
confidence interval and a two-sided hypothesis test at a cutoff of 0.05 will provide the same conclusion
when the same data and mathematical tools are applied for the analysis. A full derivation of the
explicit connection between confidence intervals and hypothesis tests is beyond the scope of this text.
The difference 𝑝1̂ − 𝑝2̂ can be modeled using a normal distribution when
1. Independence (extended). The data are independent within and between the two
groups. Generally this is satisfied if the data come from two independent random
samples or if the data come from a randomized experiment.
2. Success-failure condition. The success-failure condition holds for both groups,
where we check successes and failures in each group separately. That is, we should
have at least 10 successes and 10 failures in each of the two groups.
When these conditions are satisfied, the standard error of 𝑝1̂ − 𝑝2̂ is:
𝑝1 (1 − 𝑝1 ) 𝑝2 (1 − 𝑝2 )
𝑆𝐸(𝑝1̂ − 𝑝2̂ ) = √ +
𝑛1 𝑛2
where 𝑝1 and 𝑝2 represent the population proportions, and 𝑛1 and 𝑛2 represent the
sample sizes.
Note that in most cases, the standard error is approximated using the observed data:
where 𝑝1̂ and 𝑝2̂ represent the observed sample proportions, and 𝑛1 and 𝑛2 represent
the sample sizes.
Recall that the margin of error is defined by the standard error. The margin of error for 𝑝1̂ − 𝑝2̂ can
be directly obtained from 𝑆𝐸(𝑝1̂ − 𝑝2̂ ).
point estimate ± 𝑧 ⋆ × 𝑆𝐸
𝑝1̂ (1 − 𝑝1̂ ) 𝑝2̂ (1 − 𝑝2̂ )
(𝑝1̂ − 𝑝2̂ ) ± 𝑧 ⋆ × √ +
𝑛1 𝑛2
17.3. MATHEMATICAL MODEL FOR THE DIFFERENCE IN PROPORTIONS 287
When the conditions for the normal model are are met, the variability of the difference
in proportions, 𝑝1̂ − 𝑝2̂ , is well described by:
EXAMPLE
We reconsider the experiment for patients who underwent cardiopulmonary resuscitation (CPR)
for a heart attack and were subsequently admitted to a hospital. These patients were randomly
divided into a treatment group where they received a blood thinner or the control group where
they did not receive a blood thinner. The outcome variable of interest was whether the patients
survived for at least 24 hours. The results are shown in Table 14.2. Check whether we can
model the difference in sample proportions using the normal distribution.
We first check for independence: since this is a randomized experiment, it seems reasonable to
assume that the observations are idependent. Next, we check the success-failure condition for
each group. We have at least 10 successes and 10 failures in each experiment arm (11, 14, 39,
26), so this condition is also satisfied. With both conditions satisfied, the difference in sample
proportions can be reasonably modeled using a normal distribution for these data.
EXAMPLE
Create and interpret a 90% confidence interval of the difference for the survival rates in the
CPR study.
We’ll use 𝑝𝑇 for the survival rate in the treatment group and 𝑝𝐶 for the control group:
14 11
𝑝𝑇̂ − 𝑝𝐶
̂ = − = 0.35 − 0.22 = 0.13
40 50
We use the standard error formula previously provided. As with the one-sample proportion
case, we use the sample estimates of each proportion in the formula in the confidence interval
context:
point estimate ± 𝑧 ⋆ × 𝑆𝐸
0.13 ± 1.65 × 0.095
(−0.027 , 0.287)
We are 90% confident that individuals receiving blood thinners have between a 2.7% less chance
of survival to a 28.7% greater chance of survival than those in the control group. Because 0%
is contained in the interval, we do not have enough information to say whether blood thinners
help or harm heart attack patients who have been admitted after they have undergone CPR.
Note, the problem was set up as 90% to indicate that there was not a need for a high level
of confidence (such a 95% or 99%). A lower degree of confidence increases potential for error,
but it also produces a more narrow interval.
288 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
GUIDED PRACTICE
A 5-year experiment was conducted to evaluate the effectiveness of fish oils on reducing
cardiovascular events, where each subject was randomized into one of two treatment
groups (Manson et al. 2019). We’ll consider heart attack outcomes in the patients
listed in Table 17.2.
Create a 95% confidence interval for the effect of fish oils on heart attacks for patients
who are well-represented by those in the study. Also interpret the interval in the context
of the study.2
Table 17.2: Results for the study on n-3 fatty acid supplement and related health benefits.
When the null hypothesis is that the proportions are equal, use the pooled proportion
(𝑝pool
̂ ) of successes to verify the success-failure condition and estimate the standard
error:
2 Because the patients were randomized, the subjects are independent, both within and between the two groups.
The success-failure condition is also met for both groups as all counts are at least 10. This satisfies the conditions
necessary to model the difference in proportions using a normal distribution. Compute the sample proportions (𝑝̂fish oil =
0.0112, 𝑝̂placebo = 0.0155), point estimate of the difference (0.0112 − 0.0155 = −0.0043), and standard error 𝑆𝐸 =
√ 0.0112×0.9888
12933 + 0.0155×0.9845
12938 , 𝑆𝐸 = 0.00145. Next, plug the values into the general formula for a confidence interval,
where we’ll use a 95% confidence level with 𝑧⋆ = 1.96 ∶ −0.0043 ± 1.96 × 0.00145 = (−0.0071, −0.0015). We are 95%
confident that fish oils decreases heart attacks by 0.15 to 0.71 percentage points (off of a baseline of about 1.55%) over
a 5-year period for subjects who are similar to those in the study. Because the interval is entirely below 0, and the
treatment was randomly assigned the data provide strong evidence that fish oil supplements reduce heart attacks in
patients like those in the study.
17.3. MATHEMATICAL MODEL FOR THE DIFFERENCE IN PROPORTIONS 289
The Z score is a ratio of how the two sample proportions differ as compared to the
expected variability of difference between the proportions.
(𝑝1̂ − 𝑝2̂ ) − 0
𝑍=
√𝑝𝑝𝑜𝑜𝑙 ̂ )( 𝑛1 +
̂ (1 − 𝑝𝑝𝑜𝑜𝑙 1
𝑛2 )
1
When the null hypothesis is true and the conditions are met, Z has a standard normal
distribution. See the box below for calculation of the pooled proportion of successes.
Conditions:
• Independent observations
• Large samples: (𝑛1 𝑝1 ≥ 10 and 𝑛1 (1−𝑝1 ) ≥ 10 and 𝑛2 𝑝2 ≥ 10 and 𝑛2 (1−𝑝2 ) ≥ 10)
• Check conditions using: (𝑛1 𝑝pool
̂ ≥ 10 and 𝑛1 (1 − 𝑝pool
̂ ) ≥ 10 and 𝑛2 𝑝pool
̂ ≥ 10
and 𝑛2 (1 − 𝑝pool
̂ ) ≥ 10)
A mammogram is an X-ray procedure used to check for breast cancer. Whether mammograms should
be used is part of a controversial discussion, and it’s the topic of our next example where we learn
about 2-proportion hypothesis tests when 𝐻0 is 𝑝1 − 𝑝2 = 0 (or equivalently, 𝑝1 = 𝑝2 ).
A 30-year study was conducted with nearly 90,000 participants who identified as female. During a
5-year screening period, each participant was randomized to one of two groups: in the first group,
participants received regular mammograms to screen for breast cancer, and in the second group,
participants received regular non-mammogram breast cancer exams. No intervention was made during
the following 25 years of the study, and we’ll consider death resulting from breast cancer over the full
30-year period. Results from the study are summarized in Table 17.3.
If mammograms are much more effective than non-mammogram breast cancer exams, then we would
expect to see additional deaths from breast cancer in the control group. On the other hand, if
mammograms are not as effective as regular breast cancer exams, we would expect to see an increase
in breast cancer deaths in the mammogram group.
GUIDED PRACTICE
Is this study an experiment or an observational study?3
3 This is an experiment. Patients were randomized to receive mammograms or a standard breast cancer exam. We
GUIDED PRACTICE
Set up hypotheses to test whether there was a difference in breast cancer deaths in the
mammogram and control groups.4
The research question describing mammograms is set up to address specific hypotheses (in contrast
to a confidence interval for a parameter). In order to fully take advantage of the hypothesis testing
structure, we assess the randomness under the condition that the null hypothesis is true (as we always
do for hypothesis testing). Using the data from Table 17.3, we will check the conditions for using a
normal distribution to analyze the results of the study using a hypothesis test.
number of patients who died from breast cancer in the entire study
𝑝pool
̂ =
number of patients in the entire study
500 + 505
=
500 + 44,425 + 505 + 44,405
= 0.0112
This proportion is an estimate of the breast cancer death rate across the entire study, and it’s our
best estimate of the proportions 𝑝𝑀𝐺𝑀 and 𝑝𝐶 if the null hypothesis is true that 𝑝𝑀𝐺𝑀 = 𝑝𝐶 . We
will also use this pooled proportion when computing the standard error.
EXAMPLE
Is it reasonable to model the difference in proportions using a normal distribution in this study?
Because the patients were randomized, observations can be assumed to be independent, both
within each group and between treatment groups. We also must check the success-failure
condition for each group. Under the null hypothesis, the proportions 𝑝𝑀𝐺𝑀 and 𝑝𝐶 are equal,
so we check the success-failure condition with our best estimate of these values under 𝐻0 , the
pooled proportion from the two samples, 𝑝pool
̂ = 0.0112 ∶
𝑝pool
̂ × 𝑛𝑀𝐺𝑀 = 0.0112 × 44,925 = 503
(1 − 𝑝pool
̂ ) × 𝑛𝑀𝐺𝑀 = 0.9888 × 44,925 = 44,422
𝑝pool
̂ × 𝑛𝐶 = 0.0112 × 44,910 = 503
(1 − 𝑝pool
̂ ) × 𝑛𝐶 = 0.9888 × 44,910 = 44,407
The success-failure condition is satisfied since all values are at least 10. With both conditions
satisfied, we can safely model the difference in proportions using a normal distribution.
In the previous example, the pooled proportion was used to check the success-failure condition5 . In the
next example, we see an additional place where the pooled proportion comes into play: the standard
error calculation.
4 𝐻 ∶ the breast cancer death rate for patients screened using mammograms is the same as the breast cancer
0
death rate for patients in the control, 𝑝𝑀𝐺𝑀 − 𝑝𝐶 = 0. 𝐻𝐴 ∶ the breast cancer death rate for patients screened using
mammograms is different than the breast cancer death rate for patients in the control, 𝑝𝑀𝐺𝑀 − 𝑝𝐶 ≠ 0.
5 For an example of a two-proportion hypothesis test that does not require the success-failure condition to be met,
EXAMPLE
Compute the point estimate of the difference in breast cancer death rates in the two groups,
and use the pooled proportion 𝑝pool
̂ = 0.0112 to calculate the standard error.
500 505
𝑝𝑀𝐺𝑀
̂ − 𝑝𝐶
̂ = − = 0.01113 − 0.01125 = −0.00012
500 + 44, 425 505 + 44, 405
The breast cancer death rate in the mammogram group was 0.012% less than in the control
group. Next, the standard error is calculated using the pooled proportion, 𝑝pool
̂ ∶
𝑝pool
̂ (1 − 𝑝pool
̂ ) 𝑝pool
̂ (1 − 𝑝pool
̂ )
𝑆𝐸 = √ + = 0.00070
𝑛𝑀𝐺𝑀 𝑛𝐶
EXAMPLE
Using the point estimate 𝑝𝑀𝐺𝑀
̂ − 𝑝𝐶
̂ = −0.00012 and standard error 𝑆𝐸 = 0.00070, calculate
a p-value for the hypothesis test and write a conclusion.
The lower tail area is 0.4325, which we double to get the p-value: 0.8650. Because this p-value
is larger than 0.05, we do not reject the null hypothesis. That is, the difference in breast cancer
death rates is likely to have occurred just by chance, if the null hypothesis is true. Thus, we
do not observe benefits or harm from mammograms relative to a regular breast exam.
Can we conclude that mammograms have no benefits or harm? Here are a few considerations to keep
in mind when reviewing the mammogram study as well as any other medical study:
• We do not reject the null hypothesis, which means we do not have sufficient evidence to conclude
that mammograms reduce or increase breast cancer deaths.
• If mammograms are helpful or harmful, the data suggest the effect isn’t very large.
• Are mammograms more or less expensive than a non-mammogram breast exam? If one option
is much more expensive than the other and does not offer clear benefits, then we should lean
towards the less expensive option.
• The study’s authors also found that mammograms led to over-diagnosis of breast cancer, which
means some breast cancers were found (or thought to be found) but that these cancers would
not cause symptoms during patients’ lifetimes. That is, something else would kill the patient
before breast cancer symptoms appeared. This means some patients may have been treated for
breast cancer unnecessarily, and this treatment is another cost to consider. It is also important
to recognize that over-diagnosis can cause unnecessary physical or emotional harm to patients.
These considerations highlight the complexity around medical care and treatment recommendations.
Experts and medical boards who study medical treatments use considerations like those above to
provide their best recommendation based on the current evidence.
292 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
17.4.1 Summary
When the parameter of interest is the difference in population proportions across two groups, ran-
domization tests, bootstrapping, and mathematical modeling can be applied. For confidence intervals,
bootstrapping from each group separately will provide a sampling distribution for the difference in
sample proportions; the mathematical model shows a similar distributional shape as long as the sam-
ple size is large enough to fulfill the success-failure conditions and so that the data are representative
of the entire population. Keep in mind that some datasets will produce a confidence interval which
does not capture the true parameter, this is the nature of variability! Over your lifetime, about 95%
of the confidence intervals you create will capture the parameter of interest, and about 5% won’t. For
hypothesis testing, repeated randomization of the explanatory variable creates a null distribution of
differences in sample proportions that could have occurred under the null hypothesis. Randomization
and the mathematical model will have similar null distributions, as long as the sample size is large
enough to fulfill the success-failure conditions.
17.4.2 Terms
The terms introduced in this chapter are presented in Table 17.4. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
17.5 Exercises
a. In both words and symbols provide the parameter and statistic of interest for this study.
Do you know the numerical value of either the parameter or statisic of interest? If so,
provide the numerical value.
b. The histogram above provides the sampling distribution (under randomization) for
𝑝𝐴𝑠𝑖𝑎𝑛−𝐼𝑛𝑑𝑖𝑎𝑛
̂ − 𝑝𝐶ℎ𝑖𝑛𝑒𝑠𝑒
̂ under repeated null randomizations (𝑝̂ is the proportion in the
sample who are current smokers). Estimate the standard error of 𝑝𝐴𝑠𝑖𝑎𝑛−𝐼𝑛𝑑𝑖𝑎𝑛
̂ − 𝑝𝐶ℎ𝑖𝑛𝑒𝑠𝑒
̂
based on the randomization histogram.
c. Consider the hypothesis test to determine if there is a difference in proportion of Asian-
Indian Americans as compared to Chinese Americans who are current smokers. Write out
the null and alternative hypotheses, estimate a p-value using the randomization histogram,
and conclude the test in the context of the problem.
294 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
a. In both words and symbols provide the parameter and statistic of interest for this study.
Do you know the numerical value of either the parameter or statisic of interest? If so,
provide the numerical value.
b. The histogram above provides the sampling distribution (under randomization) for
𝑝𝑚𝑎𝑙𝑎𝑟𝑖𝑎
̂ − 𝑝𝑐𝑜𝑛𝑡𝑟𝑜𝑙
̂ under repeated null randomizations (𝑝̂ is the proportion of children in
the sample who contracted malaria). Estimate the standard error of 𝑝𝑚𝑎𝑙𝑎𝑟𝑖𝑎
̂ − 𝑝𝑐𝑜𝑛𝑡𝑟𝑜𝑙
̂
based on the randomization histogram.
c. Consider the hypothesis test constructed to show a lower proportion of children contracting
malaria on the malaria vaccine as compared to the control vaccine. Write out the null and
alternative hypotheses, estimate a p-value using the randomization histogram, and conclude
the test in the context of the problem.
17.5. EXERCISES 295
5. COVID-19 and degree completion. A 2021 Gallup poll surveyed 3,941 students pursuing a
bachelor’s degree and 2,064 students pursuing an associate degree (students were not randomly
selected but were weighted so as to represent a random selection of currently enrolled US college
students). The poll found that 51% of the bachelor’s degree students and 44% of associate degree
students said that the COVID-19 pandemic will negatively impact their ability to complete the
degree. (Gallup 2021a)
Below are two histograms generated with different computational approaches (both use 1,000
repetitions) to research questions which could be asked of these data. One of the histograms
can be used to do a randomization test on whether the proportions of bachelor’s and associate
students who think the COVID-19 pandemic will negatively impact their ability to complete the
degree. The other histogram is a bootstrap distribution used to quantify the difference in the
proportions of bachelor’s and associate’s students who feel this way.
a. Are the center and standard error of the two graphs approximately the same? Explain.
b. Write a research question that can be addressed using the histogram generated with com-
putational method A.
c. Write a research question that can addressed using the histogram generated with compu-
tational method B.
6. Renewable energy. A 2021 Gallup poll surveyed 5,447 randomly sampled US adults who are
Republican (or Republican leaning) and 7,962 who are Democrats (or Democrat leaning). 31%
of Republicans and 81% of Democrats said “government regulations are necessary to encourage
businesses and consumers to rely more on renewable energy sources”. (Gallup 2021a)
Below are two histograms generated with different computational approaches (both use 1,000
repetitions) to research questions which could be asked of these data. One of the histograms can
be used to do a randomization test on whether the proportions of Republicans and Democrats
who think government regulations are necessary to encourage businesses and consumers to rely
more on renewable energy sources are different. The other histogram is a bootstrap distribution
used to quantify the difference in the proportions of Republicans and Democrats who agree with
this statement.
a. Are the center and standard error of the two graphs approximately the same? Explain.
b. Write a research question that can addressed using the histogram generated with compu-
tational method A.
c. Write a research question that can addressed using the histogram generated with compu-
tational method B.
17.5. EXERCISES 297
7. HIV in sub-Saharan Africa. In July 2008 the US National Institutes of Health announced
that it was stopping a clinical study early because of unexpected results. The study population
consisted of HIV-infected women in sub-Saharan Africa who had been given single dose Nevarip-
ine (a treatment for HIV) while giving birth, to prevent transmission of HIV to the infant. The
study was a randomized comparison of continued treatment of a woman (after successful child-
birth) with Nevaripine vs Lopinavir, a second drug used to treat HIV. 240 women participated
in the study; 120 were randomized to each of the two treatments. Twenty-four weeks after
starting the study treatment, each woman was tested to determine if the HIV infection was
becoming worse (an outcome called virologic failure). Twenty-six of the 120 women treated with
Nevaripine experienced virologic failure, while 10 of the 120 women treated with the other drug
experienced virologic failure. (Lockman et al. 2007)
a. Create a two-way table presenting the results of this study.
b. State appropriate hypotheses to test for difference in virologic failure rates between treat-
ment groups.
c. Complete the hypothesis test and state an appropriate conclusion. (Reminder: Verify any
necessary conditions for the test.)
9. National Health Plan. A Kaiser Family Foundation poll for US adults in 2019 found that
79% of Democrats, 55% of Independents, and 24% of Republicans supported a generic “National
Health Plan”. There were 347 Democrats, 298 Republicans, and 617 Independents surveyed. 79%
of 347 Democrats and 55% of 617 Independents support a National Health Plan. (Kaiser Family
Foundation 2019)
a. Calculate a 95% confidence interval for the difference between the proportion of Democrats
and Independents who support a National Health Plan (𝑝𝐷 − 𝑝𝐼 ), and interpret it in this
context. We have already checked conditions for you.
b. True or false: If we had picked a random Democrat and a random Independent at the time
of this poll, it is more likely that the Democrat would support the National Health Plan
than the Independent.
298 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
10. Sleep deprivation, CA vs. OR, confidence interval. According to a report on sleep
deprivation by the Centers for Disease Control and Prevention, the proportion of California
residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%,
while this proportion is 8.8% for Oregon residents. These data are based on simple random
samples of 11,545 California and 4,691 Oregon residents. Calculate a 95% confidence interval
for the difference between the proportions of Californians and Oregonians who are sleep deprived
and interpret it in context of the data. (CDC 2008)
11. Gender pay gap in medicine. A study examined the average pay for men and women entering
the workforce as doctors for 21 different positions. (Lo Sasso et al. 2011)
a. If each gender was equally paid, then we would expect about half of those positions to have
men paid more than women and women would be paid more than men in the other half of
positions. Write appropriate hypotheses to test this scenario.
b. Men were, on average, paid more in 19 of those 21 positions. Complete a hypothesis test
using your hypotheses from part (a).
12. Sleep deprivation, CA vs. OR, hypothesis test. A CDC report on sleep deprivation rates
shows that the proportion of California residents who reported insufficient rest or sleep during
each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These
data are based on simple random samples of 11,545 California and 4,691 Oregon residents.
a. Conduct a hypothesis test to determine if these data provide strong evidence that the rate
of sleep deprivation is different for the two states. (Reminder: Check conditions)
b. It is possible the conclusion of the test in part (a) is incorrect. If this is the case, what type
of error was made?
Suppose we are interested in estimating the difference in yawning rates between the control and
treatment groups using a confidence interval. Explain why we cannot construct such an interval
using the normal approximation. What might go wrong if we constructed the confidence interval
despite this problem?
6 The yawn data used in this exercise can be found in the openintro R package.
17.5. EXERCISES 299
14. Heart transplant success. The Stanford University Heart Transplant Study was conducted to
determine whether an experimental heart transplant program increased lifespan. Each patient
entering the program was officially designated a heart transplant candidate, meaning that he was
gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment
and control groups. Patients in the treatment group received a transplant, and those in the
control group did not. The visualization below displays how many patients survived and died
in each group.7 (Turnbull, Brown, and Hu 1974)
Suppose we are interested in estimating the difference in survival rate between the control and
treatment groups using a confidence interval. Explain why we cannot construct such an interval
using the normal approximation. What might go wrong if we constructed the confidence interval
despite this problem?
15. Government shutdown. The United States federal government shutdown of 2018–2019 oc-
curred from December 22, 2018 until January 25, 2019, a span of 35 days. A Survey USA poll
of 614 randomly sampled Americans during this time period reported that 48% of those who
make less than $40,000 per year and 55% of those who make $40,000 or more per year said
the government shutdown has not at all affected them personally. A 95% confidence interval
for (𝑝<40K − 𝑝≥40K ), where 𝑝 is the proportion of those who said the government shutdown has
not at all affected them personally, is (-0.16, 0.02). Based on this information, determine if the
following statements are true or false, and explain your reasoning if you identify the statement
as false. (Survey USA 2019)
a. At the 5% discernibility level, the data provide convincing evidence of a real difference in
the proportion who are not affected personally between Americans who make less than
$40,000 annually and Americans who make $40,000 annually.
b. We are 95% confident that 16% more to 2% fewer Americans who make less than $40,000
per year are not at all personally affected by the government shutdown compared to those
who make $40,000 or more per year.
c. A 90% confidence interval for (𝑝<40K −𝑝≥40K ) would be wider than the (−0.16, 0.02) interval.
d. A 95% confidence interval for (𝑝≥40K − 𝑝<40K ) is (-0.02, 0.16).
7 The heart_transplant data used in this exercise can be found in the openintro R package.
300 CHAPTER 17. BOOTSTRAP CI FOR THE DIFFERENCE IN PROPORTIONS
16. Online harassment. A Pew Research poll asked US adults aged 18-29 and 30-49 whether they
have personally experienced harassment online. A 95% confidence interval for the difference
between the proportions of 18-29 year-olds and 30-49 year-olds who have personally experienced
harassment online (𝑝18−29 − 𝑝30−49 ) was calculated to be (0.115, 0.185). Based on this informa-
tion, determine if the following statements are true or false, and explain your reasoning for each
statement you identify as false. (Pew Research Center 2021b)
a. We are 95% confident that the true proportion of 18-29 year-olds who have personally
experienced harassment online is 11.5% to 18.5% lower than the true proportion of 30-49
year-olds who have personally experienced harassment online.
b. We are 95% confident that the true proportion of 18-29 year-olds who have personally
experienced harassment online is 11.5% to 18.5% higher than the true proportion of 30-49
year-olds who have personally experienced harassment online.
c. 95% of random samples will produce 95% confidence intervals that include the true differ-
ence between the population proportions of 18-29 year-olds and 30-49 year-olds who have
personally experienced harassment online.
d. We can conclude that there is a discernible difference between the proportions of 18-29 year-
olds and 30-49 year-olds who have personally experienced harassment online is too large to
plausibly be due to chance, if in fact there is no difference between the two proportions.
e. The 90% confidence interval for (𝑝18−29 − 𝑝30−49 ) cannot be calculated with only the infor-
mation given in this exercise.
17. Decision errors and comparing proportions I. In the following research studies, conclusions
were made based on the data provided. It is always possible that the analysis conclusion could
be wrong, although we will almost never actually know if an error has been made or not. For
each study conclusion, specify which of a Type I or Type II error could have been made, and
state the error in the context of the problem.
a. The malaria vaccine was seen to be effective at lowering the rate of contracting malaria
(when compared to the control vaccine).
b. In the US population, Asian-Indian Americans and Chinese Americans are not observed to
have different proportions of current smokers.
c. There is no evidence to claim a difference in the proportion of Americans who are not
affected personally by a government shutdown when comparing Americans who make less
than $40,000 annually and Americans who make $40,000 annually.
18. Decision errors and comparing proportions II. In the following research studies, conclu-
sions were made based on the data provided. It is always possible that the analysis conclusion
could be wrong, although we will almost never actually know if an error has been made or not.
For each study conclusion, specify which of a Type I or Type II error could have been made,
and state the error in the context of the problem.
a. Of registered voters in California, the proportion who report not knowing enough to voice
an opinion on whether they support off shore drilling is different across those who have a
college degree and those who do not.
b. In comparing Californians and Oregonians, there is no evidence to support a difference in
the proportion of each who are sleep deprived.
17.5. EXERCISES 301
19. Active learning. A teacher wanting to increase the active learning component of her course is
concerned about student reactions to changes she is planning to make. She conducts a survey
in her class, asking students whether they believe more active learning in the classroom (hands
on exercises) instead of traditional lecture will helps improve their learning. She does this at
the beginning and end of the semester and wants to evaluate whether students’ opinions have
changed over the semester. Can she used the methods we learned in this chapter for this analysis?
Explain your reasoning.
20. An apple a day keeps the doctor away. A physical education teacher at a high school
wanting to increase awareness on issues of nutrition and health asked her students at the be-
ginning of the semester whether they believed the expression “an apple a day keeps the doctor
away”. 40% of the students responded yes. Throughout the semester she started each class
with a discussion of a study highlighting positive effects of eating more fruits and vegetables.
She conducted the same apple-a-day survey at the end of the semester, and this time 60% of
the students responded yes. Can she used a two-proportion method from this section for this
analysis? Explain your reasoning.
21. Malaria vaccine effectiveness, effect size. A randomized controlled trial on malaria vaccine
effectiveness randomly assigned 450 children intro either one of two different doses of the malaria
vaccine or a control vaccine. 89 of 292 malaria vaccine and 106 out of 147 control vaccine children
contracted malaria within 12 months after the treatment. (Datoo et al. 2021)
Recall that in order to reject the null hypothesis that the two vaccines (malaria and control) are
equivalent, we’d need the sample proportion to be about 2 standard errors below the hypothe-
sized value of zero.
Say that the true difference (in the population) is given as 𝛿, the sample sizes are the same in
both groups (𝑛𝑚𝑎𝑙𝑎𝑟𝑖𝑎 = 𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙 ), and the true proportion who contract malaria on the control
vaccine is 𝑝𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 0.7. If you ran your own study (in the future), how likely is it that you
would get a difference in sample proportions that was sufficiently far from zero that you could
reject under each of the conditions below. (Hint: Use the mathematical model.)
a. 𝛿 = −0.1 and 𝑛𝑚𝑎𝑙𝑎𝑟𝑖𝑎 = 𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 20
b. 𝛿 = −0.4 and 𝑛𝑚𝑎𝑙𝑎𝑟𝑖𝑎 = 𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 20
c. 𝛿 = −0.1 and 𝑛𝑚𝑎𝑙𝑎𝑟𝑖𝑎 = 𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 100
d. 𝛿 = −0.4 and 𝑛𝑚𝑎𝑙𝑎𝑟𝑖𝑎 = 𝑛𝑐𝑜𝑛𝑡𝑟𝑜𝑙 = 100
e. What can you conclude about values of 𝛿 and the sample size?
22. Diabetes and unemployment. A Gallup poll surveyed Americans about their employment
status and whether they have diabetes. The survey results indicate that 1.5% of the 47,774
employed (full or part time) and 2.5% of the 5,855 unemployed 18-29 year-olds have diabetes.
(Gallup 2012)
a. Create a two-way table presenting the results of this study.
b. State appropriate hypotheses to test for difference in proportions of diabetes between em-
ployed and unemployed Americans.
c. The sample difference is about 1%. If we completed the hypothesis test, we would find
that the p-value is very small (about 0), meaning the difference is statistically discernible.
Use this result to explain the difference between statistically discernible and practically
important findings.
302
Chapter 18
1 For readers not as old as the authors, an iPod is basically an iPhone without any cellular service, assuming it was
Unbeknownst to the participants who were the sellers in the study, the buyers were collaborating with
the researchers to evaluate the influence of different questions on the likelihood of getting the sellers to
disclose the past issues with the iPod. The scripted buyers started with “Okay, I guess I’m supposed
to go first. So you’ve had the iPod for 2 years …” and ended with one of three questions:
• General: What can you tell me about it?
• Positive Assumption: It does not have any problems, does it?
• Negative Assumption: What problems does it have?
The question is the treatment given to the sellers, and the response is whether the question prompted
them to disclose the freezing issue with the iPod. The results are shown in Table 18.1, and the data
suggest that asking the, What problems does it have?, was the most effective at getting the seller to
disclose the past freezing issues. However, you should also be asking yourself: could we see these
results due to chance alone if there really is no difference in the question asked, or is this in fact
evidence that some questions are more effective for getting at the truth?
Table 18.1: Summary of the iPod study, where a question was posed to the study participant who acted.
The hypothesis test for the iPod experiment is really about assessing whether there is convincing
evidence that there was a difference in the success rates that each question had on getting the partic-
ipant to disclose the problem with the iPod. In other words, the goal is to check whether the buyer’s
question was independent of whether the seller disclosed a problem.
EXAMPLE
From the experiment, we can compute the proportion of all sellers who disclosed the freezing
problem as 61/219 = 0.2785. If there really is no difference among the questions and 27.85% of
sellers were going to disclose the freezing problem no matter the question they were asked, how
many of the 73 people in the General group would we have expected to disclose the freezing
problem?
We would predict that 0.2785 × 73 = 20.33 sellers would disclose the problem. Obviously we
observed fewer than this, though it is not yet clear if that is due to chance variation or whether
that is because the questions vary in how effective they are at getting to the truth.
304 CHAPTER 18. INFERENCE FOR TWO-WAY TABLES
GUIDED PRACTICE
If the questions were actually equally effective, meaning about 27.85% of respondents
would disclose the freezing issue regardless of what question they were asked, about how
many sellers would we expect to hide the freezing problem from the Positive Assumption
group?2
We can compute the expected number of sellers who we would expect to disclose or hide the freezing
issue for all groups, if the questions had no impact on what they disclosed, using the same strategies
employed in the previous Example and Guided Practice to compute expected counts. These expected
counts were used to construct Table 18.2, which is the same as Table 18.1, except now the expected
counts have been added in parentheses.
Table 18.2: The observed counts and the expected counts for the iPod experiment.
The examples and exercises above provided some help in computing expected counts. In general,
expected counts for a two-way table may be computed using the row totals, column totals, and the
table total. For instance, if there was no difference between the groups, then about 27.85% of each
row should be in the first column:
Looking back to how 0.2785 was computed – as the fraction of sellers who disclosed the freezing issue
(61/219) – these three expected counts could have been computed as
row 1 total
( ) (column 1 total) = 20.33
table total
row 1 total
( ) (column 2 total) = 20.33
table total
row 1 total
( ) (column 3 total) = 20.33
table total
This leads us to a general formula for computing expected counts in a two-way table when we would
like to test whether there is strong evidence of an association between the column variable and row
variable.
To calculate the expected count for the 𝑖𝑡ℎ row and 𝑗𝑡ℎ column, compute
2 We would expect (1 − 0.2785) × 73 = 52.67. It is okay that this result, like the result from Example ??, is a
fraction.
18.1. RANDOMIZATION TEST OF INDEPENDENCE 305
Adding the computed value for each cell gives the chi-squared test statistic 𝑋 2 ∶
Is 40.13 a big number? That is, does it indicate that the observed and expected values are really
different? Or is 40.13 a value of the statistic that we would expect to see just due to natural variability?
Previously, we applied the randomization test to the setting where the research question investigated
a difference in proportions. The same idea of shuffling the data under the null hypothesis can be used
in the setting of the two-way table.
As before, the randomized data is used to find a single value for the test statistic (here a chi-squared
statistic). The chi-squared statistic for the randomized two-way table is found by comparing the
observed and expected counts for each cell in the randomized table. For each cell, compute:
Adding the computed value for each cell gives the chi-squared test statistic 𝑋 2 ∶
Figure 18.1: A histogram of chi-squared statisics from 1,000 simulations produced under the null hypothesis,
𝐻0 , where the question is independent of the response. The observed statistic of 40.13 is marked by the red
line. None of the 1,000 simulations had a chi-squared value of at least 40.13. In fact, none of the simulated
chi-squared statistics came anywhere close to the observed statistic!
Figure 18.2: The chi-squared distribution for differing degrees of freedom. The larger the degrees of freedom,
the longer the right tail extends. The smaller the degrees of freedom, the more peaked the mode on the left
becomes.
The test statistic for assessing the independence between two categorical
variables is a 𝑋 2 .
The 𝑋 2 statistic is a ratio of how the observed counts vary from the expected counts as
compared to the expected counts (which are a measure of how large the sample size is).
When the null hypothesis is true and the conditions are met, 𝑋 2 has a Chi-squared
distribution with 𝑑𝑓 = (𝑟 − 1) × (𝑐 − 1).
Conditions:
• Independent observations
• Large samples: 5 expected counts in each cell
To bring it back to the example, we can safely assume that the observations are independent, as the
question groups were randomly assigned. Additionally, there are over 5 expected counts in each cell,
so the conditions for using the Chi-square distribution are met. If the null hypothesis is true (i.e.,
the questions had no impact on the sellers in the experiment), then the test statistic 𝑋 2 = 40.13 is
expected to follow a Chi-squared distribution with 2 degrees of freedom. Using this information, we
can compute the p-value for the test, which is depicted in Figure 18.3.
308 CHAPTER 18. INFERENCE FOR TWO-WAY TABLES
The software R can be used to find the p-value with the function pchisq(). Just like pnorm(),
pchisq() always gives the area to the left of the cutoff value. Because, in this example, the p-value
is represented by the area to the right of 40.13, we subtract the output of pchisq() from 1.
1 - pchisq(40.13, df = 2)
[1] 1.93e-09
EXAMPLE
Find the p-value and draw a conclusion about whether the question affects the sellers likelihood
of reporting the freezing problem.
Using a computer, we can compute a very precise value for the tail area above 𝑋 2 = 40.13 for
a chi-squared distribution with 2 degrees of freedom: 0.000000002.
Using a discernibility level of 𝛼 = 0.05, the null hypothesis is rejected since the p-value is
smaller. That is, the data provide convincing evidence that the question asked did affect a
seller’s likelihood to tell the truth about problems with the iPod.
EXAMPLE
Table 18.4 summarizes the results of an experiment evaluating three treatments for Type 2
Diabetes in patients aged 10-17 who were being treated with metformin. The three treat-
ments considered were continued treatment with metformin (met), treatment with metformin
combined with rosiglitazone (rosi), or a lifestyle intervention program. Each patient had
a primary outcome, which was either lacked glycemic control (failure) or did not lack that
control (success). What are appropriate hypotheses for this test?
Typically we will use a computer to do the computational work of finding the chi-squared statistic.
However, it is always good to have a sense for what the computer is doing, and in particular, calculating
the values which would be expected if the null hypothesis is true can help to understand the null
hypothesis claim. Additionally, comparing the expected and observed values by eye often gives the
researcher some insight into why or why not the null hypothesis for a given test is rejected or not.
GUIDED PRACTICE
A chi-squared test for a two-way table may be used to test the hypotheses in the
diabetes Example above. To get a sense for the statistic used in the chi-squared test,
first compute the expected values for each of the six table cells.3
Note, when analyzing 2-by-2 contingency tables (that is, when both variables only have two possible
options), one guideline is to use the two-proportion methods introduced in Chapter @ref(inference-
two-props).
3 The expected count for row one / column one is found by multiplying the row one total (234) and column one
18.3.1 Summary
In this chapter we extended the randomization / bootstrap / mathematical model paradigm to research
questions involving categorical variables. We continued working with one population proportion as well
as the difference in populations proportions, but the test of independence allowed for hypothesis testing
on categorical variables with more than two levels. We note that the normal model was an excellent
mathematical approximation to the sampling distribution of sample proportions (or differences in
sample proportions), but that the questions with categorical variables with more than 2 levels required
a new mathematical model, the chi-squared distribution. As seen in Chapter 11, Chapter 12 and
Chapter 13, almost all the research questions can be approached using computational methods (e.g.,
randomization tests or bootstrapping) or using mathematical models. We continue to emphasize the
importance of experimental design in making conclusions about research claims. In particular, recall
that variability can come from different sources (e.g., random sampling vs. random allocation, see
Figure 2.8).
18.3.2 Terms
The terms introduced in this chapter are presented in Table 18.5. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
18.4 Exercises
2. Act on climate change. The table below summarizes results from a Pew Research poll which
asked respondents whether they have personally taken action to help address climate change
within the last year and their generation. The differences in each generational group may be
due to chance. Complete the following computations under the null hypothesis of independence
between an individual’s generation and whether they have personally taken action to help address
climate change within the last year. (Pew Research Center 2021a)
Response
Generation Took action Didn’t take action Total
Gen Z 292 620 912
Millenial 885 2,275 3,160
Gen X 809 2,709 3,518
Boomer & older 1,276 4,798 6,074
Total 3,262 10,402 13,664
3. Lizard habitats, data. In order to assess whether habitat conditions are related to the sunlight
choices a lizard makes for resting, Western fence lizard (Sceloporus occidentalis) were observed
across three different microhabitats.4 (Adolph 1990; Asbury and Adolph 2007)
sunlight
site sun partial shade Total
desert 16 32 71 119
mountain 56 36 15 107
valley 42 40 24 106
Total 114 108 110 332
a. If the variables describing the habitat and the amount of sunlight are independent, what
proporiton of lizards (total) would be expected in each of the three sunlight categories?
b. Given the proportions of each sunlight condition, how many lizards of each type would you
expect to see in the sun? in the partial sun? in the shade?
c. Compare the observed (original data) and expected (part b.) tables. From a first glance,
does it seem as though the habitat and choice of sunlight may be associated?
d. Regardless of your answer to part (c), is it possible to tell from looking only at the expected
and observed counts whether the two variables are associated?
Smoking
ethnicity don’t smoke smoke Total
Asian-Indian 4,150 223 4,373
Chinese 4,457 279 4,736
Filipino 4,303 609 4,912
Total 12,910 1,111 14,021
a. If the variables on ethnicity and smoking status are independent, estimate the proporiton
of individuals (total) who smoke?
b. Given the overall proportion who smoke, how many of each Asian American ethnicity would
you expect to smoke?
c. Compare the observed and expected counts. From a first glance, does it seem as though
the Asian American ethnicity and choice of smoking may be associated?
d. Regardless of your answer to part (c), is it possible to tell from looking only at the expected
and observed counts whether the two variables are associated?
4 The lizard_habitat data used in this exercise can be found in the openintro R package.
18.4. EXERCISES 313
5. Lizard habitats, randomize once. In order to assess whether habitat conditions are related
to the sunlight choices a lizard makes for resting, Western fence lizard (Sceloporus occidentalis)
were observed across three different microhabitats. (Adolph 1990; Asbury and Adolph 2007)
Then, the data were randomized once, where sunlight preference was randomly assigned to the
lizards across different sites. The original data are shown on the left and the results of the
randomization is shown on the right.
Recall that the Chi-squared statistic (𝑋 2 ) measures the difference between the expected and
observed counts. Without calculating the actual statistic, report on whether the original data
or the randomized data will have a larger Chi-squared statistic. Explain your choice.
6. Disaggregating Asian American tobacco use, randomize once. In a study that aims
to disaggregate tobacco use across Asian American ethnic groups (Asian-Indian, Chinese, and
Filipino, in comparison to non-Hispanic Whites), respondents were asked whether they smoke
tobacco or not. (Rao et al. 2021) Then, the data were randomized once, where smoking status
was randomly assigned to the participants across different ethnicities. The original data are
shown on the left and the results of the randomization is shown on the right.
Recall that the Chi-squared statistic (𝑋 2 ) measures the difference between the expected and
observed counts. Without calculating the actual statistic, report on whether the original data
or the randomized data will have a larger Chi-squared statistic. Explain your choice.
314 CHAPTER 18. INFERENCE FOR TWO-WAY TABLES
7. Lizard habitats, randomization test. In order to assess whether habitat conditions are
related to the sunlight choices a lizard makes for resting, Western fence lizard (Sceloporus occi-
dentalis) were observed across three different microhabitats. (Adolph 1990; Asbury and Adolph
2007) The original data were randomized 1,000 times (sunlight variable randomly assigned to
the observations across different habitats), and the histogram of the Chi-squared statistic on
each randomization is displayed.
a. The histogram above describes the Chi-squared statistics for 1,000 different randomization
datasets. When randomizing the data, is the imposed structure that the variables are
independent or that the variables are associated? Explain.
b. What is the range of plausible values for the randomized Chi-squared statistic?
c. The observed Chi-squared statistic is 68.8 (marked in red on plot). Does the observed value
provide evidence against the null hypothesis? To answer the question, state the null and
alternative hypotheses, approximate the p-value, and conclude the test in the context of
the problem.
8. Disaggregating Asian American tobacco use, randomization test. Understanding cul-
tural differences in tobacco use across different demographic groups can lead to improved health
care education and treatment. A recent study disaggregated tobacco use across Asian American
ethnic groups including Asian-Indian (n = 4373), Chinese (n = 4736), and Filipino (n = 4912),
in comparison to non-Hispanic Whites (n = 275,025). The number of current smokers in each
group was reported as Asian-Indian (n = 223), Chinese (n = 279), Filipino (n = 609), and non-
Hispanic Whites (n = 50,880). (Rao et al. 2021) The original data were randomized 1000 times
(smoking status randomly assigned to the observations across ethnicities), and the histogram of
the Chi-squared statistic on each randomization is displayed.
a. The histogram above describes the Chi-squared statistics for 1000 different randomization
datasets. When randomizing the data, is the imposed structure that the variables are
independent or that the variables are associated? Explain.
b. What is the range of plausible values for the randomized Chi-squared statistic?
c. The observed Chi-squared statistic is 209.42 (marked in red on plot). Does the observed
value provide evidence against the null hypothesis? To answer the question, state the null
and alternative hypotheses, approximate the p-value, and conclude the test in the context
of the problem.
18.4. EXERCISES 315
9. Lizard habitats, larger data. In order to assess whether habitat conditions are related to the
sunlight choices a lizard makes for resting, Western fence lizard (Sceloporus occidentalis) were
observed across three different microhabitats. (Adolph 1990; Asbury and Adolph 2007)
Consider the situation where the dataset is 5 times larger than the original data (but have the
same proportional representation in each category). The distribution of lizards in each of the
sites resting in the sun, partial sun, and shade are as follows.
Larger data
sunlight
site sun partial shade Total
desert 80 160 355 595
mountain 280 180 75 535
valley 210 200 120 530
Total 570 540 550 1,660
The larger dataset was randomized 1,000 times (sunlight preference randomly assigned to the
observations across sites), and the histogram of the Chi-squared statistic on each randomization
is displayed.
a. The histogram above describes the Chi-squared statistics for 1,000 different randomization
of the larger dataset. When randomizing the data, is the imposed structure that the
variables are independent or that the variables are associated? Explain.
b. What is the (approximate) range of plausible values for the randomized Chi-squared statis-
tic?
c. The observed Chi-squared statistic is 343.865 (and seen in red on the graph). Does the
observed value provide evidence against the null hypothesis? To answer the question, state
the null and alternative hypotheses, approximate the p-value, and conclude the test in the
context of the problem.
d. If the alternative hypothesis is true, how does the sample size effect the ability to reject
the null hypothesis? (Hint: Consider the original data as compared with the larger dataset
that have the same proportional values.)
316 CHAPTER 18. INFERENCE FOR TWO-WAY TABLES
10. Disaggregating Asian American tobacco use, smaller data. Understanding cultural
differences in tobacco use across different demographic groups can lead to improved health care
education and treatment. A recent study disaggregated tobacco use across Asian American
ethnic groups (Rao et al. 2021).
Consider the situation where the dataset is 50 times smaller than the original data (but have
the same proportional representation in each category). The distribution of smokers in each of
the ethnicity groups in the smaller data are as follows.
Smaller data
Smoking
ethnicity don’t smoke Total
smoke
Asian-Indian 83 4 87
Chinese 89 6 95
Filipino 86 12 98
Total 258 22 280
The smaller dataset was randomized 1,000 times (smoking status randomly assigned to the obser-
vations across ethnicities), and the histogram of the Chi-squared statistic on each randomization
is displayed.
a. The histogram above describes the Chi-squared statistics for 1,000 different randomization
of the smaller dataset. When randomizing the data, is the imposed structure that the
variables are independent or that the variables are associated? Explain.
b. What is the (approximate) range of plausible values for the randomized Chi-squared statis-
tic?
c. The observed Chi-squared statistic is 4.19 (and seen in red on the graph). Does the observed
value provide evidence against the null hypothesis? To answer the question, state the null
and alternative hypotheses, approximate the p-value, and conclude the test in the context
of the problem.
d. If the alternative hypothesis is true, how does the sample size effect the ability to reject the
null hypothesis? (Hint: Consider the original data as compared with the smaller dataset
that have the same proportional values.)
18.4. EXERCISES 317
11. True / False, I. Determine if the statements below are true or false. For each false statement,
suggest an alternative wording to make it a true statement.
a. The Chi-square distribution, just like the normal distribution, has two parameters, mean
and standard deviation.
b. The Chi-square distribution is always right skewed, regardless of the value of the degrees
of freedom parameter.
c. The Chi-square statistic is always greater than or equal to 0.
d. As the degrees of freedom increases, the shape of the Chi-square distribution becomes more
skewed.
12. True / False, II. Determine if the statements below are true or false. For each false statement,
suggest an alternative wording to make it a true statement.
a. As the degrees of freedom increases, the mean of the Chi-square distribution increases.
b. If you found 𝜒2 = 10 with 𝑑𝑓 = 5 you would fail to reject 𝐻0 at the 5% discernibility level.
c. When finding the p-value of a Chi-square test, we always shade the tail areas in both tails.
d. As the degrees of freedom increases, the variability of the Chi-square distribution decreases.
13. Sleep deprived transportation workers. The National Sleep Foundation conducted a survey
on the sleep habits of randomly sampled transportation workers and randomly sampled non-
transportation workers that serve as a “control” for comparison. (National Sleep Foundation
2012) The results of the survey are shown below. Conduct a hypothesis test to evaluate if these
data provide evidence of an association between sleep levels and profession.
14. Parasitic worm. Lymphatic filariasis is a disease caused by a parasitic worm. Complications
of the disease can lead to extreme swelling and other complications. Here we consider results
from a randomized experiment that compared three different drug treatment options to clear
people of the this parasite, which people are working to eliminate entirely. The results for the
second year of the study are given below: (King et al. 2018)
Outcome
group Clear at Year 2 Not Clear at Year 2 Total
Three drugs 52 2 54
Two drugs 31 24 55
Two drugs annually 42 14 56
Total 125 40 165
a. Set up hypotheses for evaluating whether there is any difference in the performance of the
treatments, and also check conditions.
b. Statistical software was used to run a Chi-square test, which output: 𝑋 2 = 23.7 𝑑𝑓 =
2 p-value < 0.0001. Use these results to evaluate the hypotheses from part (a), and
provide a conclusion in the context of the problem.
318 CHAPTER 18. INFERENCE FOR TWO-WAY TABLES
15. Shipping holiday gifts. A local news survey asked 500 randomly sampled Los Angeles resi-
dents which shipping carrier they prefer to use for shipping holiday gifts. The table below shows
the distribution of responses by age group as well as the expected counts for each cell (shown in
italics).
Age
Shipping method 18-34 35-54 55+ Total
a. State the null and alternative hypotheses for testing for independence of age and preferred
shipping method for holiday gifts among Los Angeles residents.
b. Are the conditions for inference using a Chi-square test satisfied?
16. Coffee and depression. Researchers conducted a study investigating the relationship between
caffeinated coffee consumption and risk of depression in women. They collected data on 50,739
women free of depression symptoms at the start of the study in the year 1996, and these women
were followed through 2006. The researchers used questionnaires to collect data on caffeinated
coffee consumption, asked each individual about physician- diagnosed depression, and also asked
about the use of antidepressants. The table below shows the distribution of incidences of de-
pression by amount of caffeinated coffee consumption. (Lucas et al. 2011)
a. What type of test is appropriate for evaluating if there is an association between coffee
intake and depression?
b. Write the hypotheses for the test you identified in part (a).
c. Calculate the overall proportion of women who do and do not suffer from depression.
d. Identify the expected count for the empty cell, and calculate the contribution of this cell
to the test statistic.
e. The test statistic is 𝜒2 = 20.93. What is the p-value?
f. What is the conclusion of the hypothesis test?
g. One of the authors of this study was quoted on the New York Times as saying it was
“too early to recommend that women load up on extra coffee” based on just this study.
(O’Connor 2011) Do you agree with this statement? Explain your reasoning.
319
Chapter 19
In this chapter, we focus on the sample mean (instead of, for example, the sample median or the range
of the observations) because of the well-studied mathematical model which describes the behavior of
the sample mean. We will not cover mathematical models which describe other statistics, but the
bootstrap and randomization techniques described below are immediately extendable to any function
of the observed data. The sample mean will be calculated in one group, two paired groups, two
independent groups, and many groups settings. The techniques described for each setting will vary
slightly, but you will be well served to find the structural similarities across the different settings.
Similar to how we can model the behavior of the sample proportion 𝑝̂ using a normal distribution,
the sample mean 𝑥̄ can also be modeled using a normal distribution when certain conditions are
met. However, we’ll soon learn that a new distribution, called the 𝑡-distribution, is more useful when
working with the sample mean. We’ll first learn about this new distribution, then we’ll use it for
confidence intervals and hypothesis tests for the mean.
320 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
Consider a situation where you want to know whether you should buy a franchise of the used car
store Awesome Autos. As part of your planning, you’d like to know for how much an average car from
Awesome Autos sells. In order to go through the example more clearly, let’s say that you are only able
to randomly sample five cars from Awesome Auto. (If this were a real example, you would surely be
able to take a much larger sample size, possibly even being able to measure the entire population!)
The sample average car price of $17140.00 is a first guess at the price of the average car price at
Awesome Auto. However, as a student of statistics, you understand that one sample mean based on
a sample of five observations will not necessarily equal the true population average car price for all
the cars at Awesome Auto. Indeed, you can see that the observed car prices vary with a standard
deviation of $7170.29, and surely the average car price would be different if a different sample of size
five had been taken from the population. Fortunately, as it did in previous chapters for the sample
proportion, bootstrapping will approximate the variability of the sample mean from sample to sample.
Figure 19.2: As seen previously, the idea behind bootstrapping is to consider the sample at hand as an
estimate of the population. Sampling from the sample (of 5 cars) is identical to sampling from an infinite
population which is made up of only the cars in the original sample.
By taking repeated samples from the estimated population, the variability from sample to sample
can be observed. In Figure 12.2 the repeated bootstrap samples are seen to be different both from
each other and from the original population. Recall that the bootstrap samples were taken from the
same (estimated) population, and so the differences in bootstrap samples are due entirely to natural
variability in the sampling procedure. For the situation at hand where the sample mean is the statistic
of interest, the variability from sample to sample can be seen in Figure 19.3.
Figure 19.3: To estimate the natural variability in the sample mean, different bootstrap samples are taken
from the original sample. Notice that each bootstrap resample is different from each other as well as from the
original sample
By summarizing each of the bootstrap samples (here, using the sample mean), we see, directly, the
variability of the sample mean, 𝑥,̄ from sample to sample. The distribution of 𝑥𝑏𝑠
̄ for the Awesome
Auto cars is shown in Figure 19.4.
322 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
Figure 19.4: Because each of the bootstrap resamples respresents a different set of cars, the mean of the each
bootstrap resample will be a different value. Each of the bootstrapped means is calculated, and a histogram of
the values describes the inherent natural variability of the sample mean which is due to the sampling process.
Figure 19.5 summarizes one thousand bootstrap samples in a histogram of the bootstrap sample means.
The bootstrapped average car prices vary from about $10,000 to $25,000. The bootstrap percentile
confidence interval is found by locating the middle 90% (for a 90% confidence interval) or a 95% (for
a 95% confidence interval) of the bootstrapped statistics.
EXAMPLE
Using Figure 19.5, find the 90% and 95% bootstrap percentile confidence intervals for the true
average price of a car from Awesome Auto.
A 90% confidence interval is $12,140 to $22,007. The conclusion is that we are 90% confident
that the true average car price at Awesome Auto lies somewhere between $12,140 and $22,007.
A 95% confidence interval is $11,778 to $22,500. The conclusion is that we are 95% confident
that the true average car price at Awesome Auto lies somewhere between $11,778 to $22,500.
Figure 19.5: The original Awesome Auto data is bootstrapped 1,000 times. The histogram provides a sense
for the variability of the average car prices from sample to sample.
19.1. BOOTSTRAP CONFIDENCE INTERVAL FOR A MEAN 323
EXAMPLE
Explain how the standard error (SE) of the bootstrapped means is calculated and what it is
measuring.
The SE of the bootstrapped means measures how variable the means are from resample to
resample. The bootstrap SE is a good approximation to the SE of means as if we had taken
repeated samples from the original population (which we agreed isn’t something we would do
because of wasted resources).
Logistically, we can find the standard deviation of the bootstrapped means using the same
calculations from Chapter 5. That is, the bootstrapped means are the individual observations
about which we measure the variability.
Although we won’t spend a lot of energy on this concept, you may be wondering some of the differences
between a standard error and a standard deviation. The standard error describes how a statistic
(e.g., sample mean or sample proportion) varies from sample to sample. The standard deviation
can be thought of as a function applied to any list of numbers which measures how far those numbers
vary from their own average. So, you can have a standard deviation calculated on a column of dog
heights or a standard deviation calculated on a column of bootstrapped means from the resampled
data. Note that the standard deviation calculated on the bootstrapped means is referred to as the
bootstrap standard error of the mean.
GUIDED PRACTICE
It turns out that the standard deviation of the bootstrapped means from Figure 19.5 is
$2,891.87 (a value which is an excellent approximation for the standard error of sample
means if we were to take repeated samples from the population). (Note: in R the
calculation was done using the function sd().) The average of the observed prices is
$17,140, ad we will consider the sample average to be the best guess point estimate for
𝜇. Find and interpret the confidence interval for 𝜇 (the true average cost of a car at
Awesome Auto) using the bootstrap SE confidence interval formula.2
1 There is a large literature on understanding and improving bootstrap intervals, see Hesterberg (2015) titled “What
Teachers Should Know About the Bootstrap” and Hayden (2019) titled “Questionable Claims for Simple Versions of
the Bootstrap” for more information.
2 Using the formula for the bootstrap SE interval, we find the 95% confidence interval for 𝜇 is: 17, 140±2⋅2, 891.87 →
($11,356.26, $22,923.74). We are 95% confident that the true average car price at Awesome Auto is somewhere between
$11,356.26 and $22,923.74.
324 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
EXAMPLE
Compare and contrast the two different 95% confidence intervals for 𝜇 created by finding the
percentiles of the bootstrapped means and created by finding the SE of the bootstrapped
means. Do you think the intervals should be identical?
EXAMPLE
Describe the bootstrap distribution for the standard deviation shown in Figure 19.6.
The distribution is skewed left and centered near $7,170.286, which is the point estimate from
the original data. Most observations in this distribution lie between $0 and $10,000.
GUIDED PRACTICE
Using Figure 19.6, find and interpret a 90% bootstrap percentile confidence interval for
the population standard deviation for car prices at Awesome Auto.3
Figure 19.6: The original Awesome Auto data is bootstrapped 1,000 times. The histogram provides a sense
for the variability of the standard deviation of car prices from sample to sample.
3 Based on the percentile values in Figure 19.6, the middle 90% of the bootstrapped standard deviations is given by
the 5th ($3,602.5) and the 95th percentiles ($8,737.2). That is, we are 90% confident that the true standard deviation
of car prices is between $3,602.5 and $8,737.2. A 90% confidence level indicates that there was not a need for a high
level of confidence, such a 95% or 99%. A lower confidence level has higher potential for error, but it also produces a
narrower interval.
19.2. MATHEMATICAL MODEL FOR A MEAN 325
As with the sample proportion, the variability of the sample mean is well described by the mathe-
matical theory given by the Central Limit Theorem. However, because of missing information about
the inherent variability in the population (𝜎), a 𝑡-distribution is used in place of the standard normal
when performing hypothesis test or confidence interval analyses.
𝜎
Mean = 𝜇 Standard Error (𝑆𝐸) = √
𝑛
Before diving into confidence intervals and hypothesis tests using 𝑥,̄ we first need to cover two topics:
• When we modeled 𝑝̂ using the normal distribution, certain conditions had to be satisfied. The
conditions for working with 𝑥̄ are a little more complex, and below, we will discuss how to check
conditions for inference using a mathematical model.
• The standard error is dependent on the population standard deviation, 𝜎. However, we rarely
know 𝜎, and instead we must estimate it. Because this estimation is itself imperfect, we use a
new distribution called the 𝑡-distribution to fix this problem.
There is no perfect way to check the normality condition, so instead we use two general
rules based on the number and magnitude of extreme observations. Note, it often takes
practice to get a sense for whether a normal approximation is appropriate.
• Small 𝑛: If the sample size 𝑛 is small and there are no clear outliers in the
data, then we typically assume the data come from a nearly normal distribution
to satisfy the condition.
• Large 𝑛: If the sample size 𝑛 is large and there are no particularly extreme
outliers, then we typically assume the sampling distribution of 𝑥̄ is nearly normal,
even if the underlying distribution of individual observations is not.
Some guidelines for determining whether 𝑛 is considered small or large are as follows:
slight skew is okay for sample sizes of 15, moderate skew for sample sizes of 30, and
strong skew for sample sizes of 60.
In this first course in statistics, you aren’t expected to develop perfect judgment on the normality
condition. However, you are expected to be able to handle clear cut cases based on the rules of
thumb.4
EXAMPLE
Consider the four plots provided in Figure 19.7 that come from simple random samples from
different populations. Their sample sizes are 𝑛1 = 15 and 𝑛2 = 50.
Each samples is from a simple random sample of its respective population, so the independence
condition is satisfied. Let’s next check the normality condition for each using the rule of thumb.
The first sample has fewer than 30 observations, so we are watching for any clear outliers.
None are present; while there is a small gap in the histogram on the right, this gap is small
and over 20% of the observations in this small sample are represented to the left of the gap, so
we can hardly call these clear outliers. With no clear outliers, the normality condition can be
reasonably assumed to be met.
The second sample has a sample size greater than 30 and includes an outlier that appears to be
roughly 5 times further from the center of the distribution than the next furthest observation.
This is an example of a particularly extreme outlier, so the normality condition would not be
satisfied.
It’s often helpful to also visualize the data using a box plot to assess skewness and existence
of outliers. The box plots provided underneath each histogram confirms our conclusions that
the first sample does not have any outliers and the second sample does, with one outlier being
particularly more extreme than the others.
4 More nuanced guidelines would consider further relaxing the particularly extreme outlier check when the sample
size is very large. However, we’ll leave further discussion here to a future course.
19.2. MATHEMATICAL MODEL FOR A MEAN 327
In practice, it’s typical to also do a mental check to evaluate whether we have reason to believe the
underlying population would have moderate skew (if 𝑛 < 30) or have particularly extreme outliers
(𝑛 ≥ 30) beyond what we observe in the data. For example, consider the number of followers for each
individual account on Twitter, and then imagine this distribution. The large majority of accounts
have built up a couple thousand followers or fewer, while a relatively tiny fraction have amassed tens
of millions of followers, meaning the distribution is extremely skewed. When we know the data come
from such an extremely skewed distribution, it takes some effort to understand what sample size is
large enough for the normality condition to be satisfied.
𝜎 𝑠
𝑆𝐸 = √ ≈ √
𝑛 𝑛
This strategy tends to work well when we have a lot of data and can estimate 𝜎 using 𝑠 accurately.
However, the estimate is less precise with smaller samples, and this leads to problems when using the
normal distribution to model 𝑥.̄
We’ll find it useful to use a new distribution for inference calculations called the 𝑡-distribution. A
𝑡-distribution, shown as a solid line in Figure 19.8, has a bell shape. However, its tails are thicker
than the normal distribution’s, meaning observations are more likely to fall beyond two standard
deviations from the mean than under the normal distribution.
The extra thick tails of the 𝑡-distribution are exactly the correction needed to resolve the problem
(due to extra variability of the T score) of using 𝑠 in place of 𝜎 in the 𝑆𝐸 calculation.
328 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
The 𝑡-distribution is always centered at zero and has a single parameter: degrees of freedom. The de-
grees of freedom describes the precise form of the bell-shaped 𝑡-distribution. Several 𝑡-distributions
are shown in Figure 19.9 in comparison to the normal distribution. Similar to the Chi-square distri-
bution, the shape of the 𝑡-distribution also depends on the degrees of freedom.
In general, we’ll use a 𝑡-distribution with 𝑑𝑓 = 𝑛 − 1 to model the sample mean when the sample
size is 𝑛. That is, when we have more observations, the degrees of freedom will be larger and the
𝑡-distribution will look more like the standard normal distribution; when the degrees of freedom is
about 30 or more, the 𝑡-distribution is nearly indistinguishable from the normal distribution.
Figure 19.9: The larger the degrees of freedom, the more closely the 𝑡-distribution resembles the standard
normal distribution.
The degrees of freedom describes the shape of the 𝑡-distribution. The larger the degrees
of freedom, the more closely the distribution approximates the normal distribution.
The 𝑡-distribution allows us greater flexibility than the normal distribution when analyzing numerical
data. In practice, it’s common to use statistical software, such as R, Python, or SAS for these analyses.
In R, the function used for calculating probabilities under a 𝑡-distribution is pt() (which should seem
similar to previous R functions, pnorm() and pchisq()). Don’t forget that with the 𝑡-distribution,
the degrees of freedom must always be specified!
For the examples and guided practices below, you may have to use a table or statistical software to
find the answers. We recommend trying the problems so as to get a sense for how the 𝑡-distribution
can vary in width depending on the degrees of freedom. No matter the approach you choose, apply
your method using the examples below to confirm your working understanding of the 𝑡-distribution.
19.2. MATHEMATICAL MODEL FOR A MEAN 329
EXAMPLE
What proportion of the 𝑡-distribution with 18 degrees of freedom falls below -2.10?
Let’s first draw the picture and shade the area below -2.10.
[1] 0.025
EXAMPLE
Note that with 20 degrees of freedom, the 𝑡-distribution is relatively close to the normal
distribution. With a normal distribution, this would correspond to about 0.05, so we should
expect the 𝑡-distribution to give us a similar value. Using statistical software, we can obtain a
precise value: 0.0573.
# use pt() to find probability under the t-distribution
1 - pt(1.65, df = 20)
[1] 0.0573
EXAMPLE
A 𝑡-distribution with 2 degrees of freedom is shown below. Estimate the proportion of the
distribution falling more than 3 units from the mean (above or below).
With so few degrees of freedom, the 𝑡-distribution will give a more notably different value than
the normal distribution. Under a normal distribution, the area would be about 0.003 using the
68-95-99.7 rule. For a 𝑡-distribution with 𝑑𝑓 = 2, the area in both tails beyond 3 units totals
0.0955. This area is dramatically different than what we obtain from the normal distribution.
# use pt() to find probability under the t-distribution
pt(-3, df = 2) + (1 - pt(3, df = 2))
[1] 0.0955
330 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
GUIDED PRACTICE
What proportion of the 𝑡-distribution with 19 degrees of freedom falls above -1.79 units?
Use your preferred method for finding tail areas.5
Figure 19.10: A Risso’s dolphin. Photo by Mike Baird, www.bairdphotos.com. CC BY 2.0 license.
We will identify a confidence interval for the average mercury content in dolphin muscle using a sample
of 19 Risso’s dolphins from the Taiji area in Japan. The data are summarized in Table 19.1. The
minimum and maximum observed values can be used to evaluate whether there are clear outliers.
Table 19.1: Summary of mercury content in the muscle of 19 Risso’s dolphins from the Taiji area. Measure-
ments are in micrograms of mercury per wet gram of muscle (𝜇g/wet g).
EXAMPLE
Are the independence and normality conditions satisfied for this dataset?
The observations are a simple random sample, therefore it is reasonable to assume that the
dolphins are independent. The summary statistics in Table 19.1 do not suggest any clear
outliers, with all observations within 2.5 standard deviations of the mean. Based on this
evidence, the normality condition seems reasonable.
In the normal model, we used 𝑧 ⋆ and the standard error to determine the width of a confidence interval.
We revise the confidence interval formula slightly when using the 𝑡-distribution:
EXAMPLE
Using the summary statistics in Table 19.1, compute the standard error for the average mercury
content in the 𝑛 = 19 dolphins.
The value 𝑡⋆𝑑𝑓 is a cutoff we obtain based on the confidence level and the 𝑡-distribution with 𝑑𝑓 degrees
of freedom. That cutoff is found in the same way as with a normal distribution: we find 𝑡⋆𝑑𝑓 such that
the fraction of the 𝑡-distribution with 𝑑𝑓 degrees of freedom within a distance 𝑡⋆𝑑𝑓 of 0 matches the
confidence level of interest.
EXAMPLE
When 𝑛 = 19, what is the appropriate degrees of freedom? Find 𝑡⋆𝑑𝑓 for this degrees of freedom
and the confidence level of 95%
Using statistical software, we find the cutoff where the upper tail is equal to 2.5%: 𝑡⋆18 = 2.10.
The area below -2.10 will also be equal to 2.5%. That is, 95% of the 𝑡-distribution with 𝑑𝑓 = 18
lies within 2.10 units of 0.
# use qt() to find the t-cutoff (with 95% in the middle)
qt(0.025, df = 18)
[1] -2.1
qt(0.975, df = 18)
[1] 2.1
If the sample has 𝑛 observations and we are examining a single mean, then we use the
𝑡-distribution with 𝑑𝑓 = 𝑛 − 1 degrees of freedom.
EXAMPLE
Compute and interpret the 95% confidence interval for the average mercury content in Risso’s
dolphins.
𝑥̄ ± 𝑡⋆18 × 𝑆𝐸
4.4 ± 2.10 × 0.528
(3.29 , 5.51)
We are 95% confident the average mercury content of muscles in Risso’s dolphins is between
3.29 and 5.51 𝜇g/wet gram, which is considered extremely high.
332 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
where 𝑥̄ is the sample mean, 𝑡⋆𝑑𝑓 corresponds to the confidence level and degrees of
freedom 𝑑𝑓, and 𝑆𝐸 is the standard error as estimated by the sample.
GUIDED PRACTICE
The FDA’s webpage provides some data on mercury content of fish. Based on a sample
of 15 croaker white fish (Pacific), a sample mean and standard deviation were computed
as 0.287 and 0.069 ppm (parts per million), respectively. The 15 observations ranged
from 0.18 to 0.41 ppm. We will assume these observations are independent. Based
on the summary statistics of the data, do you have any objections to the normality
condition of the individual observations?6
EXAMPLE
Estimate the standard error of 𝑥̄ = 0.287 ppm using the data summaries in the previous Guided
Practice. If we are to use the 𝑡-distribution to create a 90% confidence interval for the actual
mean of the mercury content, identify the degrees of freedom and 𝑡⋆𝑑𝑓 .
[1] -1.76
qt(0.95, df = 14)
[1] 1.76
GUIDED PRACTICE
Using the information and results of the previous Guided Practice and Example, com-
pute a 90% confidence interval for the average mercury content of croaker white fish
(Pacific).7
GUIDED PRACTICE
The 90% confidence interval from the previous Guided Practice is 0.256 ppm to 0.318
ppm. Can we say that 90% of croaker white fish (Pacific) have mercury levels between
0.256 and 0.318 ppm?8
6 The sample size is under 30, so we check for obvious outliers: since all observations are within 2 standard deviations
Recall that the margin of error is defined by the standard error. The margin of error for 𝑥̄ can be
directly obtained from 𝑆𝐸(𝑥).
̄
The T score is a ratio of how the sample mean differs from the hypothesized mean as
compared to how the observations vary.
𝑥̄ − null value
𝑇 = √
𝑠/ 𝑛
When the null hypothesis is true and the conditions are met, T has a t-distribution with
𝑑𝑓 = 𝑛 − 1.
Conditions:
• Independent observations.
• Large samples and no extreme outliers.
Is the typical US runner getting faster or slower over time? We consider this question in the context
of the Cherry Blossom Race, which is a 10-mile race in Washington, DC each spring. The average
time for all runners who finished the Cherry Blossom Race in 2006 was 93.29 minutes (93 minutes
and about 17 seconds). We want to determine using data from 100 participants in the 2017 Cherry
Blossom Race whether runners in this race are getting faster or slower, versus the other possibility
that there has been no change.
GUIDED PRACTICE
What are appropriate hypotheses for this context?9
When completing a hypothesis test for the one-sample mean, the process is nearly identical to complet-
ing a hypothesis test for a single proportion. First, we find the Z score using the observed value, null
value, and standard error; however, we call it a T score since we use a 𝑡-distribution for calculating
the tail area. Then we find the p-value using the same ideas we used previously: find the one-tail area
under the sampling distribution, and double it.
8 No, a confidence interval only provides a range of plausible values for a population parameter, in this case the
population mean. It does not describe what we might observe for individual observations.
9 𝐻 ∶ The average 10-mile run time was the same for 2006 and 2017. 𝜇 = 93.29 minutes. 𝐻 ∶ The average 10-mile
0 𝐴
run time for 2017 was different than that of 2006. 𝜇 ≠ 93.29 minutes.
334 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
GUIDED PRACTICE
The data come from a simple random sample of all participants, so the observations are
independent. A histogram of the race times is given below to evaluate if we can move
forward with a t-test. Is the normality condition met?10
EXAMPLE
With both the independence and normality conditions satisfied, we can proceed with a hy-
pothesis test using the 𝑡-distribution. The sample mean and sample standard deviation of the
sample of 100 runners from the 2017 Cherry Blossom Race are 98.78 and 16.59 minutes, re-
spectively. Recall that the average run time in 2006 was 93.29 minutes. Find the test statistic
and p-value. What is your conclusion?
To find the test statistic (T score), we first must determine the standard error:
√
𝑆𝐸 = 16.6/ 100 = 1.66
Now we can compute the T score using the sample mean (98.78), null value (93.29), and 𝑆𝐸 ∶
98.8 − 93.29
𝑇 = = 3.32
1.66
For 𝑑𝑓 = 100 − 1 = 99, we can determine using statistical software (or a 𝑡-table) that the
one-tail area is 0.000631, which we double to get the p-value: 0.00126.
# use pt() to find the left tail and multiply by 2 to get both tails
(1 - pt(3.32, df = 99)) * 2
[1] 0.00126
Because the p-value is smaller than 0.05, we reject the null hypothesis. That is, the data
provide convincing evidence that the average run time for the Cherry Blossom Run in 2017 is
different than the 2006 average.
To help us remember to use the 𝑡-distribution, we use a 𝑇 to represent the test statistic,
and we often call this a T score. The Z score and T score are computed in the exact
same way and are conceptually identical: each represents how many standard errors the
observed value is from the null value.
10 With a sample of 100, we should only be concerned if there is are particularly extreme outliers. The histogram of
the data does not show any outliers of concern (and arguably, no outliers at all).
19.3. CHAPTER REVIEW 335
19.3.1 Summary
In this chapter we extended the randomization / bootstrap / mathematical model paradigm to ques-
tions involving quantitative variables of interest. When there is only one variable of interest, we are
often hypothesizing or finding confidence intervals about the population mean. Note, however, the
bootstrap method can be used for other statistics like the population median or the population IQR.
When comparing a quantitative variable across two groups, the question often focuses on the difference
in population means (or sometimes a paired difference in means). The questions revolving around one,
two, and paired samples of means are addressed using the t-distribution; they are therefore called “t-
tests” and “t-intervals.” When considering a quantitative variable across 3 or more groups, a method
called ANOVA is applied. Again, almost all the research questions can be approached using com-
putational methods (e.g., randomization tests or bootstrapping) or using mathematical models. We
continue to emphasize the importance of experimental design in making conclusions about research
claims. In particular, recall that variability can come from different sources (e.g., random sampling
vs. random allocation, see Figure 2.8).
19.3.2 Terms
The terms introduced in this chapter are presented in Table 19.2. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
19.4 Exercises
2. Statistics vs. parameters: one mean. Each of the following scenarios were set up to assess
an average value. For each one, identify, in words: the statistic and the parameter.
a. Georgianna samples 20 children from a particular city and measures how many years they
have each been playing piano.
b. Traffic police officers (who are regularly exposed to lead from automobile exhaust) had
their lead levels measured in their blood.
Min 147.2
Q1 163.8
Median 170.3
Mean 171.1
Q3 177.8
Max 198.1
SD 9.4
IQR 14.0
a. What are the point estimates for the average and median heights of active adults?
b. What are the point estimates for the standard deviation and IQR of heights of active adults?
c. Is a person who is 1m 80cm (180 cm) tall considered unusually tall? And is a person who
is 1m 55cm (155cm) considered unusually short? Explain your reasoning.
d. The researchers take another random sample of physically active adults. Would you expect
the mean and the standard deviation of this new sample to be the ones given above?
Explain your reasoning.
e. The sample means obtained are point estimates for the mean height of all active individuals,
if the sample of individuals is equivalent to a simple random sample. What measure do we
use to quantify the variability of such an estimate? Compute this quantity using the data
from the original sample under the condition that the data are a simple random sample.
11 The bdims data used in this exercise can be found in the openintro R package.
19.4. EXERCISES 337
4. Heights of adults, standard error. Heights of 507 physically active adults have a mean of
171 cm and a standard deviation of 9.4 cm. Provide an estimate for the standard error of the
mean for samples of following sizes.12 (Heinz et al. 2003)
a. n = 10
b. n = 50
c. n = 100
d. n = 1000
e. The standard error of the mean is a number which describes what?
5. Heights of adults vs. kindergartners. Heights of 507 physically active adults have a mean
of 171 cm and a standard deviation of 9.4 cm.13 (Heinz et al. 2003)
a. Would you expect the standard deviation of the heights of a few hundred kindergartners
to be higher or lower than 9.4 cm? Explain your reasoning.
b. Suppose many samples of size 100 adults is taken and, separately, many samples of size 100
kindergarteners are taken. For each of the many samples, the average height is computed.
Which set of sample averages would have a larger standard error of the mean, the adult
sample averages or the kindergartner sample averages?
a. Given the bootstrap sampling distribution for the sample mean, find an approximate value
for the standard error of the mean.
b. By looking at the bootstrap sampling distribution (1,000 bootstrap samples were taken),
find an approximate 90% bootstrap percentile confidence interval for the true average adult
height in the population from which the data were randomly sampled. Provide the interval
as well as a one-sentence interpretation of the interval.
c. By looking at the bootstrap sampling distribution (1,000 bootstrap samples were taken),
find an approximate 90% bootstrap SE confidence interval for the true average adult height
in the population from which the data were randomly sampled. Provide the interval as well
as a one-sentence interpretation of the interval.
12 The bdims data used in this exercise can be found in the openintro R package.
13 The bdims data used in this exercise can be found in the openintro R package.
14 The bdims data used in this exercise can be found in the openintro R package.
338 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
7. Identify the critical 𝑡. A random sample is selected from an approximately normal population
with unknown standard deviation. Find the degrees of freedom and the critical 𝑡-value (t⋆ ) for
the given sample size and confidence level.
a. 𝑛 = 6, CL = 90%
b. 𝑛 = 21, CL = 98%
c. 𝑛 = 29, CL = 95%
d. 𝑛 = 12, CL = 99%
8. 𝑡-distribution. The figure below shows three unimodal and symmetric curves: the standard
normal (z) distribution, the 𝑡-distribution with 5 degrees of freedom, and the 𝑡-distribution with
1 degree of freedom. Determine which is which, and explain your reasoning.
9. Find the p-value, I. A random sample is selected from an approximately normal population
with an unknown standard deviation. Find the p-value for the given sample size and test statistic.
Also determine if the null hypothesis would be rejected at 𝛼 = 0.05.
10. Find the p-value, II. A random sample is selected from an approximately normal population
with an unknown standard deviation. Find the p-value for the given sample size and test statistic.
Also determine if the null hypothesis would be rejected at 𝛼 = 0.01.
a. 𝑛 = 26, 𝑇 = 2.485
b. 𝑛 = 18, 𝑇 = 0.5
19.4. EXERCISES 339
11. Length of gestation, confidence interval. Every year, the United States Department of
Health and Human Services releases to the public a large dataset containing information on
births recorded in the country. This dataset has been of interest to medical researchers who are
studying the relation between habits and practices of expectant mothers and the birth of their
children. In this exercise we work with a random sample of 1,000 cases from the dataset released
in 2014. The length of pregnancy, measured in weeks, is commonly referred to as gestation. The
histograms below show the distribution of lengths of gestation from the random sample of 1,000
births (on the left) and the distribution of bootstrapped means of gestation from 1,500 different
bootstrap samples (on the right).15
a. Given the bootstrap sampling distribution for the sample mean, find an approximate value
for the standard error of the mean.
b. By looking at the bootstrap sampling distribution (1,500 bootstrap samples were taken),
find an approximate 99% bootstrap percentile confidence interval for the true average ges-
tation length in the population from which the data were randomly sampled. Provide the
interval as well as a one-sentence interpretation of the interval.
c. By looking at the bootstrap sampling distribution (1,500 bootstrap samples were taken),
find an approximate 99% bootstrap SE confidence interval for the true average gestation
length in the population from which the data were randomly sampled. Provide the interval
as well as a one-sentence interpretation of the interval.
12. Length of gestation, hypothesis test. In this exercise we work with a random sample of 1,000
cases from the dataset released by the United States Department of Health and Human Services
in 2014. Provided below are sample statistics for gestation (length of pregnancy, measured in
weeks) of births in this sample.16
a. What is the point estimate for the average length of pregnancy for all women? What about
the median?
b. You might have heard that human gestation is typically 40 weeks. Using the data, perform
a complete hypothesis test, using mathematical models, to assess the 40 week claim. State
the null and alternative hypotheses, find the T score, find the p-value, and provide a
conclusion in context of the data.
c. A quick internet search validates the claim of “40 weeks gestation” for humans. A friend
of yours claims that there are different ways to measure gestation (starting at first day of
last period, ovulation, or conception) which will result in estimates that are a week or two
different. Another friend mentions that recent increases in cesarean births is likely to have
decreased length of gestation. Do the data provide a mechanism to distinguish between
your two friends’ claims?
15 The births14 data used in this exercise can be found in the openintro R package.
16 The births14 data used in this exercise can be found in the openintro R package.
340 CHAPTER 19. INFERENCE FOR A SINGLE MEAN
13. Interpreting confidence intervals for population mean. For each of the following state-
ments, indicate if they are a true or false interpretation of the confidence interval. If false,
provide a reason or correction to the misinterpretation. You collect a large sample and calculate
a 95% confidence interval for the average number of cans of sodas consumed annually per adult
in the US to be (440 cans, 520 cans), i.e., on average, adults in the US consume just under two
cans of soda per day.
a. 95% of adults in the US consume between 440 and 520 cans of soda per year.
b. There is a 95% probability that the true population average per adult yearly soda consump-
tion is between 440 and 520 cans.
c. The true population average per adult yearly soda consumption is between 440 and 520
cans, with 95% confidence.
d. The average soda consumption of the people who were sampled is between 440 and 520
cans of soda per year, with 95% confidence.
14. Interpreting p-values for population mean. For each of the following statements, indicate
if they are a true or false interpretation of the p-value. If false, provide a reason or correction to
the misinterpretation. You are wondering if the average amount of cereal in a 10oz cereal box
is greater than 10oz. You collect 50 boxes of cereal, weigh them carefully, find a T score, and a
p-value of 0.23.
a. The probability that the average weight of all cereal boxes is 10 oz is 0.23.
b. The probability that the average weight of all cereal boxes is greater than 10 oz is 0.23.
c. Because the p-value is 0.23, the average weight of all cereal boxes is 10 oz.
d. Because the p-value is small, the population average must be just barely above 10 oz.
e. If 𝐻0 is true, the probability of observing another sample with an average as or more
extreme as the data is 0.23.
15. Working backwards, I. A 95% confidence interval for a population mean, 𝜇, is given as (18.985,
21.015). The population distribution is approximately normal and the population standard
deviation is unknown. This confidence interval is based on a simple random sample of 36
observations. Assuming that all conditions necessary for inference are satisfied, and using the 𝑡-
distribution, calculate the sample mean, the margin of error, and the sample standard deviation.
16. Working backwards, II. A 90% confidence interval for a population mean is (65, 77). The pop-
ulation distribution is approximately normal and the population standard deviation is unknown.
This confidence interval is based on a simple random sample of 25 observations. Assuming that
all conditions necessary for inference are satisfied, and using the 𝑡-distribution, calculate the
sample mean, the margin of error, and the sample standard deviation.
17. Sleep habits of New Yorkers. New York is known as “the city that never sleeps”. A random
sample of 25 New Yorkers were asked how much sleep they get per night. Statistical summaries
of these data are shown below. The point estimate suggests New Yorkers sleep less than 8 hours
a night on average. Evaluate the claim that New York is the city that never sleeps keeping in
mind that, despite this claim, the true average number of hours New Yorkers sleep could be less
than 8 hours or more than 8 hours.
18. Find the mean. You are given the hypotheses shown below. We know that the sample standard
deviation is 8 and the sample size is 20. For what sample mean would the p-value be equal to
0.05? Assume that all conditions necessary for inference are satisfied.
𝐻0 ∶ 𝜇 = 60 𝐻𝐴 ∶ 𝜇 ≠ 60
19. 𝑡⋆ for the correct confidence level. As you’ve seen, the tails of a 𝑡−distribution are longer
than the standard normal which results in 𝑡⋆𝑑𝑓 being larger than 𝑧 ⋆ for any given confidence
level. When finding a CI for a population mean, explain how mistakenly using 𝑧⋆ (instead of
the correct 𝑡∗𝑑𝑓 ) would affect the confidence level.
20. Possible bootstrap samples. Consider a simple random sample of the following observations:
47, 4, 92, 47, 12, 8. Which of the following could be a possible bootstrap samples from the
observed data above? If the set of values could not be a bootstrap sample, indicate why not.
a. 47, 47, 47, 47, 47, 47
b. 92, 4, 13, 8, 47, 4
c. 92, 47, 12
d. 8, 47, 12, 12, 8, 4, 92
e. 12, 4, 8, 8, 92, 12
21. Play the piano. Georgianna claims that in a small city renowned for its music school, the
average child takes less than 5 years of piano lessons. We have a random sample of 20 children
from the city, with a mean of 4.6 years of piano lessons and a standard deviation of 2.2 years.
a. Evaluate Georgianna’s claim (or that the opposite might be true) using a hypothesis test.
b. Construct a 95% confidence interval for the number of years students in this city take piano
lessons, and interpret it in context of the data.
c. Do your results from the hypothesis test and the confidence interval agree? Explain your
reasoning.
22. Auto exhaust and lead exposure. Researchers interested in lead exposure due to car exhaust
sampled the blood of 52 police officers subjected to constant inhalation of automobile exhaust
fumes while working traffic enforcement in a primarily urban environment. The blood samples of
these officers had an average lead concentration of 124.32 𝜇g/l and a SD of 37.74 𝜇g/l; a previous
study of individuals from a nearby suburb, with no history of exposure, found an average blood
level concentration of 35 𝜇g/l. (Mortada et al. 2000)
a. Write down the hypotheses that would be appropriate for testing if the police officers appear
to have been exposed to a different concentration of lead.
b. Explicitly state and check all conditions necessary for inference on these data.
c. Test the hypothesis that the downtown police officers have a higher lead exposure than the
group in the previous study. Interpret your results in context.
342
Chapter 20
In this section we consider a difference in two population means, 𝜇1 − 𝜇2 , under the condition that
the data are not paired. Just as with a single sample, we identify conditions to ensure we can use the
𝑡-distribution with a point estimate of the difference, 𝑥1̄ − 𝑥2̄ , and a new standard error formula.
The details for working through inferential problems in the two independent means setting are strik-
ingly similar to those applied to the two independent proportions setting. We first cover a random-
ization test where the observations are shuffled under the assumption that the null hypothesis is true.
Then we bootstrap the data (with no imposed null hypothesis) to create a confidence interval for the
true difference in population means, 𝜇1 − 𝜇2 . The mathematical model, here the 𝑡-distribution, is able
to describe both the randomization test and the bootstrapping as long as the conditions are met.
The inferential tools are applied to three different data contexts: determining whether stem cells can
improve heart function, exploring the relationship between pregnant women’s smoking habits and
birth weights of newborns, and exploring whether there is convincing evidence that one variation
of an exam is harder than another variation. This section is motivated by questions like “Is there
convincing evidence that newborns from mothers who smoke have a different average birth weight
than newborns from mothers who do not smoke?”
20.1. RANDOMIZATION TEST FOR THE DIFFERENCE IN MEANS 343
An instructor decided to run two slight variations of the same exam. Prior to passing out the exams,
they shuffled the exams together to ensure each student received a random version. Anticipating
complaints from students who took Version B, they would like to evaluate whether the difference
observed in the groups is so large that it provides convincing evidence that Version B was more
difficult (on average) than Version A.
Figure 20.1: Exam scores for students given one of two different exams.
GUIDED PRACTICE
Construct hypotheses to evaluate whether the observed difference in sample means,
𝑥𝐴
̄ − 𝑥𝐵
̄ = 3.1, is likely to have happened due to chance, if the null hypothesis is true.
We will later evaluate these hypotheses using 𝛼 = 0.01.1
GUIDED PRACTICE
Before moving on to evaluate the hypotheses in the previous Guided Practice, let’s think
carefully about the dataset. Are the observations across the two groups independent?
Are there any concerns about outliers?2
1 𝐻 ∶ the exams are equally difficult, on average. 𝜇 − 𝜇 = 0. 𝐻 ∶ one exam was more difficult than the other,
0 𝐴 𝐵 𝐴
on average. 𝜇𝐴 − 𝜇𝐵 ≠ 0.
2 Since the exams were shuffled, the “treatment” in this case was randomly assigned, so independence within and
between groups is satisfied. The summary statistics suggest the data are roughly symmetric about the mean, and the
min/max values do not suggest any outliers of concern.
344 CHAPTER 20. INFERENCE FOR COMPARING TWO INDEPENDENT MEANS
Figure 20.2: The version of the test (A or B) is randomly allocated to the test scores, under the null
assumption that the tests are equally difficult.
Building on Figure 20.2, Figure 20.3 shows the values of the simulated statistics 𝑥1,𝑠𝑖𝑚
̄ − 𝑥2,𝑠𝑖𝑚
̄ over
1,000 random simulations. We see that, just by chance, the difference in scores can range anywhere
from -10 points to +10 points.
Figure 20.3: Histogram of differences in means, calculated from 1,000 different randomizations of the exam
types.
20.1. RANDOMIZATION TEST FOR THE DIFFERENCE IN MEANS 345
Figure 20.4: Histogram of differences in means, calculated from 1,000 different randomizations of the exam
types. The observed difference of 3.1 points is plotted as a vertical line, and the area more extreme than 3.1
is shaded to represent the p-value.
EXAMPLE
Approximate the p-value depicted in Figure 20.4, and provide a conclusion in the context of
the case study.
Using software, we can find the number of shuffled differences in means that are less than the
observed difference (of 3.14) is 900 (out of 1,000 randomizations). So 10% of the simulations are
larger than the observed difference. To get the p-value, we double the proportion of randomized
differences which are larger than the observed difference, p-value = 0.2.
Previously, we specified that we would use 𝛼 = 0.01. Since the p-value is larger than 𝛼, we
do not reject the null hypothesis. That is, the data do not convincingly show that one exam
version is more difficult than the other, and the teacher should not be convinced that they
should add points to the Version B exam scores.
Before providing a full example working through a bootstrap analysis on actual data, we return to the
fictional Awesome Auto example as a way to visualize the two sample bootstrap setting. Consider an
expanded scenario where the research question centers on comparing the average price of a car at one
Awesome Auto franchise (Group 1) to the average price of a car at a different Awesome Auto franchise
(Group 2). The process of bootstrapping can be applied to each Group separately, and the differences
of means recalculated each time. Figure 20.5 visually describes the bootstrap process when interest is
in a statistic computed on two separate samples. The analysis proceeds as in the one sample case, but
now the (single) statistic of interest is the difference in sample means. That is, a bootstrap resample
is done on each of the groups separately, but the results are combined to have a single bootstrapped
difference in means. Repetition will produce 𝑘 bootstrapped differences in means, and the histogram
will describe the natural sampling variability associated with the difference in means.
Figure 20.5: For the two group comparison, the bootstrap resampling is done separately on each group, but
the statistic is calculated as a difference. The set of k differences is then analyzed as the statistic of interest
with conclusions drawn on the parameter of interest.
In the following sections, we leave the fictional setting of Awesome Auto and apply the bootstrap
method to actual datasets investigating whether embryonic stem cells help improve heart function
and later to investigate characteristics of births. Note that the fictional setting allowed us to visual-
ize the bootstrap method because we had samples of size five. The visualization was important in
understanding how the bootstrap method works. However, the bootstrap relies heavily on the sample
being an outstanding proxy for the population, and five observations are almost never enough to truly
represent the nuances of a full population. To that end, we apply bootstrap methods with much larger
sample sizes, as seen in the 1000 random samples provided in the births data.
Group n Mean SD
ESC 9 3.50 5.17
Control 9 -4.33 2.76
The point estimate of the difference in the heart pumping variable is straightforward to find: it is the
difference in the sample means.
𝑥𝑒𝑠𝑐
̄ − 𝑥𝑐𝑜𝑛𝑡𝑟𝑜𝑙
̄ = 3.50 − (−4.33) = 7.83
Figure 20.6: Histogram of differences in means after 1,000 bootstrap samples from each of the two groups.
The observed difference is plotted as a black vertical line at 7.83. The blue dashed and red dotted lines
provide the bootstrap percentile and boostrap SE confidence intervals, respectively, for the difference in true
population means.
GUIDED PRACTICE
Using the histogram of bootstrapped difference in means, estimate the standard error
of the differences in sample means, 𝑥𝐸𝑆𝐶
̄ − 𝑥𝐶𝑜𝑛𝑡𝑟𝑜𝑙
̄ .3
EXAMPLE
Choose one of the bootstrap confidence intervals for the true difference in average pumping
capacity, 𝜇𝐸𝑆𝐶 − 𝜇𝐶𝑜𝑛𝑡𝑟𝑜𝑙 . Does the interval show that there is a difference across the two
treatments?
Because neither of the 90% intervals (either percentile or SE) above overlap zero (note that
zero is never one of the bootstrapped differences so 95% and 99% intervals would have given
the same conclusion!), we conclude that the ESC treatment is substantially better with respect
to heart pumping capacity than the treatment.
Because the study is a randomized controlled experiment, we can conclude that it is the
treatment (ESC) which is causing the change in pumping capacity.
Every year, the US releases to the public a large dataset containing information on births recorded in
the country. This dataset has been of interest to medical researchers who are studying the relation
between habits and practices of expectant mothers and the birth of their children. We will work with
a random sample of 1,000 cases from the dataset released in 2014.
Table 20.3: Four cases from the births14 dataset. The emoty cells indicate missing data.
We would like to know, is there convincing evidence that newborns from mothers who smoke have a
different average birth weight than newborns from mothers who do not smoke? We will use data from
this sample to try to answer this question.
20.3. MATHEMATICAL MODEL FOR TESTING THE DIFFERENCE IN MEANS 349
EXAMPLE
The null hypothesis represents the case of no difference between the groups.
• 𝐻0 ∶ There is no difference in average birth weight for newborns from mothers who did
and did not smoke. In statistical notation: 𝜇𝑛 −𝜇𝑠 = 0, where 𝜇𝑛 represents non-smoking
mothers and 𝜇𝑠 represents mothers who smoked.
• 𝐻𝐴 ∶ There is some difference in average newborn weights from mothers who did and
did not smoke (𝜇𝑛 − 𝜇𝑠 ≠ 0).
Table 20.4 displays sample statistics from the data. We can see that the average birth weight of babies
born to smoker moms is lower than those born to nonsmoker moms.
Habit n Mean SD
nonsmoker 867 7.27 1.23
smoker 114 6.68 1.60
Figure 20.7: The top panel represents birth weights for infants whose mothers smoked during pregnancy.
The bottom panel represents the birth weights for infants whose mothers who did not smoke during pregnancy.
GUIDED PRACTICE
The summary statistics in Table 20.4 may be useful for this Guided Practice. What is
the point estimate of the population difference, 𝜇𝑛 − 𝜇𝑠 ?4
The T score is a ratio of how the groups differ as compared to how the observations
within a group vary.
(𝑥1̄ − 𝑥2̄ ) − 0
𝑇 =
√𝑠21 /𝑛1 + 𝑠22 /𝑛2
When the null hypothesis is true and the conditions are met, T has a t-distribution with
𝑑𝑓 = 𝑚𝑖𝑛(𝑛1 − 1, 𝑛2 − 1).
Conditions:
GUIDED PRACTICE
Compute the standard error of the point estimate for the average difference between
the weights of babies born to nonsmoker and smoker mothers.5
EXAMPLE
Complete the hypothesis test started in the previous Example and Guided Practice on births14
dataset and research question. Use a discernibility level of 𝛼 = 0.05. For reference, 𝑥𝑛̄ − 𝑥𝑠̄ =
0.59, 𝑆𝐸 = 0.16, and the sample sizes were 𝑛𝑛 = 867 and 𝑛𝑠 = 114.
We can find the test statistic for this test using the previous information:
0.59 − 0
𝑇 = = 3.69
0.16
We find the single tail area using software. We’ll use the smaller of 𝑛𝑛 −1 = 866 and 𝑛𝑠 −1 = 113
as the degrees of freedom: 𝑑𝑓 = 113. The one tail area is roughly 0.00017; doubling this value
gives the two-tail area and p-value, 0.00034.
The p-value is smaller than the discernibility level, 0.05, so we reject the null hypothesis. The
data provide statistically discernible evidence of a difference in the average weights of babies
born to mothers who smoked during pregnancy and those who did not.
This result is likely not surprising. We all know that smoking is bad for you and you’ve probably also
heard that smoking during pregnancy is not just bad for the mother but also for the baby as well. In
fact, some in the tobacco industry actually had the audacity to tout that as a benefit of smoking:
It’s true. The babies born from women who smoke are smaller, but they’re just as healthy
as the babies born from women who do not smoke. And some women would prefer having
smaller babies. - Joseph Cullman, Philip Morris’ Chairman of the Board on CBS’ Face the
Nation, Jan 3, 1971
Furthermore, health differences between babies born to mothers who smoke and those who do not are
not limited to weight differences.6
A small note on the power of the independent t-test (recall the discussion of power in Section 14.4). It
turns out that the independent t-test given here is often less powerful than the paired t-test discussed
in Section 21.3. That said, depending on how the data are collected, we don’t always have mechanism
for pairing the data and reducing the inherent variability across observations.
5 𝑆𝐸(𝑥̄ 2 2 2 2
𝑛 − 𝑥̄ 𝑠 ) = √𝑠𝑛 /𝑛𝑛 + 𝑠𝑠 /𝑛𝑠 = √1.23 /867 + 1.60 /114 = 0.16
6 You can watch an episode of John Oliver on Last Week Tonight to explore the present day offenses of the tobacco
industry. Please be aware that there is some adult language.
20.4. MATHEMATICAL MODEL FOR ESTIMATING THE DIFFERENCE IN MEANS 351
The 𝑡-distribution can be used for inference when working with the standardized differ-
ence of two means if
• Independence (extended). The data are independent within and between the two
groups, e.g., the data come from independent random samples or from a random-
ized experiment.
• Normality. We check the outliers for each group separately.
𝜎12 𝜎2
𝑆𝐸 = √ + 2
𝑛1 𝑛2
The official formula for the degrees of freedom is quite complex and is generally com-
puted using software, so instead you may use the smaller of 𝑛1 − 1 and 𝑛2 − 1 for the
degrees of freedom if software isn’t readily available.
Recall that the margin of error is defined by the standard error. The margin of error for 𝑥1̄ − 𝑥2̄ can
be directly obtained from 𝑆𝐸(𝑥1̄ − 𝑥2̄ ).
𝑠2 𝑠2
The margin of error is 𝑡⋆𝑑𝑓 × √ 𝑛1 + 𝑛2 where 𝑡⋆𝑑𝑓 is calculated from a specified percentile
1 2
on the t-distribution with df degrees of freedom.
EXAMPLE
Can the 𝑡-distribution be used to make inference using the point estimate, 𝑥𝑒𝑠𝑐
̄ − 𝑥𝑐𝑜𝑛𝑡𝑟𝑜𝑙
̄ = 7.83?
First, we check for independence. Because the sheep were randomized into the groups, inde-
pendence within and between groups is satisfied.
Figure 20.8 does not reveal any clear outliers in either group. (The ESC group does look a bit
more variable, but this is not the same as having clear outliers.)
With both conditions met, we can use the 𝑡-distribution to model the difference of sample
means.
352 CHAPTER 20. INFERENCE FOR COMPARING TWO INDEPENDENT MEANS
Figure 20.8: Histograms for the difference in heart pumping function after a heart attack for both the
treatment group (ESC, which received an the embryonic stem cell treatment) and the control group (which
did not receive the treatment).
Generally, we use statistical software to find the appropriate degrees of freedom, or if software isn’t
available, we can use the smaller of 𝑛1 − 1 and 𝑛2 − 1 for the degrees of freedom, e.g., if using a
𝑡-table to find tail areas. For transparency in the Examples and Guided Practice, we’ll use the latter
approach for finding 𝑑𝑓; in the case of the ESC example, this means we’ll use 𝑑𝑓 = 8.
EXAMPLE
Calculate a 95% confidence interval for the effect of ESCs on the change in heart pumping
capacity of sheep after they’ve suffered a heart attack.
We will use the sample difference and the standard error that we computed earlier:
𝑥𝑒𝑠𝑐
̄ − 𝑥𝑐𝑜𝑛𝑡𝑟𝑜𝑙
̄ = 7.83
5.172 2.762
𝑆𝐸 = √ + = 1.95
9 9
Using 𝑑𝑓 = 8, we can identify the critical value of 𝑡⋆8 = 2.31 for a 95% confidence interval.
Finally, we can enter the values into the confidence interval formula:
point estimate ± 𝑡⋆ × 𝑆𝐸
7.83 ± 2.31 × 1.95
(3.32 , 12.34)
We are 95% confident that the heart pumping function in sheep that received embryonic stem
cells is between 3.32% and 12.34% higher than for sheep that did not receive the stem cell
treatment.
20.5. CHAPTER REVIEW 353
20.5.1 Summary
In this chapter we extended the single mean inferential methods to questions of differences in
means. You may have seen parallels from the chapters that extended a single proportion (Chapter
@ref(inference-one-prop)) to differences in proportions (Chapter @ref(inference-two-props)). When
considering differences in sample means (indeed, when considering many quantitative statistics), we
use the t-distribution to describe the sampling distribution of the T score (the standardized difference
in sample means). Ideas of confidence level and type of error which might occur from a hypothesis
test conclusion are similar to those seen in other chapters (see Chapter 14).
20.5.2 Terms
The terms introduced in this chapter are presented in Table 20.5. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
20.6 Exercises
𝐻0 ∶ 𝑥𝐴𝐷
̄ ≤ 𝑥𝐼̄ 𝐻𝐴 ∶ 𝑥𝐴𝐷
̄ > 𝑥𝐼̄
2. Fill in the blanks. We use a ___ to evaluate if data provide convincing evidence of a difference
between two population means and we use a ___ to estimate this difference.
3. Diamonds, randomization test. The prices of diamonds go up as the carat weight increases,
but the increase is not smooth. For example, the difference between the size of a 0.99 carat
diamond and a 1 carat diamond is undetectable to the naked human eye, but the price of a 1
carat diamond tends to be much higher than the price of a 0.99 carat diamond. We have two
random samples of diamonds: 23 0.99 carat diamonds and 23 1 carat diamonds. In order to
be able to compare equivalent units, we first divide the price for each diamond by 100 times
its weight in carats. That is, for a 0.99 carat diamond, we divide the price by 99 and for a 1
carat diamond, we divide it by 100. Then, we randomize the carat weight to the price values
in order simulate the null distribution of differences in average prices of 0.99 carat and 1 carat
diamonds. The null distribution (with 1,000 randomized differences) is shown below and depicts
the distribution of differences in sample means (of price per carat) if there really was no difference
in the population from which these diamonds came.7 (Wickham 2016)
Using the randomization distribution, conduct a hypothesis test to evaluate if there is a difference
between the prices per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1
carat. Make sure to state your hypotheses clearly and interpret your results in context of the
data. (Wickham 2016)
7 The diamonds data used in this exercise can be found in the ggplot2 R package.
20.6. EXERCISES 355
5. Diamonds, bootstrap interval. We have data on two random samples of diamonds: 23 0.99
carat diamonds and 23 1 carat diamonds. Provided below is a histogram of bootstrap differences
in means of price per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1 carat.
(Wickham 2016)
a. Using the bootstrap distribution, create a (rough) 95% bootstrap percentile confidence
interval for the true population difference in prices per carat of diamonds that weigh 0.99
carats and 1 carat.
b. Using the bootstrap distribution, create a (rough) 95% bootstrap SE confidence interval
for the true population difference in prices per carat of diamonds that weigh 0.99 carats
and 1 carat. Note that the standard error of the bootstrap distribution is 4.64.
8 The lizard_run data used in this exercise can be found in the openintro R package.
356 CHAPTER 20. INFERENCE FOR COMPARING TWO INDEPENDENT MEANS
6. Lizards running, bootstrap interval. We have data on top speeds (in m/sec) measured on
a laboratory race track for two species of lizards: Western fence lizard (Sceloporus occidentalis)
and Sagebrush lizard (Sceloporus graciosus). The bootstrap distribution below describes the
variability of difference in means captured from 1,000 bootstrap samples of the lizard data.
(Adolph 1987)
a. Using the bootstrap distribution, create a (rough) 90% percentile bootrap confidence in-
terval for the true population difference in average speed of the Western fence lizard as
compared with Sagebrush lizard.
b. Using the bootstrap distribution, create a (rough) 90% bootstrap SE confidence interval
for the true population difference in average speed of the Western fence lizard as compared
with Sagebrush lizard.
7. Weight loss. You are reading an article in which the researchers have created a 95% confidence
interval for the difference in average weight loss for two diets. They are 95% confident that the
true difference in average weight loss over 6 months for the two diets is somewhere between (1
lb, 25 lbs). The authors claim that, “therefore diet A (𝑥𝐴 ̄ = 20 lbs average loss) results in a
much larger average weight loss as compared to diet B (𝑥𝐵 ̄ = 7 lbs average loss).” Comment on
the authors’ claim.
8. Possible randomized means. Data were collected on data from two groups (A and B). There
were 3 measurements taken on Group A and two measurements in Group B.
If the data are (repeatedly) randomly allocated across the two conditions, provide the following:
(1) the values which are assigned to group A, (2) the values which are assigned to group B, and
(3) the difference in averages (𝑥𝐴
̄ − 𝑥𝐵
̄ ) for each of the following:
a. When the randomized difference in averages is as large as possible.
b. When the randomized difference in averages is as small as possible (a big in magnitude
negative number).
c. When the randomized difference in averages is as close to zero as possible.
d. When the observed values are randomly assigned to the two groups, to which of the previous
parts would you expect the difference in means to fall closest? Explain your reasoning.
20.6. EXERCISES 357
9. Diamonds, mathematical test. We have data on two random samples of diamonds: one with
diamonds that weigh 0.99 carats and one with diamonds that weigh 1 carat. Each sample has
23 diamonds. Sample statistics for the price per carat of diamonds in each sample are provided
below. Conduct a hypothesis test using a mathematical model to evaluate if there is a difference
between the prices per carat of diamonds that weigh 0.99 carats and diamonds that weigh 1
carat Make sure to state your hypotheses clearly, check relevant conditions, and interpret your
results in context of the data. (Wickham 2016)
Mean SD n
0.99 carats $44.51 $13.32 23
1 carat $57.20 $18.19 23
10. A/B testing. A/B testing is a user experience research methodology where two variants of
a page are shown to users at random. A company wants to evaluate whether users will spend
more time, on average, on Page A or Page B using an A/B test. Two user experience designers
at the company, Lucie and Müge, are tasked with conducting the analysis of the data collected.
They agree on how the null hypothesis should be set: on average, users spend the same amount
of time on Page A and Page B. Lucie believes that Page B will provide a better experience for
users and hence wants to use a one-tailed test, Müge believes that a two-tailed test would be a
better choice. Which designer do you agree with, and why?
11. Diamonds, mathematical interval. We have data on two random samples of diamonds: one
with diamonds that weigh 0.99 carats and one with diamonds that weigh 1 carat. Each sample
has 23 diamonds. Sample statistics for the price per carat of diamonds in each sample are
provided below. Assuming that the conditions for conducting inference using a mathematical
model are satisfied, construct a 95% confidence interval for the true population difference in
prices per carat of diamonds that weigh 0.99 carats and 1 carat. (Wickham 2016)
Mean SD n
0.99 carats $44.51 $13.32 23
1 carat $57.20 $18.19 23
12. True / False: comparing means. Determine if the following statements are true or false,
and explain your reasoning for statements you identify as false.
a. As the degrees of freedom increases, the 𝑡-distribution approaches normality.
b. If a 95% confidence interval for the difference between two population means contains 0, a
99% confidence interval calculated based on the same two samples will also contain 0.
c. If a 95% confidence interval for the difference between two population means contains 0, a
90% confidence interval calculated based on the same two samples will also contain 0.
13. Difference of means. We collect two random samples from two different populations In each
part below, consider the sample means 𝑥1̄ and 𝑥2̄ that we might observe from these two samples.
14. Mindfulness intervention for nurses. In order to address extremely challenging and stressful
situations for intensive care unit nurses, researchers ran a mindfulness-based intervention (MBI)
study on 60 nurses working in three hospitals in El-Beheira, Egypt. The participants were
randomly allocated to one of the two groups: the treatment group (MBI) received 8 MBI sessions
and the control group received no intervention. The nurses’ emotional exhaustion was measured
using 9 items from a questionnaire of the Maslach Burnout Inventory-Human Services Survey
for Medical Personnel; the questions are recorded on a Likert scale where 0 indicated “Never”
and 6 indicates “Every day”. Nurses in the treatment group had an emotional exhaustion score
of 15.47, with a standard deviation of 4.44, and nurses in the control group had an emotional
exhaustion score of 32.43, with a standard deviation of 8.87. Do these data provide convincing
evidence that the emotional exhaustion decrease is different for the patients in the treatment
group compared to the control group? Assume that conditions for conducting inference using
mathematical models are satisfied. (Othman, Hassan, and Mohamed 2023)
15. Chicken diet: horsebean vs. linseed. Chicken farming is a multi-billion dollar industry,
and any methods that increase the growth rate of young chicks can reduce consumer costs
while increasing company profits, possibly by millions of dollars. An experiment was conducted
to measure and compare the effectiveness of various feed supplements on the growth rate of
chickens. Newly hatched chicks were randomly allocated into six groups, and each group was
given a different feed supplement. In this exercise we consider chicks that were fed horsebean
and linseed. Below are some summary statistics from this dataset along with box plots showing
the distribution of weights by feed type.9 (McNeil 1977)
Horsebean Linseed
Mean 160.2 218.8
SD 38.6 52.2
n 10.0 12.0
a. Describe the distributions of weights of chickens that were fed horsebean and linseed.
b. Do these data provide strong evidence that the average weights of chickens that were fed
linseed and horsebean are different? Use a 5% discernibility level.
c. What type of error might we have committed? Explain.
d. Would your conclusion change if we used 𝛼 = 0.01?
16. Fuel efficiency in the city. Each year the US Environmental Protection Agency (EPA)
releases fuel economy data on cars manufactured in that year. Below are summary statistics
on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic
transmissions manufactured in 2021. Do these data provide strong evidence of a difference
between the average fuel efficiency of cars with manual and automatic transmissions in terms of
their average city mileage?10 (US DOE EPA 2021)
CITY Mean SD n
Automatic 17.4 3.44 25
Manual 22.7 4.58 25
9 The chickwts data used in this exercise can be found in the datasets R package.
10 The epa2021 data used in this exercise can be found in the openintro R package.
20.6. EXERCISES 359
17. Chicken diet: casein vs. soybean. Casein is a common weight gain supplement for humans.
Does it have an effect on chickens? An experiment was conducted to measure and compare the
effectiveness of various feed supplements on the growth rate of chickens. Newly hatched chicks
were randomly allocated into six groups, and each group was given a different feed supplement.
In this exercise we consider chicks that were fed casein and soybean. Assume that the conditions
for conducting inference using mathematical models are met, and using the data provided below,
test the hypothesis that the average weight of chickens that were fed casein is different than the
average weight of chickens that were fed soybean. If your hypothesis test yields a statistically
discernible result, discuss whether the higher average weight of chickens can be attributed to
the casein diet. (McNeil 1977)
18. Fuel efficiency on the highway. Each year the US Environmental Protection Agency (EPA)
releases fuel economy data on cars manufactured in that year. Below are summary statistics
on fuel efficiency (in miles/gallon) from random samples of cars with manual and automatic
transmissions manufactured in 2021. Do these data provide strong evidence of a difference
between the average fuel efficiency of cars with manual and automatic transmissions in terms of
their average highway mileage? (US DOE EPA 2021)
HIGHWAY Mean SD n
Automatic 23.7 3.90 25
Manual 30.9 5.13 25
19. Gaming, distracted eating, and intake. A group of researchers who are interested in the
possible effects of distracting stimuli during eating, such as an increase or decrease in the amount
of food consumption, monitored food intake for a group of 44 patients who were randomized
into two equal groups. The treatment group ate lunch while playing solitaire, and the control
group ate lunch without any added distractions. Patients in the treatment group ate 52.1 grams
of biscuits, with a standard deviation of 45.1 grams, and patients in the control group ate 27.1
grams of biscuits, with a standard deviation of 26.4 grams. Do these data provide convincing
evidence that the average food intake (measured in amount of biscuits consumed) is different
for the patients in the treatment group compared to the control group? Assume that conditions
for conducting inference using mathematical models are satisfied. (Oldham-Cooper et al. 2011)
20. Gaming, distracted eating, and recall. A group of researchers who are interested in the
possible effects of distracting stimuli during eating, such as an increase or decrease in the amount
of food consumption, monitored food intake for a group of 44 patients who were randomized
into two equal groups. The 22 patients in the treatment group who ate their lunch while playing
solitaire were asked to do a serial-order recall of the food lunch items they ate. The average
number of items recalled by the patients in this group was 4. 9, with a standard deviation of
1.8. The average number of items recalled by the patients in the control group (no distraction)
was 6.1, with a standard deviation of 1.8. Do these data provide strong evidence that the
average numbers of food items recalled by the patients in the treatment and control groups
are different? Assume that conditions for conducting inference using mathematical models are
satisfied. (Oldham-Cooper et al. 2011)
360
Chapter 21
Paired data represent a particular type of experimental structure where the analysis is somewhat akin
to a one-sample analysis (see Chapter 19) but has other features that resemble a two-sample analysis
(see Chapter 20). As with a two-sample analysis, quantitative measurements are made on each of
two different levels of the explanatory variable. However, because the observational unit is paired
across the two groups, the two measurements are subtracted such that only the difference is retained.
Table 21.1 presents some examples of studies where paired designs were implemented.
Table 21.1: Examples of studies where a paired design is used to measure the difference in the measurement
over two conditions.
Paired data.
Two sets of observations are paired if each observation in one set has a special corre-
spondence or connection with exactly one observation in the other dataset.
21.1. RANDOMIZATION TEST FOR THE MEAN PAIRED DIFFERENCE 361
It is worth noting that if mathematical modeling is chosen as the analysis tool, paired data inference on
the difference in measurements will be identical to the one-sample mathematical techniques described
in Chapter 19. However, recall from Chapter 19 that with pure one-sample data, the computational
tools for hypothesis testing are not easy to implement and were not presented (although the bootstrap
was presented as a computational approach for constructing a one sample confidence interval). With
paired data, the randomization test fits nicely with the structure of the experiment and is presented
here.
Consider an experiment done to measure whether tire brand Smooth Turn or tire brand Quick Spin
has longer tread wear (in cm). That is, after 1,000 miles on a car, which brand of tires has more tread,
on average?
Figure 21.1: Box plots of the tire tread data (in cm) and the brand of tire from which the original measure-
ments came.
We’d like to be able to systematically distinguish between what the Smooth Turn manufacturer sees
in the plot and what the Quick Spin manufacturer sees in the plot. Fortunately for us, we have an
excellent way to simulate the natural variability (from road conditions, etc.) that can lead to tires
being worn at different rates.
362 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
Figure 21.2: The 4th car: the tire brand was randomly permuted, and in the randomization calculation, the
measurements (in cm) ended up in different groups.
Figure 21.3: The 5th car: the tire brand was randomly permuted to stay the same! In the randomization
calculation, the measurements (in cm) ended up in the original groups.
21.1. RANDOMIZATION TEST FOR THE MEAN PAIRED DIFFERENCE 363
We can put the shuffled assignments for all the cars into one plot as seen in Figure 21.4b.
(a) Brand of tire is the original brand. (b) Brand of tire is the shuffled brand assignment.
Figure 21.4: Tire tread (in cm) by brand, original and shuffled. As evidenced by the colors, some of the
cars kept their original tire assignments and some cars swapped the tire assignments.
The next step in the randomization test is to sort the brands so that the assigned brand value on the x-
axis aligns with the assigned group from the randomization. Figure 21.5a shows the same randomized
groups, as seen in Figure 21.4b previously. However, Figure 21.5b sorts the randomized groups so
that we can measure the variability across groups as compared to the variability within groups.
(a) Randomized brand assignment (b) Randomized brand assignment sorted by brand.
Figure 21.6 presents a second randomization of the data. Notice that the two observations from the
same car are linked with a grey line; some of the tread values have been randomly assigned to the
other tire brand, while some are still connected to their original tire brands.
364 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
Figure 21.6: A second randomization where the brand is randomly swapped (or not) across the two tread
wear measurements (in cm) from the same car.
Figure 21.7 presents yet another randomization of the data. Again, the same observations are linked
by a grey line, and some of the tread values have been randomly assigned to the opposite tire brand
than they were originally (while some are still connected to their original tire brands).
Figure 21.7: An additional randomization where the brand is randomly swapped (or not) across the two
tread wear measurements (in cm) from the same car.
average tire tread in Smooth Turn is due to more than just natural variability: we reject 𝐻0 and
conclude that 𝜇𝑆𝑇 ≠ 𝜇𝑄𝑆 .
Figure 21.8: Histogram of 1,000 mean differences with tire brand randomly assigned across the two tread
measurements (in cm) per pair.
Each textbook has two corresponding prices in the dataset: one for the UCLA Bookstore and one for
Amazon. When two sets of observations have this special correspondence, they are said to be paired.
366 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
GUIDED PRACTICE
Using the histogram of bootstrapped difference in means, estimate the standard error
of the mean of the sample differences, 𝑥𝑑𝑖𝑓𝑓
̄ .1
The bootstrap SE interval is found by computing the SE of the bootstrapped differences (𝑆𝐸𝑥𝑑𝑖𝑓𝑓 =
$1.64) and the normal multiplier of 𝑧⋆ = 2.58. The averaged difference is 𝑥̄ = $3.58. The 99% confidence
interval is: $3.58 ± 2.58 × $1.64 = ($ − 0.65, $7.81).
The confidence intervals seem to indicate that the UCLA bookstore price is, on average, higher than
the Amazon price, as the majority of the confidence interval is positive. However, if the analysis
required a strong degree of certainty (e.g., 99% confidence), and the bootstrap SE interval was most
appropriate (given a second course in statistics the nuances of the methods can be investigated), the
results of which book seller is higher is not well determined (because the bootstrap SE interval overlaps
zero). That is, the 99% bootstrap SE interval gives potential for UCLA bookstore to be lower, on
average, than Amazon (because of the possible negative values for the true mean difference in price).
Figure 21.9: Bootstrap distribution for the average difference in new book price at the UCLA bookstore
versus Amazon. 99% confidence intervals are superimposed using blue dashed (bootstrap percentile interval)
and red dotted (bootstrap SE interval) lines.
1 The bootstrapped differences in sample means vary roughly from 0.7 to 7.5, a range of $6.80. Although the bootstrap
distribution is not symmetric, we use the empirical rule (that with bell-shaped distributions, most observations are within
two standard errors of the center), the standard error of the mean differences is approximately $1.70. You might note
that the standard error calculation given in Section 21.3 is 𝑆𝐸(𝑥̄𝑑𝑖𝑓𝑓 ) = √𝑠2𝑑𝑖𝑓𝑓 /𝑛𝑑𝑖𝑓𝑓 = √13.42 /68 = $1.62 (values
from Section 21.3), very close to the bootstrap approximation.
21.3. MATHEMATICAL MODEL FOR THE MEAN PAIRED DIFFERENCE 367
Thinking about the differences as a single observation on an observational unit changes the paired
setting into the one-sample setting. The mathematical model for the one-sample case is covered in
Section 19.2.
n Mean SD
68 3.58 13.4
EXAMPLE
Set up a hypothesis test to determine whether, on average, there is a difference between
Amazon’s price for a book and the UCLA bookstore’s price. Also, check the conditions for
whether we can move forward with the test using the 𝑡-distribution.
Next, we check the independence and normality conditions. This is a simple random sample,
so assuming the textbooks are independent seems reasonable. While there are some outliers,
𝑛 = 68 and none of the outliers are particularly extreme, so the normality of 𝑥̄ is satisfied.
With these conditions satisfied, we can move forward with the 𝑡-distribution.
368 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
The T score is a ratio of how the sample mean difference varies from zero as compared
to how the observations vary.
𝑥𝑑𝑖𝑓𝑓
̄ −0
𝑇 = √
𝑠𝑑𝑖𝑓𝑓 / 𝑛𝑑𝑖𝑓𝑓
When the null hypothesis is true and the conditions are met, T has a t-distribution with
𝑑𝑓 = 𝑛𝑑𝑖𝑓𝑓 − 1.
Conditions:
EXAMPLE
Complete the hypothesis test started in the previous Example.
To compute the test compute the standard error associated with 𝑥𝑑𝑖𝑓𝑓 ̄ using the standard
deviation of the differences (𝑠𝑑𝑖𝑓𝑓 = 13.42) and the number of differences (𝑛𝑑𝑖𝑓𝑓 = 68) ∶
𝑠𝑑𝑖𝑓𝑓 13.42
𝑆𝐸𝑥̄𝑑𝑖𝑓𝑓 = √ = √ = 1.63
𝑛𝑑𝑖𝑓𝑓 68
𝑥𝑑𝑖𝑓𝑓
̄ −0 3.58 − 0
𝑇 = = = 2.20
𝑆𝐸𝑥̄𝑑𝑖𝑓𝑓 1.63
Doubling this area gives the p-value: 0.0312. Because the p-value is less than 0.05, we reject
the null hypothesis. The data provide evidence that Amazon prices are different, on average,
than the UCLA Bookstore prices for UCLA courses.
21.3. MATHEMATICAL MODEL FOR THE MEAN PAIRED DIFFERENCE 369
Recall that the margin of error is defined by the standard error. The margin of error for 𝑥𝑑𝑖𝑓𝑓
̄ can be
directly obtained from 𝑆𝐸(𝑥𝑑𝑖𝑓𝑓
̄ ).
EXAMPLE
Create a 95% confidence interval for the average price difference between books at the UCLA
bookstore and books on Amazon.
Conditions have already verified and the standard error computed in a previous Example.
To find the confidence interval, identify 𝑡⋆67 using statistical software or the 𝑡-table (𝑡⋆67 = 2.00),
and plug it, the point estimate, and the standard error into the confidence interval formula:
We are 95% confident that the UCLA Bookstore is, on average, between $0.32 and $6.84 more
expensive than Amazon for UCLA course books.
GUIDED PRACTICE
We have convincing evidence that Amazon is, on average, less expensive. How should
this conclusion affect UCLA student buying habits? Should UCLA students always buy
their books on Amazon?2
A small note on the power of the paired t-test (recall the discussion of power in Section 14.4). It turns
out that the paired t-test given here is often more powerful than the independent t-test discussed in
Section 20.3. That said, depending on how the data are collected, we don’t always have mechanism
for pairing the data and reducing the inherent variability across observations.
2 The average price difference is only mildly useful for this question. Examine the distribution shown in Figure 21.10.
There are certainly a handful of cases where Amazon prices are far below the UCLA Bookstore’s, which suggests it is
worth checking Amazon (and probably other online sites) before purchasing. However, in many cases the Amazon price
is above what the UCLA Bookstore charges, and most of the time the price isn’t that different. Ultimately, if getting
a book immediately from the bookstore is notably more convenient, e.g., to get started on reading or homework, it’s
likely a good idea to go with the UCLA Bookstore unless the price difference on a specific book happens to be quite
large. For reference, this is a very different result from what we (the authors) had seen in a similar dataset from 2010.
At that time, Amazon prices were almost uniformly lower than those of the UCLA Bookstore’s and by a large margin,
making the case to use Amazon over the UCLA Bookstore quite compelling at that time. Now we frequently check
multiple websites to find the best price.
370 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
21.4.1 Summary
Like the two independent sample procedures in Chapter 20, the paired difference analysis can be
done using a t-distribution. The randomization test applied to the paired differences is slightly dif-
ferent, however. Note that when randomizing under the paired setting, each null statistic is created
by randomly assigning the group to a numerical outcome within the individual observational unit.
The procedure for creating a confidence interval for the paired difference is almost identical to the
confidence intervals created in Chapter 19 for a single mean.
21.4.2 Terms
The terms introduced in this chapter are presented in Table 21.4. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
21.5 Exercises
2. True / False: paired. Determine if the following statements are true or false. If false, explain.
a. In a paired analysis we first take the difference of each pair of observations, and then we
do inference on these differences.
b. Two datasets of different sizes cannot be analyzed as paired data.
c. Consider two sets of data that are paired with each other. Each observation in one dataset
has a natural correspondence with exactly one observation from the other dataset.
d. Consider two sets of data that are paired with each other. Each observation in one dataset
is subtracted from the average of the other dataset’s observations.
3. Paired or not? I. In each of the following scenarios, determine if the data are paired.
a. Compare pre- (beginning of semester) and post-test (end of semester) scores of students.
b. Assess gender-related salary gap by comparing salaries of randomly sampled men and
women.
c. Compare artery thicknesses at the beginning of a study and after 2 years of taking Vitamin
E for the same group of patients.
d. Assess effectiveness of a diet regimen by comparing the before and after weights of subjects.
4. Paired or not? II. In each of the following scenarios, determine if the data are paired.
a. We would like to know if Intel’s stock and Southwest Airlines’ stock have similar rates of
return. To find out, we take a random sample of 50 days, and record Intel’s and Southwest’s
stock on those same days.
b. We randomly sample 50 items from Target stores and note the price for each. Then we
visit Walmart and collect the price for each of those same 50 items.
c. A school board would like to determine whether there is a difference in average SAT scores
for students at one high school versus another high school in the district. To check, they
take a simple random sample of 100 students from each high school.
5. Sample size and pairing. Determine if the following statement is true or false, and if false,
explain your reasoning: If comparing means of two groups with equal sample sizes, always use
a paired test.
372 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
6. High School and Beyond, randomization test. The National Center of Education Statistics
conducted a survey of high school seniors, collecting test data on reading, writing, and several
other subjects. Here we examine a simple random sample of 200 students from this survey.
Side-by-side box plots of reading and writing scores as well as a histogram of the differences in
scores are shown below. Also provided below is a histogram of randomized averages of paired
differences of scores (read - write), with the observed difference (𝑥𝑟𝑒𝑎𝑑−𝑤𝑟𝑖𝑡𝑒
̄ = −0.545) marked
with a red vertical line. The randomization distribution was produced by doing the following
1000 times: for each student, the two scores were randomly assigned to either read or write, and
the average was taken across all students in the sample.3
3 The hsb2 data used in this exercise can be found in the openintro R package.
21.5. EXERCISES 373
7. Global warming, randomization test. Let’s consider a limited set of climate data, exam-
ining temperature differences in 1950 vs 2022. We sampled 26 locations in the US from the
National Oceanic and Atmospheric Administration’s (NOAA) historical data, where the data
was available for both years of interest. (NOAA 2023) The data are not a random sample, but
they are selected to be a representative sample across the land area of the lower 48 United States.
Using the hottest day of the year as a measure can make the results susceptible to outliers. In-
stead, to get a sense for how hot a year was, we calculate the 90𝑡ℎ percentile; that is, we find the
maximum temperature on the day that was hotter than 90% of the days that year. We want
to know: is the 90𝑡ℎ percentile high temperature greater in 2022 or in 1950? The difference
in 90𝑡ℎ percentile high temperature (high temperature for 2022 - high temperature for 1950)
was calculated for each of the 26 locations. The average of the 26 differences was 2.52∘ F with
a standard deviation of 2.95∘ F. We are interested in determining whether these data provide
strong evidence that the 90𝑡ℎ percentile high temperature is higher in 2022 than in 1950.4
c. Do these data provide convincing evidence of a difference between the 90𝑡ℎ percentile high
temperature? Estimate the p-value from the randomization test, and conclude the hypoth-
esis test using words like “90𝑡ℎ percentile high temperature in 1950” and “90𝑡ℎ percentile
high temperature in 2022.”
8. High School and Beyond, bootstrap interval. We considered the differences between the
reading and writing scores of a random sample of 200 students who took the High School and
Beyond Survey. The mean and standard deviation of the differences are 𝑥𝑟𝑒𝑎𝑑−𝑤𝑟𝑖𝑡𝑒
̄ = −0.545
and 𝑠𝑟𝑒𝑎𝑑−𝑤𝑟𝑖𝑡𝑒 = 8.887 points. The bootstrap distribution below was produced by bootstrapping
from the sample of differences in reading and writing scores 1,000 times.
c. Interpret both confidence intervals using words like “population” and “score”.
d. From the confidence intervals calculated above, does it appear that there is a discernible
difference in reading and writing scores, on average?
4 The us_temperature data used in this exercise can be found in the openintro R package.
374 CHAPTER 21. INFERENCE FOR COMPARING PAIRED MEANS
9. Global warming, bootstrap interval. We considered the change in the 90𝑡ℎ percentile high
temperature in 1950 versus 2022 at 26 sampled locations from the NOAA database. (NOAA
2023) The mean and standard deviation of the reported differences are 2.53∘ F and 2.95∘ F.
a. Calculate a 90% bootstrap percentile confidence interval for the average difference of 90𝑡ℎ
percentile high temperature between 1950 and 2022.
b. Calculate a 90% bootstrap SE confidence interval for the average difference of 90𝑡ℎ per-
centile high temperature between 1950 and 2022.
c. Interpret both intervals in context.
d. Do the confidence intervals provide convincing evidence that there were hotter high tem-
peratures in 2022 than in 1950 at NOAA stations? Explain your reasoning.
10. High School and Beyond, mathematical test. We considered the differences between the
reading and writing scores of a random sample of 200 students who took the High School and
Beyond Survey.
a. Create hypotheses appropriate for the following research question: is there an evident
difference in the average scores of students in the reading and writing exam?
b. Check the conditions required to complete this test.
c. The average observed difference in scores is 𝑥𝑟𝑒𝑎𝑑−𝑤𝑟𝑖𝑡𝑒
̄ = −0.545, and the standard devi-
ation of the differences is 𝑠𝑟𝑒𝑎𝑑−𝑤𝑟𝑖𝑡𝑒 = 8.887 points. Do these data provide convincing
evidence of a difference between the average scores on the two exams?
d. What type of error might we have made? Explain what the error means in the context of
the application.
e. Based on the results of this hypothesis test, would you expect a confidence interval for
the average difference between the reading and writing scores to include 0? Explain your
reasoning.
21.5. EXERCISES 375
11. Global warming, mathematical test. We considered the change in the 90𝑡ℎ percentile high
temperature in 1950 versus 2022 at 26 sampled locations from the NOAA database. (NOAA
2023) The mean and standard deviation of the reported differences are 2.53∘ F and 2.95∘ F.
15. Study environment. In order to test the effects of listening to music while studying versus
studying in silence, students agree to be randomized to two treatments (i.e., study with music
or study in silence). There are two exams during the semester, so the researchers can either
randomize the students to have one exam with music and one with silence (randomly selecting
which exam corresponds to which study environment) or the researchers can randomize the
students to one study habit for both exams.
The researchers are interested in estimating the true population difference of exam score for
those who listen to music while studying as compared to those who study in silence.
a. Describe the experiment which is consistent with a paired designed experiment. How is the
treatment assigned, and how are the data collected such that the observations are paired?
b. Describe the experiment which is consistent with an indpenedent samples experiment. How
is the treatment assigned, and how are the data collected such that the observations are
independent?
16. Friday the 13th, traffic. In the early 1990’s, researchers in the UK collected data on traffic
flow on Friday the 13th with the goal of addressing issues of how superstitions regarding Friday
the 13th affect human behavior and and whether Friday the 13th is an unlucky day. The
histograms below show the distributions of numbers of cars passing by a specific intersection on
Friday the 6th and Friday the 13th for many such date pairs. Also provided are some sample
statistics, where the difference is the number of cars on the 6th minus the number of cars on the
13th.5 (Scanlon et al. 1993)
n Mean SD
sixth 10 128,385 7,259
thirteenth 10 126,550 7,664
diff 10 1,836 1,176
a. Are there any underlying structures in these data that should be considered in an analysis?
Explain.
b. What are the hypotheses for evaluating whether the number of people out on Friday the
6th is different than the number out on Friday the 13th ?
c. Check conditions to carry out the hypothesis test from part (b) using mathematical models.
d. Calculate the test statistic and the p-value.
e. What is the conclusion of the hypothesis test?
f. Interpret the p-value in this context.
g. What type of error might have been made in the conclusion of your test? Explain.
5 The friday data used in this exercise can be found in the openintro R package.
21.5. EXERCISES 377
17. Friday the 13th, accidents. In the early 1990’s, researchers in the UK collected data the
number of traffic accident related emergency room (ER) admissions on Friday the 13th with the
goal of addressing issues of how superstitions regarding Friday the 13th affect human behavior
and and whether Friday the 13th is an unlucky day. The histograms below show the distributions
of numbers of ER admissions at specific emergency rooms on Friday the 6th and Friday the 13th
for many such date pairs. Also provided are some sample statistics, where the difference is the
ER admissions on the 6th minus the ER admissions on the 13th.(Scanlon et al. 1993)
n Mean SD
sixth 6 8 3
thirteenth 6 11 4
diff 6 -3 3
18. Forest management. Forest rangers wanted to better understand the rate of growth for
younger trees in the park. They took measurements of a random sample of 50 young trees in 2009
and again measured those same trees in 2019. The data below summarize their measurements,
where the heights are in feet.
Year Mean SD n
2009 12.0 3.5 50
2019 24.5 9.5 50
Difference 12.5 7.2 50
Construct a 99% confidence interval for the average growth of (what had been) younger trees in
the park over 2009-2019.
378
Chapter 22
Sometimes we want to compare means across many groups. We might initially think to do pairwise
comparisons. For example, if there were three groups, we might be tempted to compare the first mean
with the second, then with the third, and then finally compare the second and third means for a total
of three comparisons. However, this strategy can be treacherous. If we have many groups and do
many comparisons, it is likely that we will eventually find a difference just by chance, even if there
is no difference in the populations. Instead, we should apply a holistic test to check whether there is
evidence that at least one pair of groups are in fact different, which is where ANOVA saves the day.
In this section, we will learn a new method called analysis of variance (ANOVA) and a new test
statistic called an 𝐹 -statistic (which we will introduce in our discussion of mathematical models).
ANOVA uses a single hypothesis test to check whether the means across many groups are equal:
• 𝐻0 ∶ The mean outcome is the same across all groups, i.e., 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑘 where 𝜇𝑗 represents
the mean of the outcome for observations in category 𝑗.
• 𝐻𝐴 ∶ At least one mean is different.
Generally we must check three conditions on the data before performing ANOVA:
• the observations are independent within and between groups,
• the responses within each group are nearly normal, and
• the variability across the groups is about equal.
When the three technical conditions are met, we may perform an ANOVA to determine whether the
data provide convincing evidence against the null hypothesis that all the 𝜇𝑗 are equal.
379
EXAMPLE
College departments commonly run multiple sections of the same introductory course each
semester because of high demand. Consider a statistics department that runs three sections
of an introductory statistics course. We might like to determine whether there are substantial
differences in first exam scores in these three classes (Section A, Section B, and Section C).
Describe appropriate hypotheses to determine whether there are any differences between the
three classes.
Strong evidence favoring the alternative hypothesis in ANOVA is described by unusually large differ-
ences among the group means. We will soon learn that assessing the variability of the group means
relative to the variability among individual observations within each group is key to ANOVA’s success.
EXAMPLE
Examine Figure 22.1. Compare groups I, II, and III. Can you visually determine if the dif-
ferences in the group centers is unlikely to have occurred if there were no differences in the
groups? Now compare groups IV, V, and VI. Do these differences appear to be unlikely to
have occurred if there were no differences in the groups?
Any real difference in the means of groups I, II, and III is difficult to discern, because the
data within each group are very volatile relative to any differences in the average outcome.
On the other hand, it appears there are differences in the centers of groups IV, V, and VI.
For instance, group V appears to have a higher mean than that of the other two groups.
Investigating groups IV, V, and VI, we see the differences in the groups’ centers are noticeable
because those differences are large relative to the variability in the individual observations
within each group.
Figure 22.1: Side-by-side dot plot for the outcomes for six groups. Two sets of groups; first set is comprised
of Groups I, II, and III, the second set is comprised of Groups IV, V, and VI.
380 CHAPTER 22. INFERENCE FOR COMPARING MANY MEANS
We would like to discern whether there are real differences between the batting performance of baseball
players according to their position: outfielder (OF), infielder (IF), and catcher (C). We will use a
dataset called mlb_players_18, which includes batting records of 429 Major League Baseball (MLB)
players from the 2018 season who had at least 100 at bats. Six of the 429 cases represented in mlb_-
players_18 are shown in Table 22.1, and descriptions for each variable are provided in Table 22.2. The
measure we will use for the player batting performance (the outcome variable) is on-base percentage
(OBP). The on-base percentage roughly represents the fraction of the time a player successfully gets
on base or hits a home run.
Table 22.1: Six cases and some of the variables from the mlb_players_18 data frame.
Table 22.2: Variables and their descriptions for the mlb_players_18 dataset.
Variable Description
name Player name
team The abbreviated name of the player’s team
position The player’s primary field position (OF, IF, C)
AB Number of opportunities at bat
H Number of hits
HR Number of home runs
RBI Number of runs batted in
AVG Batting average, which is equal to H/AB
OBP On-base percentage, which is roughly equal to the fraction of times
a player gets on base or hits a home run
GUIDED PRACTICE
The null hypothesis under consideration is the following: 𝜇𝑂𝐹 = 𝜇𝐼𝐹 = 𝜇𝐶 Write the
null and corresponding alternative hypotheses in plain language.1
EXAMPLE
The player positions have been divided into three groups: outfield (OF), infield (IF), and
catcher (C). What would be an appropriate point estimate of the on-base percentage by out-
fielders, 𝜇𝑂𝐹 ?
A good estimate of the on-base percentage by outfielders would be the sample average of OBP
for just those players whose position is outfield: 𝑥𝑂𝐹
̄ = 0.320.
1 𝐻 ∶ The average on-base percentage is equal across the four positions. 𝐻 ∶ The average on-base percentage varies
0 𝐴
across some (or all) groups.
22.2. RANDOMIZATION TEST FOR COMPARING MANY MEANS 381
Table 22.3 provides summary statistics for each group. A side-by-side box plots for the on-base
percentage is shown in Figure 22.2. Notice that the variability appears to be approximately constant
across groups; nearly constant variance across groups is an important assumption that must be satisfied
before we consider the ANOVA approach.
Position n Mean SD
OF 160 0.320 0.043
IF 205 0.318 0.038
C 64 0.302 0.038
Figure 22.2: Side-by-side box plots of the on-base percentage for 429 players across three groups. There are
a few potential outliers, but with large numbers of observations in each group, the outliers are not extreme
enough to have an impact on the calculations, so it is not a concern for moving forward with the analysis.
EXAMPLE
The largest difference between the sample means is between the catcher and the outfielder
positions. Consider again the original hypotheses:
• 𝐻0 ∶ 𝜇𝑂𝐹 = 𝜇𝐼𝐹 = 𝜇𝐶
• 𝐻𝐴 ∶ The average on-base percentage (𝜇𝑗 ) varies across some (or all) groups.
Why might it be inappropriate to run the test by simply estimating whether the difference of
𝜇𝐶 and 𝜇𝑂𝐹 is “statistically discernible” at a 0.05 discernibility level?
The primary issue here is that we are inspecting the data before picking the groups that will be
compared. It is inappropriate to examine all data by eye (informal testing) and only afterwards
decide which parts to formally test. This is called data snooping or data fishing. Naturally,
we would pick the groups with the large differences for the formal test, and this would lead to
an inflation in the Type I error rate. To understand inflated Type I error rates better, let’s
consider a slightly different problem.
Suppose we are to measure the aptitude for students in 20 classes in a large elementary school
at the beginning of the year. In this school, all students are randomly assigned to classrooms,
so any differences we observe between the classes at the start of the year are completely due to
chance. However, with so many groups, we will probably observe a few groups that look rather
different from each other. If we select only the classes that look different and then perform a
formal test, we will probably make the wrong conclusion that the assignment wasn’t random.
While we might only formally test differences for a few pairs of classes, we informally evaluated
the other classes by eye before choosing the most extreme cases for a comparison.
382 CHAPTER 22. INFERENCE FOR COMPARING MANY MEANS
𝑀 𝑆𝐺
𝐹 =
𝑀 𝑆𝐸
The 𝑀 𝑆𝐺 represents a measure of the between-group variability, and 𝑀 𝑆𝐸 measures the variability
within each of the groups.
The F statistic is a ratio of how the groups differ (MSG) as compared to how the
observations within a group vary (MSE).
𝑀 𝑆𝐺
𝐹 =
𝑀 𝑆𝐸
When the null hypothesis is true and the conditions are met, F has an F-distribution
with 𝑑𝑓1 = 𝑘 − 1 and 𝑑𝑓2 = 𝑛 − 𝑘.
Conditions:
2 Let 𝑥̄ represent the mean of outcomes across all groups. Then the mean square between groups is computed as
𝑘 2
𝑀𝑆𝐺 = 𝑑𝑓1 𝑆𝑆𝐺 = 𝑘−1 1
∑𝑗=1 𝑛𝑗 (𝑥̄𝑗 − 𝑥)̄ where 𝑆𝑆𝐺 is called the sum of squares between groups (𝑆𝑆𝐺) and
𝐺
𝑛𝑗 is the sample size of group 𝑗.
3 See additional details on ANOVA calculations for interested readers. Let 𝑥̄ represent the mean of outcomes across
where the sum is over all observations in the dataset. Then we compute the sum of squared errors (𝑆𝑆𝐸) in one
of two equivalent ways: 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐺 = (𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22 + ⋯ + (𝑛𝑘 − 1)𝑠2𝑘 where 𝑠2𝑗 is the sample
variance (square of the standard deviation) of the residuals in group 𝑗. Then the 𝑀𝑆𝐸 is the standardized form of
𝑆𝑆𝐸 ∶ 𝑀𝑆𝐸 = 𝑑𝑓1 𝑆𝑆𝐸.
𝐸
22.2. RANDOMIZATION TEST FOR COMPARING MANY MEANS 383
Figure 22.3: Exam scores for students given one of three different exams.
Figure 22.4 shows the process of randomizing the three different exams to the observed exam scores.
If the null hypothesis is true, then the score on each exam should represent the true student ability
on that material. It shouldn’t matter whether they were given exam A or exam B or exam C. By
reallocating which student got which exam, we are able to understand how the difference in average
exam scores changes due only to natural variability. There is only one iteration of the randomization
process in Figure 22.4, leading to three different randomized sample means (computed assuming the
null hypothesis is true).
384 CHAPTER 22. INFERENCE FOR COMPARING MANY MEANS
Figure 22.4: The version of the test (A or B or C) is randomly allocated to the test scores, under the null
assumption that the tests are equally difficult.
In the two-sample case, the null hypothesis was investigated using the difference in the sample means.
However, as noted above, with three groups (three different exams), the comparison of the three sample
means gets slightly more complicated. We have already derived the F-statistic which is exactly the
way to compare the averages across three or more groups! Recall, the F statistic is a ratio of how the
groups differ (MSG) as compared to how the observations within a group vary (MSE).
Building on Figure 22.4, Figure 22.5 shows the values of the simulated 𝐹 statistics over 1,000 random
simulations. We see that, just by chance, the F statistic can be as large as 7.
Figure 22.5: Histogram of F statistics calculated from 1,000 different randomizations of the exam type.
22.3. MATHEMATICAL MODEL FOR TEST FOR COMPARING MANY MEANS 385
Figure 22.6: Histogram of F statistics calculated from 1000 different randomizations of the exam type. The
observed F statistic is given as a red vertical line at 3.48. The area to the right is more extreme than the
observed value and represents the p-value.
Using statistical software, we can calculate that 3.6% of the randomized F test statistics were at or
above the observed test statistic of 𝐹 = 3.48. That is, the p-value of the test is 0.036. Assuming
that we had set the level of discernibility to be 𝛼 = 0.05, the p-value is smaller than the level of
discernibility which would lead us to reject the null hypothesis. We claim that the difficulty level (i.e.,
the true average score, 𝜇) is different for at least one of the exams.
While it is temping to say that exam C is harder than the other two (given the inability to differentiate
between exam A and exam B in Section 20.1), we must be very careful about conclusions made using
different techniques on the same data.
When the null hypothesis is true, random variability that exists in nature sometimes produces data
with p-values less than 0.05. How often does that happen? 5% of the time. That is to say, if you
use 20 different models applied to the same data where there is no signal (i.e., the null hypothesis is
true), you are reasonably likely to to get a p-value less than 0.05 in one of the tests you run. The
details surrounding the ideas of this problem, called a multiple comparisons test or multiple
comparisons problem, are outside the scope of this textbook, but should be something that you
keep in the back of your head. To best mitigate any extra Type I errors, we suggest that you set up
your hypotheses and testing protocol before running any analyses. Once the conclusions have been
reached, you should report your findings instead of running a different type of test on the same data.
As seen with many of the tests and statistics from previous sections, the randomization test on the F
statistic has corresponding mathematical theory to describe the distribution that can be used without
using a computational approach.
We return to the baseball example from Table 22.3 to demonstrate the mathematical model applied
to the ANOVA setting.
386 CHAPTER 22. INFERENCE FOR COMPARING MANY MEANS
Analysis of variance (ANOVA) is used to test whether the mean outcome differs across
two or more groups. ANOVA uses a test statistic, the 𝐹 -statistic, which represents a
standardized ratio of variability in the sample means relative to the variability within
the groups. If 𝐻0 is true and the model conditions are satisfied, an 𝐹 -statistic follows
an 𝐹 distribution with parameters 𝑑𝑓1 = 𝑘 − 1 and 𝑑𝑓2 = 𝑛 − 𝑘. The upper tail of the
𝐹 distribution is used to represent the p-value.
GUIDED PRACTICE
For the baseball data, 𝑀 𝑆𝐺 = 0.00803 and 𝑀 𝑆𝐸 = 0.00158. Identify the degrees of
freedom associated with MSG and MSE and verify the 𝐹 -statistic is approximately
5.077.4
EXAMPLE
The p-value corresponding to the shaded area in Figure 22.7 is equal to about 0.0066. Does
this provide strong evidence against the null hypothesis?
The p-value is smaller than 0.05, indicating the evidence is strong enough to reject the null
hypothesis at a discernibility level of 0.05. That is, the data provide strong evidence that the
average on-base percentage varies by player’s primary field position.
Note that the small p-value indicates that there is a notable difference between the mean batting
averages of the different positions. However, the ANOVA test does not provide a mechanism for
knowing which group is driving the differences. If we move forward with all possible two mean
comparisons, we run the risk of a high Type I error rate. As we saw at the end of Section 22.2,
the follow-up questions surrounding individual group comparisons is called a problem of multiple
comparisons and is outside the scope of this text. We encourage you to learn more about multiple
comparisons, however, so that additional comparisons, after you have rejected the null hypothesis in
an ANOVA test, do not lead to undue false positive conclusions.
Table 22.5: ANOVA summary for testing whether the average on-base percentage differs across player
positions.
22.4.1 Summary
In this chapter we have provided both the randomization test and the mathematical model appropriate
for addressing questions of equality of means across two or more groups. Note that there were
important technical conditions required for confirming that the F distribution appropriately modeled
the ANOVA test statistic. Also, you may have noticed that there was no discussion of creating
confidence intervals. That is because the ANOVA statistic does not have a direct analogue parameter
to estimate. If there is interest in comparisons of mean differences (across each set of two groups),
then the methods from Chapter 20 comparing two independent means should be applied.
22.4.2 Terms
The terms introduced in this chapter are presented in Table 22.6. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
22.5 Exercises
2. Which test? We would like to test if students who are in the social sciences, natural sciences,
arts and humanities, and other fields spend the same amount of time, on average, studying for
a course. What type of test should we use? Explain your reasoning.
3. Cuckoo bird egg lengths, randomize once. Cuckoo birds lay their eggs in other birds’ nests,
making them known as brood parasites. One question relates to whether the size of the cuckoo
egg differs depending on the species of the host bird.5 (Latter 1902) Consider the following
plots, one represents the original data, the second represents data where the host species has
been randomly assigned to the egg length.
a. Consider the average length of the eggs for each species. Is the average length for the
original data: more variable, less variable, or about the same as the randomized species?
Describe what you see in the plots.
b. Consider the standard deviation of the lengths of the eggs within each species. Is the within
species standard deviation of the length for the original data: bigger, smaller, or about the
same as the randomized species?
c. Recall that the F statistic’s numerator measures how much the groups vary (MSG) with the
denominator measuring how much the within species values vary (MSE), which of the plots
above would have a larger F statistic, the original data or the randomized data? Explain.
5 The Cuckoo data used in this exercise can be found in the Stat2Data R package.
22.5. EXERCISES 391
4. Cuckoo bird egg lengths, randomization test. Cuckoo birds lay their eggs in other birds’
nests, making them known as brood parasites. One question relates to whether the size of
the cuckoo egg differs depending on the species of the host bird.6 (Latter 1902) Using the
randomization distribution of the F statistic (host species randomized to egg length), conduct
a hypothesis test to evaluate if there is a difference, in the population, between the average egg
lengths for different host bird species. Make sure to state your hypotheses clearly and interpret
your results in context of the data.
5. Chicken diet and weight, many groups. An experiment was conducted to measure and
compare the effectiveness of various feed supplements on the growth rate of chickens. Newly
hatched chicks were randomly allocated into six groups, and each group was given a different
feed supplement. Sample statistics and a visualization of the observed data are shown below.
(McNeil 1977)
Using the ANOVA output below, conduct a hypothesis test to determine if these data provide
convincing evidence that the average weight of chicks varies across some (or all) groups. Make
sure to check relevant conditions.
6. Teaching descriptive statistics. A study compared five different methods for teaching de-
scriptive statistics. The five methods were traditional lecture and discussion, programmed text-
book instruction, programmed text with lectures, computer instruction, and computer instruc-
tion with lectures. 45 students were randomly assigned, 9 to each method. After completing
the course, students took a 1-hour exam.
a. What are the hypotheses for evaluating if the average test scores are different for the
different teaching methods?
b. What are the degrees of freedom associated with the 𝐹 -test for evaluating these hypotheses?
c. Suppose the p-value for this test is 0.0168. What is the conclusion?
6 The data Cuckoo used in this exercise can be found in the Stat2Data R package.
392 CHAPTER 22. INFERENCE FOR COMPARING MANY MEANS
7. Coffee, depression, and physical activity. Caffeine is the world’s most widely used stimu-
lant, with approximately 80% consumed in the form of coffee. Participants in a study investigat-
ing the relationship between coffee consumption and exercise were asked to report the number
of hours they spent per week on moderate (e.g., brisk walking) and vigorous (e.g., strenuous
sports and jogging) exercise. Based on these data the researchers estimated the total hours
of metabolic equivalent tasks (MET) per week, a value always greater than 0. The table be-
low gives summary statistics of MET for women in this study based on the amount of coffee
consumed. (Lucas et al. 2011)
a. Write the hypotheses for evaluating if the average physical activity level varies among the
different levels of coffee consumption.
b. Check conditions and describe any assumptions you must make to proceed with the test.
c. Below is the output associated with this test. What is the conclusion of the test?
8. Student performance across discussion sections. A professor who teaches a large intro-
ductory statistics class (197 students) with eight discussion sections would like to test if student
performance differs by discussion section, where each discussion section has a different teaching
assistant. The summary table below shows the average final exam score for each discussion
section as well as the standard deviation of scores and the number of students in each section.
The ANOVA output below can be used to test for differences between the average scores from
the different discussion sections.
Conduct a hypothesis test to determine if these data provide convincing evidence that the average
score varies across some (or all) groups. Check conditions and describe any assumptions you
must make to proceed with the test.
22.5. EXERCISES 393
9. GPA and major. Undergraduate students in an introductory statistics course at Duke Uni-
versity conducted a survey about GPA and major. The density plots show the distributions of
GPA among three groups of majors. ANOVA output is also provided.
a. Write the hypotheses for testing for a difference between average GPA across majors.
b. What is the conclusion of the hypothesis test?
c. How many students answered the questions on the survey, i.e., what is the sample size?
10. Work hours and education. The General Social Survey collects data on demographics,
education, and work, among many other characteristics of US residents. (NORC 2010) Using
ANOVA, we can consider educational attainment levels for all 1,172 respondents at once. Below
are the distributions of hours worked by educational attainment and relevant summary statistics
that will be helpful in carrying out this analysis.
Educational Mean SD n
attainment
Lt High School 38.7 15.8 121
High School 39.6 15.0 546
Junior College 41.4 18.1 97
Bachelor 42.5 13.6 253
Graduate 40.8 15.5 155
a. Write hypotheses for evaluating whether the average number of hours worked varies across
the five groups.
b. Check conditions and describe any assumptions you must make to proceed with the test.
c. Below is the output associated with this test. What is the conclusion of the test?
11. True / False: ANOVA, I. Determine if the following statements are true or false in ANOVA,
and explain your reasoning for statements you identify as false.
a. As the number of groups increases, the modified discernibility level for pairwise tests in-
creases as well.
b. As the total sample size increases, the degrees of freedom for the residuals increases as well.
c. The constant variance condition can be somewhat relaxed when the sample sizes are rela-
tively consistent across groups.
d. The independence assumption can be relaxed when the total sample size is large.
12. True / False: ANOVA, II. Determine if the following statements are true or false, and explain
your reasoning for statements you identify as false.
If the null hypothesis that the means of four groups are all the same is rejected using ANOVA
at a 5% discernibility level, then…
a. we can then conclude that all the means are different from one another.
b. the standardized variability between groups is higher than the standardized variability
within groups.
c. the pairwise analysis will identify at least one pair of means that are discernibly different.
d. the appropriate 𝛼 to be used in pairwise comparisons is 0.05 / 4 = 0.0125 since there are
four groups.
13. Matching observed data with randomized F statistics. Consider the following two
datasets. The response variable is the score and the explanatory variable is whether the indi-
vidual is in one of four groups.
14. Child care hours. The China Health and Nutrition Survey aims to examine the effects of the
health, nutrition, and family planning policies and programs implemented by national and local
governments. (UNC Carolina Population Center 2006) It, for example, collects information
on number of hours Chinese parents spend taking care of their children under age 6. The
side-by-side box plots below show the distribution of this variable by educational attainment
of the parent. Also provided below is the ANOVA output for comparing average hours across
educational attainment categories.
a. Write the hypotheses for testing for a difference between the average number of hours spent
on child care across educational attainment levels.
b. What is the conclusion of the hypothesis test?
396
Chapter 23
Applications: Infer
23.1 Recap: Computational methods
The computational methods we have presented are used in two settings. First, in many real life appli-
cations (as in those covered here), the mathematical model and computational model give identical
conclusions. When there are no differences in conclusions, the advantage of the computational method
is that it gives the analyst a good sense for the logic of the statistical inference process. Second, when
there is a difference in the conclusions (seen primarily in methods beyond the scope of this text), it
is often the case that the computational method relies on fewer technical conditions and is therefore
more appropriate to use.
23.1.1 Randomization
An important feature of randomization tests is that the data are permuted in such a way that the null
hypothesis is true. The randomization distribution provides a distribution of the statistic of interest
under the null hypothesis, which is exactly the information needed to calculate a p-value — where the
p-value is the probability of obtaining the observed data or more extreme when the null hypothesis is
true. Although there are ways to adjust the randomization for settings other than the null hypothesis
being true, they are not covered in this book and they are not used widely. In approaching research
questions with a randomization test, be sure to ask yourself what the null hypothesis represents and
how it is that permuting the data is creating different possible null data representations.
Hypothesis tests. When using a randomization test, we proceed as follows:
• Write appropriate hypotheses.
• Compute the observed statistic of interest.
• Permute the data repeatedly, each time, recalculating the statistic of interest.
• Compute the proportion of times the permuted statistics are as extreme as or more extreme
than the observed statistic, this is the p-value.
• Make a conclusion based on the p-value, and write the conclusion in context and in plain language
so anyone can understand the result.
23.1.2 Bootstrapping
Bootstrapping, in contrast to randomization tests, represents a proxy sampling of the original pop-
ulation. With bootstrapping, the analyst is not forcing the null hypothesis to be true (or false, for
that matter), but instead, they are replicating the variability seen in taking repeated samples from a
population. Because there is no underlying true (or false) null hypothesis, bootstrapping is typically
used for creating confidence intervals for the parameter of interest. Bootstrapping can be used to test
particular values of a parameter (e.g., by evaluating whether a particular value of interest is contained
in the confidence interval), but generally, bootstrapping is used for interval estimation instead of
testing.
23.2. RECAP: MATHEMATICAL MODELS 397
Confidence intervals. The following is how we generally computed a confidence interval using
bootstrapping:
• Repeatedly resample the original data, with replacement, using the same sample size as the
original data.
• For each resample, calculate the statistic of interest.
• Calculate the confidence interval using one of the following methods:
– Bootstrap percentile interval: Obtain the endpoints representing the middle (e.g., 95%) of
the bootstrapped statistics. The endpoints will be the confidence interval.
– Bootstrap standard error (SE) interval: Find the SE of the bootstrapped statistics. The
confidence interval will be given by the original observed statistic plus or minus some
multiple (e.g., 2) of SEs.
• Put the conclusions in context and in plain language so even non-statisticians and data scientists
can understand the results.
The mathematical models which have been used to produce inferential analyses follow a consistent
framework for different parameters of interest. As a way to contrast and compare the mathematical
approach, we offer the following summaries in Table 23.1 and Table 23.2.
23.2.1 z-procedures
Generally, when the response variable is categorical (or binary), the summary statistic is a proportion
and the model used to describe the proportion is the standard normal curve (also referred to as
a 𝑧-curve or a 𝑧-distribution). We provide Table 23.1 partly as a mechanism for understanding 𝑧-
procedures and partly to highlight the extremely common usage of the 𝑧-distribution in practice.
Table 23.1: Similarities of z-methods across one sample and two independent samples analysis of a binary
response variable. 𝑝 represents the population proportion, 𝑝̂ represents the sample proportion, 𝑝0 represents
the null hypothesized proportion, 𝑝𝑝𝑜𝑜𝑙
̂ represents the pooled proportion, and 𝑛 represents the sample size.
The subscripts of 1 and 2 indicate that the values are measured separately for samples 1 and 2.
Hypothesis tests. When applying the 𝑧-distribution for a hypothesis test, we proceed as follows:
• Write appropriate hypotheses.
• Verify conditions for using the 𝑧-distribution.
– One-sample: the observations (or differences) must be independent. The success-failure
condition of at least 10 success and at least 10 failures should hold.
398 CHAPTER 23. APPLICATIONS: INFER
– For a difference of proportions: each sample must separately satisfy the success-failure
conditions, and the data in the groups must also be independent.
• Compute the point estimate of interest and the standard error.
• Compute the Z score and p-value.
• Make a conclusion based on the p-value, and write a conclusion in context and in plain language
so anyone can understand the result.
Confidence intervals. Similarly, the following is how we generally computed a confidence interval
using a 𝑧-distribution:
• Verify conditions for using the 𝑧-distribution. (See above.)
• Compute the point estimate of interest, the standard error, and 𝑧⋆ .
• Calculate the confidence interval using the general formula:
point estimate ± 𝑧 ⋆ 𝑆𝐸.
• Put the conclusions in context and in plain language so even non-statisticians and data scientists
can understand the results.
23.2.2 t-procedures
With quantitative response variables, the 𝑡-distribution was applied as the appropriate mathematical
model in three distinct settings. Although the three data structures are different, their similarities and
differences are worth pointing out. We provide Table 23.2 partly as a mechanism for understanding
𝑡-procedures and partly to highlight the extremely common usage of the 𝑡-distribution in practice.
Table 23.2: Similarities of 𝑡-methods across one sample, paired sample, and two independent samples analysis
of a numeric response variable. 𝜇 represents the population mean, 𝑥̄ represents the sample mean, 𝑠 represents
the standard deviation, and 𝑛 represents the sample size. The subscript of 𝑑𝑖𝑓𝑓 indicates that the values are
measured on the paired differences. The subscripts of 1 and 2 indicate that the values are measured separately
on sample 1 and sample 2.
Hypothesis tests. When applying the 𝑡-distribution for a hypothesis test, we proceed as follows:
• Write appropriate hypotheses.
• Verify conditions for using the 𝑡-distribution.
– One-sample or differences from paired data: the observations (or differences) must be
independent and nearly normal. For larger sample sizes, we can relax the nearly normal
requirement, e.g., slight skew is okay for sample sizes of 15, moderate skew for sample sizes
of 30, and strong skew for sample sizes of 60.
– For a difference of means when the data are not paired: each sample mean must separately
satisfy the one-sample conditions for the 𝑡-distribution, and the data in the groups must
also be independent.
• Compute the point estimate of interest, the standard error, and the degrees of freedom For 𝑑𝑓,
use 𝑛 − 1 for one sample, and for two samples use either statistical software or the smaller of
𝑛1 − 1 and 𝑛2 − 1.
• Compute the T score and p-value.
• Make a conclusion based on the p-value, and write a conclusion in context and in plain language
so anyone can understand the result.
Confidence intervals. Similarly, the following is how we generally computed a confidence interval
using a 𝑡-distribution:
• Verify conditions for using the 𝑡-distribution. (See above.)
• Compute the point estimate of interest, the standard error, the degrees of freedom, and 𝑡⋆𝑑𝑓 .
• Calculate the confidence interval using the general formula:
• Put the conclusions in context and in plain language so even non-statisticians and data scientists
can understand the results.
Take a look at the images in Figure 23.1. How would you describe the circled item in Figure 23.1a?
Would you call it “the triangle”? Or “the blue triangle”? How about in Figure 23.1b? Does your
answer change?
In Figure 23.1a the circled item is the only triangle, but in the bottom image the circled item is one
of two triangles. While in Figure 23.1a “the triangle” is a sufficient description for the circled item,
many of us might choose to refer to it as the “blue triangle” anyway. In Figure 23.1a there are two
triangles, so “the triangle” is no longer sufficient, and to describe the circled item we must qualify it
with the color as well, as “the blue triangle”.
400 CHAPTER 23. APPLICATIONS: INFER
Your answers to the above questions might be different if you’re answering in a different language
than English. For example, in Spanish, the adjective comes after the noun (e.g., “el triángulo azul”)
therefore the incremental value of the additional adjective might be different for Figure 23.1a.
Researchers studying frequent use of redundant adjectives (e.g., referring to a single triangle as “the
blue triangle”) and incrementality of language processing designed an experiment where they showed
the following two images to 22 native English speakers (undergraduates from University College Lon-
don) and 22 native Spanish speakers (undergraduates from the Universidad de las Islas Baleares).
They found that in both languages, the subjects used more redundant color adjectives in denser dis-
plays where it would be more efficient. (Rubio-Fernandez, Mollica, and Jara-Ettinger 2021) One of
the displays from the study is shown in Figure 23.2.
Figure 23.2: Images used in one of the experiments described in Rubio-Fernandez, Mollica, and Jara-Ettinger
(2021).
In this case study we will examine data from redundant adjective study, which the authors have made
available on Open Science Framework at osf.io/9hw68.
Table 23.3 shows the top six rows of the data. The full dataset has 88 rows. Remember that there are
a total of 44 subjects in the study (22 English and 22 Spanish speakers). There are two rows in the
dataset for each of the subjects: one representing data from when they were shown an image with 4
items on it and the other with 16 items on it. Each subject was asked 10 questions for each type of
image (with a different layout of items on the image for each question). The variable of interest to us
is redundant_perc, which gives the percentage of questions the subject used a redundant adjective
to identify “the blue triangle”. Note that the variable in “percentage”, and we are interested in the
average percentage. Therefore, we will use methods for means. If the variable had been “success or
failure” (e.g., “used redundant or didn’t”), we would have used methods for proportions.
Table 23.3: Top six rows of the data collected in the study.
and that in both languages, subjects are more likely to use a redundant adjective when there are more
items in the image (i.e., in a denser display).
Figure 23.3: Results of redundant adjective usage experiment from Rubio-Fernandez, Mollica, and Jara-
Ettinger (2021). English speakers are more likely than Spanish speakers to use redundant adjectives, regardless
of number of items in image. For both images, respondents are more likely to use a redundant adjective when
there are more items in the image.
Figure 23.4: Distribution of 1,000 bootstrapped means of redundant adjective usage percentage among
English speakers who were shown four items in images. Overlaid on the distribution is the 95% bootstrap
percentile interval that ranges from 19.1% to 56.4%.
402 CHAPTER 23. APPLICATIONS: INFER
Using a similar technique, we can also construct confidence intervals for the true mean redundant
adjective usage percentage for English speakers who are shown dense (16 item) displays and for Spanish
speakers with both types (4 and 16 items) displays. However, these confidence intervals are not very
meaningful to compare to one another as the interpretation of the “true mean redundant adjective
usage percentage” is quite an abstract concept. Instead, we might be more interested in comparative
questions such as “Does redundant adjective usage differ between dense and sparse displays among
English speakers and among Spanish speakers?” or “Does redundant adjective usage differ between
English speakers and Spanish speakers?” To answer either of these questions we need to conduct a
hypothesis test.
Table 23.4: Six participants who speak English with redundancy difference.
We can answer the research question using a hypothesis test with the following hypotheses:
𝐻0 ∶ 𝜇𝑑𝑖𝑓𝑓 = 0
𝐻𝐴 ∶ 𝜇𝑑𝑖𝑓𝑓 ≠ 0
where 𝜇𝑑𝑖𝑓𝑓 is the true difference in redundancy percentages when comparing a 16 item display with
a 4 item display. Recall that the computational method used to assess a hypothesis pertaining to the
true average of a paired difference shuffles the observed percentage across the two groups (4 item vs
16 item) but within a single participant. The shuffling process allows for repeated calculations of
potential sample differences under the condition that the null hypothesis is true.
Figure 23.5 shows the distribution of 1,000 mean differences from redundancy percentages permuted
across the two conditions. Note that the distribution is centered at 0, since the structure of randomly
assigning redundancy percentages to each item display will balance the data out such that the average
of any differences will be zero.
23.3. CASE STUDY: REDUNDANT ADJECTIVES 403
Figure 23.5: Distribution of 1,000 mean differences of redundant adjective usage percentage among English
speakers who were shown images with 4 and 16 items. Overlaid on the distribution is the observed average
difference in the sample (solid line) as well as the difference in the other direction (dashed line), which is far
out in the tail, yielding a p-value that is approximately 0.
With such a small p-value, we reject the null hypothesis and conclude that the data provide convincing
evidence of a difference in mean redundant adjective usage percentages across different displays for
English speakers.
𝐻0 ∶ 𝜇𝐸𝑛𝑔𝑙𝑖𝑠ℎ = 𝜇𝑆𝑝𝑎𝑛𝑖𝑠ℎ
𝐻𝐴 ∶ 𝜇𝐸𝑛𝑔𝑙𝑖𝑠ℎ ≠ 𝜇𝑆𝑝𝑎𝑛𝑖𝑠ℎ
Here, the randomization process is slightly different than the paired setting (because the English
and Spanish speakers do not have a natural pairing across the two groups). To answer the research
question using a computational method, we can use a randomization test where we permute the data
across all participants under the assumption that the null hypothesis is true (no difference in mean
redundant adjective usage percentages across English vs Spanish speakers).
Figure 23.6 shows the null distributions for each of the two hypothesis tests. The p-value for the 4
item display comparison is very small (0.002) while the p-value for the 16 item display is much larger
(0.102).
404 CHAPTER 23. APPLICATIONS: INFER
Figure 23.6: Distributions of 1,000 differences in randomized means of redundant adjective usage percentage
between English and Spanish speakers. In each plot, the observed differences in the sample (solid line) and
the differences in the other direction (dashed line) are overlaid.
Based on the p-values (a measure of deviation from the null claim), we can conclude that the data
provide convincing evidence of a difference in mean redundant adjective usage percentages between
languages in 4 item displays (small p-value) but not in 16 item displays (not small p-value). The
results suggests that language patterns around redundant adjective usage might be more similar for
denser displays than sparser displays across English and Spanish speakers.
23.4. INTERACTIVE R TUTORIALS 405
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 5: Statistical inference
https://openintrostat.github.io/ims-tutorials/05-infer
Tutorial 5 - Lesson 1: Inference for a single proportion
https://openintro.shinyapps.io/ims-05-infer-01
Tutorial 5 - Lesson 2: Hypothesis tests to compare proportions
https://openintro.shinyapps.io/ims-05-infer-02
Tutorial 5 - Lesson 3: Chi-squared test of independence
https://openintro.shinyapps.io/ims-05-infer-03
Tutorial 5 - Lesson 4: Chi-squared goodness of fit Test
https://openintro.shinyapps.io/ims-05-infer-04
Tutorial 5 - Lesson 5: Bootstrapping for estimating a parameter
https://openintro.shinyapps.io/ims-05-infer-05
Tutorial 5 - Lesson 6: Introducing the t-distribution
https://openintro.shinyapps.io/ims-05-infer-06
Tutorial 5 - Lesson 7: Inference for difference in two means
https://openintro.shinyapps.io/ims-05-infer-07
Tutorial 5 - Lesson 8: Comparing many means
https://openintro.shinyapps.io/ims-05-infer-08
You can also access the full list of tutorials supporting this book at
https://openintrostat.github.io/ims-tutorials.
23.5 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Inference for categorical responses - Texting while driving
https://www.openintro.org/go?id=ims-r-lab-infer-1
https://www.openintro.org/go?id=ims-r-lab-infer-2
You can also access the full list of labs supporting this book at
https://www.openintro.org/go?id=ims-r-labs.
406
PART VI
Inferential modeling
407
In a previous part, Regression modeling, you learned how to build linear and logistic regression models
based on a set of observations. In a different part, Foundations of inference, you learned about the
structures that allow us to make inferential claims about a population given a sample of data. In this
part, we develop inferential methods applied to regression models.
• Chapter 24 provides specific details about inference for linear regression models with a single
predictor.
• Chapter 25 provides specific details about inference for linear regression models with multiple
predictors.
• Chapter 26 provides specific details about inference for logistic regression models.
• Chapter 27 includes an application on the Mario Kart case study where the topics from this
part of the book on linear regression are fully developed.
We have only scratched the surface in providing information about modeling and related inferential
methods. We hope that the ideas we’ve covered have whet your appetite to learn more higher level
modeling.
408
Chapter 24
Consider the following hypothetical population of all of the sandwich stores of a particular chain seen
in Figure 24.1. In this made-up world, the CEO actually has all the relevant data, which is why they
can plot it here. The CEO is omniscient and can write down the population model which describes
the true population relationship between the advertising dollars and revenue. There appears to be a
linear relationship between advertising dollars and revenue (both in $1,000).
Figure 24.1: Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands
of dollars.
You may remember from Chapter 7 that the population model is:
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀.
Again, the omniscient CEO (with the full population information) can write down the true population
model as:
expected revenue = 11.23 + 4.8 × advertising.
𝑦 ̂ = 𝑏0 + 𝑏1 𝑥.
Two random samples of 20 stores shows different least squares regression lines in Figure 24.2a and
Figure 24.2b, depending on which observations are selected. Both trends are similar to those seen in
Figure 24.1, which describes the population.
410 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
Figure 24.2: Two random samples of 20 stores from the entire population. A linear trend between advertising
and revenue is observed in both.
Figure 24.3 shows the two samples and the least squares regressions from fig-sand-samp on the same
plot. We can see that the two lines are different. That is, there is variability in the regression line
from sample to sample. The concept of the sampling variability is something you’ve seen before, but
in this lesson, you will focus on the variability of the line often measured through the variability of a
single statistic: the slope of the line.
Figure 24.3: The linear models from the two random samples are quite similar, but not exactly the same.
Figure 24.4 shows least squares lines fit to many more random samples of 20 from the population.
Figure 24.4: If repeated samples of size 20 are taken from the entire population, each linear model will be
slightly different. The red line provides the linear fit to the entire population.
24.1. CASE STUDY: SANDWICH STORE 411
You might notice in Figure 24.4 that the 𝑦 ̂ values given by the lines are much more consistent in the
middle of the dataset than at the ends. The reason is that the data itself anchors the lines in such a
way that the line must pass through the center of the data cloud. The effect of the fan-shaped lines
is that predicted revenue for advertising close to $4,000 will be much more precise than the revenue
predictions made for $1,000 or $7,000 of advertising.
The distribution of slopes (for samples of size 𝑛 = 20) can be seen in Figure 24.5.
Figure 24.5: Variability of slope estimates from many different samples of stores, each of size 20.
Recall, the example described in this introduction is hypothetical. That is, we created an entire
population in order demonstrate how the slope of a line would vary from sample to sample. The tools
in this textbook are designed to evaluate only one single sample of data. With actual studies, we do
not have repeated samples, so we are not able to use repeated samples to visualize the variability in
slopes. We have seen variability in samples throughout this text, so it should not come as a surprise
that different samples will produce different linear models. However, it is nice to visually consider the
linear models produced by different slopes. Additionally, as with measuring the variability of previous
statistics (e.g., 𝑋 1 − 𝑋 2 or 𝑝1̂ − 𝑝2̂ ), the histogram of the sample statistics can provide information
related to inferential considerations.
In the following sections, the distribution (i.e., histogram) of 𝑏1 (the estimated slope coefficient) will
be constructed in the same three ways that, by now, may be familiar to you. First (in Section 24.2),
the distribution of 𝑏1 when 𝛽1 = 0 is constructed by randomizing (permuting) the response variable.
Next (in Section 24.3), we can bootstrap the data by taking random samples of size 𝑛 from the original
dataset. And last (in Section 24.4), we use mathematical tools to describe the variability using the
𝑡-distribution that was first encountered in Section 19.2.
412 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
Consider data on 100 randomly selected births gathered originally from the US Department of Health
and Human Services. Some of the variables are plotted in Figure 24.6.
The scientific research interest at hand will be in determining the linear relationship between weight
of baby at birth (in lbs) and number of weeks of gestation. The dataset is quite rich and deserves
exploring, but for this example, we will focus only on the weight of the baby.
The births14 data can be found in the openintro R package. We will work with a
random sample of 100 observations from these data.
Figure 24.6: Weight of baby at birth (in lbs) as plotted by four other birth variables (mother’s weight gain,
mother’s age, number of hospital visits, and weeks gestation).
As you have seen previously, statistical inference typically relies on setting a null hypothesis which
is hoped to be subsequently rejected. In the linear model setting, we might hope to have a linear
relationship between weeks and weight in settings where weeks gestation is known and weight of
baby needs to be predicted.
The relevant hypotheses for the linear model setting can be written in terms of the population slope
parameter. Here the population refers to a larger population of births in the US.
• 𝐻0 ∶ 𝛽1 = 0, there is no linear relationship between weight and weeks.
• 𝐻𝐴 ∶ 𝛽1 ≠ 0, there is some linear relationship between weight and weeks.
Recall that for the randomization test, we permute one variable to eliminate any existing relationship
between the variables. That is, we set the null hypothesis to be true, and we measure the natural
variability in the data due to sampling but not due to variables being correlated. Figure 24.7a shows
the observed data and Figure 24.7b shows one permutation of the weight variable. The careful
observer can see that each of the observed values for weight (and for weeks) exist in both the original
data plot as well as the permuted weight plot, but the weight and weeks gestation are no longer
matched for a given birth. That is, each weight value is randomly assigned to a new weeks gestation.
By repeatedly permuting the response variable, any pattern in the linear model that is observed is
due only to random chance (and not an underlying relationship). The randomization test compares
the slopes calculated from the permuted response variable with the observed slope. If the observed
slope is inconsistent with the slopes from permuting, we can conclude that there is some underlying
relationship (and that the slope is not merely due to random chance).
24.2. RANDOMIZATION TEST FOR THE SLOPE 413
Figure 24.7: Permutation removes the linear relationship between weight and weeks. Repeated permutations
allow for quantifying the variability in the slope under the condition that there is no linear relationship (i.e.,
that the null hypothesis is true).
Table 24.1: The least squares estimates of the intercept and slope are given in the estimate column. The
observed slope is 0.335.
Figure 24.8: Two permutations of weight with slightly different least squares regression lines.
As you can see, sometimes the slope of the permuted data is positive, sometimes it is negative. Be-
cause the randomization happens under the condition of no underlying relationship (because the
response variable is completely mixed with the explanatory variable), we expect to see the center of
the randomized slope distribution to be zero.
414 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
Figure 24.9: Histogram of slopes given different permutations of the weight variable. The vertical red line
is at the observed value of the slope, 0.335.
As we can see from Figure 24.9, a slope estimate as extreme as the observed slope estimate (the red
line) never happened in many repeated permutations of the weight variable. That is, if indeed there
were no linear relationship between weight and weeks, the natural variability of the slopes would
produce estimates between approximately -0.15 and +0.15. We reject the null hypothesis. Therefore,
we believe that the slope observed on the original data is not just due to natural variability and indeed,
there is a linear relationship between weight of baby and weeks gestation for births in the US.
As we have seen in previous chapters, we can use bootstrapping to estimate the sampling distribution
of the statistic of interest (here, the slope) without the null assumption of no relationship (which was
the condition in the randomization test). Because interest is now in creating a CI, there is no null
hypothesis, so there won’t be any reason to permute either of the variables.
Figure 24.10: Using the original data, the weight of baby as a linear model of mother’s age. Notice that the
relationship between mother’s age and weight of baby is not as strong as the relationship we saw previously
between weeks gestation and weight of baby.
24.3. BOOTSTRAP CONFIDENCE INTERVAL FOR THE SLOPE 415
Table 24.2: The least squares estimates of the intercept and slope are given in the estimate column. The
observed slope is 0.036.
Figure 24.11: Original and one bootstrap sample of the births data. It is difficult to differentiate between
the two plots, as (within a single bootstrap sample) the observations which have been resampled twice are
plotted as points on top of one another. The red circles represent points in the original data which were not
included in the bootstrap sample. The blue circles represent a data point that was repeatedly resampled (and
is therefore darker) in the bootstrap sample. The green circles represent a particular structure to the data
which is observed in both the original and bootstrap samples.
Figure 24.11a shows the original data as compared with a single bootstrap sample in Figure 24.11b,
resulting in (slightly) different linear models. The red circles represent points in the original data which
were not included in the bootstrap sample. The blue circles represent a point that was repeatedly
resampled (and is therefore darker) in the bootstrap sample. The green circles represent a particular
structure to the data which is observed in both the original and bootstrap samples. By repeatedly
resampling, we can see dozens of bootstrapped slopes on the same plot in Figure 24.12.
416 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
Figure 24.12: Repeated bootstrap resamples of size 100 are taken from the original data. Each of the
bootstrapped linear models is slightly different.
Recall that in order to create a confidence interval for the slope, we need to find the range of values that
the statistic (here the slope) takes on from different bootstrap samples. Figure 24.13 is a histogram
of the relevant bootstrapped slopes. We can see that a 95% bootstrap percentile interval for the true
population slope is given by (-0.01, 0.081). We are 95% confident that for the model describing the
population of births, predicting weight of baby from mother’s age, a one unit increase in mage (in
years) is associated with an increase in predicted average baby weight of between -0.01 and 0.081
pounds. Notice that the CI contains zero, so the true relationship might be null!
Figure 24.13: The original births data on baby’s weight and mother’s age is bootstrapped 1,000 times. The
histogram provides a sense for the variability of the slope of the linear model from sample to sample.
24.4. MATHEMATICAL MODEL FOR TESTING THE SLOPE 417
EXAMPLE
Using Figure 24.13, calculate the bootstrap estimate for the standard error of the slope. Us-
ing the bootstrap standard error, find a 95% bootstrap SE confidence interval for the true
population slope, and interpret the interval in context.
Notice that most of the bootstrapped slopes fall between -0.01 and +0.08 (a range of 0.09).
Using the empirical rule (that with bell-shaped distributions, most observations are within two
standard errors of the center), the standard error of the slopes is approximately 0.0225. The
critical value for a 95% confidence interval is 𝑧⋆ = 1.96 which leads to a confidence interval of
𝑏1 ± 1.96 ⋅ 𝑆𝐸 → 0.036 ± 1.96 ⋅ 0.0225 → (−0.0081, 0.0801). The bootstrap SE confidence interval
is almost identical to the bootstrap percentile interval. In context, we are 95% confident that
for the model describing the population of births, predicting weight of baby from mother’s age,
a one unit increase in mage (in years) is associated with an increase in predicted average baby
weight of between -0.0081 and 0.0801 pounds.
Figure 24.14 shows these data and the least-squares regression line:
We consider the percent change in the number of seats of the President’s party (e.g., percent change
in the number of seats for Republicans in 2018) against the unemployment rate.
Examining the data, there are no clear deviations from linearity or substantial outliers (see Sec-
tion 7.1.3 for a discussion on using residuals to visualize how well a linear model fits the data). While
418 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
the data are collected sequentially, a separate analysis was used to check for any apparent correlation
between successive observations; no such correlation was found.
Figure 24.14: The percent change in House seats for the President’s party in each election from 1898 to
2010 plotted against the unemployment rate. The two points for the Great Depression have been removed,
and a least squares regression line has been fit to the data.
GUIDED PRACTICE
The data for the Great Depression (1934 and 1938) were removed because the unemploy-
ment rate was 21% and 18%, respectively. Do you agree that they should be removed
for this investigation? Why or why not?1
There is a negative slope in the line shown in Figure 24.14. However, this slope (and the y-intercept)
are only estimates of the parameter values. We might wonder, is this convincing evidence that the
“true” linear model has a negative slope? That is, do the data provide strong evidence that the political
theory is accurate, where the unemployment rate is a useful predictor of the midterm election? We
can frame this investigation into a statistical hypothesis test:
• 𝐻0 : 𝛽1 = 0. The true linear model has slope zero.
• 𝐻𝐴 : 𝛽1 ≠ 0. The true linear model has a slope different than zero. The unemployment is
predictive of whether the President’s party wins or loses seats in the House of Representatives.
We would reject 𝐻0 in favor of 𝐻𝐴 if the data provide strong evidence that the true slope parameter is
different than zero. To assess the hypotheses, we identify a standard error for the estimate, compute
an appropriate test statistic, and identify the p-value.
1 The answer to this question relies on the idea that statistical data analysis is somewhat of an art. That is, in many
situations, there is no “right” answer. As you do more and more analyses on your own, you will come to recognize
the nuanced understanding which is needed for a particular dataset. In terms of the Great Depression, we will provide
two contrasting considerations. Each of these points would have very high leverage on any least-squares regression line,
and years with such high unemployment may not help us understand what would happen in other years where the
unemployment is only modestly high. On the other hand, the Depression years are exceptional cases, and we would be
discarding important information if we exclude them from a final analysis.
24.4. MATHEMATICAL MODEL FOR TESTING THE SLOPE 419
Table 24.3: Output from statistical software for the regression line modeling the midterm election losses for
the President’s party as a response to unemployment.
EXAMPLE
What do the first and second columns of Table 24.3 represent?
The entries in the first column represent the least squares estimates, 𝑏0 and 𝑏1 , and the values
in the second column correspond to the standard errors of each estimate. Using the estimates,
we could write the equation for the least square regression line as
𝑦 ̂ = −7.36 − 0.89𝑥
where 𝑦 ̂ in this case represents the predicted change in the number of seats for the president’s
party, and 𝑥 represents the unemployment rate.
We previously used a 𝑡-test statistic for hypothesis testing in the context of numerical data. Regression
is very similar. In the hypotheses we consider, the null value for the slope is 0, so we can compute the
test statistic using the T score formula:
EXAMPLE
Use Table 24.3 to determine the p-value for the hypothesis test.
The last column of the table gives the p-value for the two-sided hypothesis test for the coefficient
of the unemployment rate 0.2961 That is, the data do not provide convincing evidence that
a higher unemployment rate has any correspondence with smaller or larger losses for the
President’s party in the House of Representatives in midterm elections. If there was no linear
relationship between the two variables (i.e., if 𝛽1 = 0), then we would expect to see linear
models as or more extreme that the observed model roughly 30% of the time.
420 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
EXAMPLE
Examine Figure 7.13, which relates the Elmhurst College aid and student family income. Are
you convinced that the slope is discernibly different from zero? That is, do you think a formal
hypothesis test would reject the claim that the true slope of the line should be zero?
While the relationship between the variables is not perfect, there is an evident decreasing trend
in the data. Such a distinct trend suggests that the hypothesis test will reject the null claim
that the slope is zero.
The tools in this section help you go beyond a visual interpretation of the linear relationship toward
a formal mathematical claim about whether the slope estimate is meaningfully different from 0 to
suggest that the true population slope is different from 0.
Table 24.4: Summary of least squares fit for the Elmhurst College data, where we are predicting the gift aid
by the university based on the family income of students.
GUIDED PRACTICE
Table 24.4 shows statistical software output from fitting the least squares regression line
shown in Figure 7.13. Use the output to formally evaluate the following hypotheses.2
We usually rely on statistical software to identify point estimates, standard errors, test
statistics, and p-values in practice. However, be aware that software will not generally
check whether the method is appropriate, meaning we must still verify conditions are
met. See Section 24.6.
2 We look in the second row corresponding to the family income variable. We see the point estimate of the slope
of the line is -0.0431, the standard error of this estimate is 0.0108, and the 𝑡-test statistic is 𝑇 = −3.98. The p-value
corresponds exactly to the two-sided test we are interested in: 0.0002. The p-value is so small that we reject the null
hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year
2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of
Figure 7.13.
24.5. MATHEMATICAL MODEL, INTERVAL FOR THE SLOPE 421
Similar to how we can conduct a hypothesis test for a model coefficient using regression output, we
can also construct confidence intervals for the slope and intercept coefficients.
Confidence intervals for model coefficients (e.g., the intercept or the slope) can be
computed using the 𝑡-distribution:
𝑏𝑖 ± 𝑡⋆𝑑𝑓 × 𝑆𝐸𝑏𝑖
where 𝑡⋆𝑑𝑓 is the appropriate 𝑡⋆ cutoff corresponding to the confidence level with the
model’s degrees of freedom, 𝑑𝑓 = 𝑛 − 2.
EXAMPLE
Compute the 95% confidence interval for the coefficient using the regression output from Ta-
ble 24.4.
The point estimate is -0.0431 and the standard error is 𝑆𝐸 = 0.0108. When constructing a
confidence interval for a model coefficient, we generally use a 𝑡-distribution. The degrees of
freedom for the distribution are noted in the regression output, 𝑑𝑓 = 48, allowing us to identify
𝑡⋆48 = 2.01 for use in the confidence interval.
We are 95% confident that for an additional one unit (i.e., $1000 increase) in family income,
the university’s gift aid is predicted to decrease on average by $21.40 to $64.80.
On the topic of intervals in this book, we have focused exclusively on confidence intervals for model
parameters. However, there are other types of intervals that may be of interest (and are outside the
scope of this book), including prediction intervals for a response value and confidence intervals for a
mean response value in the context of regression.
422 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
In the previous sections, we used randomization and bootstrapping to perform inference when the
mathematical model was not valid due to violations of the technical conditions. In this section, we’ll
provide details for when the mathematical model is appropriate and a discussion of technical conditions
needed for the randomization and bootstrapping procedures. Recall from Section 7.1.3 that residual
plots can be used to visualize how well a linear model fits the data.
24.6.1 What are the technical conditions for the mathematical model?
When fitting a least squares line, we generally require the following:
• Linearity. The data should show a linear trend. If there is a nonlinear trend (e.g., first panel
of Figure 24.15) an advanced regression method from another book or later course should be
applied.
• Independent observations. Be cautious about applying regression to data that are sequential
observations in time such as a stock price each day. Such data may have an underlying structure
that should be considered in a different type of model and analysis. An example of a dataset
where successive observations are not independent is shown in the fourth panel of Figure 24.15.
There are also other instances where correlations within the data are important, which is further
discussed in Chapter 25.
• Nearly normal residuals. Generally, the residuals should be nearly normal. When this
condition is found to be unreasonable, it is often because of outliers or concerns about influential
points, which we’ll talk about more in Section 7.3. An example of a residual that would be
potentially concerning is shown in the second panel of Figure 24.15, where one observation
is clearly much further from the regression line than the others. Outliers should be treated
extremely carefully. Do not automatically remove an outlier if it truly belongs in the dataset.
However, be honest about its impact on the analysis. A strategy for dealing with outliers is to
present two analyses: one with the outlier and one without the outlier. Additionally, a type of
violation of normality happens when the positive residuals are smaller in magnitude than the
negative residuals (or vice versa). That is, when the residuals are not symmetrically distributed
around the line 𝑦 = 0.
Figure 24.15: Four examples showing when the methods in this chapter are insufficient to apply a linear
model to the data. The top set of graphs represents the 𝑥 and 𝑦 relationship. The bottom set of graphs is a
residual plot. First panel – linearity fails. Second panel – there are outliers, most especially one point that
is very far away from the line. Third panel – the variability of the errors is related to the value of 𝑥. Fourth
panel – a time series dataset is shown, where successive observations are highly correlated.
24.6. CHECKING MODEL CONDITIONS 423
• Constant or equal variability. The variability of points around the least squares line re-
mains roughly constant. An example of non-constant variability is shown in the third panel of
Figure 24.15, which represents the most common pattern observed when this condition fails: the
variability of 𝑦 is larger when 𝑥 is larger.
GUIDED PRACTICE
Should we have concerns about applying least squares regression to the Elmhurst data
in Figure 7.14?3
The technical conditions are often remembered using the LINE mnemonic. The linearity, normality,
and equality of variance conditions usually can be assessed through residual plots, as seen in Fig-
ure 24.15. A careful consideration of the experimental design should be undertaken to confirm that
the observed values are indeed independent.
• L: linear model
• I: independent observations
• N: points are normally distributed around the line
• E: equal variability around the line for all values of the explanatory variable
3 The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly
constant. The data do not come from a time series or other obvious violation to independence. Least squares regression
can be applied to these data.
424 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
24.7.1 Summary
Recall that early in the text we presented graphical techniques which communicated relationships
across multiple variables. We also used modeling to formalize the relationships. Many chapters
were dedicated to inferential methods which allowed claims about the population to be made based
on samples of data. Not only did we present the mathematical model for each of the inferential
techniques, but when appropriate, we also presented bootstrapping and permutation methods.
In Chapter 24 we brought all of those ideas together by considering inferential claims on linear models
through randomization tests, bootstrapping, and mathematical modeling. We continue to emphasize
the importance of experimental design in making conclusions about research claims. In particular,
recall that variability can come from different sources (e.g., random sampling vs. random allocation,
see Figure 2.8).
24.7.2 Terms
The terms introduced in this chapter are presented in Table 24.5. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
bootstrap CI for the slope randomization test for the technical conditions linear
slope regression
inference with single t-distribution for slope variability of the slope
precictor regression
426 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
24.8 Exercises
a. What are the null and alternative hypotheses for evaluating whether the slope of the model
predicting height from shoulder girth is differen than 0.
b. Using the histogram which describes the distribution of slopes when the null hypothesis is
true, find the p-value and conclude the hypothesis test in the context of the problem (use
words like shoulder girth and height).
c. Is the conclusion based on the histogram of randomized slopes consistent with the conclu-
sion from the mathematical model? Explain your reasoning.
2. Baby’s weight and father’s age, randomization test. US Department of Health and Hu-
man Services, Centers for Disease Control and Prevention collect information on births recorded
in the country. The data used here are a random sample of 1000 births from 2014. Here, we
study the relationship between the father’s age and the weight of the baby.5 (ICPSR 2014)
Shown below are the linear model output for predicting baby’s weight (in pounds) from father’s
age (in years) and the histogram of slopes from 1000 randomized datasets (1000 times, weight
was permuted and regressed against fage). The red vertical line is drawn at the observed slope
value which was produced in the linear model output.
a. What are the null and alternative hypotheses for evaluating whether the slope of the model
for predicting baby’s weight from father’s age is different than 0?
b. Using the histogram which describes the distribution of slopes when the null hypothesis is
true, find the p-value and conclude the hypothesis test in the context of the problem (use
words like father’s age and weight of baby). What does the conclusion of your test say
about whether the father’s age is a useful predictor of baby’s weight?
c. Is the conclusion based on the histogram of randomized slopes consistent with the conclu-
sion from the mathematical model? Explain your reasoning.
4 The bdims data used in this exercise can be found in the openintro R package.
5 The births14 data used in this exercise can be found in the openintro R package.
24.8. EXERCISES 427
3. Body measurements, mathematical test. The scatterplot and least squares summary below
show the relationship between weight measured in kilograms and height measured in centimeters
of 507 physically active individuals. (Heinz et al. 2003)
4. Baby’s weight and father’s age, mathematical test. Is the father’s age useful in predicting
the baby’s weight? The scatterplot and least squares summary below show the relationship
between baby’s weight (measured in pounds) and father’s age for a random sample of babies.
(ICPSR 2014)
a. Using the bootstrap percentile method and the histogram above, find a 98% confidence
interval for the slope parameter.
b. Interpret the confidence interval in the context of the problem.
6. Baby’s weight and father’s age, bootstrap percentile interval. US Department of Health
and Human Services, Centers for Disease Control and Prevention collect information on births
recorded in the country. The data used here are a random sample of 1000 births from 2014.
Here, we study the relationship between the father’s age and the weight of the baby. Below is
the bootstrap distribution of the slope statistic from 1,000 different bootstrap samples of the
data. (ICPSR 2014)
a. Using the bootstrap percentile method and the histogram above, find a 95% confidence
interval for the slope parameter.
b. Interpret the confidence interval in the context of the problem.
24.8. EXERCISES 429
7. Body measurements, standard error bootstrap interval. A linear model is built to predict
height based on shoulder girth (circumference of shoulders measured over deltoid muscles), both
measured in centimeters. (Heinz et al. 2003) Shown below are the linear model output for
predicting height from shoulder girth and the bootstrap distribution of the slope statistic from
1,000 different bootstrap samples of the data.
8. Baby’s weight and father’s age, standard error bootstrap interval. US Department
of Health and Human Services, Centers for Disease Control and Prevention collect information
on births recorded in the country. The data used here are a random sample of 1000 births
from 2014. Here, we study the relationship between the father’s age and the weight of the baby.
(ICPSR 2014) Shown below are the linear model output for predicting baby’s weight (in pounds)
from father’s age (in years) and the the bootstrap distribution of the slope statistic from 1000
different bootstrap samples of the data.
9. Body measurements, conditions. The scatterplot below shows the residuals (on the y-axis)
from the linear model of weight vs. height from a dataset of body measurements from 507
physically active individuals. The x-axis is the height of the individuals, in cm. (Heinz et al.
2003)
a. For these data, 𝑅2 is 51.84%. What is the value of the correlation coefficient? How can
you tell if it is positive or negative? (Hint: you may need to look at a previous exercise.)
b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate
for these data? Which of the LINE conditions are met or not met?
10. Baby’s weight and father’s age, conditions. The scatterplot below shows the residuals (on
the y-axis) from the linear model of baby’s weight (measured in pounds) vs. father’s age for a
random sample of babies. Father’s age is on the x-axis. (ICPSR 2014)
a. For these data, 𝑅2 is 0.09%. What is the value of the correlation coefficient? How can you
tell if it is positive or negative? (Hint: you may need to look at a previous exercise.)
b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate
for these data? Which of the LINE conditions are met or not met?
24.8. EXERCISES 431
11. Murders and poverty, randomization test. The following regression output is for predict-
ing annual murders per million (annual_murders_per_mil) from percentage living in poverty
(perc_pov) in a random sample of 20 metropolitan areas. Shown below are the linear model
output for predicting annual murders per million from percentage living in poverty for metropoli-
tan areas and the histogram of slopes from 1000 randomized datasets (1000 times, annual_-
murders_per_mil was permuted and regressed against perc_pov). The red vertical line is drawn
at the observed slope value which was produced in the linear model output.
12. Murders and poverty, mathematical test. The table below shows the output of a linear
model annual murders per million (annual_murders_per_mil) from percentage living in poverty
(perc_pov) in a random sample of 20 metropolitan areas.
a. What are the hypotheses for evaluating whether the slope of the model predicting annual
murder rate from poverty percentage is different than 0?
b. State the conclusion of the hypothesis test from part (a) in context. What does this say
about whether poverty percentage is a useful predictor of annual murder rate?
c. Calculate a 95% confidence interval for the slope of poverty percentage, and interpret it in
context.
d. Do your results from the hypothesis test and the confidence interval agree? Explain your
reasoning.
432 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
13. Murders and poverty, bootstrap percentile interval. Data on annual murders per million
(annual_murders_per_mil) and percentage living in poverty (perc_pov) is collected from a
random sample of 20 metropolitan areas. Using these data we want to estimate the slope of the
model predicting annual_murders_per_mil from perc_pov. We take 1,000 bootstrap samples
of the data and fit a linear model predicting annual_murders_per_mil from perc_pov to each
bootstrap sample. A histogram of these slopes is shown below.
a. Using the percentile bootstrap method and the histogram above, find a 90% confidence
interval for the slope parameter.
b. Interpret the confidence interval in the context of the problem.
14. Murders and poverty, standard error bootstrap interval. A linear model is built to pre-
dict annual murders per million (annual_murders_per_mil) from percentage living in poverty
(perc_pov) in a random sample of 20 metropolitan areas. Shown below are the standard lin-
ear model output for predicting annual murders per million from percentage living in poverty
for metropolitan areas and the bootstrap distribution of the slope statistic from 1000 different
bootstrap samples of the data.
a. Using the histogram, approximate the standard error of the slope statistic (that is, quantify
the variability of the slope statistic from sample to sample).
b. Find a 90% bootstrap SE confidence interval for the slope parameter.
c. Interpret the confidence interval in the context of the problem.
24.8. EXERCISES 433
15. Murders and poverty, conditions. The scatterplot below shows the annual murders per
million vs. percentage living in poverty in a random sample of 20 metropolitan areas. The
second figure plots residuals on the y-axis and percent living in poverty on the x-axis.
a. For these data, 𝑅2 is 70.56%. What is the value of the correlation coefficient? How can
you tell if it is positive or negative?
b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate
for the data? Which of the LINE conditions are met or not met?
16. I heart cats. Researchers collected data on heart and body weights of 144 domestic adult cats.
The table below shows the output of a linear model predicting heart weight (measured in grams)
from body weight (measured in kilograms) of these cats.6
a. What are the hypotheses for evaluating whether body weight is positively associated with
heart weight in cats?
b. State the conclusion of the hypothesis test from part (a) in context.
c. Calculate a 95% confidence interval for the slope of body weight, and interpret it in context.
d. Do your results from the hypothesis test and the confidence interval agree? Explain your
reasoning.
6 The cats data used in this exercise can be found in the MASS R package.
434 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
17. Beer and blood alcohol content. Many people believe that weight, drinking habits, and
many other factors are much more important in predicting blood alcohol content (BAC) than
simply considering the number of drinks a person consumed. Here we examine data from sixteen
student volunteers at Ohio State University who each drank a randomly assigned number of cans
of beer. These students were evenly divided between men and women, and they differed in weight
and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content
(BAC) in grams of alcohol per deciliter of blood. The scatterplot and regression table summarize
the findings. 7 (Malkevitch and Lesser 2008)
a. Describe the relationship between the number of cans of beer and BAC.
b. Write the equation of the regression line. Interpret the slope and intercept in context.
c. Do the data provide convincing evidence that drinking more cans of beer is associated with
an increase in blood alcohol? State the null and alternative hypotheses, report the p-value,
and state your conclusion.
d. The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate 𝑅2 and
interpret it in context.
e. Suppose we visit a bar in our own town, ask people how many drinks they have had, and
also measure their BAC. Would the relationship between number of drinks and BAC would
be as strong as the relationship found in the Ohio State study? Why?
7 The bac data used in this exercise can be found in the openintro R package.
24.8. EXERCISES 435
18. Urban homeowners, conditions. The scatterplot below shows the percent of families who
own their home vs. the percent of the population living in urban areas. (US Census Bureau 2010)
There are 52 observations, each corresponding to a state in the US. Puerto Rico and District of
Columbia are also included. The second figure plots residuals on the y-axis and percent of the
population living in urban areas on the x-axis.
a. For these data, 𝑅2 is 29.16%. What is the value of the correlation coefficient? How can
you tell if it is positive or negative?
b. Examine the residual plot. What do you observe? Is a simple least squares fit appropriate
for the data? Which of the LINE conditions are met or not met?
19. I heart cats, LINE conditions. Researchers collected data on heart and body weights of 144
domestic adult cats. The figure below shows the output of the predicted values and residuals
generated from a linear model predicting heart weight (measured in grams) from body weight
(measured in kilograms) of these cats.
a. Examine the residual plot. Notice that for the small predicted values the residuals have
a smaller magnitude than the larger residuals seen with the larger predicted values. The
change in magnitude of the residuals across the predicted values is an indication of violation
of which LINE technical condition?
b. If the LINE condtion described in part (a) is violated, might it lead to an incorrect con-
clusion about the model (i.e., the least squares regression line itself), the inference of the
model (i.e., the p-value associated with the least squares regression line), neither, or both?
Explain your reasoning.
436 CHAPTER 24. INFERENCE FOR REGRESSION WITH A SINGLE PREDICTOR
20. Beer and blood alcohol content, LINE conditions. The figure below shows the output of
the predicted values and residuals generated from a linear model predicting the blood alcohol
content (BAC) from number of cans of beer drunk by sixteen student volunteers at Ohio State
University.8 (Malkevitch and Lesser 2008)
a. Examine the residual plot. Notice that it is difficult to identify any convincing patterns for
or against violation of the LINE technical conditions. What is it about the residual plot
that makes it difficult to assess the LINE technical conditions?
b. Is there anything about the residual plot which would make you hesitate about using the
linear model for inference about all students? Is there anything about the experimental
design of the study which would make you hesitate about using the linear model for inference
about all students?
8 The bac data used in this exercise can be found in the openintro R package.
437
Chapter 25
In Chapter 8, the least squares regression method was used to estimate lin-
ear models which predicted a particular response variable given more than
one explanatory variable. Here, we discuss whether each of the variables in-
dividually is a statistically discernible predictor of the outcome or whether
the model might be just as strong without that variable. That is, as before,
we apply inferential methods to ask whether a variable could have come
from a population where the particular coefficient at hand was zero. If
one of the linear model coefficients is truly zero (in the population), then
the estimate of the coefficient (using least squares) will vary around zero.
The inference task at hand is to decide whether the coefficient’s difference
from zero is large enough to decide that the data cannot possibly have
come from a model where the true population coefficient is zero. Both the
derivations from the mathematical model and the randomization model are
beyond the scope of this book, but we are able to calculate p-values using
statistical software. We will discuss interpreting p-values in the multiple
regression setting and note some scenarios where careful understanding of
the context and the relationship between variables is important. We use
cross-validation as a method for independent assessment of the multiple
linear regression model.
Now, our goal is to create a model where interest_rate can be predicted using the variables debt_-
to_income, term, and credit_checks. As you learned in Chapter 8, least squares can be used to
find the coefficient estimates for the linear model. The unknown population model can be written as:
𝐸[interest_rate] = 𝛽0 + 𝛽1 × debt_to_income
+ 𝛽2 × term
+ 𝛽3 × credit_checks
Table 25.1: Summary of a linear model for predicting interest rate based on debt_to_income, term, and
credit_checks. Each of the variables has its own coefficient estimate as well as a p-value.
The estimated equation for the regression model may be written as a model with three predictor
variables:
̂
interest_rate = 4.31 + 0.041 × debt_to_income + 0.16 × term + 0.25 × credit_checks
Not only does Table 25.1 provide the estimates for the coefficients, it also provides information on the
inference analysis (i.e., hypothesis testing) which is the focus of this chapter.
In Chapter 24, you learned that the hypothesis test for a linear model with one predictor1 can be
written as:
if only one predictor, 𝐻0 ∶ 𝛽1 = 0.
That is, if the true population slope is zero, the p-value measures how likely it would be to select data
which produced the observed slope (𝑏1 ) value.
With multiple predictors, the hypothesis is similar, however, it is now conditioned on each of the
other variables remaining in the model.
if multiple predictors, 𝐻0 ∶ 𝛽𝑖 = 0 given other variables in the model
Using the example above and focusing on each of the variable p-values (here we won’t discuss the
p-value associated with the intercept), we can write out the three different hypotheses:
• 𝐻0 ∶ 𝛽1 = 0, given term and credit_checks are included in the model
• 𝐻0 ∶ 𝛽2 = 0, given debt_to_income and credit_checks are included in the model
• 𝐻0 ∶ 𝛽3 = 0, given debt_to_income and term are included in the model
The very low p-values from the software output tell us that each of the variables acts as an important
predictor in the model, despite the inclusion of the other two. Consider the p-value on 𝐻0 ∶ 𝛽1 = 0.
The low p-value says that it would be extremely unlikely to see data that produce a coefficient on
debt_to_income as large as 0.041 if the true relationship between debt_to_incomeand interest_-
rate was non-existent (i.e., if 𝛽1 = 0) and the model also included term and credit_checks. You
might have thought that the value 0.041 is a small number (i.e., close to zero), but in the units of the
problem, 0.041 turns out to be far away from zero, it’s all about context! The p-values on term and
on credit_checks are interpreted similarly.
Sometimes a set of predictor variables can impact the model in unusual ways, often due to the predictor
variables themselves being correlated.
1 In previous sections, the term explanatory variable was used instead of predictor. The words are synonymous
and are used separately in the different sections to be consistent with how most analysts use them: explanatory variable
for testing, predictor for modeling.
25.2. MULTICOLLINEARITY 439
25.2 Multicollinearity
In practice, there will almost always be some degree of correlation between the explanatory variables
in a multiple regression model. For regression models, it is important to understand the entire context
of the model, particularly for correlated variables. Our discussion will focus on interpreting coefficients
(and their signs) in relationship to other variables as well as the discernibility (i.e., the p-value) of
each coefficient.
Consider an example where we would like to predict how much money is in a coin dish based only
on the number of coins in the dish. We ask 26 students to tell us about their individual coin dishes,
collecting data on the total dollar amount, the total number of coins, and the total number of low
coins.2 The number of low coins is the number of coins minus the number of quarters (a quarter is
the largest commonly used US coin, at US$0.25). Figure 25.1 illustrates a sample of U.S. coins, their
total worth (total_amount), the total number_of_coins, and the number_of_low_coins.
Figure 25.1: A sample of coins with 16 total coins, 10 low coins, and a net worth of $1.90.
The collected data is given in Figure 25.2 and shows that the total_amount of money is more highly
correlated with the total number_of_coins than it is with the number_of_low_coins. We also note
that the total number_of_coins and the number_of_low_coins are positively correlated.
(a) Total number of coins on the x-axis. (b) Number of low coins on the x-axis.
Figure 25.2: Two plots describing the total amount of money (USD) as a function of the total number of
coins or low coins. As you might expect, the total amount of money is more highly postively correlated with
the total number of coins than with the number of low coins.
2 In all honesty, this particular dataset is fabricated, and the original idea for the problem comes from Jeff Witmer
at Oberlin College.
440 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
Using the total number_of_coins as the predictor variable, Table 25.2 provides the least squares
estimate of the coefficient is 0.13. For every additional coin in the dish, we would predict that the
student had US$0.13 more. The 𝑏1 = 0.13 coefficient has a small p-value associated with it, suggesting
we would not have seen data like this if number_of_coins and total_amount of money were not
linearly related.
̂
total_amount = 0.55 + 0.13 × number_of_coins
Table 25.2: Linear model output predicting the total amount of money based on the total number of coins.
Using number_of_low_coins as the predictor variable, Table 25.3 provides the least squares estimate
of the coefficient is 0.02. For every additional low coin in the dish, we would predict that the student
had US$0.02 more. The 𝑏1 = 0.02 coefficient has a large p-value associated with it, suggesting we
could easily have seen data like ours even if the number_of_low_coins and total_amount of money
are not at all linearly related.
̂
total_amount = 2.28 + 0.02 × number_of_low_coins
Table 25.3: Linear model output predicting the total amount of money based on the number of low coins.
EXAMPLE
Come up with an example of two observations that have the same number of low coins but the
number of total coins differs by one. What is the difference in total amount?
Two samples of coins with the same number of low coins (3), but a different number of total
coins (4 vs 5) and a different total total amount ($0.41 vs $0.66).
EXAMPLE
Come up with an example of two observations that have the same total number of coins but a
different number of low coins. What is the difference in total amount?
Two samples of coins with the same total number of coins (4), but a different number of low
coins (3 vs 4) and a different total total amount ($0.41 vs $0.17).
25.2. MULTICOLLINEARITY 441
Using both the total number_of_coins and the number_of_low_coins as predictor variables, Ta-
ble 25.4 provides the least squares estimates of both coefficients as 0.21 and -0.16. Now, with two
variables in the model, the interpretation is more nuanced.
• A coefficient interpretation always indicates a change in one variable while keeping the other
variable(s) constant.
For every additional coin in the dish while the number_of_low_coins stays constant, we would
predict that the student had US$0.21 more. Re-considering the phrase “every additional coin
in the dish while the number of low coins stays constant” makes us realize that each increase
is a single additional quarter (larger samples sizes would have led to a 𝑏1 coefficient closer to
0.25 because of the deterministic relationship described here).
• For every additional low coin in the dish while the total number_of_coins stays constant, we
would predict that the student had US$0.16 less. Re-considering the phrase “every additional
low coin in the dish while the number of total coins stays constant” makes us realize that a
quarter is being swapped out for a penny, nickel, or dime.
Considering the coefficients across Table 25.2, Table 25.3, and Table 25.4 within the context and
knowledge we have of US coins allows us to understand the correlation between variables and why the
signs of the coefficients would change depending on the model. Note also, however, that the p-value for
the number_of_low_coins coefficient changed from Table 25.3 to Table 25.4. It makes sense that the
variable describing the number_of_low_coins provides more information about the total_amount of
money when it is part of a model which also includes the total number_of_coins than it does when
it is used as a single variable in a simple linear regression model.
̂
total_amount = 0.80 + 0.21 × number_of_coins
− 0.16 × number_of_low_coins
Table 25.4: Linear model output predicting the total amount of money based on both the total number of
coins and the number of low coins.
When working with multiple regression models, interpreting the model coefficients is not always as
straightforward as it was with the coin example. However, we encourage you to always think carefully
about the variables in the model, consider how they might be correlated among themselves, and
work through different models to see how using different sets of variables might produce different
relationships for predicting the response variable of interest.
Multicollinearity.
Multicollinearity happens when the predictor variables are correlated within themselves.
When the predictor variables themselves are correlated, the coefficients in a multiple
regression model can be difficult to interpret.
Although diving into the details are beyond the scope of this text, we will provide one more reflection
about multicollinearity. If the predictor variables have some degree of correlation, it can be quite
difficult to interpret the value of the coefficient or evaluate whether the variable is a statistically
discernible predictor of the outcome. However, even a model that suffers from high multicollinearity
will likely lead to unbiased predictions of the response variable. So if the task at hand is only to do
prediction (and not to interpret coefficients), multicollinearity is likely to not cause you substantial
problems.
442 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
In Section 25.1, p-values were calculated on each of the model coefficients. The p-value gives a sense of
which variables are important to the model; however, a more extensive treatment of variable selection
is warranted in a follow-up course or textbook. Here, we use cross-validation prediction error to
focus on which variable(s) are important for predicting the response variable of interest. In general,
linear models are also used to make predictions of individual observations. In addition to model
building, cross-validation provides a method for generating predictions that are not overfit to the
particular dataset at hand. We continue to encourage you to take up further study on the topic of
cross-validation, as it is among the most important ideas in modern data analysis, and we are only
able to scratch the surface here.
Cross-validation is a computational technique which removes some observations before a model is
run, then assesses the model accuracy on the held-out sample. By removing some observations, we
provide ourselves with an independent evaluation of the model (that is, the removed observations do
not contribute to finding the parameters which minimize the least squares equation). Cross-validation
can be used in many different ways (as an independent assessment), and here we will just scratch the
surface with respect to one way the technique can be used to compare models. See Figure 25.3 for a
visual representation of the cross-validation process.
Figure 25.3: The dataset is broken into k folds (here k = 4). One at a time, a model is built using k-1 of
the folds, and predictions are calculated on the single held out sample which will be completely independent
of the model estimation.
Our goal in this section is to compare two different regression models which both seek to predict
the mass of an individual penguin in grams. The observations of three different penguin species
include measurements on body size and sex. The data were collected by Dr. Kristen Gorman and the
Palmer Station, Antarctica LTER as part of the Long Term Ecological Research Network. (Gorman,
Williams, and Fraser 2014b) Although not exactly aligned with this research project, you might be
able to imagine a setting where the dimensions of the penguin are known (through, for example, aerial
photographs) but the mass is not known. The first model predicts body_mass_g by using only the
bill_length_mm, a variable denoting the length of a penguin’s bill, in mm. The second model predicts
body_mass_g by using bill_length_mm, bill_depth_mm, flipper_length_mm, sex, and species.
Prediction error.
The predicted error (also previously called the residual) is the difference between the
observed value and the predicted value (from the regression model).
The presentation below (see the comparison of Figure 25.5 and Figure 25.7) shows that the model with
more variables predicts body_mass_g with much smaller errors (predicted minus actual body mass)
than the model which uses only bill_length_g. We have deliberately used a model that intuitively
makes sense (the more body measurements, the more predictable mass is). However, in many settings,
it is not obvious which variables or which models contribute most to accurate predictions. Cross-
validation is one way to get accurate independent predictions with which to compare different models.
𝐸[body_mass_g] = 𝛽0 + 𝛽1 × bill_length_mm
̂
body_mass_g = 362.31 + 87.42 × bill_length_mm
Table 25.5: Least squares estimates for the smaller regression model predicting body_mass_g from bill_-
length_mm.
𝐸[body_mass_g] = 𝛽0 + 𝛽1 × bill_length_mm
+ 𝛽2 × bill_depth_mm
+ 𝛽3 × flipper_length_mm
+ 𝛽4 × sex_𝑚𝑎𝑙𝑒
+ 𝛽5 × species_𝐶ℎ𝑖𝑛𝑠𝑡𝑟𝑎𝑝
+ 𝛽6 × species_𝐺𝑒𝑛𝑡𝑜𝑜
̂
body_mass_g = −1460.99 + 18.20 × bill_length_mm
+ 67.22 × bill_depth_mm
+ 15.95 × flipper_length_mm
+ 389.89 × sex_𝑚𝑎𝑙𝑒
− 251.48 × species_𝐶ℎ𝑖𝑛𝑠𝑡𝑟𝑎𝑝
+ 1014.63 × species_𝐺𝑒𝑛𝑡𝑜𝑜
444 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
Table 25.6: Least squares estimates of the larger regression model predicting body_mass_g from bill_-
length_mm, bill_depth_mm, flipper_length_mm, sex, and species.
In order to compare the smaller and larger models in terms of their ability to predict penguin
mass, we need to build models that can provide independent predictions based on the penguins in the
holdout samples created by cross-validation. To reiterate, each of the predictions that (when combined
together) will allow us to distinguish between the smaller and larger are independent of the data which
were used to build the model. In this example, using cross-validation, we remove one quarter of the
data before running the least squares calculations. Then the least squares model is used to predict the
body_mass_g of the penguins in the holdout sample. Here we use a 4-fold cross-validation (meaning
that one quarter of the data is removed each time) to produce four different versions of each model
(other times it might be more appropriate to use 2-fold or 10-fold or even run the model separately
after removing each individual data point one at a time).
Figure 25.4 displays how a model is fit to 3/4 of the data (note the slight differences in coefficients as
compared to Table 25.5), and then predictions are made on the holdout sample.
Figure 25.4: The smaller model. The coefficients are estimated using the least squares model on 3/4 of the
dataset with only a single predictor variable. Predictions are made on the remaining 1/4 of the observations.
The y-axis in the scatterplot represents the residual, true observed value minus the predicted value. Note that
the predictions are independent of the estimated model coefficients.
By repeating the process for each holdout quarter sample, the residuals from the model can be plotted
against the predicted values. We see that the predictions are scattered which shows a good model fit
but that the prediction errors vary ± 1,000 g of the true body mass.
25.3. CROSS-VALIDATION FOR PREDICTION ERROR 445
Figure 25.5: The smaller model. One quarter at a time, the data were removed from the model building,
and the body mass of the removed penguins was predicted. The least squares regression model was fit
independently of the removed penguins. The predictions of body mass are based on bill length only. The
x-axis represents the predicted value, the y-axis represents the error, the difference between predicted value
and actual value.
The cross-validation SSE is the sum of squared error associated with the predictions. Let 𝑦𝑐𝑣,𝑖
̂ be the
prediction for the 𝑖𝑡ℎ observation where the 𝑖𝑡ℎ observation was in the hold-out fold and the other
three folds were used to create the linear model. For the model using only bill_length_mm to predict
body_mass_g, the CV SSE is 141,552,822.
Cross-validation SSE.
The prediction error from the cross-validated model can be used to calculate a single
numerical summary of the model. The cross-validation SSE is the sum of squared
cross-validation prediction errors.
𝑛
̂ − 𝑦𝑖 )2
CV SSE = ∑(𝑦𝑐𝑣,𝑖
𝑖=1
The same process is repeated for the larger number of explanatory variables. Note that the coefficients
estimated for the first cross-validation model (in Figure 25.6) are slightly different from the estimates
computed on the entire dataset (seen in Table 25.6). Figure 25.6 displays the cross-validation process
for the multivariable model with a full set of residual plots given in Figure 25.7. Note that the residuals
are mostly within ± 500g, providing much more precise predictions for the independent body mass
values of the individual penguins.
446 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
Figure 25.6: The larger model. The coefficients are estimated using the least squares model on 3/4 of
the dataset with the five specified predictor variables. Predictions are made on the remaining 1/4 of the
observations. The y-axis in the scatterplot represents the residual, true observed value minus the predicted
value. Note that the predictions are independent of the estimated model coefficients.
Figure 25.7: The larger model. One quarter at a time, the data were removed from the model building, and
the body mass of the removed penguins was predicted. The least squares regression model was fit independently
of the removed penguins. The predictions of body mass are based on the set of five variables described in
Table 25.6. The x-axis represents the predicted value, the y-axis represents the error, the difference between
predicted value and actual value.
25.3. CROSS-VALIDATION FOR PREDICTION ERROR 447
Figure 25.5 shows that the independent predictions are centered around the true values (i.e., errors
are centered around zero), but that the predictions can be as much as 1,000 g off when using only
bill_length_mm to predict body_mass_g. On the other hand, when using bill_length_mm, bill_-
depth_mm, flipper_length_mm, sex, and species to predict body_mass_g, the prediction errors seem
to be about half as big, as seen in Figure 25.7. For the model using bill_length_mm, bill_depth_-
mm, flipper_length_mm, sex, and species to predict body_mass_g, the CV SSE is 27,728,698 (as
compared to a CV SSE of 141,552,822 for the smaller model). Consistent with visually comparing
the two sets of residual plots, the sum of squared prediction errors is smaller for the model which
uses more predictor variables. The model with more predictor variables seems like the better model
(according to the cross-validated prediction errors criteria).
We have provided a very brief overview to and example of using cross-validation. Cross-validation
is a computational approach to model building and model validation as an alternative to reliance
on p-values. While p-values have a role to play in understanding model coefficients, throughout this
text, we have continued to present computational methods that broaden statistical approaches to data
analysis. Cross-validation will be used again in Chapter 26 with logistic regression. We encourage
you to consider both standard inferential methods (such as p-values) and computational approaches
(such as cross-validation) as you build and use multivariable models of all varieties.
448 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
25.4.1 Summary
Building on the modeling ideas from Chapter 8, we have now introduced methods for evaluating
coefficients (based on p-values) and evaluating models (cross-validation). There are many important
aspects to consider when working with multiple variables in a single model, and we have only glanced
at a few topics. Remember, multicollinearity can make coefficient interpretation difficult. A topic not
covered in this text but important for multiple regression models is interaction, and we hope that you
learn more about how variables work together as you continue to build up your modeling skills.
25.4.2 Terms
The terms introduced in this chapter are presented in Table 25.7. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
25.5 Exercises
a. Calculate a 95% confidence interval for the coefficient of out_mt2 (go out more than two
night a week) in the model, and interpret it in the context of the data.
b. Would you expect a 95% confidence interval for the slope of the remaining variables to
include 0? Explain your reasoning.
2. Tourism spending. The Association of Turkish Travel Agencies reports the number of foreign
tourists visiting Turkey and tourist spending by year. Three plots are provided: scatterplot
showing the relationship between these two variables along with the least squares fit, residuals
plot, and histogram of residuals.4
3 The gpa data used in this exercise can be found in the openintro R package.
4 The tourism data used in this exercise can be found in the openintro R package.
450 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
3. Cherry trees, collinear predictors. Timber yield is approximately equal to the volume of
a tree, however, this value is difficult to measure without first cutting the tree down. Instead,
other variables, such as height and diameter, may be used to predict a tree’s volume and yield.
Researchers wanting to understand the relationship between these variables for black cherry
trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height
is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.5
(Hand 1994) The plots below display the distribution of each of these variables (on the diagonal)
as well as provide information on the pairwise correlations between them.
Also provided below are three regression model outputs: volume vs. diam, volume vs. height,
and volume vs. height + diam.
a. There are three variables described in the figure, and each is paired with each other to
create three different scatterplots. Rate the pairwise relationships from most correlated to
least correlated.
b. When using only one variable to model a tree’s volume, is diameter a discernible predictor?
Is height a discernible predictor? Explain your reasoning.
c. When using both diameter and height to predict a tree’s volume, are both predictors still
discernible? Explain your reasoning.
5 The cherry data used in this exercise can be found in the openintro R package.
25.5. EXERCISES 451
4. GPA, collinear predictors. In this exercise we work with data from a survey of 55 Duke
University students who were asked about their GPA, number of hours they sleep nightly, and
number of nights they go out each week. The plots below display the distributions of each of these
variables (on the diagonal) as well as their pairwise relationships and correlation coefficients.
Also provided below are three regression model outputs: gpa vs. out, gpa vs. sleepnight, and
gpa vs. out + sleepnight.
a. There are three variables described in the figure, and each is paired with each other to
create three different scatterplots. Rate the pairwise relationships from most correlated to
least correlated.
b. When using only one variable to model gpa, is out a discernible predictor? Is sleepnight
a discernible predictor? Explain your reasoning.
c. When using both out and sleepnight to predict gpa in a multiple regression model, are
either of the variables discernible? Explain your reasoning.
452 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
5. Movie returns. A FiveThirtyEight.com article reports that “Horror movies get nowhere near
as much draw at the box office as the big-time summer blockbusters or action/adventure movies,
but there’s a huge incentive for studios to continue pushing them out. The return-on-investment
potential for horror movies is absurd.” To investigate how the return-on-investment (ROI) com-
pares between genres and how this relationship has changed over time, an introductory statistics
student fit a linear regression model to predict the ratio of gross revenue of movies to the pro-
duction costs from genre and release year for 1,070 movies released between 2000 and 2018.
Using the plots given below, determine if this regression model is appropriate for these data. In
particular, use the residual plot to check the LINE conditons. (FiveThirtyEight 2015)
6. Difficult encounters. A study was conducted at a university outpatient primary care clinic
in Switzerland to identify factors associated with difficult doctor-patient encounters. The data
consist of 527 patient encounters, conducted by the 27 medical residents employed at the clinic.
After each encounter, the attending physician completed two questionnaires: the Difficult Doctor
Patient Relationship Questionnaire (DDPRQ-10) and the Patient’s Vulnerability Grid (PVG).
A higher score on the DDPRQ-10 indicates a more difficult encounter. The maximum possible
score is 60 and encounters with score 30 and higher are considered difficult. A model was fit to
predict DDPRQ-10 score from features of the attending physician: age, sex (male or not), and
years of training.
a. The intercept of the model is 30.594. What is the age, sex, and years of training of a
physician whom this model would predict to have a DDPRQ-10 score of 30.594.
b. Is there evidence of a discernible association between DDPRQ-10 score and any of the
physician features?
25.5. EXERCISES 453
7. Baby’s weight, mathematical test. US Department of Health and Human Services, Centers
for Disease Control and Prevention collect information on births recorded in the country. The
data used here are a random sample of 1,000 births from 2014. Here, we study the relationship
between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a
smoker, and 0 if not. The summary table below shows the results of a linear regression model
for predicting the average birth weight of babies, measured in pounds, based on the smoking
status of the mother.6 (ICPSR 2014)
a. Determine if the conditions for doing inference based on mathematical models with these
data are met using the diagnostic plots above. If not, describe how to proceed with the
analysis.
b. Using the regression output, evaluate whether the true slope of habit (i.e., whether the
mother is a smoker) is different than 0, given the other variables in the model. State the
null and alternative hypotheses, report the p-value (using a mathematical model), and state
your conclusion.
6 The births14 data used in this exercise can be found in the openintro R package.
454 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
8. Baby’s weight, collinear predictors. In this exercise we study the relationship between
the weight of the baby and two explanatory variables: number of weeks of gestation and num-
ber of pregnancy hospital visits. (ICPSR 2014) The plots below display the distributions of
each of these variables (on the diagonal) as well as their pairwise relationships and correlation
coefficients.
Also provided below are three regression model outputs: weight vs. weeks, weight vs. visits,
and weight vs. weeks + visits.
a. There are three variables described in the figure, and each is paired with each other to
create three different scatterplots. Rate the pairwise relationships from most correlated to
least correlated.
b. When using only one variable to model the baby’s weight, is weeks a discernible predictor?
Is visits a discernible predictor? Explain your reasoning.
c. When using both visits and weeks to predict the baby’s weight, are both predictors still
discernible? Explain your reasoning.
25.5. EXERCISES 455
9. Baby’s weight, cross-validation. Using a random sample of 1,000 US births from 2014,
we study the relationship between the weight of the baby and various explanatory variables.
(ICPSR 2014) The plots below display prediction errors associated with two different models
designed to predict weight of baby at birth; one model uses 7 predictors, one model uses 2
predictors. Using 4-fold cross-validation, the data were split into 4 folds. Three of the folds
estimate the 𝛽𝑖 parameters using 𝑏𝑖 , and the model is applied to the held out fold for prediction.
The process was repeated 4 times (each time holding out one of the folds).
a. In the first graph, note the point at roughly (predicted = 11 and error = -4). Estimate the
observed and predcited value for that observation.
b. Using the same point, describe which cross-validation fold(s) were used to build its predic-
tion model.
c. For the plot on the top, for one of the cross-validation folds, how many coefficients were
estimated in the linear model? For the plot on the bottom, for one of the cross-validation
folds, how many coefficients were estimated in the linear model?
d. Do the values of the residuals (along the y-axis, not the x-axis) seem markedly different for
the two models? Explain your reasoning.
456 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
10. RailTrail, cross-validation. The Pioneer Valley Planning Commission (PVPC) collected data
north of Chestnut Street in Florence, MA for ninety days from April 5, 2005 to November 15,
2005. Data collectors set up a laser sensor, with breaks in the laser beam recording when a
rail-trail user passed the data collection station.7 The plots below display prediction errors
associated with two different models designed to predict the volume of riders on the RailTrail;
one model uses 6 predictors, one model uses 2 predictors. Using 3-fold cross-validation, the data
were split into 3 folds. Three of the folds estimate the 𝛽𝑖 parameters using 𝑏𝑖 , and the model is
applied to the held out fold for prediction. The process was repeated 4 times (each time holding
out one of the folds).
a. In the second graph, note the point at roughly (predicted = 400 and error = 100). Estimate
the observed and predcited value for that observation.
b. Using the same point, describe which cross-validation fold(s) were used to build its predic-
tion model.
c. For the plot on the top, for one of the cross-validation folds, how many coefficients were
estimated in the linear model? For the plot on the bottom, for one of the cross-validation
folds, how many coefficients were estimated in the linear model?
d. Do the values of the residuals (along the y-axis, not the x-axis) seem markedly different for
the two models? Explain your reasoning.
7 The RailTrail data used in this exercise can be found in the mosaicData R package.
25.5. EXERCISES 457
11. Baby’s weight, cross-validation for model selection. Using a random sample of 1,000
US births from 2014, we study the relationship between the weight of the baby and various
explanatory variables. (ICPSR 2014) The plots below display prediction errors associated with
two different models designed to predict weight of baby at birth; one model uses 7 predictors,
one model uses 2 predictors. Using 4-fold cross-validation, the data were split into 4 folds. Three
of the folds estimate the 𝛽𝑖 parameters using 𝑏𝑖 , and the model is applied to the held out fold
for prediction. The process was repeated 4 times, each time holding out one of the folds.
a. Using the spread of the points (in the y-direction), which model should be chosen for a
final report on these data? Explain your reasoning.
b. Using the summary statistic (CV SSE), which model should be chosen for a final report on
these data? Explain your reasoning.
c. Why would the model with more predictors fit the data less closely than the model with
only two predictors?
458 CHAPTER 25. INFERENCE FOR REGRESSION WITH MULTIPLE PREDICTORS
12. RailTrail, cross-validation for model selection. The Pioneer Valley Planning Commission
(PVPC) collected data north of Chestnut Street in Florence, MA for ninety days from April
5, 2005 to November 15, 2005. Data collectors set up a laser sensor, with breaks in the laser
beam recording when a rail-trail user passed the data collection station. The plots below display
prediction errors associated with two different models designed to predict the volume of riders
on the RailTrail; one model uses 6 predictors, one model uses 2 predictors. Using 3-fold cross-
validation, the data were split into 3 folds. Three of the folds estimate the 𝛽𝑖 parameters using
𝑏𝑖 , and the model is applied to the held out fold for prediction. The process was repeated 4
times, each time holding out one of the folds.
a. Using the spread of the points (in the y-direction), which model should be chosen for a
final report on these data? Explain your reasoning.
b. Using the summary statistic (CV SSE), which model should be chosen for a final report on
these data? Explain your reasoning.
c. Why would the model with more predictors fit the data less closely than the model with
only two predictors?
459
Chapter 26
As with multiple linear regression, the inference aspect for logistic regression will focus on interpreta-
tion of coefficients and relationships between explanatory variables. Both p-values and cross-validation
will be used for assessing a logistic regression model.
Consider the email data which describes email characteristics which can be used to predict whether a
particular incoming email is spam (unsolicited bulk email). Without reading every incoming message,
it might be nice to have an automated way to identify spam emails. Which of the variables describing
each email are important for predicting the status of the email?
Before looking at the hypothesis tests associated with the coefficients (turns out they are very similar
to those in linear regression!), it is valuable to understand the technical conditions that underlie the
inference applied to the logistic regression model. Generally, as you’ve seen in the logistic regression
modeling examples, it is imperative that the response variable is binary. Additionally, the key technical
condition for logistic regression has to do with the relationship between the predictor variables (𝑥𝑖
values) and the probability the outcome will be a success. It turns out, the relationship is a specific
𝑝
functional form called a logit function, where logit(𝑝) = log𝑒 ( 1−𝑝 ). The function may feel complicated,
and memorizing the formula of the logit is not necessary for understanding logistic regression. What
you do need to remember is that the probability of the outcome being a success is a function of a
linear combination of the explanatory variables.
460 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
Table 26.1: Variables and their descriptions for the email dataset. Many of the variables are indicator
variables, meaning they take the value 1 if the specified characteristic is present and 0 otherwise.
Variable Description
spam Indicator for whether the email was spam.
to_multiple Indicator for whether the email was addressed to more than one
recipient.
from Whether the message was listed as from anyone (this is usually set
by default for regular outgoing email).
cc Number of people cc’ed.
sent_email Indicator for whether the sender had been sent an email in the last
30 days.
attach The number of attached files.
dollar The number of times a dollar sign or the word “dollar” appeared in
the email.
winner Indicates whether “winner” appeared in the email.
format Indicates whether the email was written using HTML (e.g., may
have included bolding or active links).
re_subj Whether the subject started with “Re:”, “RE:”, “re:”, or “rE:”
exclaim_subj Whether there was an exclamation point in the subject.
urgent_subj Whether the word “urgent” was in the email subject.
exclaim_mess The number of exclamation points in the email message.
number Factor variable saying whether there was no number, a small
number (under 1 million), or a big number.
There are two key conditions for fitting a logistic regression model:
The first logistic regression model condition — independence of the outcomes — is reasonable if we
can assume that the emails that arrive in an inbox within a few months are independent of each other
with respect to whether they’re spam or not.
The second condition of the logistic regression model is not easily checked without a fairly sizable
amount of data. Luckily, we have 3921 emails in the dataset! Let’s first visualize these data by
plotting the true classification of the emails against the model’s fitted probabilities, as shown in
Figure 26.1.
Figure 26.1: The predicted probability that each of the 3921 emails are spam. Points have been jittered so
that those with nearly identical values aren’t plotted exactly on top of one another.
26.2. MULTIPLE LOGISTIC REGRESSION OUTPUT FROM SOFTWARE 461
We’d like to assess the quality of the model. For example, we might ask: if we look at emails that we
modeled as having 10% chance of being spam, do we find out 10% of the actually are spam? We can
check this for groups of the data by constructing a plot as follows:
1. Bucket the observations into groups based on their predicted probabilities.
2. Compute the average predicted probability for each group.
3. Compute the observed probability for each group, along with a 95% confidence interval for the
true probability of success for those individuals.
4. Plot the observed probabilities (with 95% confidence intervals) against the average predicted
probabilities for each group.
If the model does a good job describing the data, the plotted points should fall close to the line
𝑦 = 𝑥, since the predicted probabilities should be similar to the observed probabilities. We can use
the confidence intervals to roughly gauge whether anything might be amiss. Such a plot is shown in
Figure 26.2.
Figure 26.2: A reconfiguration of Figure 26.1. Again, the predicted probabilities are on the x-axis and the
truth is on the y-axis for each email. After data have been bucketed into predicted probability groups, the
proportion of spam emails (i.e., the observed probability) is given by the black circles. The dashed line is
within the confidence bound of the 95% confidence intervals for many of the buckets, suggesting the logistic
fit is reasonable.
A plot like Figure 26.2 helps to better understand the deviations. Additional diagnostics may be
created that are similar to those featured in Section 24.6. For instance, we could compute residuals
as the observed outcome minus the expected outcome (𝑒𝑖 = 𝑌𝑖 − 𝑝𝑖̂ ), and then we could create plots
of these residuals against each predictor.
As you learned in Chapter 8, optimization can be used to find the coefficient estimates for the logistic
model. The unknown population model can be written as:
𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × to_multiple
1−𝑝
+ 𝛽2 × cc
+ 𝛽3 × dollar
+ 𝛽4 × urgent_subj
462 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
The estimated equation for the regression model may be written as a model with four predictor
variables, where 𝑝̂ is the estimated probability of being a spam email message:
𝑝̂
log𝑒 ( ) = −2.05 − 1.91 × to_multiple
1 − 𝑝̂
+ 0.02 × cc
− 0.07 × dollar
+ 2.66 × urgent_subj
Table 26.2: Summary of a logistic model for predicting whether an email is spam based on the variables
to_multiple, cc, dollar, and urgent_subj. Each of the variables has its own coefficient estimate and p-value.
Not only does Table 26.2 provide the estimates for the coefficients, it also provides information on the
inference analysis (i.e., hypothesis testing) which is the focus of this chapter.
As in Chapter 25, with multiple predictors, each hypothesis test (for each of the explanatory
variables) is conditioned on each of the other variables remaining in the model.
if multiple predictors 𝐻0 ∶ 𝛽𝑖 = 0 given other variables in the model
Using the example above and focusing on each of the variable p-values (here we won’t discuss the
p-value associated with the intercept), we can write out the four different hypotheses (associated with
the p-value corresponding to each of the coefficients / rows in Table 26.2):
• 𝐻0 ∶ 𝛽1 =0 given cc, dollar, and urgent_subj are included in the model
• 𝐻0 ∶ 𝛽2 =0 given to_multiple, dollar, and urgent_subj are included in the model
• 𝐻0 ∶ 𝛽3 =0 given to_multiple, cc, and urgent_subj are included in the model
• 𝐻0 ∶ 𝛽4 =0 given to_multiple, dollar, and dollar are included in the model
The very low p-values from the software output tell us that three of the variables (that is, not cc)
act as statistically discernible predictors in the model at the discernibility level of 0.05, despite the
inclusion of any of the other variables. Consider the p-value on 𝐻0 ∶ 𝛽1 = 0. The low p-value says
that it would be extremely unlikely to observe data that yield a coefficient on to_multiple at least
as far from 0 as -1.91 (i.e., |𝑏1 | > 1.91) if the true relationship between to_multiple and spam was
non-existent (i.e., if 𝛽1 = 0) and the model also included cc and dollar and urgent_subj. Note also
that the coefficient on dollar has a small associated p-value, but the magnitude of the coefficient is
also seemingly small (0.07). It turns out that in units of standard errors (0.02 here), 0.07 is actually
quite far from zero, it’s all about context! The p-values on the remaining variables are interpreted
similarly. From the initial output (p-values) in Table 26.2, it seems as though to_multiple, dollar,
and urgent_subj are important variables for modeling whether an email is spam. We remind you
that although p-values provide some information about the importance of each of the predictors in the
model, there are many, arguably more important, aspects to consider when choosing the best model.
As with linear regression (see Section 25.2), existence of predictors that are correlated with each
other can affect both the coefficient estimates and the associated p-values. However, investigating
multicollinearity in a logistic regression model is saved for a text which provides more detail about
logistic regression. Next, as a model building alternative (or enhancement) to p-values, we revisit
cross-validation within the context of predicting status for each of the individual emails.
26.3. CROSS-VALIDATION FOR PREDICTION ERROR 463
The p-value is a probability measure under a setting of no relationship. That p-value provides infor-
mation about the degree of the relationship (e.g., above we measure the relationship between spam
and to_multiple using a p-value), but the p-value does not measure how well the model will predict
the individual emails (e.g., the accuracy of the model). Depending on the goal of the research project,
you might be inclined to focus on variable importance (through p-values) or you might be inclined to
focus on prediction accuracy (through cross-validation).
Here we present a method for using cross-validation accuracy to determine which variables (if
any) should be used in a model which predicts whether an email is spam. A full treatment of cross-
validation and logistic regression models is beyond the scope of this text. Using 𝑘-fold cross-validation,
we can build 𝑘 different models which are used to predict the observations in each of the 𝑘 holdout
samples. As with linear regression (see Section 25.3), we compare a smaller logistic regression model
to a larger logistic regression model. The smaller model uses only the to_multiple variable, see the
complete dataset (not cross-validated) model output in Table 26.3. The logistic regression model can
be written as, where 𝑝̂ is the estimated probability of being a spam email message:
The smaller model:
𝑝̂
log𝑒 ( ) = −2.12 + −1.81 × to_multiple
1 − 𝑝̂
Table 26.3: The smaller model. Summary of a logistic model for predicting whether an email is spam based
on only the predictor variable to_multiple. Each of the variables has its own coefficient estimate and p-value.
For each cross-validated model, the coefficients change slightly, and the model is used to make inde-
pendent predictions on the holdout sample. The model from the first cross-validation sample is given
in Figure 26.3 and can be compared to the coefficients in Table 26.3.
Figure 26.3: The smaller model. The coefficients are estimated using the least squares model on 3/4 of the
data with a single predictor variable. Predictions are made on the remaining 1/4 of the observations. Note
that the prediction error rate is quite high.
464 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
Table 26.4: The smaller model. One quarter at a time, the data were removed from the model building,
and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was
fit independently of the removed emails. Only to_multiple is used to predict whether the email is spam.
Because we used a cutoff designed to identify spam emails, the accuracy of the non-spam email predictions
is very low. spamTP is the proportion of true spam emails that were predicted to be spam. notspamTP is the
proportion of true not spam emails that were predicted to be not spam.
Because the email dataset has a ratio of roughly 90% non-spam and 10% spam emails, a model which
randomly guessed all non-spam would have an overall accuracy of 90%! Clearly, we would like to
capture the information with the spam emails, so our interest is in the percent of spam emails which
are identified as spam (see Table 26.4). Additionally, in the logistic regression model, we use a 10%
cutoff to predict whether the email is spam. Fortunately, we have done a great job of predicting!
However, the trade-off was that most of the non-spam emails are now predicted to be spam which is
not acceptable for a prediction algorithm. Adding more variables to the model may help with both
the spam and non-spam predictions.
The larger model uses to_multiple, attach, winner, format, re_subj, exclaim_mess, and number
as the set of seven predictor variables, see the complete dataset (not cross-validated) model output
in Table 26.5. The logistic regression model can be written as follows, where 𝑝̂ is the estimated
probability of being a spam email message.
𝑝̂
log𝑒 ( ) = −0.34 − 2.56 × to_multiple + 0.20 × attach + 1.73 × winner𝑦𝑒𝑠
1 − 𝑝̂
− 1.28 × format − 2.86 × re_subj + 0.00 × exclaim_mess
− 1.07 × number𝑠𝑚𝑎𝑙𝑙 − 0.42 × number𝑏𝑖𝑔
Table 26.5: The larger model. Summary of a logistic model for predicting whether an email is spam based
on the variables to_multiple, attach, winner, format, re_subj, exclaim_mess, and number. Each of the
variables has its own coefficient estimate and p-value.
Figure 26.4: The larger model. The coefficients are estimated using the least squares model on 3/4 of
the dataset with the seven specified predictor variables. Predictions are made on the remaining 1/4 of the
observations. Note that the predictions are independent of the estimated model coefficients. The predictions
are now much better for both the spam and the non-spam emails (than they were with a single predictor
variable).
Table 26.6: The larger model. One quarter at a time, the data were removed from the model building,
and whether the email was spam (TRUE) or not (FALSE) was predicted. The logistic regression model was
fit independently of the removed emails. Now, the variables to_multiple, attach, winner, format, re_subj,
exclaim_mess, and number are used to predict whether the email is spam. spamTP is the proportion of true
spam emails that were predicted to be spam. notspamTP is the proportion of true not spam emails that were
predicted to be not spam.’
Somewhat expected, the larger model (see Table 26.6) was able to capture more nuance in the emails
which lead to better predictions. However, it is not true that adding variables will always lead to
better predictions, as correlated or noise variables may dampen the signal from the set of variables
that truly predict the status. We encourage you to learn more about multiple variable models and
cross-validation in your future exploration of statistical topics.
466 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
26.4.1 Summary
Throughout the text, we have presented a modern view to introduction to statistics. Earlier, we
presented graphical techniques which communicated relationships across multiple variables. We also
used modeling to formalize the relationships. In Chapter 26 we considered inferential claims on
models which include many variables used to predict the probability of the outcome being a success.
We continue to emphasize the importance of experimental design in making conclusions about research
claims. In particular, recall that variability can come from different sources (e.g., random sampling
vs. random allocation, see Figure 2.8).
As you might guess, this text has only scratched the surface of the world of statistical analyses that
can be applied to different datasets. In particular, to do justice to the topic, the linear models and
generalized linear models we have introduced can each be covered with their own course or book.
Hierarchical models, alternative methods for fitting parameters (e.g., Ridge Regression or LASSO),
and advanced computational methods applied to multivariable models (e.g., permuting the response
variable? one explanatory variable? all the explanatory variables?) are all beyond the scope of this
book. However, your successful understanding of the ideas we have covered has set you up perfectly
to move on to a higher level of statistical modeling and inference. Enjoy!
26.4.2 Terms
The terms introduced in this chapter are presented in Table 26.7. If you’re not sure what some of
these terms mean, we recommend you go back in the text and review their definitions. You should be
able to easily spot them as bolded text.
26.5 Exercises
2. Oceans and skin cancer. A researcher wants to investigate the relationship between living
within 10 miles of an ocean for at least one year of life and developing skin cancer before the
age of 50.
a. Explain why logistic regression can be used to study the relationship between these two
binary variables? What is the technical assumption describing the relationship between
the response (outcome) and explanatory (predictor) variables?
b. What other methods covered in this text might be used to address the research question of
interest? What advantages does logistic regression have over these methods?
3. Marijuana use in college. Researchers studying whether the value systems of adolescents
conflict with those of their parents asked 445 college students if they use marijuana. They also
asked the students’ parents if they used marijuana when they were in college. Based on the
regression output shown below for predicting student drug use from parent drug use, evaluate
whether parents’ marijuana usage is a discernible predictor of their kids’ marijuana usage. State
the hypotheses, the test statistics, the p-value, and the conclusion in context of the data and
the research question.1 (Ellis and Stone 1979)
4. Treating heart attacks. Researchers studying the effectiveness of Sulfinpyrazone in the pre-
vention of sudden death after a heart attack conducted a randomized experiment on 1,475
patients. Based on the regression output shown below for predicting the outcome (died or
lived, where success is defined as lived) from the treatment group (control and treatment),
evaluate whether treatment group is a discernible predictor of the outcome. State the hypothe-
ses, the test statistics, the p-value, and the conclusion in context of the data and the research
question.2 (Anturane Reinfarction Trial Research Group 1980)
1 The drug_use data used in this exercise can be found in the openintro R package.
2 The sulphinpyrazone data used in this exercise can be found in the openintro R package.
468 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
𝑝 𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × tail_l log𝑒 ( ) = 𝛽0 + 𝛽1 × total_l + 𝛽2 × sex
1−𝑝 1−𝑝
a. How many observations are in Fold2? Use the model with only tail length as a predictor
variable. Of the observations in Fold2, how many of them were correctly predicted to be
from Vicotria? How many of them were incorrectly predicted to be from Victoria?
b. How many observations are used to build the model which predicts for the observations in
Fold2?
c. For one of the cross-validation folds, how many coefficients were estimated for the model
which uses tail length as a predictor? For one of the cross-validation folds, how many
coefficients were estimated for the model which uses total length and sex as predictors?
3 The possum data used in this exercise can be found in the openintro R package.
26.5. EXERCISES 469
𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × mage
1−𝑝
+ 𝛽2 × weight 𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × weight
+ 𝛽3 × mature 1−𝑝
+ 𝛽4 × visits + 𝛽2 × mature
+ 𝛽5 × gained
+ 𝛽6 × habit
a. How many observations are in Fold2? Use the model with only weight and mature as pre-
dictor variables. Of the observations in Fold2, how many of them were correctly predicted
to be premature? How many of them were incorrectly predicted to be premature?
b. How many observations are used to build the model which predicts for the observations in
Fold2?
c. In the original dataset, are most of the births premature or full term? Explain.
d. For one of the cross-validation folds, how many coefficients were estimated for the model
which uses mage, weight, mature, visits, gained, and habit as predictors? For one of
the cross-validation folds, how many coefficients were estimated for the model which uses
weight and mature as predictors?
4 The births14 data used in this exercise can be found in the openintro R package.
470 CHAPTER 26. INFERENCE FOR LOGISTIC REGRESSION
𝑝 𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × tail_l log𝑒 ( ) = 𝛽0 + 𝛽1 × total_l + 𝛽2 × sex
1−𝑝 1−𝑝
a. For the model with tail length, how many of the observations were correctly classified?
What proportion of the observations were correctly classified?
b. For the model with total length and sex, how many of the observations were correctly
classified? What proportion of the observations were correctly classified?
c. If you have to choose between using only tail length as a predictor versus using total
length and sex as predictors (for classification into region), which model would you choose?
Explain.
d. Given the predictions above, what third model might be preferable to either of the models
above? Explain.
26.5. EXERCISES 471
𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × mage
1−𝑝
+ 𝛽2 × weight 𝑝
log𝑒 ( ) = 𝛽0 + 𝛽1 × weight
+ 𝛽3 × mature 1−𝑝
+ 𝛽4 × visits + 𝛽2 × mature
+ 𝛽5 × gained
+ 𝛽6 × habit
a. For the model with 6 predictors, how many of the observations were correctly classified?
What proportion of the observations were correctly classified?
b. For the model with 2 predictors, how many of the observations were correctly classified?
What proportion of the observations were correctly classified?
c. If you have to choose between the model with 6 predictors and the model with 2 predictors
(for predicting whether a baby will be premature), which model would you choose? Explain.
472
Chapter 27
In this case study, we consider Ebay auctions of a video game called Mario Kart for the Nintendo
Wii. The outcome variable of interest is the total price of an auction, which is the highest bid plus
the shipping cost. We will try to determine how total price is related to each characteristic in an
auction while simultaneously controlling for other variables. For instance, all other characteristics
held constant, are longer auctions associated with higher or lower prices? And, on average, how much
more do buyers tend to pay for additional Wii wheels (plastic steering wheels that attach to the Wii
controller) in auctions? Multiple regression will help us answer these and other questions.
The mariokart dataset includes results from 141 auctions. Four observations from this dataset are
shown in Table 27.1, and descriptions for each variable are shown in Table 27.2. Notice that the
condition and stock photo variables are indicator variables, similar to bankruptcy in the loans
dataset from Chapter 25.
Table 27.2: Variables and their descriptions for the mariokart dataset.
Variable Description
price Final auction price plus shipping costs, in US dollars.
cond_new Indicator variable for if the game is new (1) or used (0).
stock_photo Indicator variable for if the auction’s main photo is a stock photo.
duration The length of the auction, in days, taking values from 1 to 10.
wheels The number of Wii wheels included with the auction. A Wii wheel
is an optional steering wheel accessory that holds the Wii controller.
27.1. CASE STUDY: MARIO KART 473
𝐸[price] = 𝛽0 + 𝛽1 × cond_new
Table 27.3: Summary of a linear model for predicting price based on cond_new.
GUIDED PRACTICE
Write down the equation for the model, note whether the slope is statistically different
from zero, and interpret the coefficient.1
Sometimes there are underlying structures or relationships between predictor variables. For instance,
new games sold on Ebay tend to come with more Wii wheels, which may have led to higher prices
for those auctions. We would like to fit a model that includes all potentially important variables
simultaneously, which would help us evaluate the relationship between a predictor variable and the
outcome while controlling for the potential influence of other variables.
We want to construct a model that accounts for not only the game condition but simultaneously
accounts for three other variables:
Table 27.4 summarizes the full model. Using the output, we identify the point estimates of each
coefficient and the corresponding impact (measured with information on the standard error used to
compute the p-value).
Table 27.4: Summary of a linear model for predicting price based on cond_new, stock_photo, duration,
and wheels.
GUIDED PRACTICE
Write out the model’s equation using the point estimates from Table 27.4. How many
predictors are there in the model? How many coefficients are estimated?2
GUIDED PRACTICE
What does 𝛽4 , the coefficient of variable 𝑥4 (Wii wheels), represent? What is the point
estimate of 𝛽4 ?3
GUIDED PRACTICE
Compute the residual of the first observation in Table 27.1 using the equation identified
in Table 27.4.4
EXAMPLE
In Table 27.3, we estimated a coefficient for cond_new in of 𝑏1 = 10.90 with a standard error
of 𝑆𝐸𝑏1 = 1.26 when using simple linear regression. Why might there be a difference between
that estimate and the one in the multiple regression setting?
If we examined the data carefully, we would see that there is multicollinearity among some
predictors. For instance, when we estimated the connection of the outcome price and predictor
cond_new using simple linear regression, we were unable to control for other variables like the
number of Wii wheels included in the auction. That model was biased by the confounding
variable wheels. When we use both variables, this particular underlying and unintentional
bias is reduced or eliminated (though bias from other confounding variables may still remain).
Figure 27.1: Estimated slopes from linear models (price regressed on cond_new) built on 1,000 randomized
datasets. Each dataset was permuted under the null hypothesis.
2 price
̂ = 36.21 + 5.13 × cond_new + 1.08 × stock_photo − 0.03 × duration + 7.29 × wheels, with 4 predictors but 5
coefficients (including the intercept).
3 In the population of all auctions, it is the average difference in auction price for each additional Wii wheel included
when holding the other variables constant. The point estimate is 𝑏4 = 7.29
4 𝑒 = 𝑦 − 𝑦 ̂ = 51.55 − 49.62 = 1.93.
𝑖 𝑖 𝑖
27.1. CASE STUDY: MARIO KART 475
EXAMPLE
In Figure 27.1, the red line (the observed slope) is far from the bulk of the histogram. Explain
why the randomly permuted datasets produce slopes that are quite different from the observed
slope.
The null hypothesis is that, in the population, there is no linear relationship between the price
and the cond_new of the Mario Kart games. When the data are randomly permuted, prices
are randomly assigned to a condition (new or used), so that the null hypothesis is forced to
be true, i.e., permutation is done under the assumption that no relationship between the two
variables exists. In the actual study, the new Mario Kart games do actually cost more (on
average) than the used games! So the slope describing the actual observed relationship is not
one that is likely to have happened in a randomly dataset permuted under the assumption that
the null hypothesis is true.
GUIDED PRACTICE
Using the histogram in Figure 27.1, find the p-value and conclude the hypothesis test
in the context of the problem.5
GUIDED PRACTICE
Is the conclusion based on the histogram of randomized slopes consistent with the
conclusion obtained using the mathematical model? Explain.6
Although knowing there is a relationship between the condition of the game and its price, we might
be more interested in the difference in price, here given by the slope of the linear regression line. That
is, 𝛽1 represents the population value for the difference in price between new Mario Kart games and
used games.
Figure 27.2: Estimated slopes from linear models (price regressed on cond_new) built on 1,000 bootstrapped
datasets. Each bootstrap sample was taken from the original Mario Kart auction data.
5 The observed slope is 10.9 which is nowhere near the range of values for the permuted slopes (roughly -5 to +5).
Because the observed slope is not a plausible value under the null distribution, the p-value is essentially zero. We reject
the null hypothesis and claim that there is a relationship between whether the game is new (or not) and the average
predicted price of the game.
6 The p-value in Table 27.3 is also essentially zero, so the null hypothesis is also rejected when the mathematical
model approach is taken. Often, the mathematical and computational approaches to inference will give quite similar
answers.
476 CHAPTER 27. APPLICATIONS: MODEL AND INFER
EXAMPLE
Figure 27.2 displays the slope estimates taken from bootstrap samples of the original data.
Using the histogram, estimate the standard error of the slope. Is your estimate similar to
the value of the standard error of the slope provided in the output of the mathematical linear
model?
The slopes seem to vary from approximately 8 to 14. Using the empirical rule, we know that if a
variable has a bell-shaped distribution, most of the observations will be with 2 standard errors
of the center. Therefore, a rough approximation of the standard error is 1.5. The standard
error given in Table 27.3 is 1.26 which is not too different from the value computed using the
bootstrap approach.
GUIDED PRACTICE
Use Figure 27.2 to create a 90% standard error bootstrap confidence interval for the
true slope. Interpret the interval in context.7
GUIDED PRACTICE
Use Figure 27.2 to create a 90% bootstrap percentile confidence interval for the true
slope. Interpret the interval in context.8
27.1.3 Cross-validation
2
In Chapter 8, models were compared using 𝑅𝑎𝑑𝑗 . In Chapter 25, however, a computational approach
was introduced to compare models by removing chunks of data one at a time and assessing how well
the variables predicted the observations that had been held out.
Figure 27.3 was created by cross-validating models with the same variables as in Table 27.3 and
Table 27.4. We applied 3-fold cross-validation, so 1/3 of the data was removed while 2/3 of the
observations were used to build each model (first on cond_new only and then on cond_new, stock_-
photo, duration, and wheels). Note that each time 1/3 of the data is removed, the resulting model
will produce slightly different model coefficients.
The points in Figure 27.3 represent the prediction (x-axis) and residual (y-axis) for each observation
run through the cross-validated model. In other words, the model is built (using the other 2/3) without
the observation (which is in the 1/3) being used. The residuals give us a sense for how well the model
will do at predicting observations which were not a part of the original dataset, e.g., future studies.
7 Using the bootstrap SE method, we know the normal percentile is 𝑧⋆ = 1.645, which gives a CI of 𝑏 ±1.645⋅𝑆𝐸 →
1
10.9 ± 1.645 ⋅ 1.5 → (8.43, 13.37). For games that are new, the average price is higher by between $8.43 and $13.37
than games that are used, with 90% confidence.
8 Because there were 1,000 bootstrap resamples, we look for the cutoffs which provide 50 bootstrap slopes on the
left, 900 in the middle, and 50 on the right. Looking at the bootstrap histogram, the rough 95% confidence interval is
$9 to $13.10. For games that are new, the average price is higher by between $9.00 and $13.10 than games that are
used, with 90% confidence.
27.1. CASE STUDY: MARIO KART 477
(a) price vs. cond_new (b) price vs. cond_new, stock_photo, duration, and wheels
Figure 27.3: Cross-validation predictions and errors from linear models built on two different sets of variables.
GUIDED PRACTICE
In Figure 27.3b, note the point at roughly predicted = 50 and prediction error = 10.
Estimate the observed and predicted value for that observation.9
GUIDED PRACTICE
In Figure 27.3b, for the same point at roughly predicted = 50 and prediction error =
10, describe which cross-validation fold(s) were used to build its prediction model.10
GUIDED PRACTICE
By noting the spread of the cross-validated prediction errors (on the y-axis) in Fig-
ure 27.3, which model should be chosen for a final report on these data?11
GUIDED PRACTICE
Using the summary statistic cross-validation sum of squared errors (CV SSE), which
model should be chosen for a final report on these data?12
9 The ̂ = $50. The observed value is roughly price = $60 riders (using 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ ).
predicted value is roughly price 𝑖
10 The point appears to be in fold 2, so folds 1 and 3 were used to build the prediction model.
11 The cross-validated residuals on cond_new vary roughly from -15 to 15, while the cross-validated residuals on the
four predictor model vary less, roughly from -10 to 10. Given the smaller residuals from the four predictor model, it
seems as though the larger model is better.
12 The CV SSE is smaller (by a factor of almost two!) for the model with four predictors. Using a single valued
criterion (CV SSE) allows us to make a decision to choose the model with four predictors.
478 CHAPTER 27. APPLICATIONS: MODEL AND INFER
Navigate the concepts you’ve learned in this part in R using the following self-paced tutorials. All
you need is your browser to get started!
Tutorial 6: Inferential modeling
https://openintrostat.github.io/ims-tutorials/06-model-infer
Tutorial 6 - Lesson 1: Inference in regression
https://openintro.shinyapps.io/ims-06-model-infer-01
Tutorial 6 - Lesson 2: Randomization test for slope
https://openintro.shinyapps.io/ims-06-model-infer-02
Tutorial 6 - Lesson 3: t-test for slope
https://openintro.shinyapps.io/ims-06-model-infer-03
Tutorial 6 - Lesson 4: Checking technical conditions for slope inference
https://openintro.shinyapps.io/ims-06-model-infer-04
Tutorial 6 - Lesson 5: Inference beyond the simple linear regression model
https://openintro.shinyapps.io/ims-06-model-infer-05
You can also access the full list of tutorials supporting this book at
https://openintrostat.github.io/ims-tutorials.
27.3 R labs
Further apply the concepts you’ve learned in this part in R with computational labs that walk you
through a data analysis case study.
Multiple linear regression - Grading the professor
https://www.openintro.org/go?id=ims-r-lab-model-infer
You can also access the full list of labs supporting this book at
https://www.openintro.org/go?id=ims-r-labs.
479
Chapter A
Exercise solutions
A.1 Chapter 1
1. 23 observations and 7 variables.
3. (a) “Is there an association between air pollution exposure and preterm births?” (b) 143,196
births in Southern California between 1989 and 1993. (c) Measurements of carbon monoxide,
nitrogen dioxide, ozone, and particulate matter less than 10𝜇𝑔/𝑚3 (PM10 ) collected at air-
quality-monitoring stations as well as length of gestation. Continuous numerical variables.
5. (a) “What is the effect of gamification on learning outcomes compared to traditional teaching
methods?” (b) 365 college students taking a statistics course (c) Gender (categorical), level
of studies (categorical, ordinal), academic major (categorical), expertise in English language
(categorical, ordinal), use of personal computers and games (categorical, ordinal), treatment
group (categorical), score (numerical, discrete).
7. (a) Treatment: 10/43 = 0.23 → 23%. (b) Control: 2/46 = 0.04 → 4%. (c) A higher percentage
of patients in the treatment group were pain free 24 hours after receiving acupuncture. (d) It
is possible that the observed difference between the two group percentages is due to chance. (e)
Explanatory: acupuncture or not. Response: if the patient was pain free or not.
9. (a) Experiment; researchers are evaluating the effect of fines on parents’ behavior related to
picking up their children late from daycare. (b) 10 cases: the daycare centers. (c) Number of
late pickups (discrete numerical). (d) Week (numerical, discrete), group (categorical, nominal),
number of late pickups (numerial, discrete), and study period (categorical, ordinal).
11. (a) 344 cases (penguins) are included in the data. (b) There are 4 numerical variables in the data:
bill length, bill depth, and flipper length (measured in millimeters) and body mass (measured
in grams). They are all continuous. (c) There are 3 categorical variables in the data: species
(Adelie, Chinstrap, Gentoo), island (Torgersen, Biscoe, and Dream), and sex (female and male).
13. (a) Airport ownership status (public/private), airport usage status (public/private), region (Cen-
tral, Eastern, Great Lakes, New England, Northwest Mountain, Southern, Southwest, Western
Pacific), latitude, and longitude. (b) Airport ownership status: categorical, not ordinal. Airport
usage status: categorical, not ordinal. Region: categorical, not ordinal. Latitude: numerical,
continuous. Longitude: numerical, continuous.
15. (a) Year, number of baby girls named Fiona born in that year, nation. (b) Year (numerical,
discrete), number of baby girls named Fiona born in that year (numerical, discrete), nation
(categorical, nominal).
17. (a) County, state, driver’s race, whether the car was searched or not, and whether the driver was
arrested or not. (b) All categorical, non-ordinal. (c) Response: whether the car was searched or
not. Explanatory: race of the driver.
19. (a) Observational study. (b) Dog: Lucy. Cat: Luna. (c) Oliver and Lily. (d) Positive, as the
popularity of a name for dogs increases, so does the popularity of that name for cats.
480 APPENDIX A. EXERCISE SOLUTIONS
A.2 Chapter 2
1. (a) Population mean, 𝜇2007 = 52; sample mean, 𝑥2008 ̄ = 58. (b) Population mean, 𝜇2001 = 3.37;
sample mean, 𝑥2012
̄ = 3.59.
3. (a) Population: all births, sample: 143,196 births between 1989 and 1993 in Southern California.
(b) If births in this time span at the geography can be considered to be representative of all
births, then the results are generalizable to the population of Southern California. However,
since the study is observational the findings cannot be used to establish causal relationships.
5. (a) The population of interest is all college students studying statistics. The sample consists of
365 such students. (b) If the students in this sample, who are likely not randomly sampled, can
be considered to be representative of all college students studying statistics, then the results
are generalizable to the population defined above. This is probably not a reasonable assump-
tion since these students are from two specific majors only. Additionally, since the study is
experimental, the findings can be used to establish causal relationships.
7. (a) Observation. (b) Variable. (c) Sample statistic (mean). (d) Population parameter (mean).
9. (a) Observational. (b) Use stratified sampling to randomly sample a fixed number of students,
say 10, from each section for a total sample size of 40 students.
11. (a) Positive, non-linear, somewhat strong. Countries in which a higher percentage of the popula-
tion have access to the internet also tend to have higher average life expectancies, however rise
in life expectancy trails off before around 80 years old. (b) Observational. (c) Wealth: countries
with individuals who can widely afford the internet can probably also afford basic medical care.
(Note: Answers may vary.)
13. (a) Simple random sampling is okay. In fact, it’s rare for simple random sampling to not be
a reasonable sampling method! (b) The student opinions may vary by field of study, so the
stratifying by this variable makes sense and would be reasonable. (c) Students of similar ages
are probably going to have more similar opinions, and we want clusters to be diverse with
respect to the outcome of interest, so this would not be a good approach. (Additional thought:
the clusters in this case may also have very different numbers of people, which can also create
unexpected sample sizes.)
15. (a) The cases are 200 randomly sampled men and women. (b) The response variable is attitude
towards a fictional microwave oven. (c) The explanatory variable is dispositional attitude. (d)
Yes, the cases are sampled randomly, recruited online using Amazon’s Mechanical Turk. (e)
This is an observational study since there is no random assignment to treatments. (f) No, we
cannot establish a causal link between the explanatory and response variables since the study
is observational. (g) Yes, the results of the study can be generalized to the population at large
since the sample is random.
17. (a) Simple random sample. Non-response bias, if only those people who have strong opinions
about the survey responds their sample may not be representative of the population. (b) Conve-
nience sample. Under coverage bias, their sample may not be representative of the population
since it consists only of their friends. It is also possible that the study will have non-response
bias if some choose to not bring back the survey. (c) Convenience sample. This will have a
similar issues to handing out surveys to friends. (d) Multi-stage sampling. If the classes are
similar to each other with respect to student composition this approach should not introduce
bias, other than potential non-response bias.
19. (a) Exam performance. (b) Light level: fluorescent overhead lighting, yellow overhead lighting,
no overhead lighting (only desk lamps). (c) Wearing glasses or not.
21. (a) Experiment. (b) Light level (overhead lighting, yellow overhead lighting, no overhead lighting)
and noise level (no noise, construction noise, and human chatter noise). (c) Since the researchers
want to ensure equal representation of those wearing glasses and not wearing glasses, wearing
glasses is a blocking variable.
23. Need randomization and blinding. One possible outline: (1) Prepare two cups for each partici-
pant, one containing regular Coke and the other containing Diet Coke. Make sure the cups ar
identical and contain equal amounts of soda. Label the cups (regular) and B (diet). (Be sure to
randomize A and B for each trial!) (2) Give each participant the two cups, one cup at a time,
in random order, and ask the participant to record a value that indicates ho much she liked the
beverage. Be sure that neither the participant nor the person handing out the cups knows the
identity of th beverage to make this a double-blind experiment. (Answers may vary.)
25. (a) Experiment. (b) Treatment: 25 grams of chia seeds twice a day, control: placebo. (c) Yes,
A.3. CHAPTER 3 481
gender. (d) Yes, single blind since the patients were blinded to the treatment they received. (e)
Since this is an experiment, we can make a causal statement. However, since the sample is not
random, the causal statement cannot be generalized to the population at large.
27. (a) Non-responders may have a different response to this question, e.g., parents who returned
the surveys likely don’t have difficulty spending time with their children. (b) It is unlikely that
the women who were reached at the same address 3 years later are a random sample. These
missing responders are probably renters (as opposed to homeowners) which means that they
might have a lower socio-economic status than the respondents. (c) There is no control group
in this study, this is an observational study, and there may be confounding variables, e.g., these
people may go running because they are generally healthier and/or do other exercises.
29. (a) Randomized controlled experiment. (b) Explanatory: treatment group (categorical, with
3 levels). Response variable: Psychological well-being. (c) No, because the participants were
volunteers. (d) Yes, because it was an experiment. (e) The statement should say “evidence”
instead of “proof”.
A.3 Chapter 3
A.4 Chapter 4
1. (a) We see the order of the categories and the relative frequencies in the bar plot. (b) There are
no features that are apparent in the pie chart but not in the bar plot. (c) We usually prefer to
use a bar plot as we can also see the relative frequencies of the categories in this graph.
3. (a) The horizontal locations at which the age groups break into the various opinion levels differ,
which indicates that likelihood of supporting protests varies by age group. Two variables may
be associated. (b) Answers may vary. Political ideology/leaning and education level.
5. (a) Number of participants in each group. (b) Proportion of survival. (c) The standardized bar
plot should be displayed as a way to visualize the survival improvement in the treatment versus
the control group.
7. (a) The ridge plots do not tell us about the relationship between meat consumption and life
expectancy. While it is true that the high income group of countries has highest meat consump-
tion and highest life expectancy, we can’t, for example, differentiate meat consumption across
the low and middle income groups (so as to connect to life expectancy). Additionally, we don’t
know anything about the relationship betwen meat consumption and life expectancy within an
income group. (b) When a relationship is confounded we cannot determine the causal mecha-
nism. We don’t know if the longer life expecancy is due to meat consumption or due to higher
income (which comes with many other life-extending practices). (c) In order to investigate a
specific confounding variable, first break the data into categories according to that confounding
variable (here, income). Then look at the relationship of interest (here meat consumption and
life expectancy) separately for each of the levels of the confounding variable (income).
9. (a) 41% of the JetBlue flights are delayed. 40.7% of the United Airlines flights are delayed.
(b) For SFO: JetBlue had 39.7% delayed, United had 40% delayed (United had more delayed
flights). For LAX: JetBlue had 40.1% delayed, United had 41% delayed (United had more
delayed flights). For BQN: JetBlue had 45.7% delayed, United had 48.8% delayed (United had
more delayed flights). (c) Note that JetBlue had substantially more flights than United out of
BQN (where there was a high delay percentage). United had substantially more flights than
United out of SFO and LAX, both of which had low delay percentages. So JetBlue’s overall
percentage delay is bumped up due to the BQN flights, and United’s overall percentage delay is
bumped down due to the SFO and LAX flights.
A.5 Chapter 5
1. (a) Positive association: mammals with longer gestation periods tend to live longer as well. (b)
Association would still be positive. (c) No, they are not independent. See part (a).
482 APPENDIX A. EXERCISE SOLUTIONS
3. The graph below shows a ramp up period. There may also be a period of exponential growth
at the start before the size of the petri dish becomes a factor in slowing growth.
5. (a) Decrease: the new score is smaller than the mean of the 24 previous scores. (b) Calculate
a weighted mean. Use a weight of 24 for the old mean and 1 for the new mean: (24 × 74 +
1 × 64)/(24 + 1) = 73.6. (c) The new score is more than 1 standard deviation away from the
previous mean, so increase.
7. Any 10 employees whose average number of days off is between the minimum and the mean
number of days off for the entire workforce at this plant.
9. (a) Dist B has a higher mean since 20 > 13, and a higher standard deviation since 20 is further
from the rest of the data than 13. (b) Dist A has a higher mean since −20 > −40, and Dist
B has a higher standard deviation since -40 is farther away from the rest of the data than -20.
(c) Dist B has a higher mean since all values in this Dist Are higher than those in Dist A, but
both distribution have the same standard deviation since they are equally variable around their
respective means. (d) Both distributions have the same mean since they’re both centered at 300,
but Dist B has a higher standard deviation since the observations are farther from the mean
than in Dist A.
11. (a) About 26. (b) Since the distribution is right skewed the mean is higher than the median. (c)
Q1: between 15 and 20, Q3: between 35 and 40, IQR: about 20. (d) Values that are considered
to be unusually low or high lie more than 1.5×IQR away from the quartiles. Upper fence: Q3
+ 1.5 × IQR = 37.5 + 1.5 × 20 = 67.5; Lower fence: Q1 - 1.5 × IQR = 17.5 + 1.5 × 20 = −12.5;
The lowest AQI recorded is not lower than 5 and the highest AQI recorded is not higher than 65,
which are both within the fences. Therefore none of the days in this sample would be considered
to have an unusually low or high AQI.
13. The histogram shows that the distribution is bimodal, which is not apparent in the box plot. The
box plot makes it easy to identify more precise values of observations outside of the whiskers.
15. (a) Right skewed, there is a natural boundary at 0 and only a few people have many pets. Center:
median, variability: IQR. (b) Right skewed, there is a natural boundary at 0 and only a few
people live a very long distance from work. Center: median, variability: IQR. (c) Symmetric.
Center: mean, variability: standard deviation. (d) Left skewed. Center: median, variability:
IQR. (e) Left skewed. Center: median, variability: IQR.
17. No, we would expect this distribution to be right skewed. There are two reasons for this: there
is a natural boundary at 0 (it is not possible to watch less than 0 hours of TV) and the standard
deviation of the distribution is very large compared to the mean.
19. No, the outliers are likely the maximum and the minimum of the distribution so a statistic based
on these values cannot be robust to outliers.
21. The 75th percentile is 82.5, so 5 students will get an A. Also, by definition 25% of students will
be above the 75th percentile.
𝑥̄
23. (a) If 𝑚𝑒𝑑𝑖𝑎𝑛 = 1, then 𝑥̄ = 𝑚𝑒𝑑𝑖𝑎𝑛. This is most likely to be the case for symmetric distribu-
𝑥̄
tions. (b) If 𝑚𝑒𝑑𝑖𝑎𝑛 < 1, then 𝑥̄ < 𝑚𝑒𝑑𝑖𝑎𝑛. This is most likely to be the case for left skewed
distributions, since the mean is affected (and pulled down) by the lower values more so than the
𝑥̄
median. (c) If 𝑚𝑒𝑑𝑖𝑎𝑛 > 1, then 𝑥̄ > 𝑚𝑒𝑑𝑖𝑎𝑛. This is most likely to be the case for right skewed
distributions, since the mean is affected (and pulled up) by the higher values more so than the
median.
25. (a) The distribution of percentage of population that is Hispanic is extremely right skewed with
majority of counties with less than 10% Hispanic residents. However there are a few counties
A.6. CHAPTER 6 483
that have more than 90% Hispanic population. It might be preferable to, in certain analyses, to
use the log-transformed values since this distribution is much less skewed. (b) The map reveals
that counties with higher proportions of Hispanic residents are clustered along the Southwest
border, all of New Mexico, a large swath of Southwest Texas, the bottom two-thirds of California,
and in Southern Florida. In the map all counties with more than 40% of Hispanic residents are
indicated by the darker shading, so it is impossible to discern how high Hispanic percentages go.
The histogram reveals that there are counties with over 90% Hispanic residents. The histogram
is also useful for estimating measures of center and spread. (c) Both visualizations are useful,
but if we could only examine one, we should examine the map since it explicitly ties geographic
data to each county’s percentage.
A.6 Chapter 6
A.7 Chapter 7
1. (a) The residual plot will show randomly distributed residuals around 0. The variance is also
approximately constant. (b) The residuals will show a fan shape, with higher variability for
smaller 𝑥. There will also be many points on the right above the line. There is trouble with the
model being fit here.
3. (a) Strong relationship, but a straight line would not fit the data. (b) Strong relationship, and a
linear fit would be reasonable. (c) Weak relationship, and trying a linear fit would be reasonable.
(d) Moderate relationship, but a straight line would not fit the data. (e) Strong relationship,
and a linear fit would be reasonable. (f) Weak relationship, and trying a linear fit would be
reasonable.
5. (a) Exam 2 since there is less of a scatter in the plot of course grade versus exam 2. Notice
that the relationship between Exam 1 and the course grade appears to be slightly nonlinear. (b)
(Answers may vary.) If Exam 2 is cumulative it might be a better indicator of how a student is
doing in the class.
7. (a) 𝑟 = −0.7 → (4). (b) 𝑟 = 0.45 → (3). (c) 𝑟 = 0.06 → (1). (d) 𝑟 = 0.92 → (2).
9. (a) There is a moderate, positive, and linear relationship between shoulder girth and height. (b)
Changing the units, even if just for one of the variables, will not change the form, direction or
strength of the relationship between the two variables.
11. (a) There is a somewhat weak, positive, possibly linear relationship between the distance traveled
and travel time. There is clustering near the lower left corner that we should take special note of.
(b) Changing the units will not change the form, direction or strength of the relationship between
the two variables. If longer distances measured in miles are associated with longer travel time
measured in minutes, longer distances measured in kilometers will be associated with longer
travel time measured in hours. (c) Changing units doesn’t affect correlation: 𝑟 = 0.636.
13. we can write the amount of meat consumption as an exact linear function of the amount of
carbohydrate consumption. (a) 𝑐𝑎𝑟𝑏𝑠 = 𝑚𝑒𝑎𝑡 − 3. (b) 𝑐𝑎𝑟𝑏𝑠 = 𝑚𝑒𝑎𝑡 + 2. (c) 𝑐𝑎𝑟𝑏𝑠 = 2 × 𝑚𝑒𝑎𝑡.
Since the slopes are positive and these are perfect linear relationships, the correlation will be
exactly 1 in all three parts. An alternative way to gain insight into this solution is to create a
mock dataset, e.g., 5 countries with meat consumption of 10, 20, 50, 75, and 100 kg per capita,
find the related carbohydrate consumption for each mock country, then create a scatterplot.
15. Correlation: no units. Intercept: cal. Slope: cal/cm.
17. Over-estimate. Since the residual is calculated as 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑, a negative residual
means that the predicted value is higher than the observed value.
19. (a) There is a positive, moderate, linear association between number of calories and amount
of carbohydrates. In addition, the amount of carbohydrates is more variable for menu items
with higher calories, indicating non-constant variance. There also appear to be two clusters of
data: a patch of about a dozen observations in the lower left and a larger patch on the right
side. (b) Explanatory: number of calories. Response: amount of carbohydrates (in grams).
(c) With a regression line, we can predict the amount of carbohydrates for a given number of
calories. This may be useful if only calorie counts for the food items are posted but the amount
of carbohydrates in each food item is not readily available. (d) Food menu items with higher
484 APPENDIX A. EXERCISE SOLUTIONS
predicted protein are predicted with higher variability than those without, suggesting that the
model is doing a better job predicting protein amount for food menu items with lower predicted
proteins.
21. (a) First calculate the slope: 𝑏1 = 𝑅 × 𝑠𝑦 /𝑠𝑥 = 0.636 × 113/99 = 0.726. Next, make use of
the fact that the regression line passes through the point (𝑥,̄ 𝑦):̄ 𝑦 ̄ = 𝑏0 + 𝑏1 × 𝑥.̄ Plug in 𝑥,̄ 𝑦,̄
and 𝑏1 , and solve for 𝑏0 : 51. Solution: 𝑡𝑟𝑎𝑣𝑒𝑙 ̂ 𝑡𝑖𝑚𝑒 = 51 + 0.726 × 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒. (b) 𝑏1 : For each
additional mile in distance, the model predicts an additional 0.726 minutes in travel time. 𝑏0 :
When the distance travelled is 0 miles, the travel time is expected to be 51 minutes. It does not
make sense to have a travel distance of 0 miles in this context. Here, the 𝑦-intercept serves only
to adjust the height of the line and is meaningless by itself. (c) 𝑅2 = 0.6362 = 0.40. About 40%
of the variability in travel time is accounted for by the model, i.e., explained by the distance
̂
travelled. (d) 𝑡𝑟𝑎𝑣𝑒𝑙 𝑡𝑖𝑚𝑒 = 51 + 0.726 × 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 51 + 0.726 × 103 ≈ 126 minutes. (Note: we
should be cautious in our predictions with this model since we have not yet evaluated whether
it is a well-fit model.) (e) 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ = 168 − 126 = 42 minutes. A positive residual means that
the model underestimates the travel time. (f) No, this calculation would require extrapolation.
23. ̂ = 4.60+2.05×unemployment_rate. (b) The model predicts a poverty rate of 4.60%
(a) poverty
for counties with 0% unemployment, on average. This is not a meaningful value as no counties
have such low unexmployment, it just serves to adjust the height of the regression line. (c) For
each additional percentage increase in unemployment rate, poverty rate is predicted to be higher,
on average, by 2.05%. √ (d) Unemployment rate explains 46% of the variability in poverty levels
in US counties. (e) 0.46 = 0.678.
25. (a) There is an outlier in the bottom right. Since it is far from the center of the data, it is a point
with high leverage. It is also an influential point since, without that observation, the regression
line would have a very different slope. (b) There is an outlier in the bottom right. Since it is
far from the center of the data, it is a point with high leverage. However, it does not appear to
be affecting the line much, so it is not an influential point. (c) The observation is in the center
of the data (in the x-axis direction), so this point does not have high leverage. This means the
point won’t have much effect on the slope of the line and so is not an influential point.
27. (a) There is a negative, moderate-to-strong, somewhat linear relationship between percent of
families who own their home and the percent of the population living in urban areas in 2010.
There is one outlier: a state where 100% of the population is urban. The variability in the
percent of homeownership also increases as we move from left to right in the plot. (b) The
outlier is located in the bottom right corner, horizontally far from the center of the other points,
so it is a point with high leverage. It is an influential point since excluding this point from the
analysis would greatly affect the slope of the regression line.
29. (a) True. (b) False, correlation is a measure of the linear association between any two numerical
variables.
31. (a) 𝑟 = 0.7 → (1) (b) 𝑟 = 0.09 → (4) (c) 𝑟 = −0.91 → (2) (d) 𝑟 = 0.96 → (3).
A.8 Chapter 8
1. Annika is right. All variables being highly correlated, including the predictor variables being
highly correlated with each other, is not desirable as this would result in multicollinearity.
3. (a) The association between meat consumption and life expectancy is positive, moderate, and
curved. (b) While tempting to say that eating meat may lead to a longer life expectancy, we
do not have any sense of why the variables are associated. We are better off thinking that the
countries with high meat consumption and high life expectancy are similar in many other ways
(e.g., income bracket). (c) Within an income bracket, the relationship between meat consumption
and life expectancy is not nearly as strong (as compared to when the data are aggregated into
one plot).
5. No, they shouldn’t include all variables as days_since_start and days_since_race are per-
fectly correlated with each other. They should only include one of them.
̂ = 7.270 − 0.593 × habitsmoker . (b) The estimated body weight of babies born
7. (a) weight
to smoking mothers is 0.593 pounds lower than those who are born to non-smoking mothers.
̂ = 7.270 − 0.593 × 1 = 6.68 pounds. Non-smoker: weight
Smoker: weight ̂ = 7.270 − 0.593 × 0 =
7.270 pounds.
9. (a) Horror movies. (b) Not necessarily, the change in adjusted 𝑅2 is quite small.
A.9. CHAPTER 9 485
̂ = −3.82+0.26×weeks+0.02×mage+0.37×sexmale +0.02×visits−0.43×habitsmoker .
11. (a) weight
(b) 𝑏weeks : The model predicts a 0.26 pound increase in the birth weight of the baby for each
additional week in length of pregnancy, all else held constant. 𝑏habitsmoker : The model predicts a
0.43 pound decrease in the birth weight of the babies born to smoker mothers compared to non-
smokers, all else held constant. (c) Habit might be correlated with one of the other variables in
the model, which introduces multicollinearity and complicates model estimation. (d) -0.17~lbs.
13. Remove gained.
15. Add weeks.
A.9 Chapter 9
1. (a) False. The line is fit to predict the probability of success, not the binary outcome. (b)
False. Residuals are not used in logistic regression like they are in linear regression because the
observed value is always either zero or one (and the predicted value is a probability). The goal of
the logistic regression is not to get a perfect prediction (of zero or one), so minimizing residuals
is not part of the modeling process. (c) True.
3. (a) There are a few potential outliers, e.g., on the left in the variable total length, but nothing
that will be of serious concern in a dataset this large. (b) When coefficient estimates are sensitive
to which variables are included in the model, this typically indicates that some variables are
collinear. For example, a possum’s gender may be related to its head length, which would explain
why the coefficient for sex changed when we removed the variable. Likewise, a possum’s skull
width is likely to be related to its head length and probably even much more closely related than
the head length was to gender.
5. (a) The logistic model relating 𝑝̂ to the predictors may be written as log ( 1−𝑝̂ 𝑝̂ ) = 33.5095 −
1.4207 × sexmale − 0.2787 × skull_w + 0.5687 × total_l − 1.8057 × tail_l. Only total_l has a
positive association with a possum being from Victoria. (b) 𝑝̂ = 0.0062. While the probability
is very near zero, we have not run diagnostics on the model. We might also be a little skeptical
that the model will remain accurate for a possum found in a US zoo. For example, perhaps
the zoo selected a possum with specific characteristics but only looked in one region. On the
other hand, it is encouraging that the possum was caught in the wild. (Answers regarding the
reliability of the model probability will vary.)
7. (a) The variable exclaim_subj should be removed, since it’s removal reduces AIC the most (and
the resulting model has lower AIC than the None Dropped model). (b) The variable cc should
be removed. (c) Removing any variable will increase AIC, so we should not remove any variables
from this set.
9. (a) The AIC is smallest using the variables sex, head_l, skull_w, total_l, and tail_l to
predict region (AIC = 83.52), so we would choose that model. (b) If the metric is equivalent
across two models with different numbers of variables, we usually want the model with smaller
number of variables. Sometimes refered to as Occam’s razor, the simplest explanation is often
the one that will generalize most effectively.
A.10 Chapter 10
A.11 Chapter 11
1. (a) Mean. Each student reports a numerical value: a number of hours. (b) Mean. Each
student reports a number, which is a percentage, and we can average over these percentages.
(c) Proportion. Each student reports Yes or No, so this is a categorical variable and we use a
proportion. (d) Mean. Each student reports a number, which is a percentage like in part (b).
(e) Proportion. Each student reports whether s/he expects to get a job, so this is a categorical
variable and we use a proportion.
3. (a) Alternative. (b) Null. (c) Alternative. (d) Alternative. (e) Null. (f) Alternative. (g) Null.
5. (a) 𝐻0 ∶ 𝜇 = 8 (On average, New Yorkers sleep 8 hours a night.) 𝐻𝐴 ∶ 𝜇 < 8 (On average, New
Yorkers sleep less than 8 hours a night.) (b) 𝐻0 ∶ 𝜇 = 15 (The average amount of company
486 APPENDIX A. EXERCISE SOLUTIONS
time each employee spends not working is 15 minutes for March Madness.) 𝐻𝐴 ∶ 𝜇 > 15 (The
average amount of company time each employee spends not working is greater than 15 minutes
for March Madness.)
7. (a) (i) False. Instead of comparing counts, we should compare percentages of people in each
group who suffered cardiovascular problems. (ii) True. (iii) False. Association does not imply
causation. We cannot infer a causal relationship based on an observational study. The difference
from part (ii) is subtle. (iv) True. (b) Proportion of all patients who had cardiovascular prob-
7,979
lems: 227,571 ≈ 0.035 (c) The expected number of heart attacks in the Rosiglitazone group, if
having cardiovascular problems and treatment were independent, can be calculated as the num-
ber of patients in that group multiplied by the overall cardiovascular problem rate in the study:
7,979
67, 593∗ 227,571 ≈ 2370. (d) (i) 𝐻0 : The treatment and cardiovascular problems are independent.
They have no relationship, and the difference in incidence rates between the Rosiglitazone and
Pioglitazone groups is due to chance. 𝐻𝐴 : The treatment and cardiovascular problems are not
independent. The difference in the incidence rates between the Rosiglitazone and Pioglitazone
groups is not due to chance and Rosiglitazone is associated with an increased risk of serious
cardiovascular problems. (ii) A higher number of patients with cardiovascular problems than
expected under the assumption of independence would provide support for the alternative hy-
pothesis as this would suggest that Rosiglitazone increases the risk of such problems. (iii) In
the actual study, we observed 2,593 cardiovascular events in the Rosiglitazone group. In the
100 simulations under the independence model, the simulated differences were never so high,
which suggests that the actual results did not come from the independence model. That is, the
variables do not appear to be independent, and we reject the independence model in favor of
the alternative. The study’s results provide convincing evidence that Rosiglitazone is associated
with an increased risk of cardiovascular problems.
A.12 Chapter 12
1. (a) The statistic is the sample proportion (0.289); the parameter is the population proportion
(unknown). (b) 𝑝̂ and 𝑝. (c) Bootstrap sample proportion. (d) 0.289. (e) Roughly (0.22, 0.35).
(f) We can be 90% confident that between 22% and 35% of all YouTube videos take place
outdoors.
3. With 98% confidence, the true proportion of all US adults (in 2022) who get news from social
media sometimes or often is between 0.487 and 0.51.
5. (a) A or perhaps D. (b) A, B, C, or D. (c) B or C. (d) B. (e) None.
7. (a) This claim is reasonable, since the entire interval lies above 50%. (b) The value of 70% lies
outside of the interval, so we have convincing evidence that the researcher’s conjecture is wrong.
(c) A 90% confidence interval will be narrower than a 95% confidence interval. Even without
calculating the interval, we can tell that 70% would not fall in the interval, and we would reject
the researcher’s conjecture based on a 90% confidence level as well.
A.13 Chapter 13
1. (a) 0.089 (b) 0.069 (c) 0.589 (d) 𝑃 (|𝑍| > 2) = 𝑃 (𝑍 < −2) + 𝑃 (𝑍 > 2) 0.046
3. (a) Verbal: 𝑁 (𝜇 = 151, 𝜎 = 7), Quant: 𝑁 (𝜇 = 153, 𝜎 = 7.67). (b) 𝑍𝑉 𝑅 = 1.29, 𝑍𝑄𝑅 = 0.52. (c)
She scored 1.29 standard deviations above the mean on the Verbal Reasoning section and 0.52
standard deviations above the mean on the Quantitative Reasoning section.
(d) She did better on the Verbal Reasoning section since her Z score on that section was higher.
(e)𝑃 𝑒𝑟𝑐𝑉 𝑅 = 0.9007 ≈ 90%, 𝑃 𝑒𝑟𝑐𝑄𝑅 = 0.6990 ≈ 70%. (f) 100% − 90% = 10% did better than
her on VR, and 100% − 70% = 30% did better than her on QR. (g) We cannot compare the raw
scores since they are on different scales. Comparing her percentile scores is more appropriate
when comparing her performance to others. (h) Answer to part (b) would not change as Z scores
can be calculated for distributions that are not normal. However, we could not answer parts
(d)-(f) since we cannot use the normal probability table to calculate probabilities and percentiles
without a normal model.
1. (a) 𝑍 = 0.84, which corresponds to approximately 159 on QR. (b) 𝑍 = −0.52, which corresponds
to approximately 147 on VR.
A.14. CHAPTER 14 487
A.14 Chapter 14
A.15 Chapter 15
A.16 Chapter 16
1. First, the hypotheses should be about the population proportion (𝑝), not the sample proportion.
Second, the null value should be what we are testing (0.25), not the observed value (0.29). The
correct way to set up these hypotheses is: 𝐻0 ∶ 𝑝 = 0.25 and 𝐻𝐴 ∶ 𝑝 > 0.25.
3. (a) 𝐻0 ∶ 𝑝 = 0.20, 𝐻𝐴 ∶ 𝑝 > 0.20. (b) 𝑝̂ = 159/650 = 0.245. (c) Answers will vary. Each
student can be represented with a card. Take 100 cards, 20 black cards representing those who
support proposals to defund police departments and 80 red cards representing those who do not.
Shuffle the cards and draw with replacement (shuffling each time in between draws) 650 cards
representing the 650 respondents to the poll. Calculate the proportion of black cards in this
sample, 𝑝𝑠𝑖𝑚
̂ , i.e., the proportion of those who upport proposals to defund police departments.
The p-value will be the proportion of simulations where 𝑝𝑠𝑖𝑚
̂ ≥ 0.245. (Note: We would generally
use a computer to perform the simulations.) (d) There 1 only one simulated proportion that
is at least 0.245, therefore the approximate p-value is 0.001. Your p-value may vary slightly
since it is based on a visual estimate. Since the p-value is smaller than 0.05, we reject 𝐻0 . The
data provide convincing evidence that the proportion of Seattle adults who support proposals
to defund police departments is greater than 0.20, i.e., more than one in five.
5. (a) 𝐻0 ∶ 𝑝 = 0.5, 𝐻𝐴 ∶ 𝑝 ≠ 0.5. (b) The p-value is roughly 0.4, There is not evidence in the
data (possibly because there are only 7 cats being measured!) to conclude that the cats have a
preference one way or the other between the two shapes.
7. (a) 𝑆𝐸(𝑝)̂ = 0.189. (c) Roughly 0.188. (c) Yes. (d) No. (e) The draws from the null hypothesis
are discrete (only a few distinct options) and the mathematical model is continuous (infinite
options on a continuum).
9. (a) The null hypothesis simulation was done with 𝑝 = 0.7, and the data bootstrap simulation
was done with 𝑝 = 0.6. (b) The null hypothesis simulation is centered at 0.7; the data bootstrap
is centered at 0.6. (c) The standard error of the sample proportion is given to be roughly 0.1 for
both histograms. (d) Both histograms are reasonably symmetric. Note that histograms which
describe the variability of proportions become more skewed as the center of the distribution
gets closer to 1 (or zero) because the boundary of 1.0 restricts the symmetry of the tail of the
distribution. For this reason, the null hypothesis simulation histogram is slightly more skewed
(left).
11. (a) The null hypothesis simulation distribution for testing. The data bootstrap distribution for
confidence intervals. (b) 𝐻0 ∶ 𝑝 = 0.7; 𝐻𝐴 ∶ 𝑝 ≠ 0.7. p-value > 0.05. There is no evidence that
the proportion of full-time statistics majors who work is different from 70%. (c) We are 98%
confident that the true proportion of all full-time student statistics majors who work at least 5
hours per week is between 35% and 80%. (d) Using 𝑧 ⋆ = 2.33, the 98% confidence interval is
0.367 to 0.833.
13. (a) False. Doesn’t satisfy success-failure condition. (b) True. The success-failure condition
is not satisfied. In most samples we would expect 𝑝̂ to be close to 0.08, the true population
proportion. While 𝑝̂ can be much above 0.08, it is bound below by 0, suggesting it would
take on a right skewed shape. Plotting the sampling distribution would confirm this suspicion.
(c) False. 𝑆𝐸𝑝̂ = 0.0243, and 𝑝̂ = 0.12 is only 0.12−0.08
0.0243 = 1.65 SEs away from the mean, which
would not be considered unusual. (d) True. 𝑝̂ = 0.12 is 2.32 standard errors away from √ the
mean, which is often considered unusual. (e) False. Decreases the SE by a factor of 1/ 2.
15. (a) True. See the reasoning of 6.1(b). (b) True. We take the square root of the sample size in the
SE formula. (c) True. The independence and success-failure conditions are satisfied. (d) True.
The independence and success-failure conditions are satisfied.
17. (a) False. A confidence interval is constructed to estimate the population proportion, not the
sample proportion. (b) True. 95% CI: 82% ± 2%. (c) True. By the definition of the confidence √
level. (d) True. Quadrupling the sample size decreases the SE and ME by a factor of 1/ 4.
(e) True. The 95% CI is entirely above 50%.
19. With a random sample, independence is satisfied. The success-failure condition is also satisfied.
𝑀 𝐸 = 𝑧 ⋆ √ 𝑝(1−
̂
𝑛
𝑝)̂
= 1.96√ 0.56×0.44
600 = 0.0397 ≈ 4%.
21. (a) No. The sample only represents students who took the SAT, and this was also an online
survey. (b) (0.5289, 0.5711). We are 90% confident that 53% to 57% of high school seniors who
took the SAT are fairly certain that they will participate in a study abroad program in college.
(c) 90% of such random samples would produce a 90% confidence interval that includes the true
A.17. CHAPTER 17 489
A.17 Chapter 17
5. (a) While the standard errors of the difference in proportion across the two graphs are roughly the
same (approximately 0.012), the centers are not. Computational method A is centered at 0.07
(the difference in the observed sample proportions) and Computational method B is centered
at 0. (b) What is the difference between the proportions of Bachelor’s and Associate’s students
who believe that the COVID-19 pandemic will negatively impact their ability to complete the
degree? (c) Is the proportion of Bachelor’s students who believe that their ability to complete the
degree will be negatively impacted by the COVID-19 pandemic different than that of Associate’s
students?
7. (a) 26 Yes and 94 No in Nevaripine and 10 Yes and 110 No in Lopinavir group. (b) 𝐻0 ∶ 𝑝𝑁 = 𝑝𝐿 .
There is no difference in virologic failure rates between the Nevaripine and Lopinavir groups.
𝐻𝐴 ∶ 𝑝𝑁 ≠ 𝑝𝐿 . There is some difference in virologic failure rates between the Nevaripine and
Lopinavir groups. (c) Random assignment was used, so the observations in each group are in-
dependent. If the patients in the study are representative of those in the general population
(something impossible to check with the given information), then we can also confidently gener-
alize the findings to the population. The success-failure condition, which we would check using
the pooled proportion (𝑝𝑝𝑜𝑜𝑙
̂ = 36/240 = 0.15), is satisfied. 𝑍 = 2.89 → p-value = 0.0039. Since
the p-value is low, we reject 𝐻0 . There is strong evidence of a difference in virologic failure rates
between the Nevaripine and Lopinavir groups. Treatment and virologic failure do not appear to
be independent.
9. (a) Standard error: 𝑆𝐸 = √ 0.79(1−0.79)
347 + 0.55(1−0.55)
617 = 0.03. Using 𝑧⋆ = 1.96, we get: 0.79 −
0.55 ± 1.96 × 0.03 → (0.181, 0.299). We are 95% confident that the proportion of Democrats who
support the plan is 18.1% to 29.9% higher than the proportion of Independents who support
the plan. (b) True.
11. (a) In effect, we’re checking whether men are paid more than women (or vice-versa), and we’d
expect these outcomes with either chance under the null hypothesis: 𝐻0 ∶ 𝑝 = 0.5 and 𝐻𝐴 ∶
𝑝 ≠ 0.5. We’ll use 𝑝 to represent the fraction of cases where men are paid more than women.
(b) There isn’t a good way to check independence here since the jobs are not a simple random
sample. However, independence doesn’t seem unreasonable, since the individuals in each job
are different from each other. The success-failure condition is met since we check it using the
null proportion: 𝑝0 𝑛 = (1 − 𝑝0 )𝑛 = 10.5 is greater than 10. We can compute the sample
proportion, 𝑆𝐸, and test statistic: 𝑝̂ = 19/21 = 0.905 and 𝑆𝐸 = √ 0.5×(1−0.5) 21 = 0.109 and
0.905−0.5
𝑍 = 0.109 = 3.72. The test statistic 𝑍 corresponds to an upper tail area of about 0.0001, so
the p-value is 2 times this value: 0.0002. Because the p-value is smaller than 0.05, we reject the
notion that all these gender pay disparities are due to chance. Because we observe that men are
paid more in a higher proportion of cases and we have rejected 𝐻0 , we can conclude that men
are being paid higher amounts in ways not explainable by chance alone. If you’re curious for
more info around this topic, including a discussion about adjusting for additional factors that
affect pay, please see the following video by Healthcare Triage: youtu.be/aVhgKSULNQA.
13. Before we can calculate a confidence interval, we must first check that the conditions are met.
There aren’t at least 10 successes and 10 failures in each of the four groups (treatment/control
and yawn/not yawn), (𝑝𝐶 ̂ − 𝑝𝑇̂ ) is not expected to be approximately normal and therefore cannot
calculate a confidence interval for the difference between the proportions of participants who
yawned in the treatment and control groups using large sample techniques and a critical Z score.
15. (a) False. The confidence interval includes 0. (b) False. We are 95% confident that 16% fewer
to 2% Americans who make less than $40,000 per year are not at all personally affected by the
government shutdown compared to those who make $40,000 or more per year. (c) False. As the
confidence level decreases the width of the confidence level decreases as well. (d) True.
17. (a) Type I. (b) Type II. (c) Type II.
19. No. The samples at the beginning and at the end of the semester are not independent since the
survey is conducted on the same students.
21. (a) The proportion of the normal curve centered at -0.1 with a standard deviation of 0.15 that
is less than -2 * standard error is 0.09. (b) The proportion of the normal curve centered at -0.4
with a standard deviation of 0.145 that is less than 2 * standard error is 0.78. (c) The proportion
of the normal curve centered at -0.1 with a standard deviation of 0.0671 that is less than 2 *
standard error is 0.31. (d) The proportion of the normal curve centered at -0.4 with a standard
deviation of 0.0678 that is less than 2 * standard error is 1. (e) The larger the value of 𝛿 and
the larger the sample size, the more likely that the future study will lead to sample proportions
A.18. CHAPTER 18 491
A.18 Chapter 18
(𝑟𝑜𝑤 1 𝑡𝑜𝑡𝑎𝑙)×(𝑐𝑜𝑙 1 𝑡𝑜𝑡𝑎𝑙)
1. (a) Two-way table is shown below. (b-i) 𝐸𝑟𝑜𝑤1 ,𝑐𝑜𝑙1 = 𝑡𝑎𝑏𝑙𝑒 𝑡𝑜𝑡𝑎𝑙 = 35. This is lower
(𝑟𝑜𝑤 2 𝑡𝑜𝑡𝑎𝑙)×(𝑐𝑜𝑙 2 𝑡𝑜𝑡𝑎𝑙)
than the observed value. (b-ii) 𝐸𝑟𝑜𝑤2 ,𝑐𝑜𝑙2 = 𝑡𝑎𝑏𝑙𝑒 𝑡𝑜𝑡𝑎𝑙 = 115. This is lower than
the observed value.
Quit
Treatment Yes No Total
Patch + support 40 110 150
group
Only patch 30 120 150
Total 70 230 300
3. (a) Sun = 0.343, Partial = 0.325, Shade = 0.331. (b) For each, the numbers are listed in the
order sun, partial, and shade: Desert (40,9, 38,7, 39.4), Mountain (36.7, 34.8, 35.5), Valley (36.4,
34.5, 35.1). (c) Yes. (d) We can’t evaluate the association without a formal test.
5. The original dataset will have a higher Chi-squared statistic than the randomized dataset.
7. (a) The two variables are independent. (b) The randomized Chi-squared values range from
zero to approximately 15. (c) The null hypothesis is that the variables are independent; the
alternative hypothesis is that the variables are associated. The p-value is extremely small. The
habitat provides information about the likelihood of being in the different sunshine states.
9. (a) The two variables are independent. (b) The randomized Chi-squared values range from
zero to approximately 25. (c) The null hypothesis is that the variables are independent; the
alternative hypothesis is that the variables are associated. The p-value is around 0. There is
convincing evidence to claim that site and sunlight preference are associated. (d) With larger
sample sizes, the power (the probability of rejecting 𝐻0 when 𝐻𝐴 is true) is higher.
11. (a) False. The Chi-square distribution has one parameter called degrees of freedom. (b) True.
(c) True. (d) False. As the degrees of freedom increases, the shape of the Chi-square distribution
becomes more symmetric.
13. The hypotheses are 𝐻0 ∶ Sleep levels and profession are independent. 𝐻𝐴 ∶ Sleep levels and
profession are associated. The observations are independent and the sample sizes are large
enough to conduct a Chi-square test of independence. The Chi-square statistic is 1 with 2
degrees of freedom. The p-value is 0.6. Since the p-value is high (default to alpha = 0.05), we
fail to reject 𝐻0 . The data do not provide convincing evidence of an association between sleep
levels and profession.
15. (a) 𝐻0 : The age of Los Angeles residents is independent of shipping carrier preference variable.
𝐻𝐴 : The age of Los Angeles residents is associated with the shipping carrier preference variable.
(b) The conditions are not satisfied since some expected counts are below 5.
A.19 Chapter 19
1. (a) Average sleep of 20 in sample vs. all New Yorkers. (b) Average height of students in study
vs all undergraduates.
3. (a) Use the sample mean to estimate the population mean: 171.1. Likewise, use the sample
median to estimate the population median: 170.3. (b) Use the sample standard deviation (9.4)
and sample IQR (177.8 − 163.8 = 14). (c) 𝑍180 = 0.95 and 𝑍155 = −1.71. Neither of these
observations is more than two standard deviations away from the mean, so neither would be
considered unusual. (d) No, sample point estimates only estimate the population parameter,
and they vary from one sample to another. Therefore we cannot expect to get the same mean
492 APPENDIX A. EXERCISE SOLUTIONS
and standard deviation with each random sample. (e) We use the standard error of the mean to
measure the variability in means of random samples of same size taken from a population. The
variability in the means of random samples is quantified by the standard error. Based on this
sample, 𝑆𝐸𝑥̄ = √9.4507
= 0.417.
5. (a) The kindergartners will have a smaller standard deviation of heights. We would expect
their heights to be more similar to each other compared to a group of adults’ heights. (b) The
standard error of the mean will depend on the variability √ of individual heights. The standard
error of the adult sample averages will be around 9.4/ 100 = 0.94cm. The standard error of
the kindergartner sample averages will be smaller.
7. (a) 𝑑𝑓 = 6 − 1 = 5, 𝑡⋆5 = 2.02. (b) 𝑑𝑓 = 21 − 1 = 20, 𝑡⋆20 = 2.53. (c) 𝑑𝑓 = 28, 𝑡⋆28 = 2.05. (d)
𝑑𝑓 = 11, 𝑡⋆11 = 3.11.
9. (a) 0.085, do not reject 𝐻0 . (b) 0.003, reject 𝐻0 . (c) 0.438, do not reject 𝐻0 . (d) 0.042, reject
𝐻0 .
11. (a) Roughly 0.1 weeks. (b) Roughly (38.45 weeks, 38.85 weeks). (c) Roughly (38.49 weeks, 38.91
weeks).
13. (a) False (b) False. (c) True. (d) False.
15. The mean is the midpoint:√ 𝑥̄ = 20. Identify the margin of error: 𝑀 𝐸 = 1.015, then use
𝑡⋆35 = 2.03 and 𝑆𝐸 = 𝑠/ 𝑛 in the formula for margin of error to identify 𝑠 = 3.
17. (a) 𝐻0 : 𝜇 = 8 (New Yorkers sleep 8 hrs per night on average.) 𝐻𝐴 : 𝜇 ≠ 8 (New Yorkers sleep
less or more than 8 hrs per night on average.) (b) Independence: The sample is random. The
min/max suggest there are no concerning outliers. 𝑇 = −1.75. 𝑑𝑓 = 25 − 1 = 24. (c) p-value
= 0.093. If in fact the true population mean of the amount New Yorkers sleep per night was 8
hours, the probability of getting a random sample of 25 New Yorkers where the average amount
of sleep is 7.73 hours per night or less (or 8.27 hours or more) is 0.093. (d) Since p-value > 0.05,
do not reject 𝐻0 . The data do not provide strong evidence that New Yorkers sleep more or less
than 8 hours per night on average. (e) Yes, since we did not rejected 𝐻0 .
19. With a larger critical value, the confidence interval ends up being wider. This makes intuitive
sense as when we have a small sample size and the population standard deviation is unknown,
we should have a wider interval than if we knew the population standard deviation, or if we had
a large enough sample size.
21. (a) We will conduct a 1-sample 𝑡-test. 𝐻0 : 𝜇 = 5. 𝐻𝐴 : 𝜇 ≠ 5. We’ll use 𝛼 = 0.05. This is a
random sample, so the observations are independent. To proceed, √ we assume the distribution
of years of piano lessons is approximately normal. 𝑆𝐸 = 2.2/ 20 = 0.4919. The test statistic
is 𝑇 = (4.6 − 5)/𝑆𝐸 = −0.81. 𝑑𝑓 = 20 − 1 = 19. The one-tail area is about 0.21, so the
p-value is about 0.42, which is bigger than 𝛼 = 0.05 and we do not reject 𝐻0 . That is, we do
not have sufficiently strong evidence to reject the notion that the average is 5 years. (b) Using
𝑆𝐸 = 0.4919 and 𝑡⋆𝑑𝑓=19 = 2.093, the confidence interval is (3.57, 5.63). We are 95% confident
that the average number of years a child takes piano lessons in this city is 3.57 to 5.63 years.
(c) They agree, since we did not reject the null hypothesis and the null value of 5 was in the
𝑡-interval.
A.20 Chapter 20
1. The hypotheses should use population means (𝜇) not sample means (𝑥), ̄ the null hypothesis
should set the two population means equal to each other, the alternative hypothesis should be
two-tailed and use a not equal to sign.
3. 𝐻0 ∶ 𝜇0.99 = 𝜇1 and 𝐻𝐴 ∶ 𝜇0.99 ≠ 𝜇1 . p-value < 0.05, reject 𝐻0 . The data provide convincing
evidence that the difference in population averages of price per carat of 0.99 carats and 1 carat
diamonds are different.
5. (a) We are 95% confident that the population average price per carat of 0.99 carat diamonds is
$2 to $23 lower than the population average price per carat of 1 carat diamonds. (b) We are
95% confident that the population average price per carat of 0.99 carat diamonds is $2.91 to
$21.10 lower than the population average price per carat of 1 carat diamonds.
7. The difference is not zero (statistically discernible), but there is no evidence that the difference
is large (practically important), because the interval provides values as low as 1 lb.
9. 𝐻0 ∶ 𝜇0.99 = 𝜇1 and 𝐻𝐴 ∶ 𝜇0.99 ≠ 𝜇1 . Independence: Both samples are random and represent less
than 10% of their respective populations. Also, we have no reason to think that the 0.99 carats
A.21. CHAPTER 21 493
are not independent of the 1 carat diamonds since they are both sampled randomly. Normality:
The distributions are not extremely skewed, hence we can assume that the distribution of the
average differences will be nearly normal as well. 𝑇22 = −2.7, p-value = 0.0131. Since p-value
less than 0.05, reject 𝐻0 . The data provide convincing evidence that the difference in population
averages of price per carat of 0.99 carats and 1 carat diamonds are different.
11. We are 95% confident that the population average price per carat of 0.99 carat diamonds is $2.96
to $22.42 lower than the √ population average price per carat of 1 carat
√ diamonds.
13. (a) 𝜇𝑥̄1 = 15, 𝜎𝑥̄1 = 20/ 50 = 2.8284. (b) 𝜇𝑥̄2 = 20, 𝜎𝑥̄1 = 10/ 30 = 1.8257. (c) 𝜇𝑥̄2 −𝑥̄1 =
√ 2 √ 2
20 − 15 = 5, 𝜎𝑥̄2 −𝑥̄1 = √(20/ 50) + (10/ 30) = 3.3665. (d) Think of 𝑥1̄ and 𝑥2̄ as being
random variables, and we are considering the standard deviation of the difference of these two
random variables, so we square each standard deviation, add them together, and then take the
square root of the sum: 𝑆𝐷𝑥̄2 −𝑥̄1 = √𝑆𝐷𝑥2 ̄2 + 𝑆𝐷𝑥2 ̄1 .
15. (a) Chicken fed linseed weighed an average of 218.75 grams while those fed horsebean weighed an
average of 160.20 grams. Both distributions are relatively symmetric with no apparent outliers.
There is more variability in the weights of chicken fed linseed. (b) 𝐻0 ∶ 𝜇𝑙𝑠 = 𝜇ℎ𝑏 . 𝐻𝐴 ∶ 𝜇𝑙𝑠 ≠ 𝜇ℎ𝑏 .
We leave the conditions to you to consider. 𝑇 = 3.02, 𝑑𝑓 = 𝑚𝑖𝑛(11, 9) = 9 → p-value = 0.014.
Since p-value < 0.05, reject 𝐻0 . The data provide strong evidence that there is a discernible
difference between the average weights of chickens that were fed linseed and horsebean. (c) Type
I error, since we rejected 𝐻0 . (d) Yes, since p-value > 0.01, we would not have rejected 𝐻0 .
17. 𝐻0 ∶ 𝜇𝐶 = 𝜇𝑆 . 𝐻𝐴 ∶ 𝜇𝐶 ≠ 𝜇𝑆 . 𝑇 = 3.27, 𝑑𝑓 = 11 → p-value = 0.007. Since p-value < 0.05,
reject 𝐻0 . The data provide strong evidence that the average weight of chickens that were fed
casein is different than the average weight of chickens that were fed soybean (with weights from
casein being higher). Since this is a randomized experiment, the observed difference can be
attributed to the diet.
19. 𝐻0 ∶ 𝜇𝑇 = 𝜇𝐶 . 𝐻𝐴 ∶ 𝜇𝑇 ≠ 𝜇𝐶 . 𝑇 = 2.24, 𝑑𝑓 = 21 → p-value = 0.036. Since p-value < 0.05,
reject 𝐻0 . The data provide strong evidence that the average food consumption by the patients
in the treatment and control groups are different. Furthermore, the data indicate patients in
the distracted eating (treatment) group consume more food than patients in the control group.
A.21 Chapter 21
1. Paired, data are recorded in the same cities at two different time points. The temperature in a
city at one point is not independent of the temperature in the same city at another time point
3. (a) Since it’s the same students at the beginning and the end of the semester, there is a pairing
between the datasets, for a given student their beginning and end of semester grades are depen-
dent. (b) Since the subjects were sampled randomly, each observation in the men’s group does
not have a special correspondence with exactly one observation in the other (women’s) group.
(c) Since it’s the same subjects at the beginning and the end of the study, there is a pairing
between the datasets, for a subject student their beginning and end of semester artery thickness
are dependent. (d) Since it’s the same subjects at the beginning and the end of the study, there
is a pairing between the datasets, for a subject student their beginning and end of semester
weights are dependent.
5. False. While it is true that paired analysis requires equal sample sizes, only having the equal
sample sizes isn’t, on its own, sufficient for doing a paired test. Paired tests require that there
be a special correspondence between each pair of observations in the two groups.
7. (a) Let 𝑑𝑖𝑓𝑓 = 2022 − 1950. Then the hypotheses are 𝐻0 ∶ 𝜇𝑑𝑖𝑓𝑓 = 0 and 𝐻𝐴 ∶ 𝜇𝑑𝑖𝑓𝑓 ≠ 0. (b)
The observed average of difference is outside the randomized differences. (c) Since the p-value
< 0.05, reject 𝐻0 . There is evidence of a difference between the average 90𝑡ℎ percentile high
temperature in 2022 and the average 90𝑡ℎ percentile high temperature in 1950.
9. (a) Roughly (1.5∘ F, 3.5∘ F). (b) Roughly (1.5∘ F, 3.56∘ F). (c) We are 90% confident that the
true average of the difference in 90𝑡ℎ percentile high temperature in 2022 vs 1950 is somewhere
between 1.5∘ F and 3.5∘ F. We are 90% confident that the true average of the difference in 90𝑡ℎ
percentile high temperature in 2022 vs 1950 is somewhere between 1.5∘ F and 3.56∘ F. (d) There
is a discernible difference.
11. (a) For each observation in the 1950 dataset, there is exactly one specially corresponding observa-
tion in the 2022 dataset for the same geographic location. The data are paired. (b) 𝐻0 ∶ 𝜇diff = 0
494 APPENDIX A. EXERCISE SOLUTIONS
(There is no difference in the 90𝑡ℎ percentile high temperature in 1950 and 2022 for NOAA sta-
tions.) 𝐻𝐴 ∶ 𝜇diff ≠ 0 (There is a difference.) (c) Locations were not randomly sampled across
the geographic region, so we need to be careful concluding independence. However, the question
above describes the data as representative of the land area of the lower 48 states, so independence
is reasonable. The sample size is 26 which is close to 30, so we’re just looking for particularly
extreme outliers: none are present (the observation off to the right in the histogram would be
considered a outlier, but not a particularly
√ extreme one). Therefore, the conditions are rea-
sonably satisfied. (d) 𝑆𝐸 = 2.95/ 26 = 0.579. 𝑇 = 2.53−0 0.579 = 4.37 with degrees of freedom
𝑑𝑓 = 26 − 1 = 25, which leads to a one-tail area of 0.0000954 and a p-value of about 0.0002. (e)
Since the p-value is less than 0.05, we reject 𝐻0 . The data provide strong evidence that NOAA
stations observed a hotter 90𝑡ℎ percentile high temperature in 2022 than in 1950. (f) Type I
error, since we may have incorrectly rejected 𝐻0 . This error would mean that NOAA stations
did not actually observe an increase, but the sample we took just so happened to make it appear
that this was the case. (g) No, since we rejected 𝐻0 , which had a null value of 0.
13. (a) 𝑆𝐸 = 0.579 and 𝑡⋆25 = 1.71. 2.53 ± 1.71 × 0.579 → (1.54�𝐹 , 3.52 �𝐹 ). (b) We are 90% confident
that the true average of the difference in 90𝑡ℎ percentile high temperature in 2022 vs 1950 is
somewhere between 1.54∘ F and 3.52∘ F. (c) Yes, since the interval lies entirely above 0.
15. (a) Each student study under each condition, use the difference in individual student scores. (b)
Each student study under one condition, use the difference in average across the two conditions.
17. (a)𝐻0 ∶ 𝜇𝑑𝑖𝑓𝑓 = 0. 𝐻𝐴 ∶ 𝜇𝑑𝑖𝑓𝑓 ≠ 0. 𝑇 = −2.71. 𝑑𝑓 = 5. p-value = 0.042. Since p-value <
0.05, reject 𝐻0 . The data provide strong evidence that the average number of traffic accident
related emergency room admissions are different between Friday the 6th and Friday the 13th .
Furthermore, the data indicate that the direction of that difference is that accidents are lower on
Friday the 6𝑡ℎ relative to Friday the 13th . (b) (-6.49, -0.17). (c) This is an observational study,
not an experiment, so we cannot so easily infer a causal intervention implied by this statement.
It is true that there is a difference. However, for example, this does not mean that a responsible
adult going out on Friday the 13𝑡ℎ has a higher chance of harm than on any other night.
A.22 Chapter 22
1. Alternative.
3. (a) Means across original data are more variable. (b) Standard deviation of egg lengths are
about the same for both plots. (c) F statistic is bigger for the original data.
5. 𝐻0 : 𝜇1 = 𝜇2 = ⋯ = 𝜇6 . 𝐻𝐴 : The average weight varies across some (or all) groups. In-
dependence: Chicks are randomly assigned to feed types (presumably kept separate from one
another), therefore independence of observations is reasonable. Approx. normal: the distribu-
tions of weights within each feed type appear to be fairly symmetric. Constant variance: Based
on the side-by-side box plots, the constant variance assumption appears to be reasonable. There
are differences in the actual computed standard deviations, but these might be due to chance
as these are quite small samples. 𝐹5,65 = 15.36 and the p-value is approximately 0. With such
a small p-value, we reject 𝐻0 . The data provide convincing evidence that the average weight of
chicks varies across some (or all) feed supplement groups.
7. (a) 𝐻0 : The population mean of MET for each group is equal to the others. 𝐻𝐴 : At least one
pair of means is different. (b) Independence: We don’t have any information on how the data
were collected, so we cannot assess independence. To proceed, we must assume the subjects in
each group are independent. In practice, we would inquire for more details. Normality: The
data are bound below by zero and the standard deviations are larger than the means, indicating
very strong skew. However, since the sample sizes are extremely large, even extreme skew is
acceptable. Constant variance: This condition is sufficiently met, as the standard deviations are
reasonably consistent across groups. (c) Since p-value is very small, reject 𝐻0 . The data provide
convincing evidence that the average MET differs between at least one pair of groups.
9. (a) 𝐻0 : Average GPA is the same for all majors. 𝐻𝐴 : At least one pair of means are different.
(b) Since p-value > 0.05, fail to reject 𝐻0 . The data do not provide convincing evidence of a
difference between the average GPAs across three groups of majors. (c) The total degrees of
freedom is 195 + 2 = 197, so the sample size is 197 + 1 = 198.
11. (a) False. As the number of groups increases, so does the number of comparisons and hence the
modified discernibility level decreases. (b) True. (c) True. (d) False. We need observations to
A.23. CHAPTER 23 495
A.23 Chapter 23
A.24 Chapter 24
1. (a) 𝐻0 ∶ 𝛽1 = 0, 𝐻𝐴 ∶ 𝛽1 ≠ 0. (b) The observed slope of 0.604 is not a plausible value, the p-value
is extremely small, and the null hypothesis can be rejected. c. The p-value is also extremely
small.
3. (a) The relationship is positive, moderate-to-strong, and linear. There are a few outliers but
no points that appear to be influential. (b) wgt ̂ = −105.0113 + 1.0176 × hgt. Slope: For each
additional centimeter in height, the model predicts the average weight to be 1.0176 additional
kilograms (about 2.2 pounds). Intercept: People who are 0 centimeters tall are expected to
weigh -105.0113 kilograms. This is obviously not possible. Here, the 𝑦- intercept serves only to
adjust the height of the line and is meaningless by itself. (c) 𝐻0 : The true slope coefficient of
height is zero (𝛽1 = 0). 𝐻𝐴 : The true slope coefficient of height is different than zero (𝛽1 ≠ 0).
The p-value for the two-sided alternative hypothesis (𝛽1 ≠ 0) is incredibly small, so we reject
𝐻0 . The data provide convincing evidence that height and weight are positively correlated. The
true slope parameter is indeed greater than 0. (d) 𝑅2 = 0.722 = 0.52. Approximately 52% of
the variability in weight can be explained by the height of individuals.
5. (a) Roughly 0.53 to 0.67. (b) For individuals with one cm larger shoulder girth, their average
height is predicted to be between 0.53 and 0.67 cm taller, with 98% confidence.
7. (a) Approximately 0.025. (b) 𝑏1 ± 2.33 × 𝑆𝐸 → (0.546, 0.662). (c) For individuals with one cm
larger shoulder girth, their average height is predicted to be between 0.546 and 0.662 cm taller,
with 98%√confidence.
9. (a) 𝑟 = 0.518 ≈ +0.72. We know the correlation is positive due to the positive association
between the variables seen in the scatterplot (above in previous exercise). (b) The residuals have
a larger spread above the horizontal line at zero than below the line at zero. This indicates that
the values are not symmetric around zero (so therefore not normally distributed). However, the
violation is not extreme, and a simple least squares fit is probably appropriate for these data.
11. (a) 𝐻0 ∶ 𝛽1 = 0, 𝐻𝐴 ∶ 𝛽1 ≠ 0. (b) The observed slope of 2.559 is not a plausible value, the p-value
is extremely small, and the null hypothesis can be rejected. (c) The p-value is also extremely
small.
13. (a) Rough 90% confidence interval is 1.9 to 3.1. (b) For a one unit (one percentage point) increase
in poverty across given metropolitan areas, the predicted average annual murder rate will be
between √ 1.9 and 3.1 persons per million larger, with 90% confidence.
15. (a) 𝑟 = 0.706 ≈ +0.84. We know the correlation is positive due to the positive association
shown in the scatterplot. (b) The technical conditions all seem to be met.
17. (a) With only sixteen observations in the analysis there are not enough data points to establish
any patterns in the residual plot. That said, the sixteen observations do not show any large
deviations of LNE conditions. We do not know if the volunteers were friends, for example, which
would violate the independence condition. (b) The layout of the points does not indicate any de-
viation form the LINE technical conditions. The small number of points, however, suggests that
care should be given to making sure that the individuals in the study are a good representative
sample of the population to which we would like to infer the results.
19. (a) The Linearity and Normality conditions seem to be met. If anything, the Equal variance
condition is violated due to the a fan shaped pattern in the plot, which indicates non-constant
variability in the residuals (little variability when 𝑥 is small, more variability when 𝑥 is large).
We do not know if the cats were randomly sampled (i.e., are independent from one another),
but we have no reason to believe that they are not independent. (b) Unequal variability does
not affect the fit of the line. The line will continue to model the average heart weight of cats at
a given body weight. However, the p-value for the inference on the line will be affected by the
unequal variability. How much? Probably not much given that the violation is quite minimal.
496 APPENDIX A. EXERCISE SOLUTIONS
A.25 Chapter 25
1. (a) (-0.044, 0.346). We are 95% confident that student who go out more than two nights a week
on average have GPAs 0.044 points lower to 0.346 points higher than those who do not go out
more than two nights a week, when controlling for the other variables in the model. (b) Yes,
since the p-value is larger than 0.05 in all cases (not including the intercept).
3. (a) volume and diam; volume and height; diam and height. (b) Each is discernible in its own
model. (c) When both diameter and height are used in the multiple linear regression model,
both continue to be discernible predictors of volume.
5. (a) Linearity: Horror movies seem to show a much different pattern than the other genres.
While the residuals plots show a random scatter over years and in order of data collection,
there is a clear pattern in residuals for various genres, which signals that this regression model
is not appropriate for these data. Independent observations: The variability of the residuals
is higher for data that comes later in the dataset. We don’t know if the data are sorted by
year, but if so, there may be a temporal pattern in the data that voilates the independence
condition. Normality: The residuals are right skewed (skewed to the high end). Constant or
Equal variability: The residuals vs. predicted values plot reveals some outliers. This plot for
only babies with predicted birth weights between 6 and 8.5 pounds looks a lot better, suggesting
that for bulk of the data the constant variance condition is met.
7. (a) Linearity: With so many observations in the dataset, we look for particularly extreme outliers
in the histogram of residuals and do not see any. We also don’t see a non-linear pattern emerging
in the residuals vs. predicted plot. Independent observations: The sample is random and there
does not seem to be a trend in the residuals vs. order of data collection plot. Normality: The
histogram of residuals appears to be unimodal and symmetic, centered at 0. Constant or equal
variability: The residuals vs. predicted values plot reveals some outliers. This plot for only
babies with predicted birth weights between 6 and 8.5 pounds looks a lot better, suggesting that
for bulk of the data the constant variance condition is met. All concerns raised here are relatively
mild. There are some outliers, but there is so much data that the influence of such observations
will be minor. (b) 𝐻0 : The true slope coefficient of habit is zero (𝛽5 = 0). 𝐻𝐴 : The true slope
coefficient of height is different than zero (𝛽5 ≠ 0). The p-value for the two-sided alternative
hypothesis (𝛽5 ≠ 0) is incredibly 0.0007 (smaller than 0.05), so we reject 𝐻0 . The data provide
convincing evidence that height and weight are positively correlated, given the other variables
in the model. The true slope parameter is indeed greater than 0.
̂ = 11 pounds and weight = 7 pounds. (b) Folds 1, 2, and 4 were used to
9. (a) Roughly weight 𝑖
build the prediction model. (c) The plot on the top estimates 8 parameters; the plot on the
bottom estimates 3 parameters. (d) The residuals are not substantially different.
11. (a) The plots are difficult to differentiate. (b) The CV SSE is smaller for the model with only
two predictors. (c) The model with more predictors seems to be over-fitting the data used to
model build at the expense of not fitting (as well) the cross-validation hold out set for prediction.
A.26 Chapter 26
1. No, logistic regression is not appropriate because the response (or outcome) variable is not binary.
Linear regression is likely to be more appropriate.
3. 𝐻0 ∶ 𝛽1 = 0, the slope of the model predicting kids’ marijuana use in college from their parents’
marijuana use in college is 0. 𝐻𝐴 ∶ 𝛽1 ≠ 0, the slope of the model predicting kids’ marijuana
use in college from their parents’ marijuana use in college is different than 0. The test statistic
is 𝑍 = 4.09 and the associated p-value is less than 0.0001. With a small p-value we reject 𝐻0 .
The data provide convincing evidence that the slope of the model predicting kids’ marijuana
use in college from their parents’ marijuana use in college is different than 0, i.e., that parents’
marijuana use in college is a discernible predictor of kids’ marijuana use in college.
5. (a) 26 observations are in Fold2. 8 correctly and 2 incorrectly predicted to be from Victoria. (b)
78 observations are used to build the model. (c) 2 coefficients for tail length; 3 coefficients for
total length and sex.
7. (a) 76, 73.1%. (b) 58, 55.8%. (c) The tail length model should be chosen for classification
purposes. (d) A model using all three predictors might be superior to either of the smaller
models.
497
Chapter B
References
Adolph, S. C. 1987. “Physiological and Behavioral Ecology of the Lizards Sceloporus Occidentalis and
Sceloporus Graciosus.” PhD thesis, University of Washington, Seattle, Washington.
———. 1990. “Influence of Behavioral Thermoregulation on Microhabitat Use by Two Sceloporus
Lizards.” Ecology 71: 315–27.
Allais, G., M. Romoli, S. Rolando, G. Airola, I. Castagnoli Gabellari, R. Allais, and C. Benedetto.
2011. “Ear Acupuncture in the Treatment of Migraine Attacks: A Randomized Trial on the
Efficacy of Appropriate Versus Inappropriate Acupoints.” Neurological Sciences 32 (1): 173–75.
https://doi.org/10.1007/s10072-011-0525-4.
Allison, T., and D. V. Cicchetti. 1975. “Sleep in Mammals: Ecological and Constitutional Correlates.”
Arch. Hydrobiol 75: 442. https://doi.org/10.1126/science.982039.
American Council on Education. 2008. “College-Bound Students’ Interests in Study Abroad and
Other International Learning Activities.” http://www.openintro.org/redirect.php?go=textbook-
Interests_in_Study_Abroad_2008.
Anturane Reinfarction Trial Research Group. 1980. “Sulfinpyrazone in the Prevention of Sudden
Death After Myocardial Infarction.” New England Journal of Medicine 302 (5): 250–56. https:
//doi.org/10.1056/NEJM198001313020502.
Asbury, D. A., and S. C. Adolph. 2007. “Behavioral Plasticity in an Ecological Generalist: Microhab-
itat Use by Western Fence Lizards.” Evolutionary Ecology Research 9: 801–15.
Audera, C., R. V. Patulny, Sander B. H, and R. M. Douglas. 2001. “Mega-Dose Vitamin c in
Treatment of the Common Cold: A Randomised Controlled Trial.” Medical Journal of Australia 175
(7): 359–62. http://www.openintro.org/redirect.php?go=textbook-vitamin_C_cold_treatment_
2001.
Backstrom, L. 2011. “Anatomy of Facebook.” Facebook Data Team’s Notes. http://www.openintro.
org/redirect.php?go=textbook-anatomy-of-facebook.
Benson, J. B. 1993. “Season of Birth and Onset of Locomotion: Theoretical and Methodological
Implications.” Infant Behavior and Development 16 (1): 69–81. https://doi.org/10.1016/0163-
6383(93)80029-8.
Bertrand, M., and S. Mullainathan. 2003. “Are Emily and Greg More Employable than Lakisha and
Jamal? A Field Experiment on Labor Market Discrimination.” https://doi.org/10.3386/w9873.
Böttiger, B. W., C. Bode, S. Kern, A. Gries, R. Gust, R. Glätzer, H. Bauer, J. Motsch, and E. Martin.
2001. “Efficacy and Safety of Thrombolytic Therapy After Initially Unsuccessful Cardiopulmonary
Resuscitation: A Prospective Clinical Trial.” The Lancet 357 (9268): 1583–85.
Bucciol, Alessandro, and Marco Piovesan. 2011. “Luck or Cheating? A Field Experiment on Honesty
with Children.” Journal of Economic Psychology 32 (1): 73–78. https://doi.org/10.1016/j.joep.
2010.12.001.
CDC. 2008. “Perceived Insufficient Rest or Sleep Among Adults – United States, 2008.”
http://www.openintro.org/redirect.php?go=textbook-Perceived_Insufficient_Rest_or_Sleep_
Among_Adults.
———. 2018. “2018 Assisted Reproductive Technology Fertility Clinic Success Rates Report.” https:
//www.cdc.gov/art/pdf/2018-report/ART-2018-Clinic-Report-Full.pdf.
Chance, B., and A. Rossman. 2018. Investigating Statistical Concepts, Applications, and Methods.
498 APPENDIX B. REFERENCES
http://www.rossmanchance.com/iscam3/.
Chimowitz, M. I., M. J. Lynn, C. P. Derdeyn, T. N. Turan, D. Fiorella, B. F. Lane, L. S. Janis,
et al. 2011. “Stenting Versus Aggressive Medical Therapy for Intracranial Arterial Stenosis.”
New England Journal of Medicine 365 (11): 993–1003. http://www.nejm.org/doi/full/10.1056/
NEJMoa1105335.
Conner, T. S., K. L. Brookie, A. C. Carr, L. A. Mainvil, and M. CM. Vissers. 2017. “Let Them
Eat Fruit! The Effect of Fruit and Vegetable Consumption on Psychological Well-Being in Young
Adults: A Randomized Controlled Trial.” PloS One 12 (2). https://doi.org/10.1371/journal.pone.
0171206.
Datoo, M. S., M. H. Natama, A. Somé, O. Traoré, T. Rouamba, D. Bellamy, P. Yameogo, et al.
2021. “High Efficacy of a Low Dose Candidate Malaria Vaccine, R21 in 1 Adjuvant Matrix-m𝑇 𝑀 ,
with Seasonal Administration to Children in Burkina Faso.” The Lancet. https://doi.org/10.1016/
S0140-6736(21)00943-0.
Demos. 2011. “The State of Young America: The Poll.” http://www.openintro.org/redirect.php?go=
textbook-young_americans_2011_extra.
Ellis, G. J., and L. H. Stone. 1979. “Marijuana Use in College:” an Evaluation of a Modeling
Explanation”.” Youth and Society 10 (4): 323.
FiveThirtyEight. 2015. http://www.openintro.org/redirect.php?go=textbook-fivethirtyeight-scary-
movies.
Frederick, S., N. Novemsky, J. Wang, R. Dhar, and S. Nowlis. 2009. “Opportunity Cost Neglect.”
Journal of Consumer Research 36 (4): 553–61.
Gallup. 2012. http://www.openintro.org/redirect.php?go=textbook-employed_americans_in_
better_health_2012.
Gallup. 2021a. “Half of College Students Say COVID-19 May Impact Completion.” https://www.
openintro.org/go?id=textbook-gallup-2021-covid-college-impact.
———. 2021b. “U.s. Support for Vaccination Proof Varies by Activity, Data Collected in April 2021.”
https://www.openintro.org/go?id=textbook-gallup-2021-vaccine-proof.
Garbutt, J. M., C. Banister, E. Spitznagel, and J. F. Piccirillo. 2012. “Amoxicillin for Acute Rhinosi-
nusitis: A Randomized Controlled Trial.” JAMA: The Journal of the American Medical Association
307 (7): 685–92. https://doi.org/10.1001/jama.2012.138.
Gneezy, Uri, and Aldo Rustichini. 2000. “A Fine Is a Price.” The Journal of Legal Studies 29 (1):
1–17. https://doi.org/10.1086/468061.
Gorman, K. B., T. D. Williams, and W. R. Fraser. 2014b. “Ecological sexual dimorphism and
environmental variability within a community of Antarctic penguins (genus Pygoscelis).” PloS
One 9 (3): e90081. https://doi.org/doi.org/10.1371/journal.pone.0090081.
———. 2014a. “Ecological Sexual Dimorphism and Environmental Variability Within a Community
of Antarctic Penguins (Genus Pygoscelis).” PLoS ONE 9(3) (e90081): –13. https://doi.org/10.
1371/journal.pone.0090081.
Graham, D. J., R. Ouellet-Hellstrom, T. E. MaCurdy, F. Ali, C. Sholley, C. Worrall, and J. A. Kelman.
2010. “Risk of Acute Myocardial Infarction, Stroke, Heart Failure, and Death in Elderly Medicare
Patients Treated with Rosiglitazone or Pioglitazone.” JAMA 304 (4): 411. https://doi.org/10.
1001/jama.2010.920.
Hand, D. J. 1994. A handbook of small data sets. Chapman & Hall/CRC.
Hayden, R. W. 2019. “Questionable Claims for Simple Versions of the Bootstrap.” Journal of Statistics
Education 27 (3): 208–15. https://doi.org/10.1080/10691898.2019.1669507.
Heinz, G., L. J. Peterson, R. W. Johnson, and C. J. Kerk. 2003. “Exploring Relationships in Body
Dimensions.” Journal of Statistics Education 11 (2). http://www.openintro.org/redirect.php?go=
textbook-body_dim_2003.
Hepler, J., and D. Albarracı́n. 2013. “Attitudes Without Objects: Evidence for a Dispositional
Attitude, Its Measurement, and Its Consequences.” Journal of Personality and Social Psychology
104 (6): 1060. https://doi.org/10.1037/a0032282.
Hesterberg, T. C. 2015. “What Teachers Should Know about the Bootstrap: Resampling in the
Undergraduate Statistics Curriculum.” The American Statistician 69 (4): 371–86. https://doi.
org/10.1080/00031305.2015.1089789.
ICPSR. 2014. “United States Department of Health and Human Services. Centers for Disease Control
and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States.
Inter-University Consortium for Political and Social Research, 2016-10-07.” https://doi.org/10.
3886/ICPSR36461.v1.
499
Index