Think Stats: Probability and Statistics For Programmers
Think Stats: Probability and Statistics For Programmers
Version 1.5.9
Think Stats
Probability and Statistics for Programmers
Version 1.5.9
Allen B. Downey
The original form of this book is LATEX source code. Compiling this code has the
effect of generating a device-independent representation of a textbook, which can
be converted to other formats and printed.
The cover for this book is based on a photo by Paul Friel (http://flickr.com/
people/frielp/), who made it available under the Creative Commons Attribution
license. The original photo is at http://flickr.com/photos/frielp/11999738/.
Preface
To demonstrate the kind of analysis I want students to do, the book presents
a case study that runs through all of the chapters. It uses data from two
sources:
Other examples use data from the IRS, the U.S. Census, and the Boston
Marathon.
I did not do that. In fact, I used almost no printed material while I was
writing this book, for several reasons:
breed of dog that is about half the size of a Hyracotherium (see http://wikipedia.
1A
org/wiki/Hyracotherium).
vii
• Proponents of old media think that the exclusive use of electronic re-
sources is lazy and unreliable. They might be right about the first part,
but I think they are wrong about the second, so I wanted to test my
theory.
The resource I used more than any other is Wikipedia, the bugbear of li-
brarians everywhere. In general, the articles I read on statistical topics were
very good (although I made a few small changes along the way). I include
references to Wikipedia pages throughout the book and I encourage you to
follow those links; in many cases, the Wikipedia page picks up where my
description leaves off. The vocabulary and notation in this book are gener-
ally consistent with Wikipedia, unless I had a good reason to deviate.
Other resources I found useful were Wolfram MathWorld and (of course)
Google. I also used two books, David MacKay’s Information Theory, In-
ference, and Learning Algorithms, which is the book that got me hooked on
Bayesian statistics, and Press et al.’s Numerical Recipes in C. But both books
are available online, so I don’t feel too bad.
Allen B. Downey
Needham MA
Contributor List
If you have a suggestion or correction, please send email to
[email protected]. If I make a change based on your feed-
back, I will add you to the contributor list (unless you ask to be omitted).
If you include at least part of the sentence the error appears in, that makes it
easy for me to search. Page and section numbers are fine, too, but not quite
as easy to work with. Thanks!
• Lisa Downey and June Downey read an early draft and made many correc-
tions and suggestions.
viii Chapter 0. Preface
• Andy Pethan and Molly Farison helped debug some of the solutions, and
Molly spotted several typos.
• Gábor Lipták found a typo in the book and the relay race solution.
• Many thanks to Kevin Smith and Tim Arnold for their work on plasTeX,
which I used to convert this book to DocBook.
• Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnson
found errors in the first print edition.
• Jörg Beyer found typos in the book and made many corrections in the doc-
strings of the accompanying code.
Preface v
1.5 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Descriptive statistics 11
2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.13 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.10 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Continuous distributions 37
4.8 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Contents xi
5 Probability 53
5.3 Poincaré . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Operations on distributions 67
6.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.8 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7 Hypothesis testing 79
7.5 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8 Estimation 93
8.1 The estimation game . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Guess the variance . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3 Understanding errors . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Exponential distributions . . . . . . . . . . . . . . . . . . . . . 96
8.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . 97
8.6 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . 97
8.7 Implementing Bayesian estimation . . . . . . . . . . . . . . . 99
8.8 Censored data . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.9 The locomotive problem . . . . . . . . . . . . . . . . . . . . . 102
8.10 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
9 Correlation 107
9.1 Standard scores . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.4 Making scatterplots in pyplot . . . . . . . . . . . . . . . . . . 110
9.5 Spearman’s rank correlation . . . . . . . . . . . . . . . . . . . 114
9.6 Least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.7 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
9.8 Correlation and Causation . . . . . . . . . . . . . . . . . . . . 119
9.9 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 1
This book is about turning data into knowledge. Data is cheap (at least
relatively); knowledge is harder to come by.
I will present three related pieces:
Probability is the study of random events. Most people have an intuitive
understanding of degrees of probability, which is why you can use
words like “probably” and “unlikely” without special training, but we
will talk about how to make quantitative claims about those degrees.
Statistics is the discipline of using data samples to support claims about
populations. Most statistical analysis is based on probability, which is
why these pieces are usually presented together.
Computation is a tool that is well-suited to quantitative analysis, and
computers are commonly used to process statistics. Also, computa-
tional experiments are useful for exploring concepts in probability and
statistics.
The thesis of this book is that if you know how to program, you can use
that skill to help you understand probability and statistics. These topics are
often presented from a mathematical perspective, and that approach works
well for some people. But some important ideas in this area are hard to work
with mathematically and relatively easy to approach computationally.
The rest of this chapter presents a case study motivated by a question I
heard when my wife and I were expecting our first child: do first babies
tend to arrive late?
2 Chapter 1. Statistical thinking for programmers
“My two friends that have given birth recently to their first ba-
bies, BOTH went almost 2 weeks overdue before going into
labour or being induced.”
“My first one came 2 weeks late and now I think the second one
is going to come out two weeks early!!”
“I don’t think that can be true because my sister was my
mother’s first and she was early, as with many of my cousins.”
Reports like these are called anecdotal evidence because they are based on
data that is unpublished and usually personal. In casual conversation, there
is nothing wrong with anecdotes, so I don’t mean to pick on the people I
quoted.
But we might want evidence that is more persuasive and an answer that is
more reliable. By those standards, anecdotal evidence usually fails, because:
Small number of observations: If the gestation period is longer for first ba-
bies, the difference is probably small compared to the natural varia-
tion. In that case, we might have to compare a large number of preg-
nancies to be sure that a difference exists.
Selection bias: People who join a discussion of this question might be in-
terested because their first babies were late. In that case the process of
selecting data would bias the results.
Confirmation bias: People who believe the claim might be more likely to
contribute examples that confirm it. People who doubt the claim are
more likely to cite counterexamples.
Data collection: We will use data from a large national survey that was de-
signed explicitly with the goal of generating statistically valid infer-
ences about the U.S. population.
Exploratory data analysis: We will look for patterns, differences, and other
features that address the questions we are interested in. At the same
time we will check for inconsistencies and identify limitations.
By performing these steps with care to avoid pitfalls, we can reach conclu-
sions that are more justifiable and more likely to be correct.
We will use data collected by this survey to investigate whether first babies
tend to come late, and other questions. In order to use this data effectively,
we have to understand the design of the study.
1 See http://cdc.gov/nchs/nsfg.htm.
4 Chapter 1. Statistical thinking for programmers
The NSFG has been conducted seven times; each deployment is called a cy-
cle. We will be using data from Cycle 6, which was conducted from January
2002 to March 2003.
The goal of the survey is to draw conclusions about a population; the target
population of the NSFG is people in the United States aged 15-44.
The people who participate in a survey are called respondents; a group of
respondents is called a cohort. In general, cross-sectional studies are meant
to be representative, which means that every member of the target popu-
lation has an equal chance of participating. Of course that ideal is hard to
achieve in practice, but people who conduct surveys come as close as they
can.
The NSFG is not representative; instead it is deliberately oversampled.
The designers of the study recruited three groups—Hispanics, African-
Americans and teenagers—at rates higher than their representation in the
U.S. population. The reason for oversampling is to make sure that the num-
ber of respondents in each of these groups is large enough to draw valid
statistical inferences.
Of course, the drawback of oversampling is that it is not as easy to draw
conclusions about the general population based on statistics from the sur-
vey. We will come back to this point later.
Exercise 1.1 Although the NSFG has been conducted seven times,
it is not a longitudinal study. Read the Wikipedia pages http:
//wikipedia.org/wiki/Cross-sectional_study and http://wikipedia.
org/wiki/Longitudinal_study to make sure you understand why not.
Exercise 1.2 In this exercise, you will download data from the NSFG; we
will use this data throughout the book.
5. Browse the code to get a sense of what it does. The next section ex-
plains how it works.
“Oeuf” means egg, “chapeau” means hat. It’s like those French
have a different word for everything.
Each line in the respondents file contains information about one respondent.
This information is called a record. The variables that make up a record are
called fields. A collection of records is called a table.
If you read survey.py you will see class definitions for Record, which is an
object that represents a record, and Table, which represents a table.
There are two subclasses of Record—Respondent and Pregnancy—which
contain records from the respondent and pregnancy tables. For the time
being, these classes are empty; in particular, there is no init method to ini-
tialize their attributes. Instead we will use Table.MakeRecord to convert a
line of text into a Record object.
There are also two subclasses of Table: Respondents and Pregnancies. The
init method in each class specifies the default name of the data file and the
6 Chapter 1. Statistical thinking for programmers
type of record to create. Each Table object has an attribute named records,
which is a list of Record objects.
For each Table, the GetFields method returns a list of tuples that specify
the fields from the record that will be stored as attributes in each Record
object. (You might want to read that last sentence twice.)
def GetFields(self):
return [
('caseid', 1, 12, int),
('prglength', 275, 276, int),
('outcome', 277, 277, int),
('birthord', 278, 279, int),
('finalwgt', 423, 440, float),
]
The first tuple says that the field caseid is in columns 1 through 12 and it’s
an integer. Each tuple contains the following information:
field: The name of the attribute where the field will be stored. Most of the
time I use the name from the NSFG codebook, converted to all lower
case.
start: The index of the starting column for this field. For example, the start
index for caseid is 1. You can look up these indices in the NSFG
codebook at http://nsfg.icpsr.umich.edu/cocoon/WebDocs/NSFG/
public/index.htm.
end: The index of the ending column for this field; for example, the end
index for caseid is 12. Unlike in Python, the end index is inclusive.
outcome is an integer code for the outcome of the pregnancy. The code 1
indicates a live birth.
birthord is the integer birth order of each live birth; for example, the code
for a first child is 1. For outcomes other than live birth, this field is
blank.
If you read the casebook carefully, you will see that most of these variables
are recodes, which means that they are not part of the raw data collected by
the survey, but they are calculated using the raw data.
For example, prglength for live births is equal to the raw variable wksgest
(weeks of gestation) if it is available; otherwise it is estimated using mosgest
* 4.33 (months of gestation times the average number of weeks in a
month).
Recodes are often based on logic that checks the consistency and accuracy
of the data. In general it is a good idea to use recodes unless there is a
compelling reason to process the raw data yourself.
You might also notice that Pregnancies has a method called Recode that
does some additional checking and recoding.
Exercise 1.3 In this exercise you will write a program to explore the data in
the Pregnancies table.
1. In the directory where you put survey.py and the data files, create a
file named first.py and type or paste in the following code:
import survey
table = survey.Pregnancies()
table.ReadRecords()
print 'Number of pregnancies', len(table.records)
2. Write a loop that iterates table and counts the number of live births.
Find the documentation of outcome and confirm that your result is
consistent with the summary in the documentation.
8 Chapter 1. Statistical thinking for programmers
3. Modify the loop to partition the live birth records into two groups, one
for first babies and one for the others. Again, read the documentation
of birthord to see if your results are consistent.
When you are working with a new dataset, these kinds of checks
are useful for finding errors and inconsistencies in the data, detect-
ing bugs in your program, and checking your understanding of the
way the fields are encoded.
4. Compute the average pregnancy length (in weeks) for first babies and
others. Is there a difference between the groups? How big is it?
1.5 Significance
In the previous exercise, you compared the gestation period for first babies
and others; if things worked out, you found that first babies are born about
13 hours later, on average.
A difference like that is called an apparent effect; that is, there might be
something going on, but we are not yet sure. There are several questions
we still want to ask:
• If the two groups have different means, what about other summary
statistics, like median and variance? Can we be more precise about
how the groups differ?
Answering these questions will take most of the rest of this book.
Exercise 1.4 The best way to learn about statistics is to work on a project
you are interested in. Is there a question like, “Do first babies arrive late,”
that you would like to investigate?
1.6. Glossary 9
1.6 Glossary
anecdotal evidence: Evidence, often personal, that is collected casually
rather than by a well-designed study.
Questions.
10 Chapter 1. Statistical thinking for programmers
raw data: Values collected and recorded with little or no checking, calcula-
tion or interpretation.
Descriptive statistics
If you have a sample of n values, xi , the mean, µ, is the sum of the values
divided by the number of values; in other words
1
n∑
µ= xi
i
The words “mean” and “average” are sometimes used interchangeably, but
I will maintain this distinction:
each, two pie pumpkins that are 3 pounds each, and one Atlantic Gi-
ant® pumpkin that weighs 591 pounds. The mean of this sample is 100
pounds, but if I told you “The average pumpkin in my garden is 100
pounds,” that would be wrong, or at least misleading.
In this example, there is no meaningful average because there is no typical
pumpkin.
2.2 Variance
If there is no single number that summarizes pumpkin weights, we can do
a little better with two numbers: mean and variance.
In the same way that the mean is intended to describe the central tendency,
variance is intended to describe the spread. The variance of a set of values
is
1
σ 2 = ∑ ( x i − µ )2
n i
The term xi -µ is called the “deviation from the mean,” so variance is the
mean squared deviation, which is why it is denoted σ2 . The square root of
variance, σ, is called the standard deviation.
By itself, variance is hard to interpret. One problem is that the units are
strange; in this case the measurements are in pounds, so the variance is in
pounds squared. Standard deviation is more meaningful; in this case the
units are pounds.
Exercise 2.1 For the exercises in this chapter you should download http:
//thinkstats.com/thinkstats.py, which contains general-purpose func-
tions we will use throughout the book. You can read documentation of
these functions in http://thinkstats.com/thinkstats.html.
Write a function called Pumpkin that uses functions from thinkstats.py
to compute the mean, variance and standard deviation of the pumpkins
weights in the previous section.
Exercise 2.2 Reusing code from survey.py and first.py, compute the stan-
dard deviation of gestation time for first babies and others. Does it look like
the spread is the same for the two groups?
How big is the difference in the means compared to these standard devia-
tions? What does this comparison suggest about the statistical significance
of the difference?
2.3. Distributions 13
If you have prior experience, you might have seen a formula for variance
with n − 1 in the denominator, rather than n. This statistic is called the
“sample variance,” and it is used to estimate the variance in a population
using a sample. We will come back to this in Chapter 8.
2.3 Distributions
Summary statistics are concise, but dangerous, because they obscure the
data. An alternative is to look at the distribution of the data, which de-
scribes how often each value appears.
hist = {}
for x in t:
hist[x] = hist.get(x, 0) + 1
The result is a dictionary that maps from values to frequencies. To get from
frequencies to probabilities, we divide through by n, which is called nor-
malization:
n = float(len(t))
pmf = {}
for x, freq in hist.items():
pmf[x] = freq / n
The normalized histogram is called a PMF, which stands for “probability
mass function”; that is, it’s a function that maps from values to probabilities
(I’ll explain “mass” in Section 6.3).
The function MakeHistFromList takes a list of values and returns a new Hist
object. You can test it in Python’s interactive mode:
Hist objects provide methods to look up values and their probabilities. Freq
takes a value and returns its frequency:
>>> hist.Freq(2)
2
If you look up a value that has never appeared, the frequency is 0.
>>> hist.Freq(4)
0
Values returns an unsorted list of the values in the Hist:
>>> hist.Values()
[1, 5, 3, 2]
To loop through the values in order, you can use the built-in function
sorted:
for val in sorted(hist.Values()):
print val, hist.Freq(val)
If you are planning to look up all of the frequencies, it is more efficient to
use Items, which returns an unsorted list of value-frequency pairs:
Exercise 2.3 The mode of a distribution is the most frequent value (see http:
//wikipedia.org/wiki/Mode_(statistics)). Write a function called Mode
that takes a Hist object and returns the most frequent value.
As a more challenging version, write a function called AllModes that takes
a Hist object and returns a list of value-frequency pairs in descending or-
der of frequency. Hint: the operator module provides a function called
itemgetter which you can pass as a key to sorted.
Histogram
2500 first babies
others
2000
frequency 1500
1000
500
0 25 30 35 40 45
weeks
Outliers: Values far from the mode are called outliers. Some of these are
just unusual cases, like babies born at 30 weeks. But many of them are
probably due to errors, either in the reporting or recording of data.
Although histograms make some features apparent, they are usually not
useful for comparing two distributions. In this example, there are fewer
“first babies” than “others,” so some of the apparent differences in the his-
tograms are due to sample sizes. We can address this problem using PMFs.
Finally, in the text, I use PMF to refer to the general concept of a probability
mass function, independent of my implementation.
To create a Pmf object, use MakePmfFromList, which takes a list of values:
>>> import Pmf
>>> pmf = Pmf.MakePmfFromList([1, 2, 2, 3, 5])
>>> print pmf
<Pmf.Pmf object at 0xb76cf68c>
Pmf and Hist objects are similar in many ways. The methods Values and
Items work the same way for both types. The biggest difference is that
a Hist maps from values to integer counters; a Pmf maps from values to
floating-point probabilities.
To look up the probability associated with a value, use Prob:
>>> pmf.Prob(2)
0.4
You can modify an existing Pmf by incrementing the probability associated
with a value:
>>> pmf.Incr(2, 0.2)
>>> pmf.Prob(2)
0.6
Or you can multiply a probability by a factor:
>>> pmf.Mult(2, 0.5)
>>> pmf.Prob(2)
0.3
If you modify a Pmf, the result may not be normalized; that is, the probabil-
ities may no longer add up to 1. To check, you can call Total, which returns
the sum of the probabilities:
>>> pmf.Total()
0.9
To renormalize, call Normalize:
>>> pmf.Normalize()
>>> pmf.Total()
1.0
Pmf objects provide a Copy method so you can make and and modify a copy
without affecting the original.
18 Chapter 2. Descriptive statistics
µ= ∑ pi xi
i
where the xi are the unique values in the PMF and pi =PMF(xi ). Similarly,
you can compute variance like this:
σ2 = ∑ p i ( x i − µ )2
i
Write functions called PmfMean and PmfVar that take a Pmf object and com-
pute the mean and variance. To test these methods, check that they are
consistent with the methods Mean and Var in Pmf.py.
Figure 2.2 shows the PMF of pregnancy lengths as a bar graph. Using the
PMF, we can see more clearly where the distributions differ. First babies
2.8. Outliers 19
PMF
first babies
0.5 others
0.4
probability
0.3
0.2
0.1
0.0 25 30 35 40 45
weeks
seem to be less likely to arrive on time (week 39) and more likely to be a late
(weeks 41 and 42).
The code that generates the figures in this chapters is available from http:
//thinkstats.com/descriptive.py. To run it, you will need the modules
it imports and the data from the NSFG (see Section 1.3).
Note: pyplot provides a function called hist that takes a sequence of val-
ues, computes the histogram and plots it. Since I use Hist objects, I usually
don’t use pyplot.hist.
2.8 Outliers
Outliers are values that are far from the central tendency. Outliers might
be caused by errors in collecting or processing the data, or they might be
correct but unusual measurements. It is always a good idea to check for
outliers, and sometimes it is useful and appropriate to discard them.
In the list of pregnancy lengths for live births, the 10 lowest values are {0, 4,
9, 13, 17, 17, 18, 19, 20, 21}. Values below 20 weeks are certainly errors, and
values higher than 30 weeks are probably legitimate. But values in between
are hard to interpret.
weeks count
43 148
20 Chapter 2. Descriptive statistics
4 Difference in PMFs
834 36 38 40 42 44 46
weeks
44 46
45 10
46 1
47 1
48 7
50 2
Again, some values are almost certainly errors, but it is hard to know for
sure. One option is to trim the data by discarding some fraction of the high-
est and lowest values (see http://wikipedia.org/wiki/Truncated_mean).
Make three PMFs, one for first babies, one for others, and one for all live
births. For each PMF, compute the probability of being born early, on time,
or late.
One way to summarize data like this is with relative risk, which is a ratio
of two probabilities. For example, the probability that a first baby is born
early is 18.2%. For other babies it is 16.8%, so the relative risk is 1.08. That
means that first babies are about 8% more likely to be early.
Write code to confirm that result, then compute the relative risks of being
born on time and being late. You can download a solution from http://
thinkstats.com/risk.py.
2. Remove from the cohort all pregnancies with length less than 39.
22 Chapter 2. Descriptive statistics
3. Compute the PMF of the remaining durations; the result is the condi-
tional PMF.
This algorithm is conceptually clear, but not very efficient. A simple alter-
native is to remove from the distribution the values less than 39 and then
renormalize.
Exercise 2.7 Write a function that implements either of these algorithms and
computes the probability that a baby will be born during Week 39, given
that it was not born prior to Week 39.
Generalize the function to compute the probability that a baby will be born
during Week x, given that it was not born prior to Week x, for all x. Plot this
value as a function of x for first babies and others.
The answer might depend on who is asking the question. For example, a
scientist might be interested in any (real) effect, no matter how small. A
doctor might only care about effects that are clinically significant; that is,
differences that affect treatment decisions. A pregnant woman might be
interested in results that are relevant to her, like the conditional probabilities
in the previous section.
How you report results also depends on your goals. If you are trying to
demonstrate the significance of an effect, you might choose summary statis-
tics, like relative risk, that emphasize differences. If you are trying to reas-
sure a patient, you might choose statistics that put the differences in context.
Exercise 2.8 Based on the results from the previous exercises, suppose you
were asked to summarize what you learned about whether first babies ar-
rive late.
2.13. Glossary 23
Which summary statistics would you use if you wanted to get a story on
the evening news? Which ones would you use if you wanted to reassure an
anxious patient?
Finally, imagine that you are Cecil Adams, author of The Straight Dope
(http://straightdope.com), and your job is to answer the question, “Do
first babies arrive late?” Write a paragraph that uses the results in this chap-
ter to answer the question clearly, precisely, and accurately.
2.13 Glossary
central tendency: A characteristic of a sample or population; intuitively, it
is the most average value.
• Students typically take 4–5 classes per semester, but professors often
teach 1 or 2.
• The number of students who enjoy a small class is small, but the num-
ber of students in a large class is (ahem!) large.
The first effect is obvious (at least once it is pointed out); the second is more
subtle. So let’s look at an example. Suppose that a college offers 65 classes
in a given semester, with the following distribution of sizes:
size count
5- 9 8
10-14 8
15-19 14
20-24 4
25-29 6
30-34 12
35-39 8
40-44 3
45-49 2
26 Chapter 3. Cumulative distribution functions
If you ask the Dean for the average class size, he would construct a PMF,
compute the mean, and report that the average class size is 24.
But if you survey a group of students, ask them how many students are in
their classes, and compute the mean, you would think that the average class
size was higher.
Exercise 3.1 Build a PMF of these data and compute the mean as perceived
by the Dean. Since the data have been grouped in bins, you can use the
mid-point of each bin.
Now find the distribution of class sizes as perceived by students and com-
pute its mean.
Suppose you want to find the distribution of class sizes at a college, but you
can’t get reliable data from the Dean. An alternative is to choose a random
sample of students and ask them the number of students in each of their
classes. Then you could compute the PMF of their responses.
The result would be biased because large classes would be oversampled,
but you could estimate the actual distribution of class sizes by applying an
appropriate transformation to the observed distribution.
Write a function called UnbiasPmf that takes the PMF of the observed values
and returns a new Pmf object that estimates the distribution of class sizes.
You can download a solution to this problem from http://thinkstats.
com/class_size.py.
Exercise 3.2 In most foot races, everyone starts at the same time. If you are
a fast runner, you usually pass a lot of people at the beginning of the race,
but after a few miles everyone around you is going at the same speed.
When I ran a long-distance (209 miles) relay race for the first time, I noticed
an odd phenomenon: when I overtook another runner, I was usually much
faster, and when another runner overtook me, he was usually much faster.
At first I thought that the distribution of speeds might be bimodal; that is,
there were many slow runners and many fast runners, but few at my speed.
Then I realized that I was the victim of selection bias. The race was unusual
in two ways: it used a staggered start, so teams started at different times;
also, many teams included runners at different levels of ability.
As a result, runners were spread out along the course with little relationship
between speed and location. When I started running my leg, the runners
near me were (pretty much) a random sample of the runners in the race.
3.2. The limits of PMFs 27
So where does the bias come from? During my time on the course, the
chance of overtaking a runner, or being overtaken, is proportional to the
difference in our speeds. To see why, think about the extremes. If another
runner is going at the same speed as me, neither of us will overtake the
other. If someone is going so fast that they cover the entire course while I
am running, they are certain to overtake me.
Write a function called BiasPmf that takes a Pmf representing the actual
distribution of runners’ speeds, and the speed of a running observer, and
returns a new Pmf representing the distribution of runners’ speeds as seen
by the observer.
To test your function, get the distribution of speeds from a normal road race
(not a relay). I wrote a program that reads the results from the James Joyce
Ramble 10K in Dedham MA and converts the pace of each runner to MPH.
Download it from http://thinkstats.com/relay.py. Run it and look at
the PMF of speeds.
Now compute the distribution of speeds you would observe if you ran a
relay race at 7.5 MPH with this group of runners. You can download a
solution from http://thinkstats.com/relay_soln.py
Overall, these distributions resemble the familiar “bell curve,” with many
values near the mean and a few values much higher and lower.
But parts of this figure are hard to interpret. There are many spikes and
valleys, and some apparent differences between the distributions. It is hard
to tell which of these features are significant. Also, it is hard to see overall
patterns; for example, which distribution do you think has the higher mean?
These problems can be mitigated by binning the data; that is, dividing the
domain into non-overlapping intervals and counting the number of values
28 Chapter 3. Cumulative distribution functions
0.040
Birth weight PMF
first babies
0.035 others
0.030
0.025
probability
0.020
0.015
0.010
0.005
0.0000 50 100 150 200 250
weight (ounces)
Figure 3.1: PMF of birth weights. This figure shows a limitation of PMFs:
they are hard to compare.
in each bin. Binning can be useful, but it is tricky to get the size of the bins
right. If they are big enough to smooth out noise, they might also smooth
out useful information.
3.3 Percentiles
If you have taken a standardized test, you probably got your results in the
form of a raw score and a percentile rank. In this context, the percentile
rank is the fraction of people who scored lower than you (or the same). So
if you are “in the 90th percentile,” you did as well as or better than 90% of
the people who took the exam.
Here’s how you could compute the percentile rank of a value, your_score,
relative to the scores in the sequence scores:
For example, if the scores in the sequence were 55, 66, 77, 88 and 99, and
you got the 88, then your percentile rank would be 100 * 4 / 5 which is
80.
If you are given a value, it is easy to find its percentile rank; going the other
way is slightly harder. If you are given a percentile rank and you want to
find the corresponding value, one option is to sort the values and search for
the one you want:
The result of this calculation is a percentile. For example, the 50th percentile
is the value with percentile rank 50. In the distribution of exam scores, the
50th percentile is 77.
Exercise 3.4 Optional: If you only want to compute one percentile, it is not
efficient to sort the scores. A better option is the selection algorithm, which
you can read about at http://wikipedia.org/wiki/Selection_algorithm.
The CDF is a function of x, where x is any value that might appear in the
distribution. To evaluate CDF(x) for a particular value of x, we compute the
fraction of the values in the sample less than (or equal to) x.
Here’s what that looks like as a function that takes a sample, t, and a value,
x:
def Cdf(t, x):
count = 0.0
for value in t:
if value <= x:
count += 1.0
1.0 CDF
0.8
0.6
CDF(x)
0.4
0.2
0.00 1 2 3 4 5 6
x
Because xs and ps are sorted, these operations can use the bisection algo-
rithm, which is efficient. The run time is proportional to the logarithm of
the number of values; see http://wikipedia.org/wiki/Time_complexity.
Cdfs also provide Render, which returns two lists, xs and ps, suitable for
plotting the CDF. Because the CDF is a step function, these lists have two
elements for each unique value in the distribution.
The Cdf module provides several functions for making Cdfs, including
MakeCdfFromList, which takes a sequence of values and returns their Cdf.
Finally, myplot.py provides functions named Cdf and Cdfs that plot Cdfs as
lines.
Exercise 3.5 Download Cdf.py and relay.py (see Exercise 3.2) and generate
a plot that shows the CDF of running speeds. Which gives you a better sense
of the shape of the distribution, the PMF or the CDF? You can download a
solution from http://thinkstats.com/relay_cdf.py.
32 Chapter 3. Cumulative distribution functions
1.0
Birth weight CDF
first babies
others
0.8
0.6
probability
0.4
0.2
This figure makes the shape of the distributions, and the differences be-
tween them, much clearer. We can see that first babies are slightly lighter
throughout the distribution, with a larger discrepancy above the mean.
Exercise 3.6 How much did you weigh at birth? If you don’t know, call your
mother or someone else who knows. Using the pooled data (all live births),
compute the distribution of birth weights and use it to find your percentile
rank. If you were a first baby, find your percentile rank in the distribution
for first babies. Otherwise use the distribution for others. If you are in the
90th percentile or higher, call your mother back and apologize.
Exercise 3.7 Suppose you and your classmates compute the percentile rank
of your birth weights and then compute the CDF of the percentile ranks.
What do you expect it to look like? Hint: what fraction of the class do you
expect to be above the median?
For example, if you are above average in weight, but way above average in
height, then you might be relatively light for your height. Here’s how you
could make that claim more precise.
1. Select a cohort of people who are the same height as you (within some
range).
For example, people who compete in foot races are usually grouped by age
and gender. To compare people in different groups, you can convert race
times to percentile ranks.
Exercise 3.8 I recently ran the James Joyce Ramble 10K in Dedham MA.
The results are available from http://coolrunning.com/results/10/ma/
Apr25_27thAn_set1.shtml. Go to that page and find my results. I came
in 97th in a field of 1633, so what is my percentile rank in the field?
It might not be obvious why this works, but since it is easier to implement
than to explain, let’s try it out.
Exercise 3.9 Write a function called Sample, that takes a Cdf and an integer,
n, and returns a list of n values chosen at random from the Cdf. Hint: use
random.random. You will find a solution to this exercise in Cdf.py.
Using the distribution of birth weights from the NSFG, generate a random
sample with 1000 elements. Compute the CDF of the sample. Make a plot
that shows the original CDF and the CDF of the random sample. For large
values of n, the distributions should be the same.
that if you have an even number of elements in a sample, the median is the average of
3.10. Glossary 35
are often used to check whether a distribution is symmetric, and their dif-
ference, which is called the interquartile range, measures the spread.
Exercise 3.11 Write a function called Median that takes a Cdf and computes
the median, and one called Interquartile that computes the interquartile
range.
Compute the 25th, 50th, and 75th percentiles of the birth weight CDF. Do
these values suggest that the distribution is symmetric?
3.10 Glossary
percentile rank: The percentage of values in a distribution that are less than
or equal to a given value.
the middle two elements. This is an unnecessary special case, and it has the odd effect of
generating a value that is not in the sample. As far as I’m concerned, the median is the 50th
percentile. Period.
36 Chapter 3. Cumulative distribution functions
Chapter 4
Continuous distributions
CDF ( x ) = 1 − e−λx
The parameter, λ, determines the shape of the distribution. Figure 4.1 shows
what this CDF looks like with λ = 2.
1.0
Exponential CDF
0.8
0.6
CDF
0.4
0.2
y ≈ e−λx
0.8
0.6
CDF
0.4
0.2
100
Time between births
Complementary CDF
10-1
Hint: You can use the function pyplot.yscale to plot the y axis on a log
scale.
Or, if you use myplot, the Cdf function takes a boolean option, complement,
that determines whether to plot the CDF or CCDF, and string options,
xscale and yscale, that transform the axes; to plot a CCDF on a log-y scale:
myplot.Cdf(cdf, complement=True, xscale='linear', yscale='log')
Exercise 4.2 Collect the birthdays of the students in your class, sort them,
and compute the interarrival times in days. Plot the CDF of the interar-
rival times and the CCDF on a log-y scale. Does it look like an exponential
distribution?
The parameters x m and α determine the location and shape of the distri-
bution. x m is the minimum possible value. Figure 4.4 shows the CDF of a
Pareto distribution with parameters x m = 0.5 and α = 1.
4.2. The Pareto distribution 41
0.8
0.6
CDF
0.4
0.2
0.00 2 4 6 8 10
x
The median of this distribution is xm 21/α , which is 1, but the 95th percentile
is 10. By contrast, the exponential distribution with median 1 has 95th per-
centile of only 1.5.
There is a simple visual test that indicates whether an empirical distribution
fits a Pareto distribution: on a log-log scale, the CCDF looks like a straight
line. If you plot the CCDF of a sample from a Pareto distribution on a linear
scale, you expect to see a function like:
−α
x
y≈
xm
the slope?
Exercise 4.4 To get a feel for the Pareto distribution, imagine what the world
would be like if the distribution of human height were Pareto. Choosing the
parameters x m = 100 cm and α = 1.7, we get a distribution with a reasonable
minimum, 100 cm, and median, 150 cm.
Generate 6 billion random values from this distribution. What is the mean
of this sample? What fraction of the population is shorter than the mean?
How tall is the tallest person in Pareto World?
Exercise 4.5 Zipf’s law is an observation about how often different words
are used. The most common words have very high frequencies, but there
are many unusual words, like “hapaxlegomenon,” that appear only a few
times. Zipf’s law predicts that in a body of text, called a “corpus,” the dis-
tribution of word frequencies is roughly Pareto.
Find a large corpus, in any language, in electronic format. Count how many
times each word appears. Find the CCDF of the word counts and plot it on a
log-log scale. Does Zipf’s law hold? What is the value of α, approximately?
Can you find a transformation that makes a Weibull distribution look like a
straight line? What do the slope and intercept of the line indicate?
The normal distribution has many properties that make it amenable for
analysis, but the CDF is not one of them. Unlike the other distributions
4.3. The normal distribution 43
0.8
0.6
CDF
0.4
0.2
we have looked at, there is no closed-form expression for the normal CDF;
the most common alternative is to write it in terms of the error function,
which is a special function written erf(x):
x−µ
1
CDF ( x ) = 1 + erf √
2 σ 2
Z x
2 2
erf( x ) = √ e−t dt
π 0
The parameters µ and σ determine the mean and standard deviation of the
distribution.
If these formulas make your eyes hurt, don’t worry; they are easy to imple-
ment in Python2 . There are many fast and accurate ways to approximate
erf(x). You can download one of them from http://thinkstats.com/erf.
py, which provides functions named erf and NormalCdf.
Figure 4.5 shows the CDF of the normal distribution with parameters µ = 2.0
and σ = 0.5. The sigmoid shape of this curve is a recognizable characteristic
of a normal distribution.
In the previous chapter we looked at the distribution of birth weights in the
NSFG. Figure 4.6 shows the empirical CDF of weights for all live births and
the CDF of a normal distribution with the same mean and variance.
The normal distribution is a good model for this dataset. A model is a useful
simplification. In this case it is useful because we can summarize the entire
2 As of Python 3.2, it is even easier; erf is in the math module.
44 Chapter 4. Continuous distributions
1.0
Birth weights
model
data
0.8
0.6
CDF
0.4
0.2
distribution with just two numbers, µ = 116.5 and σ = 19.9, and the resulting
error (difference between the model and the data) is small.
Below the 10th percentile there is a discrepancy between the data and the
model; there are more light babies than we would expect in a normal distri-
bution. If we are interested in studying preterm babies, it would be impor-
tant to get this part of the distribution right, so it might not be appropriate
to use the normal model.
Exercise 4.7 The Wechsler Adult Intelligence Scale is a test that is intended
to measure intelligence3 . Results are transformed so that the distribution of
scores in the general population is normal with µ = 100 and σ = 15.
Exercise 4.8 Plot the CDF of pregnancy lengths for all live births. Does it
look like a normal distribution?
3 Whetherit does or not is a fascinating controversy that I invite you to investigate at
your leisure.
4 On this topic, you might be interested to read http://wikipedia.org/wiki/
Christopher_Langan.
4.4. Normal probability plot 45
Compute the mean and variance of the sample and plot the normal distri-
bution with the same parameters. Is the normal distribution a good model
for this data? If you had to summarize this distribution with two statistics,
what statistics would you choose?
Exercise 4.9 Write a function called Sample that generates 6 samples from a
normal distribution with µ = 0 and σ = 1. Sort and return the values.
Write a function called Samples that calls Sample 1000 times and returns a
list of 1000 lists.
If you apply zip to this list of lists, the result is 6 lists with 1000 values in
each. Compute the mean of each of these lists and print the results. I predict
that you will get something like this:
If you increase the number of times you call Sample, the results should con-
verge on these values.
3. Plot the sorted values from your dataset versus the random values.
46 Chapter 4. Continuous distributions
250
200
100
50
04 3 2 1 0 1 2 3 4
Standard normal values
For large datasets, this method works well. For smaller datasets, you can
improve it by generating m(n+1) − 1 values from a normal distribution,
where n is the size of the dataset and m is a multiplier. Then select every
mth element, starting with the mth.
This method works with other distributions as well, as long as you know
how to generate a random sample.
Figure 4.7 is a quick-and-dirty normal probability plot for the birth weight
data.
The curvature in this plot suggests that there are deviations from a normal
distribution; nevertheless, it is a good (enough) model for many purposes.
1.0
Adult weight
model
data
0.8
0.6
CDF
0.4
0.2
0.0 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2
adult weight (log kg)
the same as the CDF of the normal distribution, with log x substituted for
x.
CDFlognormal (x) = CDFnormal (log x)
The parameters of the lognormal distribution are usually denoted µ and σ.
But remember that these parameters are not the mean and standard devia-
tion; the mean of a lognormal distribution is exp(µ + σ2 /2) and the standard
deviation is ugly5 .
It turns out that the distribution of weights for adults is approximately log-
normal6 .
The National Center for Chronic Disease Prevention and Health Promotion
conducts an annual survey as part of the Behavioral Risk Factor Surveil-
lance System (BRFSS)7 . In 2008, they interviewed 414,509 respondents and
asked about their demographics, health and health risks.
Among the data they collected are the weights in kilograms of 398,484 re-
spondents. Figure 4.8 shows the distribution of log w, where w is weight in
5 Seehttp://wikipedia.org/wiki/Log-normal_distribution.
6I was tipped off to this possibility by a comment (without citation) at http://
mathworld.wolfram.com/LogNormalDistribution.html. Subsequently I found a paper
that proposes the log transform and suggests a cause: Penman and Johnson, “The Chang-
ing Shape of the Body Mass Index Distribution Curve in the Population,” Preventing
Chronic Disease, 2006 July; 3(3): A74. Online at http://www.ncbi.nlm.nih.gov/pmc/
articles/PMC1636707.
7 Centers for Disease Control and Prevention (CDC). Behavioral Risk Factor Surveillance
System Survey Data. Atlanta, Georgia: U.S. Department of Health and Human Services,
Centers for Disease Control and Prevention, 2008.
48 Chapter 4. Continuous distributions
The normal model is a good fit for the data, although the highest weights
exceed what we expect from the normal model even after the log transform.
Since the distribution of log w fits a normal distribution, we conclude that
w fits a lognormal distribution.
Write a program that reads adult weights from the BRFSS and generates
normal probability plots for w and log w. You can download a solution
from http://thinkstats.com/brfss_figs.py.
Exercise 4.12 The distribution of populations for cities and towns has been
proposed as an example of a real-world phenomenon that can be described
with a Pareto distribution.
The U.S. Census Bureau publishes data on the population of every incorpo-
rated city and town in the United States. I have written a small program
that downloads this data and stores it in a file. You can download it from
http://thinkstats.com/populations.py.
1. Read over the program to make sure you know what it does; then run
it to download and process the data.
3. Plot the CDF on linear and log-x scales so you can get a sense of the
shape of the distribution. Then plot the CCDF on a log-log scale to see
if it has the characteristic shape of a Pareto distribution.
4. Try out the other transformations and plots in this chapter to see if
there is a better model for this data.
What conclusion do you draw about the distribution of sizes for cities
and towns? You can download a solution from http://thinkstats.com/
populations_cdf.py.
Exercise 4.13 The Internal Revenue Service of the United States (IRS) pro-
vides data about income taxes at http://irs.gov/taxstats.
4.6. Why model? 49
One of their files, containing information about individual incomes for 2008,
is available from http://thinkstats.com/08in11si.csv. I converted it to
a text format called CSV, which stands for “comma-separated values;” you
can read it using the csv module.
Extract the distribution of incomes from this dataset. Are any of the con-
tinuous distributions in this chapter a good model of the data? You can
download a solution from http://thinkstats.com/irs.py.
Continuous models are also a form of data compression. When a model fits
a dataset well, a small set of parameters can summarize a large amount of
data.
x = ICDF(p)
p = 1 − e−λx
x= −log (1 − p) / λ
def expovariate(lam):
p = random.random()
x = -math.log(1-p) / lam
return x
I called the parameter lam because lambda is a Python keyword. Most im-
plementations of random.random can return 0 but not 1, so 1 − p can be 1
but not 0, which is good, because log 0 is undefined.
Exercise 4.14 Write a function named weibullvariate that takes lam and k
and returns a random value from the Weibull distribution with those pa-
rameters.
4.8 Glossary
empirical distribution: The distribution of values in a sample.
normal probability plot: A plot of the sorted values in a sample versus the
expected value for each if their distribution is normal.
Probability
In cases like this some people would say that the notion of probability does
not apply. This position is sometimes called frequentism because it defines
probability in terms of frequencies. If there is no set of identical trials, there
is no probability.
Frequentism is philosophically safe, but frustrating because it limits the
scope of probability to physical systems that are either random (like atomic
decay) or so unpredictable that we model them as random (like a tumbling
die). Anything involving people is pretty much off the table.
An alternative is Bayesianism, which defines probability as a degree of be-
lief that an event will occur. By this definition, the notion of probability
can be applied in almost any circumstance. One difficulty with Bayesian
probability is that it depends on a person’s state of knowledge; people with
different information might have different degrees of belief about the same
event. For this reason, many people think that Bayesian probabilities are
more subjective than frequency probabilities.
As an example, what is the probability that Thaksin Shinawatra is the Prime
Minister of Thailand? A frequentist would say that there is no probability
for this event because there is no set of trials. Thaksin either is, or is not, the
PM; it’s not a question of probability.
In contrast, a Bayesian would be willing to assign a probability to this
event based on his or her state of knowledge. For example, if you re-
member that there was a coup in Thailand in 2006, and you are pretty sure
Thaksin was the PM who was ousted, you might assign a probability like
0.1, which acknowledges the possibility that your recollection is incorrect,
or that Thaksin has been reinstated.
If you consult Wikipedia, you will learn that Thaksin is not the PM of Thai-
land (at the time I am writing). Based on this information, you might revise
your probability estimate to 0.01, reflecting the possibility that Wikipedia is
wrong.
This formula only applies if A and B are independent, which means that if
I know A occurred, that doesn’t change the probability of B, and vice versa.
For example, if A is tossing a coin and getting heads, and B is rolling a die
and getting 1, A and B are independent, because the coin toss doesn’t tell
me anything about the die roll.
But if I roll two dice, and A is getting at least one six, and B is getting two
sixes, A and B are not independent, because if I know that A occurred, the
probability of B is higher, and if I know B occurred, the probability of A is
1.
When A and B are not independent, it is often useful to compute the condi-
tional probability, P(A|B), which is the probability of A given that we know
B occurred:
P( A and B)
P( A| B) =
P( B)
From that we can derive the general relation
P(A and B) = P(A) P(B|A)
This might not be as easy to remember, but if you translate it into English
it should make sense: “The chance of both things happening is the chance
that the first one happens, and then the second one given the first.”
There is nothing special about the order of events, so we could also write
P(A and B) = P(B) P(A|B)
These relationships hold whether A and B are independent or not. If they
are independent, then P(A|B) = P(A), which gets us back where we started.
Because all probabilities are in the range 0 to 1, it is easy to show that
P(A and B) ≤ P(A)
To picture this, imagine a club that only admits people who satisfy some
requirement, A. Now suppose they add a new requirement for member-
ship, B. It seems obvious that the club will get smaller, or stay the same
if it happens that all the members satisfy B. But there are some scenar-
ios where people are surprisingly bad at this kind of analysis. For exam-
ples and discussion of this phenomenon, see http://wikipedia.org/wiki/
Conjunction_fallacy.
Exercise 5.1 If I roll two dice and the total is 8, what is the chance that one
of the dice is a 6?
56 Chapter 5. Probability
Exercise 5.2 If I roll 100 dice, what is the chance of getting all sixes? What is
the chance of getting no sixes?
Exercise 5.3 The following questions are adapted from Mlodinow, The
Drunkard’s Walk.
1. If a family has two children, what is the chance that they have two
girls?
2. If a family has two children and we know that at least one of them is
a girl, what is the chance that they have two girls?
3. If a family has two children and we know that the older one is a girl,
what is the chance that they have two girls?
4. If a family has two children and we know that at least one of them is
a girl named Florida, what is the chance that they have two girls?
You can assume that the probability that any child is a girl is 1/2, and that
the children in a family are independent trials (in more ways than one). You
can also assume that the percentage of girls named Florida is small.
Monty Hall was the original host of the game show Let’s Make a Deal. The
Monty Hall problem is based on one of the regular games on the show. If
you are on the show, here’s what happens:
• Monty shows you three closed doors and tells you that there is a prize
behind each door: one prize is a car, the other two are less valuable
prizes like peanut butter and fake finger nails. The prizes are arranged
at random.
• The object of the game is to guess which door has the car. If you guess
right, you get to keep the car.
5.2. Monty Hall 57
• So you pick a door, which we will call Door A. We’ll call the other
doors B and C.
• Before opening the door you chose, Monty likes to increase the sus-
pense by opening either Door B or C, whichever does not have the
car. (If the car is actually behind Door A, Monty can safely open B or
C, so he chooses one at random).
• Then Monty offers you the option to stick with your original choice or
switch to the one remaining unopened door.
The question is, should you “stick” or “switch” or does it make no differ-
ence?
Most people have the strong intuition that it makes no difference. There are
two doors left, they reason, so the chance that the car is behind Door A is
50%.
But that is wrong. In fact, the chance of winning if you stick with Door A is
only 1/3; if you switch, your chances are 2/3. I will explain why, but I don’t
expect you to believe me.
The key is to realize that there are three possible scenarios: the car is behind
Door A, B or C. Since the prizes are arranged at random, the probability of
each scenario is 1/3.
If your strategy is to stick with Door A, then you will win only in Scenario
A, which has probability 1/3.
If you are not completely convinced by this argument, you are in good com-
pany. When a friend presented this solution to Paul Erdős, he replied, “No,
that is impossible. It should make no difference.2 ”
Exercise 5.4 Write a program that simulates the Monty Hall problem and
use it to estimate the probability of winning if you stick and if you switch.
Which do you find more convincing, the simulation or the arguments, and
why?
Exercise 5.5 To understand the Monty Hall problem, it is important to real-
ize that by deciding which door to open, Monty is giving you information.
To see why this matters, imagine the case where Monty doesn’t know where
the prizes are, so he chooses Door B or C at random.
If he opens the door with the car, the game is over, you lose, and you don’t
get to choose whether to switch or stick.
Otherwise, are you better off switching or sticking?
5.3 Poincaré
Henri Poincaré was a French mathematician who taught at the Sorbonne
around 1900. The following anecdote about him is probably fabricated, but
it makes an interesting probability problem.
Supposedly Poincaré suspected that his local bakery was selling loaves of
bread that were lighter than the advertised weight of 1 kg, so every day for
a year he bought a loaf of bread, brought it home and weighed it. At the end
of the year, he plotted the distribution of his measurements and showed that
it fit a normal distribution with mean 950 g and standard deviation 50 g. He
brought this evidence to the bread police, who gave the baker a warning.
For the next year, Poincaré continued the practice of weighing his bread
every day. At the end of the year, he found that the average weight was
1000 g, just as it should be, but again he complained to the bread police,
and this time they fined the baker.
Why? Because the shape of the distribution was asymmetric. Unlike the
normal distribution, it was skewed to the right, which is consistent with
the hypothesis that the baker was still making 950 g loaves, but deliberately
giving Poincaré the heavier ones.
Exercise 5.6 Write a program that simulates a baker who chooses n loaves
from a distribution with mean 950 g and standard deviation 50 g, and gives
the heaviest one to Poincaré. What value of n yields a distribution with
mean 1000 g? What is the standard deviation?
Compare this distribution to a normal distribution with the same mean and
the same standard deviation. Is the difference in the shape of the distribu-
tion big enough to convince the bread police?
5.4. Another rule of probability 59
In the BRFSS (see Section 4.5), the distribution of heights is roughly normal
with parameters µ = 178 cm and σ2 = 59.4 cm for men, and µ = 163 cm and
σ2 = 52.8 cm for women.
As an aside, you might notice that the standard deviation for men is higher
and wonder whether men’s heights are more variable. To compare vari-
ability between groups, it is useful to compute the coefficient of variation,
which is the standard deviation as a fraction of the mean, σ/µ. By this mea-
sure, women’s heights are slightly more variable.
P(A|B) = P(B|A) = 0
But remember that this only applies if the events are mutually exclusive. In
general the probability of A or B or both is:
The reason we have to subtract off P(A and B) is that otherwise it gets
counted twice. For example, if I flip two coins, the chance of getting at
least one tails is 1/2 + 1/2 − 1/4. I have to subtract 1/4 because otherwise I
am counting heads-heads twice. The problem becomes even clearer if I toss
three coins.
Exercise 5.8 If I roll two dice, what is the chance of rolling at least one 6?
Exercise 5.9 What is the general formula for the probability of A or B but
not both?
60 Chapter 5. Probability
n−1 n−1
n
= +
k k k−1
3. Now imagine that you divide a population of 10000 people into 100
cohorts and follow them for 10 years. What is the chance that at least
one of the cohorts will have a “statistically significant” cluster? What
if we require a p-value of 1%?
4. Now imagine that you arrange 10000 people in a 100 ×100 grid and
follow them for 10 years. What is the chance that there will be at least
one 10 ×10 block anywhere in the grid with a statistically significant
cluster?
5. Finally, imagine that you follow a grid of 10000 people for 30 years.
What is the chance that there will be a 10-year interval at some point
with a 10 ×10 block anywhere in the grid with a statistically significant
cluster?
6 See Gawande, “The Cancer Cluster Myth,” New Yorker, Feb 8, 1997.
5.7. Bayes’s theorem 63
P( B| A) P( A)
P( A| B) =
P( B)
To see that this is true, it helps to write P(A and B), which is the probability
that A and B occur
So
P( E| H )
P( H | E) = P( H )
P( E)
In words, this equation says that the probability of H after you have seen
E is the product of P(H), which is the probability of H before you saw the
evidence, and the ratio of P(E|H), the probability of seeing the evidence
assuming that H is true, and P(E), the probability of seeing the evidence
under any circumstances (H true or not).
A classic use of Bayes’s theorem is the interpretation of clinical tests. For ex-
ample, routine testing for illegal drug use is increasingly common in work-
places and schools (See http://aclu.org/drugpolicy/testing.). The com-
panies that perform these tests maintain that the tests are sensitive, which
means that they are likely to produce a positive result if there are drugs (or
metabolites) in a sample, and specific, which means that they are likely to
yield a negative result if there are no drugs.
Now suppose these tests are applied to a workforce where the actual rate
of drug use is 5%. Of the employees who test positive, how many of them
actually use drugs?
P( E| D )
P( D | E) = P( D )
P( E)
The prior, P(D) is the probability of drug use before we see the outcome of
the test, which is 5%. The likelihood, P(E|D), is the probability of a positive
test assuming drug use, which is the sensitivity.
P( D ) P( E| D )
P( D | E) =
P( D ) P( E| D ) + P( N ) P( E| N )
Plugging in the given values yields P(D|E) = 0.76, which means that of the
people who test positive, about 1 in 4 is innocent.
8I
got these numbers from Gleason and Barnum, “Predictive Probabilities In Employee
Drug-Testing,” at http://piercelaw.edu/risk/vol2/winter/gleason.htm.
5.8. Glossary 65
Exercise 5.14 Write a program that takes the actual rate of drug use, and the
sensitivity and specificity of the test, and uses Bayes’s theorem to compute
P(D|E).
Suppose the same test is applied to a population where the actual rate of
drug use is 1%. What is the probability that someone who tests positive is
actually a drug user?
Exercise 5.16 The blue M&M was introduced in 1995. Before then, the color
mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10%
Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green,
16% Orange, 14% Yellow, 13% Red, 13% Brown).
A friend of mine has two bags of M&Ms, and he tells me that one is from
1994 and one from 1996. He won’t tell me which is which, but he gives
me one M&M from each bag. One is yellow and one is green. What is the
probability that the yellow M&M came from the 1994 bag?
Exercise 5.17 This exercise is adapted from MacKay, Information Theory, In-
ference, and Learning Algorithms:
Elvis Presley had a twin brother who died at birth. According to the
Wikipedia article on twins:
5.8 Glossary
event: Something that may or may not occur, with some probability.
66 Chapter 5. Probability
independent: Two events are independent if the occurrence of one does has
no effect on the probability of another.
likelihood of the evidence: One of the terms in Bayes’s theorem, the prob-
ability of the evidence conditioned on a hypothesis.
Operations on distributions
6.1 Skewness
Skewness is a statistic that measures the asymmetry of a distribution.
Given a sequence of values, xi , the sample skewness is:
g1 = m3 /m3/2
2
1
n∑
m2 = ( x i − µ )2
i
1
n∑
m3 = ( x i − µ )3
i
You might recognize m2 as the mean squared deviation (also known as vari-
ance); m3 is the mean cubed deviation.
Negative skewness indicates that a distribution “skews left;” that is, it ex-
tends farther to the left than the right. Positive skewness indicates that a
distribution skews right.
In practice, computing the skewness of a sample is usually not a good idea.
If there are any outliers, they have a disproportionate effect on g1 .
Another way to evaluate the asymmetry of a distribution is to look at the
relationship between the mean and median. Extreme values have more ef-
fect on the mean than the median, so in a distribution that skews left, the
mean is less than the median.
Pearson’s median skewness coefficient is an alternative measure of skew-
ness that explicitly captures the relationship between the mean, µ, and the
68 Chapter 6. Operations on distributions
median, µ1/2 :
g p = 3(µ − µ1/2 )/σ
This statistic is robust, which means that it is less vulnerable to the effect of
outliers.
Exercise 6.1 Write a function named Skewness that computes g1 for a sam-
ple.
Compute the skewness for the distributions of pregnancy length and birth
weight. Are the results consistent with the shape of the distributions?
Exercise 6.2 The “Lake Wobegon effect” is an amusing nickname1 for il-
lusory superiority, which is the tendency for people to overestimate their
abilities relative to others. For example, in some surveys, more than 80%
of respondents believe that they are better than the average driver (see
http://wikipedia.org/wiki/Illusory_superiority).
What percentage of the population has more than the average number of
legs?
Exercise 6.3 The Internal Revenue Service of the United States (IRS) pro-
vides data about income taxes, and other statistics, at http://irs.gov/
taxstats. If you did Exercise 4.13, you have already worked with this
data; otherwise, follow the instructions there to extract the distribution of
incomes from this dataset.
What fraction of the population reports a taxable income below the mean?
Compute the median, mean, skewness and Pearson’s skewness of the in-
come data. Because the data has been binned, you will have to make some
approximations.
Hint: use the PMF to compute the relative mean difference (see http://
wikipedia.org/wiki/Mean_difference).
You can download a solution to this exercise from http://thinkstats.com/
gini.py.
def generate(self):
return random.expovariate(self.lam)
The init method takes the parameter, λ, and stores it as an attribute. The
generate method returns a random value from the exponential distribution
with that parameter.
70 Chapter 6. Operations on distributions
Each time you invoke generate, you get a different value. The value you
get is called a random variate, which is why many function names in the
random module include the word “variate.”
class Erlang(RandomVariable):
def __init__(self, lam, k):
self.lam = lam
self.k = k
self.expo = Exponential(lam)
def generate(self):
total = 0
for i in range(self.k):
total += self.expo.generate()
return total
The init method creates an Exponential object with the given parameter;
then generate uses it. In general, the init method can take any set of pa-
rameters and the generate function can implement any random process.
Exercise 6.4 Write a definition for a class that represents a random vari-
able with a Gumbel distribution (see http://wikipedia.org/wiki/Gumbel_
distribution).
6.3 PDFs
The derivative of a CDF is called a probability density function, or PDF.
For example, the PDF of an exponential distribution is
PDFexpo ( x ) = λe−λx
6.3. PDFs 71
Evaluating a PDF for a particular value of x is usually not useful. The result
is not a probability; it is a probability density.
In physics, density is mass per unit of volume; in order to get a mass, you
have to multiply by volume or, if the density is not constant, you have to
integrate over volume.
Or, since the CDF is the integral of the PDF, we can write
Exercise 6.5 What is the probability that a value chosen from an exponential
distribution with parameter λ falls between 1 and 20? Express your answer
as a function of λ. Keep this result handy; we will use it in Section 8.8.
Exercise 6.6 In the BRFSS (see Section 4.5), the distribution of heights is
roughly normal with parameters µ = 178 cm and σ2 = 59.4 cm for men,
and µ = 163 cm and σ2 = 52.8 cm for women.
In order to join Blue Man Group, you have to be male between 5’10” and
6’1” (see http://bluemancasting.com). What percentage of the U.S. male
population is in this range? Hint: see Section 4.3.
2 Totake the analogy one step farther, the mean of a distribution is its center of mass,
and the variance is its moment of inertia.
72 Chapter 6. Operations on distributions
6.4 Convolution
Suppose we have two random variables, X and Y, with distributions CDFX
and CDFY . What is the distribution of the sum Z = X + Y?
class Sum(RandomVariable):
def __init__(X, Y):
self.X = X
self.Y = Y
def generate():
return X.generate() + Y.generate()
Given any RandomVariables, X and Y, we can create a Sum object that rep-
resents Z. Then we can use a sample from Z to approximate CDFZ .
This approach is simple and versatile, but not very efficient; we have to
generate a large sample to estimate CDFZ accurately, and even then it is not
exact.
If CDFX and CDFY are expressed as functions, sometimes we can find CDFZ
exactly. Here’s how:
P ( Z ≤ z | X = x ) = P (Y ≤ z − x )
Let’s read that back. The left side is “the probability that the sum is
less than z, given that the first term is x.” Well, if the first term is x and
the sum has to be less than z, then the second term has to be less than
z − x.
P(Y ≤ z − x ) = CDFY (z − x )
3. Good so far? Let’s go on. Since we don’t actually know the value of x,
we have to consider all values it could have and integrate over them:
Z ∞
P( Z ≤ z) = P( Z ≤ z | X = x ) PDFX ( x ) dx
−∞
6.4. Convolution 73
4. To get PDFZ , take the derivative of both sides with respect to z. The
result is Z ∞
PDFZ (z) = PDFY (z − x ) PDFX ( x ) dx
−∞
If you have studied signals and systems, you might recognize that
integral. It is the convolution of PDFY and PDFX , denoted with the
operator ?.
PDFZ = PDFY ? PDFX
So the distribution of the sum is the convolution of the distributions.
See http://wiktionary.org/wiki/booyah!
Now we have to remember that PDFexpo is 0 for all negative values, but we
can handle that by adjusting the limits of integration:
Z z
PDFZ (z) = λe−λx λe−λ(z− x) dx
0
Now we can combine terms and move constants outside the integral:
Z z
2 −λz
PDFZ (z) = λ e dx = λ2 z e−λz
0
Z ∼ N (µ X + µY , σX2 + σY2 )
Exercise 6.11 Let’s see what happens when we add values from other dis-
tributions. Choose a pair of distributions (any two of exponential, normal,
lognormal, and Pareto) and choose parameters that make their mean and
variance similar.
Generate random numbers from these distributions and compute the distri-
bution of their sums. Use the tests from Chapter 4 to see if the sum can be
modeled by a continuous distribution.
• If we add values drawn from other distributions, the sum does not
generally have one of the continuous distributions we have seen.
But it turns out that if we add up a large number of values from almost any
distribution, the distribution of the sum converges to normal.
More specifically, if the distribution of the values has mean and standard
deviation µ and σ, the distribution of the sum is approximately N (nµ, nσ2 ).
This is called the Central Limit Theorem. It is one of the most useful tools
for statistical analysis, but it comes with caveats:
• The values have to come from the same distribution (although this
requirement can be relaxed).
76 Chapter 6. Operations on distributions
• The values have to be drawn from a distribution with finite mean and
variance, so most Pareto distributions are out.
• The number of values you need before you see convergence depends
on the skewness of the distribution. Sums from an exponential dis-
tribution converge for small sample sizes. Sums from a lognormal
distribution do not.
The Central Limit Theorem explains, at least in part, the prevalence of nor-
mal distributions in the natural world. Most characteristics of animals and
other life forms are affected by a large number of genetic and environmental
factors whose effect is additive. The characteristics we measure are the sum
of a large number of small effects, so their distribution tends to be normal.
We started with PMFs, which represent the probabilities for a discrete set
of values. To get from a PMF to a CDF, we computed a cumulative sum.
To be more consistent, a discrete CDF should be called a cumulative mass
function (CMF), but as far as I can tell no one uses that term.
6.8. Glossary 77
Discrete Continuous
CDF smooth
Cumulative (CMF) CDF
6.8 Glossary
skewness: A characteristic of a distribution; intuitively, it is a measure of
how asymmetric the distribution is.
robust: A statistic is robust if it is relatively immune to the effect of outliers.
78 Chapter 6. Operations on distributions
illusory superiority: The tendency of people to imagine that they are better
than average.
Hypothesis testing
Exploring the data from the NSFG, we saw several “apparent effects,” in-
cluding a number of differences between first babies and others. So far we
have taken these effects at face value; in this chapter, finally, we put them to
the test.
All three of these questions are harder than they look. Nevertheless, there
is a general structure that people use to test statistical significance:
80 Chapter 7. Hypothesis testing
p-value: The p-value is the probability of the apparent effect under the null
hypothesis.
For these examples, the null hypothesis is that the distributions for the two
groups are the same, and that the apparent difference is due to chance.
To compute p-values, we find the pooled distribution for all live births (first
babies and others), generate random samples that are the same size as the
observed samples, and compute the difference in means under the null hy-
pothesis.
If we generate a large number of samples, we can count how often the dif-
ference in means (due to chance) is as big or bigger than the difference we
actually observed. This fraction is the p-value.
For pregnancy length, we observed n = 4413 first babies and m = 4735 others,
and the difference in mean was δ = 0.078 weeks. To approximate the p-value
of this effect, I pooled the distributions, generated samples with sizes n and
m and computed the difference in mean.
7.1. Testing a difference in means 81
1.0
Resampled differences
0.8
0.6
CDF(x)
0.4
0.2
The mean difference is near 0, as you would expect with samples from the
same distribution. The vertical lines show the cutoffs where x = −δ or x = δ.
Of 1000 sample pairs, there were 166 where the difference in mean (positive
or negative) was as big or bigger than δ, so the p-value is approximately
0.166. In other words, we expect to see an effect as big as δ about 17% of the
time, even if the actual distribution for the two groups is the same.
So the apparent effect is not very likely, but is it unlikely enough? I’ll ad-
dress that in the next section.
Exercise 7.1 In the NSFG dataset, the difference in mean weight for first
births is 2.0 ounces. Compute the p-value of this difference.
You can start with the code I used to generate the results in this section,
which you can download from http://thinkstats.com/hypothesis.py.
82 Chapter 7. Hypothesis testing
For this kind of hypothesis testing, we can compute the probability of a false
positive explicitly: it turns out to be α.
To see why, think about the definition of false positive—the chance of ac-
cepting a hypothesis that is false—and the definition of a p-value—the
chance of generating the measured effect if the hypothesis is false.
Putting these together, we can ask: if the hypothesis is false, what is the
chance of generating a measured effect that will be considered significant
with threshold α? The answer is α.
But there is a price to pay: decreasing the threshold raises the standard of
evidence, which increases the chance of rejecting a valid hypothesis.
In general there is a tradeoff between Type I and Type II errors. The only
way to decrease both at the same time is to increase the sample size (or, in
some cases, decrease measurement error).
Exercise 7.2 To investigate the effect of sample size on p-value, see what
happens if you discard half of the data from the NSFG. Hint: use
random.sample. What if you discard three-quarters of the data, and so on?
1 Also known as a “Significance criterion.”
7.3. Defining the effect 83
What is the smallest sample size where the difference in mean birth weight
is still significant with α = 5%? How much larger does the sample size have
to be with α = 1%?
You can start with the code I used to generate the results in this section,
which you can download from http://thinkstats.com/hypothesis.py.
P( E | H A ) P( H A )
P( H A | E) =
P( E)
As an example, I’ll compute P(H A |E) for pregnancy lengths in the NSFG.
We have already computed P(E|H 0 ) = 0.166, so all we have to do is compute
P(E|H A ) and choose a value for the prior.
To compute P(E|H A ), we assume that the effect is real—that is, that the
difference in mean duration, δ, is actually what we observed, 0.078. (This
way of formulating H A is a little bit bogus. I will explain and fix the problem
in the next section.)
By generating 1000 sample pairs, one from each distribution, I estimated
P(E|H A ) = 0.494. With the prior P(H A ) = 0.5, the posterior probability of
H A is 0.748.
So if the prior probability of H A is 50%, the updated probability, taking into
account the evidence from this dataset, is almost 75%. It makes sense that
7.5. Cross-validation 85
the posterior is higher, since the data provide some support for the hypoth-
esis. But it might seem surprising that the difference is so large, especially
since we found that the difference in means was not statistically significant.
In fact, the method I used in this section is not quite right, and it tends to
overstate the impact of the evidence. In the next section we will correct this
tendency.
Exercise 7.3 Using the data from the NSFG, what is the posterior probability
that the distribution of birth weights is different for first babies and others?
You can start with the code I used to generate the results in this section,
which you can download from http://thinkstats.com/hypothesis.py.
7.5 Cross-validation
In the previous example, we used the dataset to formulate the hypothesis
H A , and then we used the same dataset to test it. That’s not a good idea; it
is too easy to generate misleading results.
The problem is that even when the null hypothesis is true, there is likely to
be some difference, δ, between any two groups, just by chance. If we use
the observed value of δ to formulate the hypothesis, P(H A |E) is likely to be
high even when H A is false.
We can address this problem with cross-validation, which uses one dataset
to compute δ and a different dataset to evaluate H A . The first dataset is called
the training set; the second is called the testing set.
In a study like the NSFG, which studies a different cohort in each cycle, we
can use one cycle for training and another for testing. Or we can partition
the data into subsets (at random), then use one for training and one for
testing.
Exercise 7.4 If your prior probability for a hypothesis, H A , is 0.3 and new
evidence becomes available that yields a likelihood ratio of 3 relative to the
null hypothesis, H 0 , what is your posterior probability for H A ?
Exercise 7.5 This exercise is adapted from MacKay, Information Theory, Infer-
ence, and Learning Algorithms:
Two people have left traces of their own blood at the scene of
a crime. A suspect, Oliver, is tested and found to have type
O blood. The blood groups of the two traces are found to be
of type O (a common type in the local population, having fre-
quency 60%) and of type AB (a rare type, with frequency 1%).
Do these data (the blood types found at the scene) give evidence
in favor of the proposition that Oliver was one of the two people
whose blood was found at the scene?
Hint: Compute the likelihood ratio for this evidence; if it is greater than 1,
then the evidence is in favor of the proposition. For a solution and discus-
sion, see page 55 of MacKay’s book.
7.7. Chi-square test 87
So maybe the distributions have the same mean and different variance. We
could test the significance of the difference in variance, but variances are less
robust than means, and hypothesis tests for variance often behave badly.
1. We define a set of categories, called cells, that each baby might fall
into. In this example, there are six cells because there are two groups
(first babies and others) and three bins (early, on time or late).
I’ll use the definitions from Section 2.10: a baby is early if it is born
during Week 37 or earlier, on time if it is born during Week 38, 39 or
40, and late if it is born during Week 41 or later.
3. For each cell we compute the deviation; that is, the difference between
the observed value, Oi , and the expected value, Ei .
When the chi-square statistic is used, this process is called a chi-square test.
One feature of the chi-square test is that the distribution of the test statistic
can be computed analytically.
Using the data from the NSFG I computed χ2 = 91.64, which would occur
by chance about one time in 10,000. I conclude that this result is statisti-
cally significant, with one caution: again we used the same dataset for ex-
ploration and testing. It would be a good idea to confirm this result with
another dataset.
You can download the code I used in this section from http://thinkstats.
com/chi.py.
Exercise 7.6 Suppose you run a casino and you suspect that a customer has
replaced a die provided by the casino with a “crooked die;” that is, one that
has been tampered with to make one of the faces more likely to come up
than the others. You apprehend the alleged cheater and confiscate the die,
but now you have to prove that it is crooked.
You roll the die 60 times and get the following results:
Value 1 2 3 4 5 6
Frequency 8 9 19 6 8 10
What is the chi-squared statistic for these values? What is the probability of
seeing a chi-squared value as large by chance?
However, there are times when a little analysis can save a lot of computing,
and Figure 7.1 is one of those times.
Remember that we were testing the observed difference in the mean be-
tween pregnancy lengths for n = 4413 first babies and m = 4735 others. We
formed the pooled distribution for all babies, drew samples with sizes n and
m, and computed the difference in sample means.
To figure out the distribution of the sample means, we have to invoke one
of the properties of the normal distribution: if X is N (µ, σ2 ),
aX + b ∼ N (aµ + b, a2 σ2 )
X/n ∼ N (µ/n, σ2 / n2 )
So as a special case:
Putting it all together, we conclude that the sample in Figure 7.1 is drawn
from N (0, f σ2 ), where f = 1/n + 1/m. Plugging in n = 4413 and m = 4735,
we expect the difference of sample means to be N (0, 0.0032).
delta = 0.078
sigma = math.sqrt(0.0032)
left = erf.NormalCdf(-delta, 0.0, sigma)
right = 1 - erf.NormalCdf(delta, 0.0, sigma)
90 Chapter 7. Hypothesis testing
The sum of the left and right tails is the p-value, 0.168, which is pretty close
to what we estimated by resampling, 0.166. You can download the code I
used in this section from http://thinkstats.com/hypothesis_analytic.
py
7.9 Power
When the result of a hypothesis test is negative (that is, the effect is not
statistically significant), can we conclude that the effect is not real? That
depends on the power of the test.
Statistical power is the probability that the test will be positive if the null
hypothesis is false. In general, the power of a test depends on the sample
size, the magnitude of the effect, and the threshold α.
Exercise 7.7 What is the power of the test in Section 7.2, using α = 0.05 and
assuming that the actual difference between the means is 0.078 weeks?
You can estimate power by generating random samples from distributions
with the given difference in the mean, testing the observed difference in the
mean, and counting the number of positive tests.
What is the power of the test with α = 0.10?
One way to report the power of a test, along with a negative result, is to
say something like, “If the apparent effect were as large as x, this test would
reject the null hypothesis with probability p.”
7.10 Glossary
significant: An effect is statistically significant if it is unlikely to occur by
chance.
false negative: The conclusion that an effect is due to chance when it is not.
two-sided test: A test that asks, “What is the chance of an effect as big as
the observed effect, positive or negative?”
one-sided test: A test that asks, “What is the chance of an effect as big as
the observed effect, and with the same sign?”
chi-square test: A test that uses the chi-square statistic as the test statistic.
likelihood ratio: The ratio of P(E|A) to P(E|B) for two hypotheses A and
B, which is a way to report results from a Bayesian analysis without
depending on priors.
cell: In a chi-square test, the categories the observations are divided into.
power: The probability that a test will reject the null hypothesis if it is false.
92 Chapter 7. Hypothesis testing
Chapter 8
Estimation
I’m thinking of a distribution. I’ll give you two hints; it’s a normal distribution,
and here’s a random sample drawn from it:
One choice is to use the sample mean to estimate µ. Up until now we have
used the symbol µ for both the sample mean and the mean parameter, but
now to distinguish them I will use x̄ for the sample mean. In this example,
x̄ is 0.155, so it would be reasonable to guess µ = 0.155.
This process is called estimation, and the statistic we used (the sample
mean) is called an estimator.
Now what’s your estimate of µ? If you use the sample mean your guess is
−35.12. Is that the best choice? What are the alternatives?
One option is to identify and discard outliers, then compute the sample
mean of the rest. Another option is to use the median as an estimator.
Which estimator is the best depends on the circumstances (for example,
whether there are outliers) and on what the goal is. Are you trying to mini-
mize errors, or maximize your chance of getting the right answer?
If there are no outliers, the sample mean minimizes the mean squared error
(MSE). If we play the game many times, and each time compute the error
x̄ − µ, the sample mean minimizes
1
m∑
MSE = ( x̄ − µ)2
Where m is the number of times you play the estimation game (not to be
confused with n, which is the size of the sample used to compute x̄).
Minimizing MSE is a nice property, but it’s not always the best strategy.
For example, suppose we are estimating the distribution of wind speeds
at a building site. If we guess too high, we might overbuild the structure,
increasing its cost. But if we guess too low, the building might collapse.
Because cost as a function of error is asymmetric, minimizing MSE is not
the best strategy.
As another example, suppose I roll three six-sided dice and ask you to pre-
dict the total. If you get it exactly right, you get a prize; otherwise you get
nothing. In this case the value that minimizes MSE is 10.5, but that would
be a terrible guess. For this game, you want an estimator that has the high-
est chance of being right, which is a maximum likelihood estimator (MLE).
If you pick 10 or 11, your chance of winning is 1 in 8, and that’s the best you
can do.
Exercise 8.1 Write a function that draws 6 values from a normal distribution
with µ = 0 and σ = 1. Use the sample mean to estimate µ and compute the
error x̄ − µ. Run the function 1000 times and compute MSE.
Now modify the program to use the median as an estimator. Compute MSE
again and compare to the MSE for x̄.
sample:
An estimator is unbiased if the expected total (or mean) error, after many
iterations of the estimation game, is 0. Fortunately, there is another simple
statistic that is an unbiased estimator of σ2 :
1
Sn2 −1 =
n−1 ∑ ( xi − x̄ )2
The biggest problem with this estimator is that its name and symbol are
used inconsistently. The name “sample variance” can refer to either S2 or
Sn2 −1 , and the symbol S2 is used for either or both.
For an explanation of why S2 is biased, and a proof that Sn2 −1 is unbiased, see
http://wikipedia.org/wiki/Bias_of_an_estimator.
Exercise 8.2 Write a function that draws 6 values from a normal distribution
with µ = 0 and σ = 1. Use the sample variance to estimate σ2 and compute
the error S2 − σ2 . Run the function 1000 times and compute mean error (not
squared).
Now modify the program to use the unbiased estimator Sn2 −1 . Compute the
mean error again and see if it converges to zero as you increase the number
of games.
While you are playing the game, you don’t know the errors. That is, if I
give you a sample and ask you to estimate a parameter, you can compute
the value of the estimator, but you can’t compute the error. If you could,
you wouldn’t need the estimator!
The reason we talk about estimation error is to describe the behavior of
different estimators in the long run. In this chapter we run experiments to
examine those behaviors; these experiments are artificial in the sense that
we know the actual values of the parameters, so we can compute errors.
But when you work with real data, you don’t, so you can’t.
Now let’s get back to the game.
λ̂1/2 = log(2)/µ1/2
where n is the sample size, λ̂ is the mean-based estimator from the previous
section, and χ2 (k, x) is the CDF of a chi-squared distribution with k degrees
of freedom, evaluated at x (see http://wikipedia.org/wiki/Chi-square_
distribution).
In general, confidence intervals are hard to compute analytically, but rel-
atively easy to estimate using simulation. But first we need to talk about
Bayesian estimation.
interval. But from a frequentist point of view, that is not correct because the
parameter is an unknown but fixed value. It is either in the interval you
computed or not, so the frequentist definition of probability doesn’t apply.
1. Divide the range (0.5, 1.5) into a set of equal-sized bins. For each bin,
we define H i , which is the hypothesis that the actual value of λ falls in
the i th bin. Since λ was drawn from a uniform distribution, the prior
probability, P(H i ), is the same for all i.
P( X | Hi ) = ∏ expo(λi , x j )
j
f = ∑ P( Hi ) P(X | Hi )
i
8.7. Implementing Bayesian estimation 99
return likelihood
Putting it all together, here’s the code that creates the prior and computes
the posterior:
Initially, each person has a degree of confidence about their own hypothesis.
After seeing the evidence, each person updates their confidence based on
P(E|H), the likelihood of the evidence, given their hypothesis.
So the net effect is that some people get more confident, and some less,
depending on the relative likelihood of their hypothesis.
8.8. Censored data 101
One of the strengths of Bayesian estimation is that it can deal with censored
data with relative ease. We can use the method from the previous section
with only one change: we have to replace PDFexpo with the conditional dis-
tribution:
PDFcond (λ, x ) = λe−λx /Z (λ)
for 1 < x < 20, and 0 otherwise, with
Z 20
Z (λ) = λe−λx dx = e−λ − e−20λ
1
You might remember Z(λ) from Exercise 6.5. I told you to keep it handy.
Exercise 8.5 In the 2008 Minnesota Senate race the final vote count was
1,212,629 votes for Al Franken and 1,212,317 votes for Norm Coleman.
Franken was declared the winner, but as Charles Seife points out in Proofi-
ness, the margin of victory was much smaller than the margin of error, so
the result should have been considered a tie.
102 Chapter 8. Estimation
Assuming that there is a chance that any vote might be lost and a chance that
any vote might be double-counted, what is the probability that Coleman
actually received more votes?
Hint: you will have to fill in some details to model the error process.
Before you read the rest of this section, try to answer these questions:
For best results, you should take some time to work on these questions be-
fore you continue.
For a given estimate, N̂, the likelihood of seeing train i is 1/ N̂ if i ≤ N̂, and
0 otherwise. So the MLE is N̂ = i. In other words, if you see train 60 and
you want to maximize your chance of getting the answer exactly right, you
should guess that there are 60 trains.
But this estimator doesn’t do very well in terms of MSE. We can do better
by choosing N̂ = ai; all we have to do is find a good value for a.
8.9. The locomotive problem 103
Suppose that there are, in fact, N trains. Each time we play the estimation
game, we see train i and guess ai, so the squared error is (ai − N)2 .
If we play the game N times and see each train once, the mean squared error
is
1 N
MSE = ∑
N i =1
( ai − N )2
N
dMSE 1
da
=
N ∑ 2i(ai − N ) = 0
i =1
However, for large values of N, the optimal value for a converges to 3/2, so
we could choose N̂ = 3i/ 2.
N
1
ME =
N ∑ (ai − N )
i =1
2N
a=
N−1
So far we have generated three estimators, i, 3i/2, and 2i, that have the
properties of maximizing likelihood, minimizing squared error, and being
unbiased.
Yet another way to generate an estimator is to choose the value that makes
the population mean equal the sample mean. If we see train i, the sample
mean is just i; the train population that has the same mean is N̂ = 2i − 1.
104 Chapter 8. Estimation
0.014
Locomotive problem
posterior
0.012
0.010
Posterior probability
0.008
0.006
0.004
0.002
P(i | Hn ) P( Hn )
P( Hn | i ) =
P (i )
Where H n is the hypothesis that there are n trains, and i is the evidence: we
saw train i. Again, P(i|H n ) is 1/n if i < n, and 0 otherwise. The normalizing
constant, P(i), is just the sum of the numerators for each hypothesis.
If the prior distribution is uniform from 1 to 200, we start with 200 hypothe-
ses and compute the likelihood for each. You can download an implementa-
tion from http://thinkstats.com/locomotive.py. Figure 8.1 shows what
the result looks like.
The 90% credible interval for this posterior is [63, 189], which is still quite
wide. Seeing one train doesn’t provide strong evidence for any of the hy-
potheses (although it does rule out the hypotheses with n < i).
One way to think of different estimators is that they are implicitly based
on different priors. If there is enough evidence to swamp the priors, then
all estimators tend to converge; otherwise, as in this case, there is no single
estimator that has all of the properties we might want.
Exercise 8.6 Generalize locomotive.py to handle the case where you see
more than one train. You should only have to change a few lines of code.
8.10. Glossary 105
See if you can answer the other questions for the case where you see more
than one train. You can find a discussion of the problem and several solu-
tions at http://wikipedia.org/wiki/German_tank_problem.
8.10 Glossary
estimation: The process of inferring the parameters of a distribution from
a sample.
Correlation
9.2 Covariance
Covariance is a measure of the tendency of two variables to vary together.
If we have two series, X and Y, their deviations from the mean are
dxi = xi − µ X
dyi = yi − µY
where µ X is the mean of X and µY is the mean of Y. If X and Y vary together,
their deviations tend to have the same sign.
If we multiply them together, the product is positive when the deviations
have the same sign and negative when they have the opposite sign. So
adding up the products gives a measure of the tendency to vary together.
Covariance is the mean of these products:
1
n∑ i i
Cov( X, Y ) = dx dy
where n is the length of the two series (they have to be the same length).
Covariance is useful in some computations, but it is seldom reported as a
summary statistic because it is hard to interpret. Among other problems, its
units are the product of the units of X and Y. So the covariance of weight
and height might be in units of kilogram-meters, which doesn’t mean much.
Exercise 9.1 Write a function called Cov that takes two lists and computes
their covariance. To test your function, compute the covariance of a list
with itself and confirm that Cov(X, X) = Var(X).
You can download a solution from http://thinkstats.com/correlation.
py.
9.3 Correlation
One solution to this problem is to divide the deviations by σ, which yields
standard scores, and compute the product of standard scores:
( x i − µ X ) ( y i − µY )
pi =
σX σY
1
n∑ i
ρ= p
9.3. Correlation 109
Also, the result is necessarily between −1 and +1. To see why, we can
rewrite ρ by factoring out σ X and σY :
Cov( X, Y )
ρ=
σX σY
∑ dxi dy x
ρ=
∑ dxi ∑ dyi
Most correlation in the real world is not perfect, but it is still useful. For
example, if you know someone’s height, you might be able to guess their
weight. You might not get it exactly right, but your guess will be better
than if you didn’t know the height. Pearson’s correlation is a measure of
how much better.
The top row shows linear relationships with a range of correlations; you
can use this row to get a sense of what different values of ρ look like.
The second row shows perfect correlations with a range of slopes, which
1 See http://wikipedia.org/wiki/Cauchy-Schwarz_inequality.
110 Chapter 9. Correlation
200
180
160
140
Figure 9.2: Simple scatterplot of weight versus height for the respondents
in the BRFSS.
200
180
160
140
Weight (kg)
120
100
80
60
40
20140 150 160 170 180 190 200 210
Height (cm)
200
180
160
140
Weight (kg)
120
100
80
60
40
20140 150 160 170 180 190 200 210
Height (cm)
200
180
160
140
Weight (kg)
120
100
80
60
40
20140 150 160 170 180 190 200 210
Height (cm)
that the heights were rounded to the nearest inch, converted to centimeters,
and then rounded again. Some information is lost in translation.
We can’t get that information back, but we can minimize the effect on the
scatterplot by jittering the data, which means adding random noise to re-
verse the effect of rounding off. Since these measurements were rounded to
the nearest inch, they can be off by up to 0.5 inches or 1.3 cm. So I added
uniform noise in the range −1.3 to 1.3:
jitter = 1.3
heights = [h + random.uniform(-jitter, jitter) for h in heights]
Figure 9.3 shows the result. Jittering the data makes the shape of the re-
lationship clearer. In general you should only jitter data for purposes of
visualization and avoid using jittered data for analysis.
Even with jittering, this is not the best way to represent the data. There are
many overlapping points, which hides data in the dense parts of the figure
and gives disproportionate emphasis to outliers.
We can solve that with the alpha parameter, which makes the points partly
transparent:
pyplot.scatter(heights, weights, alpha=0.2)
Figure 9.4 shows the result. Overlapping data points look darker, so dark-
ness is proportional to density. In this version of the plot we can see an
apparent artifact: a horizontal line near 90 kg or 200 pounds. Since this
data is based on self-reports in pounds, the most likely explanation is some
responses were rounded off (possibly down).
Using transparency works well for moderate-sized datasets, but this figure
only shows the first 1000 records in the BRFSS, out of a total of 414509.
To handle larger datasets, one option is a hexbin plot, which divides the
graph into hexagonal bins and colors each bin according to how many data
points fall in it. pyplot provides a function called hexbin:
pyplot.hexbin(heights, weights, cmap=matplotlib.cm.Blues)
Figure 9.5 shows the result with a blue colormap. An advantage of a hexbin
is that it shows the shape of the relationship well, and it is efficient for large
datasets. A drawback is that it makes the outliers invisible.
The moral of this story is that it is not easy to make a scatterplot that is not
potentially misleading. You can download the code for these figures from
http://thinkstats.com/brfss_scatter.py.
114 Chapter 9. Correlation
Finally, compute Spearman’s rank correlation for weight and height. Which
coefficient do you think is the best measure of the strength of the relation-
ship? You can download a solution from http://thinkstats.com/brfss_
corr.py.
εi = (α + βxi ) − yi
If we get the parameters α and β wrong, the residuals get bigger, so it makes
intuitive sense that the parameters we want are the ones that minimize the
residuals.
Why? There are three good reasons and one bad one:
• Squaring gives more weight to large residuals, but not so much weight
that the largest residual always dominates.
• The values of α̂ and β̂ that minimize the squared residuals can be com-
puted efficiently.
That last reason made sense when computational efficiency was more im-
portant than choosing the method most appropriate to the problem at hand.
That’s no longer the case, so it is worth considering whether squared resid-
uals are the right thing to minimize.
For example, if you are using values of X to predict values of Y, guessing
too high might be better (or worse) than guessing too low. In that case you
might want to compute some cost function, cost(εi ), and minimize total cost.
However, computing a least squares fit is quick, easy and often good
enough, so here’s how:
1. Compute the sample means, x̄ and ȳ, the variance of X, and the co-
variance of X and Y.
To evaluate whether a location is a viable site for a wind turbine, you can
set up an anemometer to measure wind speed for a period of time. But it is
hard to measure the tail of the wind speed distribution accurately because,
by definition, events in the tail don’t happen very often.
Write a function that takes a sample from a Weibull distribution and esti-
mates its parameters.
Var (ε)
R2 = 1 −
Var (Y )
118 Chapter 9. Correlation
To understand what R2 means, suppose (again) that you are trying to guess
someone’s weight. If you didn’t know anything about them, your best strat-
egy would be to guess ȳ; in that case the MSE of your guesses would be
Var(Y):
1
MSE = ∑(ȳ − yi )2 = Var (Y )
n
But if I told you their height, you would guess α̂ + β̂ xi ; in that case your
MSE would be Var(ε).
1
n∑
MSE = (α̂ + β̂xi − yi )2 = Var (ε)
So the term Var(ε)/Var(Y) is the ratio of mean squared error with and with-
out the explanatory variable, which is the fraction of variability left unex-
plained by the model. The complement, R2 , is the fraction of variability
explained by the model.
If a model yields R2 = 0.64, you could say that the model explains 64% of
the variability, or it might be more precise to say that it reduces the MSE of
your predictions by 64%.
In the context of a linear least squares model, it turns out that there is a
simple relationship between the coefficient of determination and Pearson’s
correlation coefficient, ρ:
R2 = ρ2
See http://wikipedia.org/wiki/Howzzat!
Exercise 9.8 The Wechsler Adult Intelligence Scale (WAIS) is meant to be a
measure of intelligence; scores are calibrated so that the mean and standard
deviation in the general population are 100 and 15.
Suppose that you wanted to predict someone’s WAIS score based on their
SAT scores. According to one study, there is a Pearson correlation of 0.72
between total SAT scores and WAIS scores.
If you applied your predictor to a large sample, what would you expect to
be the mean squared error (MSE) of your predictions?
Hint: What is the MSE if you always guess 100?
Exercise 9.9 Write a function named Residuals that takes X, Y, α̂ and β̂ and
returns a list of εi .
Write a function named CoefDetermination that takes the εi and Y and re-
turns R2 . To test your functions, confirm that R2 = ρ2 . You can download a
solution from http://thinkstats.com/correlation.py.
9.8. Correlation and Causation 119
Exercise 9.10 Using the height and weight data from the BRFSS (one more
time), compute α̂, β̂ and R2 . If you were trying to guess someone’s weight,
how much would it help to know their height? You can download a solution
from http://thinkstats.com/brfss_corr.py.
1. Use time. If A comes before B, then A can cause B but not the other
way around (at least according to our common understanding of cau-
sation). The order of events can help us infer the direction of causa-
tion, but it does not preclude the possibility that something else causes
both A and B.
This works even if you don’t know what the relevant variables are, but
it works even better if you do, because you can check that the groups
are identical.
These ideas are the motivation for the randomized controlled trial, in
which subjects are assigned randomly to two (or more) groups: a treatment
group that receives some kind of intervention, like a new medicine, and
a control group that receives no intervention, or another treatment whose
effects are known.
A randomized controlled trial is the most reliable way to demonstrate
a causal relationship, and the foundation of science-based medicine (see
http://wikipedia.org/wiki/Randomized_controlled_trial).
Unfortunately, controlled trials are only possible in the laboratory sciences,
medicine, and a few other disciplines. In the social sciences, controlled ex-
periments are rare, usually because they are impossible or unethical.
One alternative is to look for a natural experiment, where different “treat-
ments” are applied to groups that are otherwise similar. One danger of
natural experiments is that the groups might differ in ways that are not ap-
parent. You can read more about this topic at http://wikipedia.org/wiki/
Natural_experiment.
In some cases it is possible to infer causal relationships using regression
analysis. A linear least squares fit is a simple form of regression that ex-
plains a dependent variable using one explanatory variable. There are sim-
ilar techniques that work with arbitrary numbers of independent variables.
I won’t cover those techniques here, but there are also simple ways to con-
trol for spurious relationships. For example, in the NSFG, we saw that first
babies tend to be lighter than others (see Section 3.6). But birth weight is
also correlated with the mother’s age, and mothers of first babies tend to be
younger than mothers of other babies.
So it may be that first babies are lighter because their mothers are younger.
To control for the effect of age, we could divide the mothers into age groups
and compare birth weights for first babies and others in each age group.
If the difference between first babies and others is the same in each age
group as it was in the pooled data, we conclude that the difference is not
related to age. If there is no difference, we conclude that the effect is entirely
due to age. Or, if the difference is smaller, we can quantify how much of the
effect is due to age.
9.9. Glossary 121
Exercise 9.11 The NSFG data includes a variable named agepreg that
records the age of the mother at the time of birth. Make a scatterplot of
mother’s age and baby’s weight for each live birth. Can you see a relation-
ship?
Compute a linear least-squares fit for these variables. What are the units
of the estimated parameters α̂ and β̂? How would you summarize these
results in a sentence or two?
Compute the average age for mothers of first babies and the average age
of other mothers. Based on the difference in ages between the groups, how
much difference do you expect in the mean birth weights? What fraction of
the actual difference in birth weights is explained by the difference in ages?
9.9 Glossary
correlation: a description of the dependence between variables.
least squares fit: A model of a dataset that minimizes the sum of squares of
the residuals.
mode, 15, 16, 23 PDF, 70, 72, 77, 78, 100, 117
model, 44–46, 48–50 Pearson coefficient of correlation,
moment of inertia, 71 107, 109, 114
Monte Carlo, 61, 66, 88 Pearson’s median skewness coeffi-
Monty Hall cient, 67
confused, 58 Pearson, Karl, 109
Monty Hall problem, 56 percentile, 29, 34, 35, 44
Mosteller, Frederick, 102 percentile rank, 28, 33, 35, 107
MSE, 94, 102, 105 plot
Munroe, Randall, 119 bar, 15, 18
mutually exclusive, 59 hexbin, 113
line, 18
National Survey of Family Growth, normal probability, 45, 50
3, 4, 19, 20, 27, 32, 34, 43, 79– scatter, 109, 110
85, 88, 120 plotting, 15, 18
natural experiment, 120, 122 PMF, 13, 18, 23, 26, 27, 76
noise, 113 Pmf object, 16, 74, 99
normal distribution, 42, 45, 46, 48, 58, point estimate, 97
70, 71, 74–77, 89, 93–95, 107, point estimation, 105
114 pooled data, 120
normal probability plot, 45, 50 population, 4, 9, 48
normalization, 13, 23
posterior, 63, 66, 99
normalize, 107, 121
posterior distribution, 102
normalizing constant, 63, 66
posterior probability, 84
NSFG, 3, 4, 19, 20, 27, 32, 34, 43, 79–
power, 90, 91
85, 88, 120
prediction, 94, 109
null hypothesis, 80, 86, 90
preferential attachment, 49
oeuf, 5 pregnancy length, 8, 15, 18, 19, 21, 44,
one-sided test, 83, 91 68, 79, 80, 82, 84, 87, 88
operations on distributions, 67 Presley, Elvis, 65
outlier, 16, 19, 23, 67, 94, 96, 113, 114 Prime Minister, 54
oversampling, 4, 10, 26 prior, 63, 66, 99
prior distribution, 104
p-value, 80, 89, 90 prior probability, 84
parameter, 37, 40, 41, 43, 47, 49, 50, probability, 1, 13, 23, 53
93, 96, 98 conditional, 21, 55
Pareto distribution, 40–42, 45, 48, 49, rules of, 54, 59
75, 76 probability density function, 70, 78
Pareto World, 42 probability mass function, 13
Pareto, Vilfredo, 40 product, 76
particles, 101 Proofiness, 101
128 Index
variability, 118
variable
random, 69, 72, 74, 78
variance, 12, 18, 23, 67, 76, 87, 94, 110
variate
random, 70, 78
visualization, 20