Think Stats Exploratory Data Analysis Second Edition Allen B. Downey instant download
Think Stats Exploratory Data Analysis Second Edition Allen B. Downey instant download
https://ebookfinal.com/download/think-stats-exploratory-data-
analysis-second-edition-allen-b-downey/
https://ebookfinal.com/download/think-python-1st-edition-allen-b-
downey/
https://ebookfinal.com/download/think-complexity-complexity-science-
and-computational-modeling-2nd-version-1-1-edition-allen-b-downey/
https://ebookfinal.com/download/exploratory-multivariate-analysis-by-
example-using-r-second-edition-husson/
https://ebookfinal.com/download/exploratory-network-analysis-with-
pajek-wouter-de-nooy/
https://ebookfinal.com/download/applied-multivariate-data-analysis-
second-edition-brian-s-everitt/
https://ebookfinal.com/download/data-analysis-in-vegetation-ecology-
second-edition-otto-wildiauth/
https://ebookfinal.com/download/exploratory-and-confirmatory-factor-
analysis-understanding-concepts-and-applications-1st-edition-bruce-
thompson/
https://ebookfinal.com/download/how-to-think-about-analysis-1st-
edition-lara-alcock/
Think Stats Exploratory Data Analysis Second Edition
Allen B. Downey Digital Instant Download
Author(s): Allen B. Downey
ISBN(s): 9781491907375, 1491907371
Edition: Second edition
File Details: PDF, 10.84 MB
Year: 2015
Language: english
2n
d
Ed
Think
Think Stats
iti
on
SECOND
EDITION
If you know how to program, you have the skills to turn data into knowledge
using tools of probability and statistics. This concise introduction shows comprehensive
“This is the most
introduc-
Think Stats
you how to perform statistical analysis computationally, rather than
mathematically, with programs written in Python. tion to the Python data
By working with a single case study throughout this thoroughly revised
analysis stack on the
book, you’ll learn the entire process of exploratory data analysis—from market. Practitioners
Stats
collecting data and generating statistics to identifying patterns and testing who want to brush up
hypotheses. You’ll explore distributions, rules of probability, visualization, on their technical skills
and many other tools and concepts.
by learning about the
New chapters on regression, time series analysis, survival analysis, and
analytic methods will enrich your discoveries.
tools available for a
modern programming
■■ Develop an understanding of probability and statistics by language will also
writing and testing code benefit from this book.
■■ Run experiments to test statistical behavior, such as generating This is an excellent
samples from several distributions
modern statistics
■■ Use simulations to understand concepts that are hard to grasp
mathematically textbook. ” —Skipper Seabold
EXPLOR ATORY
■■ Import data from most sources with Python, rather than rely author of StatsModels
DATA ANALYSIS
on data that’s cleaned and formatted for statistics tools
■■ Use statistical inference to answer questions about real-world
data
Downey
Allen B. Downey
2n
d
Ed
Think
Think Stats
iti
on
SECOND
EDITION
If you know how to program, you have the skills to turn data into knowledge
using tools of probability and statistics. This concise introduction shows comprehensive
“This is the most
introduc-
Think Stats
you how to perform statistical analysis computationally, rather than
mathematically, with programs written in Python. tion to the Python data
By working with a single case study throughout this thoroughly revised
analysis stack on the
book, you’ll learn the entire process of exploratory data analysis—from market. Practitioners
Stats
collecting data and generating statistics to identifying patterns and testing who want to brush up
hypotheses. You’ll explore distributions, rules of probability, visualization, on their technical skills
and many other tools and concepts.
by learning about the
New chapters on regression, time series analysis, survival analysis, and
analytic methods will enrich your discoveries.
tools available for a
modern programming
■■ Develop an understanding of probability and statistics by language will also
writing and testing code benefit from this book.
■■ Run experiments to test statistical behavior, such as generating This is an excellent
samples from several distributions
modern statistics
■■ Use simulations to understand concepts that are hard to grasp
mathematically textbook. ” —Skipper Seabold
EXPLOR ATORY
■■ Import data from most sources with Python, rather than rely author of StatsModels
DATA ANALYSIS
on data that’s cleaned and formatted for statistics tools
■■ Use statistical inference to answer questions about real-world
data
Downey
Allen B. Downey
SECOND EDITION
Think Stats
Allen B. Downey
Think Stats, Second Edition
by Allen B. Downey
Copyright © 2015 Allen B. Downey. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or [email protected].
Editors: Mike Loukides and Meghan Blanchette Indexer: Allen B. Downey
Production Editor: Melanie Yarbrough Cover Designer: Karen Montgomery
Copyeditor: Marta Justak Interior Designer: David Futato
Proofreader: Amanda Kersey Illustrator: Rebecca Demarest
The O’Reilly logo is a registered trademarks of O’Reilly Media, Inc. Think Stats, second edition, the cover
image of an archerfish, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark
claim, the designations have been printed in caps or initial caps.
While the publisher and the author have used good faith efforts to ensure that the information and instruc‐
tions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors
or omissions, including without limitation responsibility for damages resulting from the use of or reliance
on this work. Use of the information and instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is subject to open source licenses or the intel‐
lectual property rights of others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
Think Stats is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
License. The author maintains an online version at http://thinkstats2.com.
ISBN: 978-1-491-90733-7
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
2. Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Representing Histograms 16
Plotting Histograms 16
NSFG Variables 17
Outliers 19
First Babies 20
Summarizing Distributions 22
Variance 23
Effect Size 23
Reporting Results 24
Exercises 25
Glossary 25
iii
Plotting PMFs 28
Other Visualizations 30
The Class Size Paradox 30
DataFrame Indexing 34
Exercises 35
Glossary 37
5. Modeling Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
The Exponential Distribution 49
The Normal Distribution 52
Normal Probability Plot 54
The lognormal Distribution 55
The Pareto Distribution 57
Generating Random Numbers 60
Why Model? 61
Exercises 61
Glossary 63
iv | Table of Contents
7. Relationships Between Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Scatter Plots 79
Characterizing Relationships 82
Correlation 83
Covariance 84
Pearson’s Correlation 85
Nonlinear Relationships 86
Spearman’s Rank Correlation 87
Correlation and Causation 88
Exercises 88
Glossary 89
8. Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
The Estimation Game 91
Guess the Variance 93
Sampling Distributions 94
Sampling Bias 97
Exponential Distributions 98
Exercises 99
Glossary 100
Table of Contents | v
Testing a Linear Model 124
Weighted Resampling 126
Exercises 127
Glossary 128
vi | Table of Contents
Exercises 180
Glossary 181
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
This book is an introduction to the practical tools of exploratory data analysis. The
organization of the book follows the process I use when I start working with a dataset:
• Importing and cleaning: Whatever format the data is in, it usually takes some time
and effort to read the data, clean and transform it, and check that everything made
it through the translation process intact.
• Single variable explorations: I usually start by examining one variable at a time,
finding out what the variables mean, looking at distributions of the values, and
choosing appropriate summary statistics.
• Pair-wise explorations: To identify possible relationships between variables, I look
at tables and scatter plots, and compute correlations and linear fits.
• Multivariate analysis: If there are apparent relationships between variables, I use
multiple regression to add control variables and investigate more complex rela‐
tionships.
• Estimation and hypothesis testing: When reporting statistical results, it is important
to answer three questions: How big is the effect? How much variability should we
expect if we run the same measurement again? Is it possible that the apparent effect
is due to chance?
• Visualization: During exploration, visualization is an important tool for finding
possible relationships and effects. Then if an apparent effect holds up to scrutiny,
visualization is an effective way to communicate results.
This book takes a computational approach, which has several advantages over mathe‐
matical approaches:
• I present most ideas using Python code, rather than mathematical notation. In
general, Python code is more readable; also, because it is executable, readers can
download it, run it, and modify it.
ix
• Each chapter includes exercises readers can do to develop and solidify their learn‐
ing. When you write programs, you express your understanding in code; while you
are debugging the program, you are also correcting your understanding.
• Some exercises involve experiments to test statistical behavior. For example, you
can explore the Central Limit Theorem (CLT) by generating random samples and
computing their sums. The resulting visualizations demonstrate why the CLT works
and when it doesn’t.
• Some ideas that are hard to grasp mathematically are easy to understand by simu‐
lation. For example, we approximate p-values by running random simulations,
which reinforces the meaning of the p-value.
• Because the book is based on a general-purpose programming language (Python),
readers can import data from almost any source. They are not limited to datasets
that have been cleaned and formatted for a particular statistics tool.
• The National Survey of Family Growth (NSFG), conducted by the U.S. Centers for
Disease Control and Prevention (CDC) to gather “information on family life, mar‐
riage and divorce, pregnancy, infertility, use of contraception, and men’s and wom‐
en’s health.” (See http://cdc.gov/nchs/nsfg.htm.)
• The Behavioral Risk Factor Surveillance System (BRFSS), conducted by the Na‐
tional Center for Chronic Disease Prevention and Health Promotion to “track
health conditions and risk behaviors in the United States.” (See http://cdc.gov/
BRFSS/.)
Other examples use data from the IRS, the U.S. Census, and the Boston Marathon.
This second edition of Think Stats includes the chapters from the first edition, many of
them substantially revised, and new chapters on regression, time series analysis, survival
analysis, and analytic methods. The previous edition did not use pandas, SciPy, or
StatsModels, so all of that material is new.
x | Preface
I did not do that. In fact, I used almost no printed material while I was writing this book,
for several reasons:
• My goal was to explore a new approach to this material, so I didn’t want much
exposure to existing approaches.
• Since I am making this book available under a free license, I wanted to make sure
that no part of it was encumbered by copyright restrictions.
• Many readers of my books don’t have access to libraries of printed material, so I
tried to make references to resources that are freely available on the Internet.
• Some proponents of old media think that the exclusive use of electronic resources
is lazy and unreliable. They might be right about the first part, but I think they are
wrong about the second, so I wanted to test my theory.
The resource I used more than any other is Wikipedia. In general, the articles I read on
statistical topics were very good (although I made a few small changes along the way).
I include references to Wikipedia pages throughout the book and I encourage you to
follow those links; in many cases, the Wikipedia page picks up where my description
leaves off. The vocabulary and notation in this book are generally consistent with Wi‐
kipedia, unless I had a good reason to deviate. Other resources I found useful were
Wolfram MathWorld and the Reddit statistics forum.
• You can create a copy of my repository on GitHub by pressing the Fork button. If
you don’t already have a GitHub account, you’ll need to create one. After forking,
you’ll have your own repository on GitHub that you can use to keep track of code
you write while working on this book. Then you can clone the repo, which means
that you make a copy of the files on your computer.
• Or you could clone my repository. You don’t need a GitHub account to do this, but
you won’t be able to write your changes back to GitHub.
• If you don’t want to use Git at all, you can download the files in a Zip file using the
button in the lower-right corner of the GitHub page.
All of the code is written to work in both Python 2 and Python 3 with no translation.
Preface | xi
I developed this book using Anaconda from Continuum Analytics, which is a free
Python distribution that includes all the packages you’ll need to run the code (and lots
more). I found Anaconda easy to install. By default it does a user-level installation, not
system-level, so you don’t need administrative privileges. And it supports both Python
2 and Python 3. You can download Anaconda from Continuum.
If you don’t want to use Anaconda, you will need the following packages:
Although these are commonly used packages, they are not included with all Python
installations, and they can be hard to install in some environments. If you have trouble
installing them, I strongly recommend using Anaconda or one of the other Python
distributions that include these packages.
After you clone the repository or unzip the zip file, you should have a file called
ThinkStats2/code/nsfg.py. If you run it, it should read a data file, run some tests, and
print a message like, “All tests passed.” If you get import errors, it probably means there
are packages you need to install.
Most exercises use Python scripts, but some also use the IPython notebook. If you have
not used IPython notebook before, I suggest you start with the documentation.
I wrote this book assuming that the reader is familiar with core Python, including object-
oriented features, but not pandas, NumPy, and SciPy. If you are already familiar with
these modules, you can skip a few sections.
I assume that the reader knows basic mathematics, including logarithms, for example,
and summations. I refer to calculus concepts in a few places, but you don’t have to do
any calculus.
If you have never studied statistics, I think this book is a good place to start. And if you
have taken a traditional statistics class, I hope this book will help repair the damage.
—
Allen B. Downey is a Professor of Computer Science at the Franklin W. Olin College of
Engineering in Needham, MA.
xii | Preface
Contributor List
If you have a suggestion or correction, please send email to downey@allendow‐
ney.com. If I make a change based on your feedback, I will add you to the contributor
list (unless you ask to be omitted).
If you include at least part of the sentence the error appears in, that makes it easy for
me to search. Page and section numbers are fine, too, but not quite as easy to work with.
Thanks!
• Lisa Downey and June Downey read an early draft and made many corrections and
suggestions.
• Steven Zhang found several errors.
• Andy Pethan and Molly Farison helped debug some of the solutions, and Molly
spotted several typos.
• Andrew Heine found an error in my error function.
• Dr. Nikolas Akerblom knows how big a Hyracotherium is.
• Alex Morrow clarified one of the code examples.
• Jonathan Street caught an error in the nick of time.
• Gábor Lipták found a typo in the book and the relay race solution.
• Many thanks to Kevin Smith and Tim Arnold for their work on plasTeX, which I
used to convert this book to DocBook.
• George Caplan sent several suggestions for improving clarity.
• Julian Ceipek found an error and a number of typos.
• Stijn Debrouwere, Leo Marihart III, Jonathan Hammler, and Kent Johnson found
errors in the first print edition.
• Dan Kearney found a typo.
• Jeff Pickhardt found a broken link and a typo.
• Jörg Beyer found typos in the book and made many corrections in the docstrings
of the accompanying code.
• Tommie Gannert sent a patch file with a number of corrections.
• Alexander Gryzlov suggested a clarification in an exercise.
• Martin Veillette reported an error in one of the formulas for Pearson’s correlation.
• Christoph Lendenmann submitted several errata.
• Haitao Ma noticed a typo and and sent me a note.
• Michael Kearney sent me many excellent suggestions.
Preface | xiii
• Alex Birch made a number of helpful suggestions.
• Lindsey Vanderlyn, Griffin Tschurwald, and Ben Small read an early version of this
book and found many errors.
• John Roth, Carol Willing, and Carol Novitsky performed technical reviews of the
book. They found many errors and made many helpful suggestions.
• Rohit Deshpande found a typesetting error.
• David Palmer sent many helpful suggestions and corrections.
• Erik Kulyk found many typos.
xiv | Preface
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/think_stats_2e.
To comment or ask technical questions about this book, send email to bookques
[email protected].
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Preface | xv
CHAPTER 1
Exploratory Data Analysis
The thesis of this book is that data combined with practical methods can answer ques‐
tions and guide decisions under uncertainty.
As an example, I present a case study motivated by a question I heard when my wife
and I were expecting our first child: do first babies tend to arrive late?
If you Google this question, you will find plenty of discussion. Some people claim it’s
true, others say it’s a myth, and some people say it’s the other way around: first babies
come early.
In many of these discussions, people provide data to support their claims. I found many
examples like these:
“My two friends that have given birth recently to their first babies, BOTH went almost 2
weeks overdue before going into labour or being induced.”
“My first one came 2 weeks late and now I think the second one is going to come out two
weeks early!!”
“I don’t think that can be true because my sister was my mother’s first and she was early,
as with many of my cousins.”
Reports like these are called anecdotal evidence because they are based on data that is
unpublished and usually personal. In casual conversation, there is nothing wrong with
anecdotes, so I don’t mean to pick on the people I quoted.
But we might want evidence that is more persuasive and an answer that is more reliable.
By those standards, anecdotal evidence usually fails, because:
Small number of observations
If pregnancy length is longer for first babies, the difference is probably small com‐
pared to natural variation. In that case, we might have to compare a large number
of pregnancies to be sure that a difference exists.
1
Selection bias
People who join a discussion of this question might be interested because their first
babies were late. In that case the process of selecting data would bias the results.
Confirmation bias
People who believe the claim might be more likely to contribute examples that
confirm it. People who doubt the claim are more likely to cite counterexamples.
Inaccuracy
Anecdotes are often personal stories, and often misremembered, misrepresented,
repeated inaccurately, etc.
So how can we do better?
A Statistical Approach
To address the limitations of anecdotes, we will use the tools of statistics, which include:
Data collection
We will use data from a large national survey that was designed explicitly with the
goal of generating statistically valid inferences about the U.S. population.
Descriptive statistics
We will generate statistics that summarize the data concisely, and evaluate different
ways to visualize data.
Exploratory data analysis
We will look for patterns, differences, and other features that address the questions
we are interested in. At the same time we will check for inconsistencies and identify
limitations.
Estimation
We will use data from a sample to estimate characteristics of the general population.
Hypothesis testing
Where we see apparent effects, like a difference between two groups, we will evaluate
whether the effect might have happened by chance.
By performing these steps with care to avoid pitfalls, we can reach conclusions that are
more justifiable and more likely to be correct.
King Winter sat up, and even the four Spirits looked
startled.
At these words King Winter arose, the four spirits lifted 169
their heads, there was a murmur of many voices and
then fairy music everywhere.
“Rise up, Hanka,” said the Ruler of the North. “My reign
in the City is over, for the Merciless Tsar has repented
and become as humble as thou. Go back to the great
forests where thy Tsar and his people are, and tell them
to return hither, for King Winter and his forces have left
the city, and it belongs to the Tsar once more! In token
of this, in case thou shouldst forget what to say, take
that bag of snow-stars behind the Throne, and carry it
to the Tsar.”
170
While he spoke, a whole army of spirits, snow-fairies
and wind-fairies and genie, crowned with frost-flowers,
gathered from all parts of the palace. Some came from
the bedrooms, where they had been asleep in the
bureau-drawers, some from the kitchen where they had
been hiding under cups and mixing-bowls, some peeped
down over the pictures on the parlor wall, or between
the curtains, or even out of the empty hall-stove. They
all joined hands in a ring and danced around Hanka,
who sat bewildered on the floor with his axe and
fishing-rod, wondering where all these creatures had
been while he had explored the palace.
“Joy, joy,” sang the spirits, “we are going home again,
home to the North Pole, to our friends, the seals and
polar bears, the long waiting-time is over, for the
Merciless Tsar has repented—joy, joy, joy!”
“But you will starve!” they cried. “Oh no,” replied the
Tsar. “Some good Saint will take care of me.”
And in the night, when the village was quiet and dark,
the crows in the forest flew to him and brought him
some frozen berries, the squirrels brought nuts to
appease his hunger, and the fairies from the great
Forest brought partridge-eggs and reindeer milk.
174
It was a beautiful sunny morning, when the villagers
who stood about the gate talking to the Tsar saw Hanka
returning, with his axe and rod and a bag over his
shoulder. “Look, look,” they cried, “he is bringing a
whole bagful of fish!”
Hanka was allowed to live in the palace all his life, and
had a silver fishing-rod, a silken line and a diamond
sinker, and was permitted to cast for gold-fish in the
royal pond. King Winter came for a visit once every year
with a little snow just to remind people of his past
reign; but he always found the people ready to joke and
laugh at the bad weather he brought, for they were all
happy and contented who lived in the city of the
Merciful Tsar.
THE END
Transcriber’s Notes
Updated editions will replace the previous one—the old editions will
be renamed.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
1.F.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebookfinal.com