Data 8 Textbook
Data 8 Textbook
Science
Contents
1. Data Science
2. Causality and Experiments
3. Programming in Python
4. Data Types
5. Sequences
6. Tables
7. Visualization
8. Functions and Tables
9. Randomness
10. Sampling and Empirical Distributions
11. Testing Hypotheses
12. Comparing Two Samples
13. Estimation
14. Why the Mean Matters
15. Prediction
16. Inference for Regression
17. Classification
18. Updating Predictions
2nd Edition by Ani Adhikari, John DeNero, David Wagner.
This text was originally developed for the UC Berkeley course Data 8: Foundations of Data Science.
You can view this text online or view the source.
The contents of this book are licensed for free consumption under the following license: Creative Commons Attribution‑NonCommercial‑
NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0).
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]
While a computer cannot understand the text of a book, it can provide us with some insight into the structure of the text. The name
huck_finn_chapters is currently bound to a list of all the chapters in the book. We can place them into a table to see how each chapter begins.
Table().with_column('Chapters', huck_finn_chapters)
Chapters
I. YOU don't know about me without you have read a book ...
II. WE went tiptoeing along a path amongst the trees bac ...
III. WELL, I got a good going‑over in the morning from o ...
IV. WELL, three or four months run along, and it was wel ...
V. I had shut the door to. Then I turned around and ther ...
VI. WELL, pretty soon the old man was up and around agai ...
VII. "GIT up! What you 'bout?" I opened my eyes and look ...
VIII. THE sun was up so high when I waked that I judged ...
IX. I wanted to go and look at a place right about the m ...
X. AFTER breakfast I wanted to talk about the dead man a ...
... (33 rows omitted)
Each chapter begins with a chapter number in Roman numerals, followed by the first sentence of the chapter. Project Gutenberg has printed the first
word of each chapter in upper case.
counts = Table().with_columns([
'Jim', np.char.count(huck_finn_chapters, 'Jim'),
'Tom', np.char.count(huck_finn_chapters, 'Tom'),
'Huck', np.char.count(huck_finn_chapters, 'Huck')
])
In the plot above, the horizontal axis shows chapter numbers and the vertical axis shows how many times each character has been mentioned up to
and including that chapter.
You can see that Jim is a central character by the large number of times his name appears. Notice how Tom is hardly mentioned for much of the book
until he arrives and joins Huck and Jim, after Chapter 30. His curve and Jim’s rise sharply at that point, as the action involving both of them
intensifies. As for Huck, his name hardly appears at all, because he is the narrator.
Little Women is a story of four sisters growing up together during the civil war. In this book, chapter numbers are spelled out and chapter titles are
written in all capital letters.
# The chapters of Little Women, in a table
Table().with_column('Chapters', little_women_chapters)
Chapters
ONE PLAYING PILGRIMS "Christmas won't be Christmas witho ...
TWO A MERRY CHRISTMAS Jo was the first to wake in the gr ...
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ...
FOUR BURDENS "Oh, dear, how hard it does seem to take up ...
FIVE BEING NEIGHBORLY "What in the world are you going t ...
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did pr ...
SEVEN AMY'S VALLEY OF HUMILIATION "That boy is a perfect ...
EIGHT JO MEETS APOLLYON "Girls, where are you going?" as ...
NINE MEG GOES TO VANITY FAIR "I do think it was the most ...
TEN THE P.C. AND P.O. As spring came on, a new set of am ...
... (37 rows omitted)
We can track the mentions of main characters to learn about the plot of this book as well. The protagonist Jo interacts with her sisters Meg, Beth, and
Amy regularly, up until Chapter 27 when she moves to New York alone.
# Counts of names in the chapters of Little Women
counts = Table().with_columns([
'Amy', np.char.count(little_women_chapters, 'Amy'),
'Beth', np.char.count(little_women_chapters, 'Beth'),
'Jo', np.char.count(little_women_chapters, 'Jo'),
'Meg', np.char.count(little_women_chapters, 'Meg'),
'Laurie', np.char.count(little_women_chapters, 'Laurie'),
])
chars_periods_huck_finn = Table().with_columns([
'Huck Finn Chapter Length', [len(s) for s in huck_finn_chapters],
'Number of Periods', np.char.count(huck_finn_chapters, '.')
])
chars_periods_little_women = Table().with_columns([
'Little Women Chapter Length', [len(s) for s in little_women_chapters],
'Number of Periods', np.char.count(little_women_chapters, '.')
])
Here are the data for Huckleberry Finn. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well
as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general: the shorter the
chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths
and can involve other punctuation such as question marks.
chars_periods_huck_finn
Huck Finn Chapter Length Number of Periods
7026 66
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70
... (33 rows omitted)
Here are the corresponding data for Little Women.
chars_periods_little_women
2.1. Observation and Visualization: John Snow and the Broad Street Pump
One of the most powerful examples of astute observation eventually leading to the establishment of causality dates back more than 150 years. To get
your mind into the right timeframe, try to imagine London in the 1850’s. It was the world’s wealthiest city but many of its people were desperately
poor. Charles Dickens, then at the height of his fame, was writing about their plight. Disease was rife in the poorer parts of the city, and cholera was
among the most feared. It was not yet known that germs cause disease; the leading theory was that “miasmas” were the main culprit. Miasmas
manifested themselves as bad smells, and were thought to be invisible poisonous particles arising out of decaying matter. Parts of London did smell
very bad, especially in hot weather. To protect themselves against infection, those who could afford to held sweet‑smelling things to their noses.
For several years, a doctor by the name of John Snow had been following the devastating waves of cholera that hit England from time to time. The
disease arrived suddenly and was almost immediately deadly: people died within a day or two of contracting it, hundreds could die in a week, and the
total death toll in a single wave could reach tens of thousands. Snow was skeptical of the miasma theory. He had noticed that while entire households
were wiped out by cholera, the people in neighboring houses sometimes remained completely unaffected. As they were breathing the same air—and
miasmas—as their neighbors, there was no compelling association between bad smells and the incidence of cholera.
Snow had also noticed that the onset of the disease almost always involved vomiting and diarrhea. He therefore believed that the infection was
carried by something people ate or drank, not by the air that they breathed. His prime suspect was water contaminated by sewage.
At the end of August 1854, cholera struck in the overcrowded Soho district of London. As the deaths mounted, Snow recorded them diligently, using
a method that went on to become standard in the study of how diseases spread: he drew a map. On a street map of the district, he recorded the
location of each death.
Here is Snow’s original map. Each black bar represents one death. When there are multiple deaths at the same address, the bars corresponding to
those deaths are stacked on top of each other. The black discs mark the locations of water pumps. The map displays a striking revelation—the deaths
are roughly clustered around the Broad Street pump.
Snow studied his map carefully and investigated the apparent anomalies. All of them implicated the Broad Street pump. For example:
There were deaths in houses that were nearer the Rupert Street pump than the Broad Street pump. Though the Rupert Street pump was closer
as the crow flies, it was less convenient to get to because of dead ends and the layout of the streets. The residents in those houses used the
Broad Street pump instead.
There were no deaths in two blocks just east of the pump. That was the location of the Lion Brewery, where the workers drank what they
brewed. If they wanted water, the brewery had its own well.
There were scattered deaths in houses several blocks away from the Broad Street pump. Those were children who drank from the Broad Street
pump on their way to school. The pump’s water was known to be cool and refreshing.
The final piece of evidence in support of Snow’s theory was provided by two isolated deaths in the leafy and genteel Hampstead area, quite far from
Soho. Snow was puzzled by these until he learned that the deceased were Mrs. Susannah Eley, who had once lived in Broad Street, and her niece.
Mrs. Eley had water from the Broad Street pump delivered to her in Hampstead every day. She liked its taste.
Later it was discovered that a cesspit that was just a few feet away from the well of the Broad Street pump had been leaking into the well. Thus the
pump’s water was contaminated by sewage from the houses of cholera victims.
Snow used his map to convince local authorities to remove the handle of the Broad Street pump. Though the cholera epidemic was already on the
wane when he did so, it is possible that the disabling of the pump prevented many deaths from future waves of the disease.
The removal of the Broad Street pump handle has become the stuff of legend. At the Centers for Disease Control (CDC) in Atlanta, when scientists
look for simple answers to questions about epidemics, they sometimes ask each other, “Where is the handle to this pump?”
Snow’s map is one of the earliest and most powerful uses of data visualization. Disease maps of various kinds are now a standard tool for tracking
epidemics.
Towards Causality
Though the map gave Snow a strong indication that the cleanliness of the water supply was the key to controlling cholera, he was still a long way
from a convincing scientific argument that contaminated water was causing the spread of the disease. To make a more compelling case, he had to
use the method of comparison.
Scientists use comparison to identify an association between a treatment and an outcome. They compare the outcomes of a group of individuals who
got the treatment (the treatment group) to the outcomes of a group who did not (the control group). For example, researchers today might compare
the average murder rate in states that have the death penalty with the average murder rate in states that don’t.
If the results are different, that is evidence for an association. To determine causation, however, even more care is needed.
Snow noticed that there was no systematic difference between the people who were supplied by S&V and those supplied by Lambeth. “Each
company supplies both rich and poor, both large houses and small; there is no difference either in the condition or occupation of the persons
receiving the water of the different Companies … there is no difference whatever in the houses or the people receiving the supply of the two Water
Companies, or in any of the physical conditions with which they are surrounded …”
The only difference was in the water supply, “one group being supplied with water containing the sewage of London, and amongst it, whatever might
have come from the cholera patients, the other group having water quite free from impurity.”
Confident that he would be able to arrive at a clear conclusion, Snow summarized his data in the table below.
Supply Area Number of houses cholera deaths deaths per 10,000 houses
S&V 40,046 1,263 315
Lambeth 26,107 98 37
Rest of London 256,423 1,422 59
The numbers pointed accusingly at S&V. The death rate from cholera in the S&V houses was almost ten times the rate in the houses supplied by
Lambeth.
2.4. Randomization
An excellent way to avoid confounding is to assign individuals to the treatment and control groups at random, and then administer the treatment to
those who were assigned to the treatment group. Randomization keeps the two groups similar apart from the treatment.
If you are able to randomize individuals into the treatment and control groups, you are running a randomized controlled experiment, also known as a
randomized controlled trial (RCT). Sometimes, people’s responses in an experiment are influenced by their knowing which group they are in. So you
might want to run a blind experiment in which individuals do not know whether they are in the treatment group or the control group. To make this
work, you will have to give the control group a placebo, which is something that looks exactly like the treatment but in fact has no effect.
Randomized controlled experiments have long been a gold standard in the medical field, for example in establishing whether a new drug works. They
are also becoming more commonly used in other fields such as economics.
Example: Welfare subsidies in Mexico. In Mexican villages in the 1990’s, children in poor families were often not enrolled in school. One of the
reasons was that the older children could go to work and thus help support the family. Santiago Levy , a minister in Mexican Ministry of Finance, set
out to investigate whether welfare programs could be used to increase school enrollment and improve health conditions. He conducted an RCT on a
set of villages, selecting some of them at random to receive a new welfare program called PROGRESA. The program gave money to poor families if
their children went to school regularly and the family used preventive health care. More money was given if the children were in secondary school
than in primary school, to compensate for the children’s lost wages, and more money was given for girls attending school than for boys. The
remaining villages did not get this treatment, and formed the control group. Because of the randomization, there were no confounding factors and it
was possible to establish that PROGRESA increased school enrollment. For boys, the enrollment increased from 73% in the control group to 77% in
the PROGRESA group. For girls, the increase was even greater, from 67% in the control group to almost 75% in the PROGRESA group. Due to the
success of this experiment, the Mexican government supported the program under the new name OPORTUNIDADES, as an investment in a healthy
and well educated population.
Benefits of Randomization
In the terminology that we have developed, John Snow conducted an observational study, not a randomized experiment. But he called his study a
“grand experiment” because, as he wrote, “No fewer than three hundred thousand people … were divided into two groups without their choice, and in
most cases, without their knowledge …”
Studies such as Snow’s are sometimes called “natural experiments.” However, true randomization does not simply mean that the treatment and
control groups are selected “without their choice.” Randomization has to be carried out very carefully, following the laws of probability.
The method of randomization can be as simple as tossing a coin. It may also be quite a bit more complex. But every method of randomization
consists of a sequence of carefully defined steps that allow chances to be specified mathematically. This has two important consequences.
1. It allows us to account—mathematically—for the possibility that randomization produces treatment and control groups that are quite different
from each other.
2. It allows us to make precise mathematical statements about differences between the treatment and control groups. This in turn helps us make
justifiable conclusions about whether the treatment has any effect.
What if you can’t randomize?
In some situations it might not be possible to carry out a randomized controlled experiment, even when the aim is to investigate causality. For
example, suppose you want to study the effects of alcohol consumption during pregnancy, and you randomly assign some pregnant women to your
“alcohol” group. You should not expect cooperation from them if you present them with a drink. In such situations you will almost invariably be
conducting an observational study, not an experiment. Be alert for confounding factors.
In this course, you will learn how to conduct and analyze your own randomized experiments. That will involve more detail than has been presented in
this chapter. For now, just focus on the main idea: to try to establish causality, run a randomized controlled experiment if possible. If you are
conducting an observational study, you might be able to establish association but it will be harder to establish causation. Be extremely careful about
confounding factors before making conclusions about causality based on an observational study.
3. Programming in Python
Programming can dramatically improve our ability to collect and analyze information about the world, which in turn can lead to discoveries through
the kind of careful reasoning demonstrated in the previous section. In data science, the purpose of writing a program is to instruct a computer to
carry out the steps of an analysis. Computers cannot study the world on their own. People must describe precisely what steps the computer should
take in order to collect and analyze data, and those steps are expressed through programs.
3.1. Expressions
Programming languages are much simpler than human languages. Nonetheless, there are some rules of grammar to learn in any language, and that is
where we will begin. In this text, we will use the Python programming language. Learning the grammar rules is essential, and the same rules used in
the most basic programs are also central to more sophisticated programs.
Programs are made up of expressions, which describe to the computer how to combine pieces of data. For example, a multiplication expression
consists of a * symbol between two numerical expressions. Expressions, such as 3 * 4, are evaluated by the computer. The value (the result of
evaluation) of the last expression in each cell, 12 in this case, is displayed below the cell.
3 * 4
12
The grammar rules of a programming language are rigid. In Python, the * symbol cannot appear twice in a row. The computer will not try to interpret
an expression that differs from its prescribed expression structures. Instead, it will show a SyntaxError error. The Syntax of a language is its set of
grammar rules, and a SyntaxError indicates that an expression structure doesn’t match any of the rules of the language.
3 * * 4
81
Common Operators. Data science often involves combining numerical values, and the set of operators in a programming language are designed to
so that expressions can be used to express any sort of arithmetic. In Python, the following operators are essential.
Expression Type Operator Example Value
Addition + 2 + 3 5
Subtraction - 2 - 3 -1
Multiplication * 2 * 3 6
Division / 7 / 3 2.66667
Remainder % 7 % 3 1
Python expressions obey the same familiar rules of precedence as in algebra: multiplication and division occur before addition and subtraction.
Parentheses can be used to group together smaller expressions within a larger expression.
1 + 2 * 3 * 4 * 5 / 6 ** 3 + 7 + 8 - 9 + 10
17.555555555555557
1 + 2 * (3 * 4 * 5 / 6) ** 3 + 7 + 8 - 9 + 10
2017.0
This chapter introduces many types of expressions. Learning to program involves trying out everything you learn in combination, investigating the
behavior of the computer. What happens if you divide by zero? What happens if you divide twice in a row? You don’t always need to ask an expert (or
the Internet); many of these details can be discovered by trying them out yourself.
3.2. Names
Names are given to values in Python using an assignment statement. In an assignment, a name is followed by =, which is followed by any expression.
The value of the expression to the right of = is assigned to the name. Once a name has a value assigned to it, the value will be substituted for that
name in future expressions.
a = 10
b = 20
a + b
30
0.5
However, only the current value of an expression is assigned to a name. If that value changes later, names that were defined in terms of that value will
not change automatically.
quarter = 4
half
0.5
Names must start with a letter, but can contain both letters and numbers. A name cannot contain a space; instead, it is common to use an underscore
character _ to replace each space. Names are only as useful as you make them; it’s up to the programmer to choose names that are easy to interpret.
Typically, more meaningful names can be invented than a and b. For example, to describe the sales tax on a $5 purchase in Berkeley, CA, the
following names clarify the meaning of the various quantities involved.
purchase_price = 5
state_tax_rate = 0.075
county_tax_rate = 0.02
city_tax_rate = 0
sales_tax_rate = state_tax_rate + county_tax_rate + city_tax_rate
sales_tax = purchase_price * sales_tax_rate
sales_tax
0.475
initial = 2766000
changed = 2814000
(changed - initial) / initial
0.01735357917570499
It is also typical to subtract one from the ratio of the two measurements, which yields the same value.
(changed/initial) - 1
0.017353579175704903
This value is the growth rate over 10 years. A useful property of growth rates is that they don’t change even if the values are expressed in different
units. So, for example, we can express the same relationship between thousands of people in 2002 and 2012.
initial = 2766
changed = 2814
(changed/initial) - 1
0.017353579175704903
In 10 years, the number of employees of the US Federal Government has increased by only 1.74%. In that time, the total expenditures of the US
Federal Government increased from $2.37 trillion to $3.38 trillion in 2012.
initial = 2.37
changed = 3.38
(changed/initial) - 1
0.4261603375527425
A 42.6% increase in the federal budget is much larger than the 1.74% increase in federal employees. In fact, the number of federal employees has
grown much more slowly than the population of the United States, which increased 9.21% in the same time period from 287.6 million people in 2002
to 314.1 million in 2012.
initial = 287.6
changed = 314.1
(changed/initial) - 1
0.09214186369958277
A growth rate can be negative, representing a decrease in some value. For example, the number of manufacturing jobs in the US decreased from 15.3
million in 2002 to 11.9 million in 2012, a ‑22.2% growth rate.
initial = 15.3
changed = 11.9
(changed/initial) - 1
-0.2222222222222222
An annual growth rate is a growth rate of some quantity over a single year. An annual growth rate of 0.035, accumulated each year for 10 years, gives
a much larger ten‑year growth rate of 0.41 (or 41%).
1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 * 1.035 - 1
0.410598760621121
0.410598760621121
Likewise, a ten‑year growth rate can be used to compute an equivalent annual growth rate. Below, t is the number of years that have passed between
measurements. The following computes the annual growth rate of federal expenditures over the last 10 years.
initial = 2.37
changed = 3.38
t = 10
(changed/initial) ** (1/t) - 1
0.03613617208346853
The total growth over 10 years is equivalent to a 3.6% increase each year.
In summary, a growth rate g is used to describe the relative size of an initial amount and a changed amount after some amount of time t. To
compute \(changed\), apply the growth rate g repeatedly, t times using exponentiation.
initial * (1 + g) ** t
To compute g, raise the total growth to the power of 1/t and subtract one.
(changed/initial) ** (1/t) - 1
12
round(5 - 1.3)
4
max(2, 2 + 3, 4)
In this last example, the max function is called on three arguments: 2, 5, and 4. The value of each expression within parentheses is passed to the
function, and the function returns the final value of the full call expression. The max function can take any number of arguments and returns the
maximum.
A few functions are available by default, such as abs and round, but most functions that are built into the Python language are stored in a collection of
functions called a module. An import statement is used to provide access to a module, such as math or operator.
import math
import operator
math.sqrt(operator.add(4, 5))
3.0
3.0
Operators and call expressions can be used together in an expression. The percent difference between two values is used to compare values for
which neither one is obviously initial or changed. For example, in 2014 Florida farms produced 2.72 billion eggs while Iowa farms produced 16.25
billion eggs (http://quickstats.nass.usda.gov/). The percent difference is 100 times the absolute value of the difference between the values, divided by
their average. In this case, the difference is larger than the average, and so the percent difference is greater than 100.
florida = 2.72
iowa = 16.25
100*abs(florida-iowa)/((florida+iowa)/2)
142.6462836056932
Learning how different functions behave is an important part of learning a programming language. A Jupyter notebook can assist in remembering the
names and effects of different functions. When editing a code cell, press the tab key after typing the beginning of a name to bring up a list of ways to
complete that name. For example, press tab after math. to see all of the functions available in the math module. Typing will narrow down the list of
options. To learn more about a function, place a ? after its name. For example, typing math.log? will bring up a description of the log function in the
math module.
math.log?
log(x[, base])
The square brackets in the example call indicate that an argument is optional. That is, log can be called with either one or two arguments.
math.log(16, 2)
4.0
math.log(16)/math.log(2)
4.0
The list of Python’s built‑in functions is quite long and includes many functions that are never needed in data science applications. The list of
mathematical functions in the math module is similarly long. This text will introduce the most important functions in context, rather than expecting the
reader to memorize or understand these lists.
For example, if you want to see just the first two rows of a table, you can use the table method show.
cones.show(2)
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 5.25
strawberry 5.25
chocolate 5.25
bubblegum 4.75
You can also drop columns you don’t want. The table above can be created by dropping the Color column.
cones.drop('Color')
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 5.25
strawberry 5.25
chocolate 5.25
bubblegum 4.75
You can name this new table and look at it again by just typing its name.
no_colors = cones.drop('Color')
no_colors
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 5.25
strawberry 5.25
chocolate 5.25
bubblegum 4.75
Like select, the drop method creates a smaller table and leaves the original table unchanged. In order to explore your data, you can create any
number of smaller tables by using choosing or dropping columns. It will do no harm to your original data table.
cones.sort('Price')
cones.sort('Price', descending=True)
4. Data Types
Every value has a type, and the built‑in type function returns the type of the result of any expression.
One type we have encountered already is a built‑in function. Python indicates that the type is a builtin_function_or_method; the distinction
between a function and a method is not important at this stage.
type(abs)
builtin_function_or_method
4.1. Numbers
Computers are designed to perform numerical calculations, but there are some important details about working with numbers that every programmer
working with quantitative data should know. Python (like most other programming languages) distinguishes between two different types of numbers:
Integers are called int values in the Python language. They can only represent whole numbers (negative, zero, or positive) that don’t have a
fractional component
Real numbers are called float values (or floating point values) in the Python language. They can represent whole or fractional numbers but
have some limitations.
The type of a number is evident from the way it is displayed: int values have no decimal point and float values always have a decimal point.
# Some int values
2
1 + 3
-1234567890000000000
-1234567890000000000
1.2
3.0
3.0
When a float value is combined with an int value using some arithmetic operator, then the result is always a float value. In most cases, two
integers combine to form another integer, but any number (int or float) divided by another will be a float value. Very large or very small float
values are displayed using scientific notation.
1.5 + 2
3.5
3 / 1
3.0
-12345678900000000000.0
-1.23456789e+19
The type function can be used to find the type of any number.
type(3)
int
type(3 / 1)
float
The type of an expression is the type of its final value. So, the type function will never indicate that the type of an expression is a name, because
names are always evaluated to their assigned values.
x = 3
type(x) # The type of x is an int, not a name
int
type(x + 2.5)
float
2e+307
2e306 * 100
inf
2e-322 / 10
2e-323
2e-322 / 100
0.0
The second limit can be observed by an expression that involves numbers with more than 15 significant digits. These extra digits are discarded before
any arithmetic is carried out.
0.6666666666666666 - 0.6666666666666666123456789
0.0
The third limit can be observed when taking the difference between two expressions that should be equivalent. For example, the expression 2 ** 0.5
computes the square root of 2, but squaring this value does not exactly recover 2.
2 ** 0.5
1.4142135623730951
(2 ** 0.5) * (2 ** 0.5)
2.0000000000000004
(2 ** 0.5) * (2 ** 0.5) - 2
4.440892098500626e-16
The final result above is 0.0000000000000004440892098500626, a number that is very close to zero. The correct answer to this arithmetic expression
is 0, but a small error in the final significant digit appears very different in scientific notation. This behavior appears in almost all programming
languages because it is the result of the standard way that arithmetic is carried out on computers.
Although float values are not always exact, they are certainly reliable and work the same way across all different kinds of computers and
programming languages.
4.2. Strings
Much of the world’s data is text, and a piece of text represented in a computer is called a string. A string can represent a word, a sentence, or even
the contents of every book in a library. Since text can include numbers (like this: 5) or truth values (True), a string can also describe those things.
The meaning of an expression depends both upon its structure and the types of values that are being combined. So, for instance, adding two strings
together produces another string. This expression is still an addition expression, but it is combining a different type of value.
"data" + "science"
'datascience'
Addition is completely literal; it combines these two strings together without regard for their contents. It doesn’t add a space because these are
different words; that’s up to the programmer (you) to specify.
"data" + " " + "science"
'data science'
Single and double quotes can both be used to create strings: 'hi' and "hi" are identical expressions. Double quotes are often preferred because
they allow you to include apostrophes inside of strings.
"This won't work with a single-quoted string!"
"That's 2 True"
'LOUD'
Perhaps the most important method is replace, which replaces all instances of a substring within the string. The replace method takes two
arguments, the text to be replaced and its replacement.
'hitchhiker'.replace('hi', 'ma')
'matchmaker'
String methods can also be invoked using variable names, as long as those names are bound to strings. So, for instance, the following two‑step
process generates the word “degrade” starting from “train” by first creating “ingrain” and then applying a second replacement.
s = "train"
t = s.replace('t', 'ing')
u = t.replace('in', 'de')
u
'degrade'
Note that the line t = s.replace('t', 'ing') doesn’t change the string s, which is still “train”. The method call s.replace('t', 'ing') just has a
value, which is the string “ingrain”.
s
'train'
This is the first time we’ve seen methods, but methods are not unique to strings. As we will see shortly, other types of objects can have them.
4.3. Comparisons
Boolean values most often arise from comparison operators. Python includes a variety of operators that compare values. For example, 3 is larger than
1 + 1.
3 > 1 + 1
True
The value True indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between 3 and 1+1. The full set of
common comparison operators are listed below.
Comparison Operator True example False Example
Less than < 2<3 2<2
Greater than > 3>2 3>3
Less than or equal <= 2 <= 2 3 <= 2
Greater or equal >= 3 >= 3 2 >= 3
Equal == 3 == 3 3 == 2
Not equal != 3 != 2 2 != 2
An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express
that 1+1 is between 1 and 3 using the following expression.
1 < 1 + 1 < 3
True
The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers x and y
below. You can try different values of x and y to confirm this relationship.
x = 12
y = 5
min(x, y) <= (x+y)/2 <= max(x, y)
True
Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.
"Dog" > "Catastrophe" > "Cat"
True
5. Sequences
Values can be grouped together into collections, which allows programmers to organize those values and refer to all of them with a single name. By
grouping values together, we can write code that performs a computation on many pieces of data at once.
Calling the function make_array on several values places them into an array, which is a kind of sequential collection. Below, we collect four different
temperatures into an array called highs. These are the estimated average daily high temperatures over all land on Earth (in degrees Celsius) for the
decades surrounding 1850, 1900, 1950, and 2000, respectively, expressed as deviations from the average absolute high temperature between 1951
and 1980, which was 14.48 degrees.
baseline_high = 14.48
highs = make_array(baseline_high - 0.880, baseline_high - 0.093,
baseline_high + 0.105, baseline_high + 0.684)
highs
Collections allow us to pass multiple values into a function using a single name. For instance, the sum function computes the sum of all values in a
collection, and the len function computes its length. (That’s the number of values we put in it.) Using them together, we can compute the average of
a collection.
sum(highs)/len(highs)
14.434000000000001
The complete chart of daily high and low temperatures appears below.
5.1. Arrays
While there are many kinds of collections in Python, we will work primarily with arrays in this class. We’ve already seen that the make_array function
can be used to create arrays of numbers.
Arrays can also contain strings or other types of values, but a single array can only contain a single kind of data. (It usually doesn’t make sense to
group together unlike data anyway.) For example:
english_parts_of_speech = make_array("noun", "pronoun", "verb", "adverb", "adjective",
"conjunction", "preposition", "interjection")
english_parts_of_speech
Returning to the temperature data, we create arrays of average daily high temperatures for the decades surrounding 1850, 1900, 1950, and 2000.
baseline_high = 14.48
highs = make_array(baseline_high - 0.880,
baseline_high - 0.093,
baseline_high + 0.105,
baseline_high + 0.684)
highs
Arrays can be used in arithmetic expressions to compute over their contents. When an array is combined with a single number, that number is
combined with each element of the array. Therefore, we can convert all of these temperatures to Fahrenheit by writing the familiar conversion
formula.
(9/5) * highs + 32
Arrays also have methods, which are functions that operate on the array values. The mean of a collection of numbers is its average value: the sum
divided by the length. Each pair of parentheses in the examples below is part of a call expression; it’s calling a function with no arguments to perform
a computation on the array called highs.
highs.size
highs.sum()
57.736000000000004
highs.mean()
14.434000000000001
For example, the diff function computes the difference between each adjacent pair of elements in an array. The first element of the diff is the
second element minus the first.
np.diff(highs)
np.cumsum A cumulative sum: for each element, add all elements so far
np.exp Exponentiate each element
np.log Take the natural logarithm of each element
np.sqrt Take the square root of each element
np.sort Sort the elements
Each of these functions takes an array of strings and returns an array.
Function Description
np.char.lower Lowercase each element
np.char.upper Uppercase each element
np.char.strip Remove spaces at the beginning or end of each element
np.char.isalpha Whether each element is only letters (no numbers or symbols)
np.char.isnumeric Whether each element is only numeric (no letters)
Each of these functions takes both an array of strings and a search string; each returns an array.
Function Description
np.char.count Count the number of times a search string appears among the elements of an array
np.char.find The position within each element that a search string is found first
np.char.rfind The position within each element that a search string is found last
np.char.startswith Whether each element starts with the search string
5.2. Ranges
A range is an array of numbers in increasing or decreasing order, each separated by a regular interval. Ranges are useful in a surprisingly large
number of situations, so it’s worthwhile to learn about them.
Ranges are defined using the np.arange function, which takes either one, two, or three arguments: a start, and end, and a ‘step’.
If you pass one argument to np.arange, this becomes the end value, with start=0, step=1 assumed. Two arguments give the start and end with
step=1 assumed. Three arguments give the start, end and step explicitly.
A range always includes its start value, but does not include its end value. It counts up by step, and it stops before it gets to the end.
np.arange(end): An array starting with 0 of increasing consecutive integers, stopping before end.
np.arange(5)
array([0, 1, 2, 3, 4])
Notice how the array starts at 0 and goes only up to 4, not to the end value of 5.
np.arange(start, end): An array of consecutive increasing integers from start, stopping before end.
np.arange(3, 9)
array([3, 4, 5, 6, 7, 8])
np.arange(start, end, step): A range with a difference of step between each pair of consecutive values, starting from start and
stopping before end.
np.arange(3, 30, 5)
This array starts at 3, then takes a step of 5 to get to 8, then another step of 5 to get to 13, and so on.
When you specify a step, the start, end, and step can all be either positive or negative and may be whole numbers or fractions.
np.arange(1.5, -2, -0.5)
To get an accurate approximation to \(\pi\), we’ll use the much longer array positive_term_denominators.
positive_term_denominators = np.arange(1, 10000, 4)
positive_term_denominators
The positive terms we actually want to add together are just 1 over these denominators:
positive_terms = 1 / positive_term_denominators
The negative terms have 3, 7, 11, and so on on in their denominators. This array is just 2 added to positive_term_denominators.
negative_terms = 1 / (positive_term_denominators + 2)
3.1413926535917955
5.2.2. Footnotes
[1] Surprisingly, when we add infinitely many positive and negative fractions, the order can matter! But our approximation to \(\pi\) uses only a large
finite number of fractions, so it’s okay to add the terms in any convenient order.
baseline_low = 3.00
lows = make_array(baseline_low - 0.872, baseline_low - 0.629,
baseline_low - 0.126, baseline_low + 0.728)
lows
Suppose we’d like to compute the average daily range of temperatures for each decade. That is, we want to subtract the average daily high in the
1850s from the average daily low in the 1850s, and the same for each other decade.
We could write this laboriously using .item:
make_array(
highs.item(0) - lows.item(0),
highs.item(1) - lows.item(1),
highs.item(2) - lows.item(2),
highs.item(3) - lows.item(3)
)
As when we converted an array of temperatures from Celsius to Fahrenheit, Python provides a much cleaner way to write this:
highs - lows
Remember that np.prod multiplies all the elements of an array together. Now we can calculate Wallis’ product, to a good approximation.
2 * np.prod(even/one_below_even) * np.prod(even/one_above_even)
3.1415910827951143
That’s \(\pi\) correct to five decimal places. Wallis clearly came up with a great formula.
5.3.3. Footnotes
[1] As we saw in the example about Leibniz’s formula, when we add infinitely many fractions, the order can matter. The same is true with multiplying
fractions, as we are doing here. But our approximation to \(\pi\) uses only a large finite number of fractions, so it’s okay to multiply the terms in any
convenient order.
6. Tables
Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
a sequence of named columns that each describe a single aspect of all entries in a data set, or
a sequence of rows that each contain all information about a single entry in a data set.
In order to use tables, import all of the module called datascience, a module created for this text.
from datascience import *
Empty tables can be created using the Table function. An empty table is useful because it can be extended to contain new rows and columns.
Table()
The with_columns method on a table constructs a new table with additional labeled columns. Each column of a table is an array. To add one new
column to a table, call with_columns with a label and an array. (The with_column method can be used with the same effect.)
Below, we begin each example with an empty table that has no columns.
Table().with_columns('Number of petals', make_array(8, 34, 5))
Number of petals
8
34
5
To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.
Table().with_columns(
'Number of petals', make_array(8, 34, 5),
'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers.with_columns(
'Color', make_array('pink', 'yellow', 'red')
)
minard.num_rows
Column Labels
The method labels can be used to list the labels of all the columns. With minard we don’t gain much by this, but it can be very useful for tables that
are so large that not all columns are visible on the screen.
minard.labels
We can change column labels using the relabeled method. This creates a new table and leaves minard unchanged.
minard.relabeled('City', 'City Name')
Longitude Latitude City Name Direction Survivors
32 54.8 Smolensk Advance 145000
33.2 54.9 Dorogobouge Advance 140000
34.4 55.5 Chjat Advance 127100
37.6 55.8 Moscou Advance 100000
34.3 55.2 Wixma Retreat 55000
32 54.6 Smolensk Retreat 24000
30.4 54.4 Orscha Retreat 20000
26.8 54.3 Moiodexno Retreat 12000
However, this method does not change the original table.
minard
The 5 columns are indexed 0, 1, 2, 3, and 4. The column Survivors can also be accessed by using its column index.
minard.column(4)
The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using item, as with any array.
minard.column(4).item(0)
145000
minard.column(4).item(5)
24000
Longitude Latitude
32 54.8
33.2 54.9
34.4 55.5
37.6 55.8
34.3 55.2
32 54.6
30.4 54.4
26.8 54.3
The same selection can be made using column indices instead of labels.
minard.select(0, 1)
Longitude Latitude
32 54.8
33.2 54.9
34.4 55.5
37.6 55.8
34.3 55.2
32 54.6
30.4 54.4
26.8 54.3
The result of using select is a new table, even when you select just one column.
minard.select('Survivors')
Survivors
145000
140000
127100
100000
55000
24000
20000
12000
Notice that the result is a table, unlike the result of column, which is an array.
minard.column('Survivors')
Another way to create a new table consisting of a set of columns is to drop the columns you don’t want.
minard.drop('Longitude', 'Latitude', 'Direction')
The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows
doesn’t involve details about how basketball is played.
The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost \(\$18.7\) million in 2015‑2016.
# This table can be found online: https://www.statcrunch.com/app/index.php?
dataid=1843341
nba_salaries = Table.read_table(path_data + 'nba_salaries.csv')
nba_salaries
PLAYER POSITION TEAM '15‑'16 SALARY
Paul Millsap PF Atlanta Hawks 18.6717
Al Horford C Atlanta Hawks 12
Tiago Splitter C Atlanta Hawks 9.75625
Jeff Teague PG Atlanta Hawks 8
Kyle Korver SG Atlanta Hawks 5.74648
Thabo Sefolosha SF Atlanta Hawks 4
Mike Scott PF Atlanta Hawks 3.33333
Kent Bazemore SF Atlanta Hawks 2
Dennis Schroder PG Atlanta Hawks 1.7634
Tim Hardaway Jr. SG Atlanta Hawks 1.30452
... (407 rows omitted)
The table contains 417 rows, one for each player. Only 10 of the rows are displayed. The show method allows us to specify the number of rows, with
the default (no specification) being all the rows of the table.
nba_salaries.show(3)
Args:
``column_or_label``: the column whose values are used for sorting.
Returns:
An instance of ``Table`` containing rows sorted based on the values
in ``column_or_label``.
At the very top of this help text, the signature of the sort method appears:
sort(column_or_label, descending=False, distinct=False)
This describes the positions, names, and default values of the three arguments to sort. When calling this method, you can use either positional
arguments or named arguments, so the following three calls do exactly the same thing.
sort('SALARY', True)
sort('SALARY', descending=True)
sort(column_or_label='SALARY', descending=True)
When an argument is simply True or False, it’s a useful convention to include the argument name so that it’s more obvious what the argument value
means.
6.2. Selecting Rows
Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows
corresponding to the Warriors, or to players who earned more than \(\$10\) million. Or we might just want the top five earners.
# A local copy can be accessed here in case census.gov moves the file:
# data = path_data + 'nc-est2019-agesex-res.csv'
full_census_table = Table.read_table(data)
full_census_table
We can augment us_pop_by_age with a column that contains these changes, both in absolute terms and as percents relative to the value in 2014.
us_pop_change = us_pop_by_age.with_columns(
'Change', change,
'Percent Change', change/us_pop_by_age.column('2014')
)
us_pop_change.set_format('Percent Change', PercentFormatter)
AGE 2014 2019 Change Percent Change
0 3954787 3783052 ‑171735 ‑4.34%
1 3948891 3829599 ‑119292 ‑3.02%
2 3958711 3922044 ‑36667 ‑0.93%
3 4005928 3998665 ‑7263 ‑0.18%
4 4004032 4043323 39291 0.98%
5 4004576 4028281 23705 0.59%
6 4133372 4017227 ‑116145 ‑2.81%
7 4152666 4022319 ‑130347 ‑3.14%
8 4118349 4066194 ‑52155 ‑1.27%
9 4106068 4061874 ‑44194 ‑1.08%
... (92 rows omitted)
Almost all the entries displayed in the Percent Change column are negative, demonstrating a drop in population at the youngest ages. However, the
overall population grew by about 9.9 million people, a percent change of just over 3%.
us_pop_change.where('AGE', are.equal_to(999))
females.column('AGE')
For any given age, we can get the Female:Male sex ratio by dividing the number of females by the number of males.
To do this in one step, we can use column to extract the array of female counts and the corresponding array of male counts, and then simply divide
one array by the other. Elementwise division will create an array of sex ratios for all the years.
ratios = Table().with_columns(
'AGE', females.column('AGE'),
'2019 F:M RATIO', females.column('2019')/males.column('2019')
)
ratios
7. Visualization
Tables are a powerful way of organizing and visualizing data. However, large tables of numbers can be difficult to interpret, no matter how organized
they are. Sometimes it is much easier to interpret graphs than numbers.
In this chapter we will develop some of the fundamental graphical methods of data analysis. Our source of data is the Internet Movie Database, an
online database that contains information about movies, television shows, video games, and so on. The site Box Office Mojo provides many
summaries of IMDB data, some of which we have adapted. We have also used data summaries from The Numbers, a site with a tagline that says it is
“where data and the movie business meet.”
Scatter Plots
A scatter plot displays the relation between two numerical variables. You saw an example of a scatter plot in an early section where we looked at the
number of periods and number of characters in two classic novels.
The Table method scatter draws a scatter plot consisting of one point for each row of the table. Its first argument is the label of the column to be
plotted on the horizontal axis, and its second argument is the label of the column on the vertical.
actors.scatter('Number of Movies', 'Total Gross')
The plot contains 50 points, one point for each actor in the table. You can see that it slopes upwards, in general. The more movies an actor has been
in, the more the total gross of all of those movies – in general.
Formally, we say that the plot shows an association between the variables, and that the association is positive: high values of one variable tend to be
associated with high values of the other, and low values of one with low values of the other, in general.
Of course there is some variability. Some actors have high numbers of movies but middling total gross receipts. Others have middling numbers of
movies but high receipts. That the association is positive is simply a statement about the broad general trend.
Later in the course we will study how to quantify association. For the moment, we will just think about it qualitatively.
Now that we have explored how the number of movies is related to the total gross receipt, let’s turn our attention to how it is related to the average
gross receipt per movie.
actors.scatter('Number of Movies', 'Average per Movie')
This is a markedly different picture and shows a negative association. In general, the more movies an actor has been in, the less the average receipt
per movie.
Also, one of the points is quite high and off to the left of the plot. It corresponds to one actor who has a low number of movies and high average per
movie. This point is an outlier. It lies outside the general range of the data. Indeed, it is quite far from all the other points in the plot.
We will examine the negative association further by looking at points at the right and left ends of the plot.
For the right end, let’s zoom in on the main body of the plot by just looking at the portion that doesn’t have the outlier.
no_outlier = actors.where('Number of Movies', are.above(10))
no_outlier.scatter('Number of Movies', 'Average per Movie')
The negative association is still clearly visible. Let’s identify the actors corresponding to the points that lie on the right hand side of the plot where the
number of movies is large:
actors.where('Number of Movies', are.above(60))
Actor Total Number of Average per #1 Movie Gross
Gross Movies Movie
Samuel L. 4772.8 69 69.2 The Avengers 623.4
Jackson
Morgan 4468.3 61 73.3 The Dark Knight 534.9
Freeman
Robert DeNiro 3081.3 79 39 Meet the Fockers 279.3
Liam Neeson 2942.7 63 46.7 The Phantom 474.5
Menace
The great actor Robert DeNiro has the highest number of movies and the lowest average receipt per movie. Other fine actors are at points that are
not very far away, but DeNiro’s is at the extreme end.
To understand the negative association, note that the more movies an actor is in, the more variable those movies might be, in terms of style, genre,
and box office draw. For example, an actor might be in some high‑grossing action movies or comedies (such as Meet the Fockers), and also in a large
number of smaller films that may be excellent but don’t draw large crowds. Thus the actor’s value of average receipts per movie might be relatively
low.
To approach this argument from a different direction, let us now take a look at the outlier.
actors.where('Number of Movies', are.below(10))
Line Plots
Line plots, sometimes known as line graphs, are among the most common visualizations. They are often used to study chronological trends and
patterns.
The table movies_by_year contains data on movies produced by U.S. studios in each of the years 1980 through 2015. The columns are:
Column Content
Year Year
Total Gross Total domestic box office gross, in millions of dollars, of all movies released
Number of Movies Number of movies released
#1 Movie Highest grossing movie
movies_by_year = Table.read_table(path_data + 'movies_by_year.csv')
movies_by_year
The graph rises sharply and then has a gentle upwards trend though the numbers vary noticeably from year to year. The sharp rise in the early 1980’s
is due in part to studios returning to the forefront of movie production after some years of filmmaker driven movies in the 1970’s.
Our focus will be on more recent years. In keeping with the theme of movies, the table of rows corresponding to the years 2000 through 2015 have
been assigned to the name century_21.
century_21 = movies_by_year.where('Year', are.above(1999))
The total domestic gross receipt was higher in 2009 than in 2008, even though there was a financial crisis and a much smaller number of movies
were released.
One reason for this apparent contradiction is that people tend to go to the movies when there is a recession. “In Downturn, Americans Flock to the
Movies,” said the New York Times in February 2009. The article quotes Martin Kaplan of the University of Southern California saying, “People want to
forget their troubles, and they want to be with other people.” When holidays and expensive treats are unaffordable, movies provide welcome
entertainment and relief.
In 2009, another reason for high box office receipts was the movie Avatar and its 3D release. Not only was Avatar the #1 movie of 2009, it is also by
some calculations one of the highest grossing movies of all time, as we will see later.
century_21.where('Year', are.equal_to(2009))
If the table consists just of a column of categories and a column of frequencies, as in icecream, the method call is even simpler. You can just specify
the column containing the categories, and barh will use the values in the other column as frequencies.
icecream.barh('Flavor')
7.1.2. Design Aspects of Bar Charts
Apart from purely visual differences, there is an important fundamental distinction between bar charts and the two graphs that we saw in the
previous sections. Those were the scatter plot and the line plot, both of which display two quantitative variables – the variables on both axes are
quantitative. In contrast, the bar chart has categories on one axis and numerical quantities on the other.
This has consequences for the chart. First, the width of each bar and the space between consecutive bars is entirely up to the person who is
producing the graph, or to the program being used to produce it. Python made those choices for us. If you were to draw the bar graph by hand, you
could make completely different choices and still have a perfectly correct bar graph, provided you drew all the bars with the same width and kept all
the spaces the same.
Most importantly, the bars can be drawn in any order. The categories “chocolate,” “vanilla,” and “strawberry” have no universal rank order, unlike for
example the numbers 5, 7, and 10.
This means that we can draw a bar chart that is easier to interpret, by rearranging the bars in decreasing order. To do this, we first rearrange the rows
of icecream in decreasing order of Number of Cartons, and then draw the bar chart.
icecream.sort('Number of Cartons', descending=True).barh('Flavor')
This bar chart contains exactly the same information as the previous ones, but it is a little easier to read. While this is not a huge gain in reading a
chart with just three bars, it can be quite significant when the number of categories is large.
The Table method group allows us to count how frequently each studio appears in the table, by calling each studio a category and collecting all the
rows in each of these new categories.
The group method takes as its argument the label of the column that contains the categories. It returns a table of counts of rows in each category.
Thus group creates a distribution table that shows how the individuals (movies) are distributed among the categories (studios).
The group method lists the categories in ascending order. Since our categories are studio names and therefore represented as strings, ascending
order means alphabetical order.
The column of counts is always called count, but you can change that if you like by using relabeled.
studio_distribution = movies_and_studios.group('Studio')
studio_distribution
Studio count
AVCO 1
Buena Vista 35
Columbia 9
Disney 11
Dreamworks 3
Fox 24
IFC 1
Lionsgate 3
MGM 7
Metro 1
... (13 rows omitted)
The table shows that there are 23 different studios and provides the count of movies released by each one. The total of the count is 200, the total
number of movies.
sum(studio_distribution.column('count'))
200
We can now use this table, along with the graphing skills acquired above, to draw a bar chart that shows which studios are most frequent among the
200 highest grossing movies.
studio_distribution.sort('count', descending=True).barh('Studio')
Buena Vista and Warner Brothers are the most common studios among the top 200 movies. Warner Brothers produces the Harry Potter movies and
Buena Vista produces Star Wars.
(338.41, 1796.18)
Let’s try bins of width 100, starting at 300 and going to 2000. You are welcome to make other choices. It is common to start with something that
seems reasonable and then adjust based on the results.
bin_counts = millions.bin('Adjusted Gross', bins=np.arange(300,2001,100))
bin_counts.show()
The last bin: Notice the bin value 2000 in the last row. That’s not the left endpoint of any bin. Instead, it’s the right endpoint of the last bin. This bin
is different from all the others in that it has the form [a, b]. It includes the data at both endpoints. In our example it doesn’t matter because no movie
made 2 billion dollars (that is, 2000 million). But this aspect of binning is important to keep in mind in case you want the bins to end exactly at the
maximum value of the data. All the counts for this last bin appear in the second‑to‑last row, and the count for the last row is always zero.
There are other ways to use the bin method. If you don’t specify any bins, the default is to produce 10 equally wide bins between the minimum and
maximum values of the data. This is often useful for getting a quick sense of the distribution, but the endpoints of the bins tend to be alarming.
millions.bin('Adjusted Gross').show()
bin Adjusted Gross count
338.41 115
484.187 50
629.964 14
775.741 10
921.518 3
1067.3 4
1213.07 2
1358.85 0
1504.63 1
1650.4 1
1796.18 0
You can specify a number of equally wide bins. For example, the option bins=4 leads to 4 equally spaced bins.
millions.bin('Adjusted Gross', bins=4)
7.2.2. Histogram
A histogram is a visualization of the distribution of a quantitative variable. It looks very much like a bar chart but there are some important differences
that we will examine in this section. First, let’s just draw a histogram of the adjusted receipts.
The hist method generates a histogram of the values in a column. The optional unit argument is used in the labels on the two axes. The histogram
below shows the distribution of the adjusted gross amounts, in millions of 2016 dollars. We have not specified the bins, so hist creates 10 equally
wide bins between the minimum and maximum values of the data.
millions.hist('Adjusted Gross', unit="Million Dollars")
This figure has two numerical axes. We will take a quick look at the horizontal axis first, and then examine the vertical axis carefully. For now, just note
that the vertical axis does not represent percents.
7.2.3. The Horizontal Axis
Although in this dataset no movie grossed an amount that is exactly on the edge between two bins, hist does have to account for situations where
there might have been values at the edges. So hist uses the same endpoint convention as the bin method. Bins include the data at their left
endpoint, but not the data at their right endpoint, except for the rightmost bin which includes both endpoints.
We can see that there are 10 bins (some bars are so low that they are hard to see), and that they all have the same width. We can also see that none
of the movies grossed fewer than 300 million dollars; that is because we are considering only the top grossing movies.
It is a little harder to see exactly where the ends of the bins are situated. So it is hard to judge exactly where one bar ends and the next begins.
The optional argument bins can be used with hist to specify the endpoints of the bins exactly as with the bin method. We will start by setting the
numbers in bins to be 300, 400, 500, and so on, ending with 2000.
millions.hist('Adjusted Gross', bins=np.arange(300,2001,100), unit="Million Dollars")
The horizontal axis of this figure is easier to read. For example, you can see exactly where 600 is, even though it is not labeled.
A very small number of movies grossed a billion dollars (1000 million) or more. This results in the figure being skewed to the right, or, less formally,
having a long right hand tail. Distributions of variables like income or rent in large populations also often have this kind of shape.
The larger of the batteries is supposed to be 70% bigger than the smaller. So it’s meant to be bigger but not quite twice as big. However, the larger
battery in the picture looks almost four times the size of the smaller one.
The reason for this problem is that the eye picks up area as the measure of size, not just height or just width. In the picture, both dimensions have
been increased by 70%, leading to a multiplicative effect in the area.
The area principle of visualization says that when we represent a magnitude by a figure that has two dimensions, such as a rectangle, then the area of
the figure should represent the magnitude.
7.2.5. The Histogram: General Principles and Calculation
Histograms follow the area principle and have two defining properties:
1. The bins are drawn to scale and are contiguous (though some might be empty), because the values on the horizontal axis are numerical and
therefore have fixed positions on the number line.
2. The area of each bar is proportional to the number of entries in the bin.
Property 2 is the key to drawing a histogram, and is usually achieved as follows:
\[ \mbox{area of bar} ~=~ \mbox{percent of entries in bin} \]
Since areas represent percents, heights represent something other than percents. The numerical calculation of the heights just uses the fact that the
bar is a rectangle:
\[ \mbox{area of bar} = \mbox{height of bar} \times \mbox{width of bin} \]
and so
\[ \mbox{height of bar} ~=~ \frac{\mbox{area of bar}}{\mbox{width of bin}} ~=~ \frac{\mbox{percent of entries in bin}}{\mbox{width of bin}} \]
The units of height are “percent per unit on the horizontal axis.” The height is the percent in the bin relative to the width of the bin. So it is called
density or crowdedness.
When drawn using this method, the histogram is said to be drawn on the density scale. On this scale:
The area of each bar is equal to the percent of data values that are in the corresponding bin.
The total area of all the bars in the histogram is 100%. In terms of proportions, we can say that the areas of all the bars in a histogram “sum to 1”.
Recall that the table bin_counts has the counts in all the bins of the histogram, specified by bins=np.arange(300, 2000, 100). Also remember that
there are 200 movies in all.
bin_counts.show(3)
Drawing histograms on the density scale also allows us to compare histograms that are based on data sets of different sizes or have different choices
of bins. In such cases, neither bin counts nor percents may be directly comparable. But if both histograms are drawn to the density scale then areas
and densities are comparable.
If a histogram has unequal bins, then plotting on the density scale is a requirement for interpretability. For some variables, unequal bins may be
natural. For example, in the U.S. education system, elementary school consists of Grades 1‑5, middle school is Grades 6‑8, high school is Grades 9‑
12, and a Bachelor’s degree takes a further four years. Data on years of education might be binned using these intervals. In fact, no matter what the
variable, bins don’t have to be equal. It is quite common to have one very wide bin towards the left end or right end of the data, where there are not
many values.
Let’s plot a histogram of adjusted gross receipts using unequal bins, and then see what happens if we plot counts instead.
uneven = make_array(300, 350, 400, 500, 1800)
millions.hist('Adjusted Gross', bins=uneven, unit="Million Dollars")
Notice that the [400, 500) bar has the same height (0.3% per million dollars) as in the histograms above.
The areas of the other bars represent the percents in the bins, as usual. The bin method allows us to see the counts in each bin.
millions.bin('Adjusted Gross', bins=uneven)
bin Adjusted Gross count
300 14
350 54
400 60
500 72
1800 0
The [300, 350) bin has only 14 movies whereas the [500, 1800] bin has 72 movies. But the bar over the [500, 1800] bin is much shorter than the bar
over [300, 350). The [500, 1800] bin is so wide that its 72 movies are much less crowded than the 14 movies in the narrow [300, 350) bin. In other
words, there is less density over the [500, 1800] interval.
If instead you just plot the counts using the normed=False option as shown below, the figure looks completely different and misrepresents the data.
millions.hist('Adjusted Gross', bins=uneven, normed=False)
Even though hist has been used, the figure above is NOT A HISTOGRAM. It misleadingly exaggerates the movies grossing at least 500 million
dollars. The height of each bar is simply plotted at the number of movies in the bin, without accounting for the difference in the widths of the bins. In
this count‑based figure, the shape of the distribution of movies is lost entirely.
To see this, let us split the [400, 500) bin into 10 narrower bins, each of width 10 million dollars.
some_tiny_bins = make_array(
300, 350, 400, 410, 420, 430, 440, 450, 460, 470, 480, 490, 500, 1800)
millions.hist('Adjusted Gross', bins=some_tiny_bins, unit='Million Dollars')
Some of the skinny bars are taller than 0.3 and others are shorter. By putting a flat top at the level 0.3 across the whole bin, we are deciding to ignore
the finer detail and are using the flat level as a rough approximation. Often, though not always, this is sufficient for understanding the general shape
of the distribution.
The height as a rough approximation. This observation gives us a different way of thinking about the height. Look again at the [400, 500) bin in the
earlier histograms. As we have seen, the bin is 100 million dollars wide and contains 30% of the data. Therefore the height of the corresponding bar is
0.3% per million dollars.
Now think of the bin as consisting of 100 narrow bins that are each 1 million dollars wide. The bar’s height of “0.3% per million dollars” means that as
a rough approximation, 0.3% of the movies are in each of those 100 skinny bins of width 1 million dollars.
We have the entire dataset that is being used to draw the histograms. So we can draw the histograms to as fine a level of detail as the data and our
patience will allow. Smaller bins will lead to a more detailed picture. However, if you are looking at a histogram in a book or on a website, and you
don’t have access to the underlying dataset, then it becomes important to have a clear understanding of the “rough approximation” created by the
flat tops.
More commonly, we will first select only the columns needed for our graph, and then call the method by just specifying the variable on the common
axis:
name_of_table.method(column_label_of_common_axis)
Notice how we only specified the variable (sons’ heights) on the common horizontal axis. Python drew two scatter plots: one each for the relation
between this variable and the other two.
Each point represents a row of the table, that is, a “father, mother, son” trio. For all points, the horizontal axis represents the son’s height. In the blue
points, the vertical axis represents the father’s height. In the gold points, the vertical axis represents the mother’s heights.
Both the gold and the blue scatter plots slope upwards and show a positive association between the sons’ heights and the heights of both their
parents. The blue (fathers) plot is in general higher than the gold, because the fathers were in general taller than the mothers.
# Select columns from the full table and relabel some of them
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2014',
'POPESTIMATE2019')
us_pop = partial_census_table.relabeled('POPESTIMATE2014',
'2014').relabeled('POPESTIMATE2019', '2019')
The two distributions are quite different. California has higher percents in the API and Hispanic categories, and correspondingly lower percents in
theBlack and White categories. The percents in the Other category are quite similar in the two populations. The differences are largely due to
California’s geographical location and patterns of immigration and migration, both historically and in more recent decades.
As you can see from the graph, almost 40% of the Californian population in 2019 was Hispanic. A comparison with the population of children in the
state indicates that the Hispanic proportion is likely to be greater in future years. Among Californian children in 2019, more than 50% were in the
Hispanic category.
More complex data sets naturally give rise to varied and interesting visualizations, including overlaid graphs of different kinds. To analyze such data, it
helps to have some more skills in data manipulation, so that we can get the data into a form that allows us to use methods like those in this section. In
the next chapter we will develop some of these skills.
Defining a Function
The definition of the double function below simply doubles a number.
# Our first function definition
def double(x):
""" Double x """
return 2*x
We start any function definition by writing def. Here is a breakdown of the other parts (the syntax) of this small function:
When we run the cell above, no particular number is doubled, and the code inside the body of double is not yet evaluated. In this respect, our
function is analogous to a recipe. Each time we follow the instructions in a recipe, we need to start with ingredients. Each time we want to use our
function to double a number, we need to specify a number.
We can call double in exactly the same way we have called other functions. Each time we do that, the code in the body is executed, with the value of
the argument given the name x.
double(17)
34
double(-0.6/4)
-0.3
The two expressions above are both call expressions. In the second one, the value of the expression -0.6/4 is computed and then passed as the
argument named x to the double function. Each call expresson results in the body of double being executed, but with a different value of x.
The body of double has only a single line:
return 2*x
Executing this return statement completes execution of the double function’s body and computes the value of the call expression.
The argument to double can be any expression, as long as its value is a number. For example, it can be a name. The double function does not know
or care how its argument is computed or stored; its only job is to execute its own body using the values of the arguments passed to it.
any_name = 42
double(any_name)
84
The argument can also be any value that can be doubled. For example, a whole array of numbers can be passed as an argument to double, and the
result will be another array.
double(make_array(3, 4, 5))
array([ 6, 8, 10])
However, names that are defined inside a function, including arguments like double’s x, have only a fleeting existence. They are defined only while the
function is being called, and they are only accessible inside the body of the function. We can’t refer to x outside the body of double. The technical
terminology is that x has local scope.
Therefore the name x isn’t recognized outside the body of the function, even though we have called double in the cells above.
x
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-6fcf9dfbd479> in <module>
----> 1 x
Docstrings. Though double is relatively easy to understand, many functions perform complicated tasks and are difficult to use without explanation.
(You may have discovered this yourself!) Therefore, a well‑composed function has a name that evokes its behavior, as well as documentation. In
Python, this is called a docstring — a description of its behavior and expectations about its arguments. The docstring can also show example calls to
the function, where the call is preceded by >>>.
A docstring can be any string, as long as it is the first thing in a function’s body. Docstrings are typically defined using triple quotation marks at the
start and end, which allows a string to span multiple lines. The first line is conventionally a complete but short description of the function, while
following lines provide further guidance to future users of the function.
Here is a definition of a function called percent that takes two arguments. The definition includes a docstring.
# A function with more than one argument
percent(33, 200)
16.5
Contrast the function percent defined above with the function percents defined below. The latter takes an array as its argument, and converts all the
numbers in the array to percents out of the total of the values in the array. The percents are all rounded to two decimal places, this time replacing
round by np.round because the argument is an array and not a number.
def percents(counts):
"""Convert the values in array_x to percents out of the total of array_x."""
total = counts.sum()
return np.round((counts/total)*100, 2)
The function percents returns an array of percents that add up to 100 apart from rounding.
some_array = make_array(7, 10, 4)
percents(some_array)
It is helpful to understand the steps Python takes to execute a function. To facilitate this, we have put a function definition and a call to that function
in the same cell below.
def biggest_difference(array_x):
"""Find the biggest difference in absolute value between two adjacent elements of
array_x."""
diffs = np.diff(array_x)
absolute_diffs = abs(diffs)
return max(absolute_diffs)
parts = make_array(2, 1, 4)
print("Rounded to 1 decimal place: ", percents(parts, 1))
print("Rounded to 2 decimal places:", percents(parts, 2))
print("Rounded to 3 decimal places:", percents(parts, 3))
parts = make_array(2, 1, 4)
print("Rounded to 1 decimal place:", percents(parts, 1))
print("Rounded to the default number of decimal places:", percents(parts))
Note: Methods
Functions are called by placing argument expressions in parentheses after the function name. Any function that is defined in isolation is called in this
way. You have also seen examples of methods, which are like functions but are called using dot notation, such as some_table.sort(some_label).
The functions that you define will always be called using the function name first, passing in all of the arguments.
cut_off_at_100(17)
17
cut_off_at_100(117)
100
cut_off_at_100(100)
100
The function cut_off_at_100 simply returns its argument if the argument is less than or equal to 100. But if the argument is greater than 100, it
returns 100.
In our earlier examples using Census data, we saw that the variable AGE had a value 100 that meant “100 years old or older”. Cutting off ages at 100 in
this manner is exactly what cut_off_at_100 does.
To use this function on many ages at once, we will have to be able to refer to the function itself, without actually calling it. Analogously, we might
show a cake recipe to a chef and ask her to use it to bake 6 cakes. In that scenario, we are not using the recipe to bake any cakes ourselves; our role
is merely to refer the chef to the recipe. Similarly, we can ask a table to call cut_off_at_100 on 6 different numbers in a column.
First, we create the table ages with a column for people and one for their ages. For example, person C is 52 years old.
ages = Table().with_columns(
'Person', make_array('A', 'B', 'C', 'D', 'E', 'F'),
'Age', make_array(17, 117, 52, 100, 6, 101)
)
ages
Person Age
A 17
B 117
C 52
D 100
E 6
F 101
8.1.1. apply
To cut off each of the ages at 100, we will use a new Table method. The apply method calls a function on each element of a column, forming a new
array of return values. To indicate which function to call, just name it (without quotation marks or parentheses). The name of the column of input
values is a string that must still appear within quotation marks.
ages.apply(cut_off_at_100, 'Age')
What we have done here is apply the function cut_off_at_100 to each value in the Age column of the table ages. The output is the array of
corresponding return values of the function. For example, 17 stayed 17, 117 became 100, 52 stayed 52, and so on.
This array, which has the same length as the original Age column of the ages table, can be used as the values in a new column called Cut Off Age
alongside the existing Person and Age columns.
ages.with_column(
'Cut Off Age', ages.apply(cut_off_at_100, 'Age')
)
<function __main__.cut_off_at_100(x)>
Notice that we did not write "cut_off_at_100" with quotes (which is just a piece of text), or cut_off_at_100() (which is a function call, and an
invalid one at that). We simply wrote cut_off_at_100 to refer to the function.
Just like we can define new names for other values, we can define new names for functions. For example, suppose we want to refer to our function as
cut_off instead of cut_off_at_100. We can just write this:
cut_off = cut_off_at_100
Now cut_off is a name for a function. It’s the same function as cut_off_at_100, so the printed value is exactly the same.
cut_off
<function __main__.cut_off_at_100(x)>
Now suppose the researchers encountered a new couple, similar to those in this dataset, and wondered how tall their child would be. What would be
a good way for him to go about predicting the child’s height, given that the parent average height was, say, 68 inches?
One reasonable approach would be to base the prediction on all the points that correspond to a parent average height of around 68 inches. The
prediction equals the average child’s height calculated from those points alone.
Let’s execute this plan. For now we will just make a reasonable definition of what “around 68 inches” means, and work with that. Later in the course
we will examine the consequences of such choices.
We will take “close” to mean “within half an inch”. The figure below shows all the points corresponding to a parent average height between 67.5
inches and 68.5 inches. These are all the points in the strip between the red lines. Each of these points corresponds to one child; our prediction of
the height of the new couple’s child is the average height of all the children in the strip. That’s represented by the gold dot.
Ignore the code, and just focus on understanding the mental process of arriving at that gold dot.
heights.scatter('Parent Average')
plots.plot([67.5, 67.5], [50, 85], color='red', lw=2)
plots.plot([68.5, 68.5], [50, 85], color='red', lw=2)
plots.scatter(68, 67.62, color='gold', s=40);
In order to calculate exactly where the gold dot should be, we first need to indentify all the points in the strip. These correspond to the rows where
Parent Average is between 67.5 inches and 68.5 inches.
67.62
We now have a way to predict the height of a child given any value of the parent average height near those in our dataset. We can define a function
predict_child that does this. The body of the function consists of the code in the two cells above, apart from choices of names.
def predict_child(p_avg):
"""Predict the height of a child whose parents have a parent average height of
p_avg.
The prediction is the average height of the children whose parent average height is
in the range p_avg plus or minus 0.5.
"""
Given a parent average height of 68 inches, the function predict_child returns the same prediction (67.62 inches) as we got earlier. The advantage
of defining the function is that we can easily change the value of the predictor and get a new prediction.
predict_child(68)
67.62
predict_child(66)
66.08640776699029
How good are these predictions? We can get a sense of this by comparing the predictions with the data that we already have. To do this, we first
apply the function predict_child to the column of Parent Average heights, and collect the results in a new column labeled Prediction.
# Apply predict_child to all the midparent heights
heights_with_predictions = heights.with_column(
'Prediction', heights.apply(predict_child, 'Parent Average')
)
heights_with_predictions
The graph of gold dots is called a graph of averages, because each gold dot is the center of a vertical strip like the one we drew earlier. Each one
provides a prediction of a child’s height given the parent average height. For example, the scatter shows that for a parent average height of 65
inches, the predicted height of the child would be just above 65 inches, and indeed predict_child(65) evaluates to about 65.84.
predict_child(65)
65.83829787234043
Notice that the graph of averages roughly follows a straight line. This straight line is now called the regression line and is one of the most common
methods of making predictions. The calculation that we have just done is very similar to the calculation that led to the development of the regression
method, using the same data.
This example, like the one about John Snow’s analysis of cholera deaths, shows how some of the fundamental concepts of modern data science have
roots going back a long way. The method used here was a precursor to nearest neighbor prediction methods that now have powerful applications in
diverse settings. The modern field of machine learning includes the automation of such methods to make predictions based on vast and rapidly
evolving datasets.
Flavor Price
strawberry 3.55
chocolate 4.75
chocolate 6.55
strawberry 5.25
chocolate 5.25
cones.group('Flavor')
Flavor count
chocolate 3
strawberry 2
There are two distinct categories, chocolate and strawberry. The call to group creates a table of counts in each category. The column is called count
by default, and contains the number of rows in each category.
Notice that this can all be worked out from just the Flavor column. The Price column has not been used.
But what if we wanted the total price of the cones of each different flavor? That’s where the second argument of group comes in.
sum(cones.where('Flavor', are.equal_to('chocolate')).column('Price'))
16.55
grouped_cones = Table().with_columns(
'Flavor', make_array('chocolate', 'strawberry'),
'Array of All the Prices', make_array(cones_choc, cones_strawb)
)
# Append a column with the sum of the `Price` values in each array
price_totals = grouped_cones.with_column(
'Sum of the Array', make_array(sum(cones_choc), sum(cones_strawb))
)
price_totals
POSITION count
C 69
PF 85
PG 85
SF 82
SG 96
3. What was the average salary of the players at each of the five positions?
This time, we have to group by POSITION and take the mean of the salaries. For clarity, we will work with a table of just the positions and the salaries.
positions_and_money = nba.select('POSITION', 'SALARY')
positions_and_money.group('POSITION', np.mean)
more_cones
Flavor count
bubblegum 1
chocolate 3
strawberry 2
But now each cone has a color as well. To classify the cones by both flavor and color, we will pass a list of labels as an argument to group. The
resulting table has one row for every unique combination of values that appear together in the grouped columns. As before, a single argument (a list,
in this case, but an array would work too) gives row counts.
Although there are six cones, there are only four unique combinations of flavor and color. Two of the cones were dark brown chocolate, and two pink
strawberry.
more_cones.group(['Flavor', 'Color'])
Flavor Color count
bubblegum pink 1
chocolate dark brown 2
chocolate light brown 1
strawberry pink 2
The group method takes a list of two labels because it is flexible: it could take one or three or more. On the other hand, pivot always takes two
column labels, one to determine the columns and one to determine the rows.
pivot
The pivot method is closely related to the group method: it groups together rows that share a combination of values. It differs from group because it
organizes the resulting values in a grid. The first argument to pivot is the label of a column that contains the values that will be used to form new
columns in the result. The second argument is the label of a column used for the rows. The result gives the count of all rows of the original table that
share the combination of column and row values.
Like group, pivot can be used with additional arguments to find characteristics of each paired category. An optional third argument called values
indicates a column of values that will replace the counts in each cell of the grid. All of these values will not be displayed, however; the fourth
argument collect indicates how to collect them all into one aggregated value to be displayed in the cell.
An example will help clarify this. Here is pivot being used to find the total price of the cones in each cell.
more_cones.pivot('Flavor', 'Color', values='Price', collect=sum)
We now have the distribution of educational attainment among adult Californians. More than 30% have a Bachelor’s degree or higher, while almost
16% lack a high school diploma.
educ_distribution = educ_totals.with_column(
'Population Percent', percents(educ_totals.column(1))
)
educ_distribution
Personal Bachelor's degree College, less than High school or No high school
Income or higher 4‑yr degree equivalent diploma
A: 0 to 575491 985011 1161873 1204529
4,999
B: 5,000 to 326020 810641 626499 597039
9,999
C: 10,000 to 452449 798596 692661 664607
14,999
D: 15,000 to 773684 1345257 1252377 875498
24,999
E: 25,000 to 693884 1091642 929218 464564
34,999
F: 35,000 to 1122791 1112421 782804 260579
49,999
G: 50,000 to 1594681 883826 525517 132516
74,999
H: 75,000 2986698 748103 323192 58945
and over
Here you see the power of pivot over other cross‑classification methods. Each column of counts is a distribution of personal income at a specific
level of educational attainment. Converting the counts to percents allows us to compare the four distributions.
distributions = totals.select(0).with_columns(
"Bachelor's degree or higher", percents(totals.column(1)),
'College, less than 4-yr degree', percents(totals.column(2)),
'High school or equivalent', percents(totals.column(3)),
'No high school diploma', percents(totals.column(4))
)
distributions
Personal Bachelor's degree College, less than High school or No high school
Income or higher 4‑yr degree equivalent diploma
A: 0 to 6.75 12.67 18.46 28.29
4,999
B: 5,000 to 3.82 10.43 9.95 14.02
9,999
C: 10,000 to 5.31 10.27 11 15.61
14,999
D: 15,000 to 9.07 17.3 19.9 20.56
24,999
E: 25,000 to 8.14 14.04 14.76 10.91
34,999
F: 35,000 to 13.17 14.31 12.44 6.12
49,999
G: 50,000 to 18.7 11.37 8.35 3.11
74,999
H: 75,000 35.03 9.62 5.13 1.38
and over
At a glance, you can see that over 35% of those with Bachelor’s degrees or higher had incomes of \(\$75,000\) and over, whereas fewer than 10% of
the people in the other education categories had that level of income.
The bar chart below compares the personal income distributions of adult Californians who have no high school diploma with those who have
completed a Bachelor’s degree or higher. The difference in the distributions is striking. There is a clear positive association between educational
attainment and personal income.
distributions.select(0, 1, 4).barh(0)
Flavor Price
strawberry 3.55
vanilla 4.75
chocolate 6.55
strawberry 5.25
chocolate 5.75
ratings = Table().with_columns(
'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
'Stars', make_array(2.5, 3.5, 4)
)
ratings
Kind Stars
strawberry 2.5
chocolate 3.5
vanilla 4
Each of the tables has a column that contains ice cream flavors: cones has the column Flavor, and ratings has the column Kind. The entries in these
columns can be used to link the two tables.
The method join creates a new table in which each cone in the cones table is augmented with the Stars information in the ratings table. For each
cone in cones, join finds a row in ratings whose Kind matches the cone’s Flavor. We have to tell join to use those columns for matching.
rated = cones.join('Flavor', ratings, 'Kind')
rated
The new table rated allows us to work out the price per star, which you can think of as an informal measure of value. Low values are good – they
mean that you are paying less for each rating star.
rated.with_column('$/Star', rated.column('Price') / rated.column('Stars')).sort(3)
Flavor Price Stars $/Star
vanilla 4.75 4 1.1875
strawberry 3.55 2.5 1.42
chocolate 5.75 3.5 1.64286
chocolate 6.55 3.5 1.87143
strawberry 5.25 2.5 2.1
Though strawberry has the lowest rating among the three flavors, the less expensive strawberry cone does well on this measure because it doesn’t
cost a lot per star.
Side note. Does the order we list the two tables matter? Let’s try it. As you see it, this changes the order that the columns appear in, and can
potentially changes the order of the rows, but it doesn’t make any fundamental difference.
ratings.join('Kind', cones, 'Flavor')
Flavor Stars
vanilla 5
chocolate 3
vanilla 5
chocolate 4
average_review = reviews.group('Flavor', np.average)
average_review
We can get more detail by specifying a larger number of bins. But the overall shape doesn’t change much.
commute.hist('Duration', bins=60, unit='Second')
Make
+
this Notebook Trusted to load map: File ‑> Trust Notebook
−
The map is created using OpenStreetMap, which is an open online mapping system that you can use just as you would use Google Maps or any other
online map. Zoom in to San Francisco to see how the stations are distributed. Click on a marker to see which station it is.
You can also represent points on a map by colored circles. Here is such a map of the San Francisco bike stations.
sf = stations.where('landmark', are.equal_to('San Francisco'))
sf_map_data = sf.select('lat', 'long', 'name').relabel('name', 'labels')
Circle.map_table(sf_map_data, color='green')
Make
+
this Notebook Trusted to load map: File ‑> Trust Notebook
−
city count
Mountain View 7
Palo Alto 5
Redwood City 7
San Francisco 35
San Jose 16
colors = cities.with_column('color', make_array('blue', 'red', 'green', 'orange',
'purple'))
colors
Now the markers have five different colors for the five different cities.
To see where most of the bike rentals originate, let’s identify the start stations:
starts = commute.group('Start Station').sort('count', descending=True)
starts
9. Randomness
In the previous chapters we developed skills needed to make insightful descriptions of data. Data scientists also have to be able to understand
randomness. For example, they have to be able to assign individuals to treatment and control groups at random, and then try to say whether any
observed differences in the outcomes of the two groups are simply due to the random assignment or genuinely due to the treatment.
In this chapter, we begin our analysis of randomness. To start off, we will use Python to make choices at random. In numpy there is a sub‑module
called random that contains many functions that involve random selection. One of these functions is called choice. It picks one item at random from
an array, and it is equally likely to pick any of the items. The function call is np.random.choice(array_name), where array_name is the name of the
array from which to make the choice.
Thus the following code evaluates to treatment with chance 50%, and control with chance 50%.
two_groups = make_array('treatment', 'control')
np.random.choice(two_groups)
'treatment'
The big difference between the code above and all the other code we have run thus far is that the code above doesn’t always return the same value.
It can return either treatment or control, and we don’t know ahead of time which one it will pick. We can repeat the process by providing a second
argument, the number of times to repeat the process.
np.random.choice(two_groups, 10)
A fundamental question about random events is whether or not they occur. For example:
Did an individual get assigned to the treatment group, or not?
Is a gambler going to win money, or not?
Has a poll made an accurate prediction, or not?
Once the event has occurred, you can answer “yes” or “no” to all these questions. In programming, it is conventional to do this by labeling statements
as True or False. For example, if an individual did get assigned to the treatment group, then the statement, “The individual was assigned to the
treatment group” would be True. If not, it would be False.
True
The value True indicates that the comparison is valid; Python has confirmed this simple fact about the relationship between 3 and 1+1. The full set of
common comparison operators are listed below.
Comparison Operator True example False Example
Less than < 2<3 2<2
Greater than > 3>2 3>3
Less than or equal <= 2 <= 2 3 <= 2
Greater or equal >= 3 >= 3 2 >= 3
Equal == 3 == 3 3 == 2
Not equal != 3 != 2 2 != 2
Notice the two equal signs == in the comparison to determine equality. This is necessary because Python already uses = to mean assignment to a
name, as we have seen. It can’t use the same symbol for a different purpose. Thus if you want to check whether 5 is equal to the 10/2, then you have
to be careful: 5 = 10/2 returns an error message because Python assumes you are trying to assign the value of the expression 10/2 to a name that is
the numeral 5. Instead, you must use 5 == 10/2, which evaluates to True.
5 = 10/2
5 == 10/2
True
An expression can contain multiple comparisons, and they all must hold in order for the whole expression to be True. For example, we can express
that 1+1 is between 1 and 3 using the following expression.
1 < 1 + 1 < 3
True
The average of two numbers is always between the smaller number and the larger number. We express this relationship for the numbers x and y
below. You can try different values of x and y to confirm this relationship.
x = 12
y = 5
min(x, y) <= (x+y)/2 <= max(x, y)
True
Comparing Strings
Strings can also be compared, and their order is alphabetical. A shorter string is less than a longer string that begins with the shorter string.
'Dog' > 'Catastrophe' > 'Cat'
True
Let’s return to random selection. Recall the array two_groups which consists of just two elements, treatment and control. To see whether a
randomly assigned individual went to the treatment group, you can use a comparison:
np.random.choice(two_groups) == 'treatment'
True
As before, the random choice will not always be the same, so the result of the comparison won’t always be the same either. It will depend on whether
treatment or control was chosen. With any cell that involves random selection, it is a good idea to run the cell several times to get a sense of the
variability in the result.
The numpy method count_nonzero evaluates to the number of non‑zero (that is, True) elements of the array.
np.count_nonzero(tosses == 'Heads')
if x > 0:
return 'Positive'
sign(3)
'Positive'
This function returns the correct sign if the input is a positive number. But if the input is not a positive number, then the if expression evaluates to a
false value, and so the return statement is skipped and the function call has no value.
sign(-3)
So let us refine our function to return Negative if the input is a negative number. We can do this by adding an elif clause, where elif if Python’s
shorthand for the phrase “else, if”.
def sign(x):
if x > 0:
return 'Positive'
elif x < 0:
return 'Negative'
Now sign returns the correct answer when the input is ‑3:
sign(-3)
'Negative'
What if the input is 0? To deal with this case, we can add another elif clause:
def sign(x):
if x > 0:
return 'Positive'
elif x < 0:
return 'Negative'
elif x == 0:
return 'Neither positive nor negative'
sign(0)
Equivalently, we can replace the final elif clause by an else clause, whose body will be executed only if all the previous comparisons are false; that
is, if the input value is equal to 0.
def sign(x):
if x > 0:
return 'Positive'
elif x < 0:
return 'Negative'
else:
return 'Neither positive nor negative'
sign(0)
There is always exactly one if clause, but there can be any number of elif clauses. Python will evaluate the if and elif expressions in the headers
in order until one is found that is a true value, then execute the corresponding body. The else clause is optional. When an else header is provided, its
else body is executed only if none of the header expressions of the previous clauses are true. The else clause must always come at the end (or not at
all).
Let’s check that the function does the right thing for each different number of spots.
one_bet(1), one_bet(2), one_bet(3), one_bet (4), one_bet(5), one_bet(6)
(-1, -1, 0, 0, 1, 1)
As a review of how conditional statements work, let’s see what one_bet does when the input is 3.
First it evaluates the if expression, which is 3 <= 2 which is False. So one_bet doesn’t execute the if body.
Then it evaluates the first elif expression, which is 3 <= 4, which is True. So one_bet executes the first elif body and returns 0.
Once the body has been executed, the process is complete. The next elif expression is not evaluated.
If for some reason we use an input greater than 6, then the if expression evaluates to False as do both of the elif expressions. So one_bet does not
execute the if body nor the two elif bodies, and there is no value when you make the call below.
one_bet(17)
To play the game based on one roll of a die, you can use np.random.choice to generate the number of spots and then use that as the argument to
one_bet. Run the cell a few times to see how the output changes.
one_bet(np.random.choice(np.arange(1, 7)))
-1
At this point it is natural to want to collect the results of all the bets so that we can analyze them. In the next section we develop a way to do this
without running the cell over and over again.
9.2. Iteration
It is often the case in programming – especially when dealing with randomness – that we want to repeat a process multiple times. For example, recall
the game of betting on one roll of a die with the following rules:
If the die shows 1 or 2 spots, my net gain is ‑1 dollar.
If the die shows 3 or 4 spots, my net gain is 0 dollars.
If the die shows 5 or 6 spots, my net gain is 1 dollar.
The function bet_on_one_roll takes no argument. Each time it is called, it simulates one roll of a fair die and returns the net gain in dollars.
def bet_on_one_roll():
"""Returns my net gain on one bet"""
x = np.random.choice(np.arange(1, 7)) # roll a die once and record the number of
spots
if x <= 2:
return -1
elif x <= 4:
return 0
elif x <= 6:
return 1
Playing this game once is easy:
bet_on_one_roll()
To get a sense of how variable the results are, we have to play the game over and over again. We could run the cell repeatedly, but that’s tedious, and
if we wanted to do it a thousand times or a million times, forget it.
A more automated solution is to use a for statement to loop over the contents of a sequence. This is called iteration. A for statement begins with the
word for, followed by a name we want to give each item in the sequence, followed by the word in, and ending with an expression that evaluates to a
sequence. The indented body of the for statement is executed once for each item in that sequence.
for animal in make_array('cat', 'dog', 'rabbit'):
print(animal)
cat
dog
rabbit
It is helpful to write code that exactly replicates a for statement, without using the for statement. This is called unrolling the loop.
A for statement simple replicates the code inside it, but before each iteration, it assigns a new value from the given sequence to the name we chose.
For example, here is an unrolled version of the loop above.
animal = make_array('cat', 'dog', 'rabbit').item(0)
print(animal)
animal = make_array('cat', 'dog', 'rabbit').item(1)
print(animal)
animal = make_array('cat', 'dog', 'rabbit').item(2)
print(animal)
cat
dog
rabbit
Notice that the name animal is arbitrary, just like any name we assign with =.
Here we use a for statement in a more realistic way: we print the results of betting five times on the die as described earlier. This is called simulating
the results of five bets. We use the word simulating to remind ourselves that we are not physically rolling dice and exchanging money but using
Python to mimic the process.
To repeat a process n times, it is common to use the sequence np.arange(n) in the for statement. It is also common to use a very short name for
each item. In our code we will use the name i to remind ourselves that it refers to an item.
for i in np.arange(5):
print(bet_on_one_roll())
1
-1
-1
1
1
In this case, we simply perform exactly the same (random) action several times, so the code in the body of our for statement does not actually refer
to i.
But often while using for loops it will be convenient to mutate an array – that is, change it – when augmenting it. This is done by assigning the
augmented array to the same name as the original.
pets = np.append(pets, 'Another Pet')
pets
for i in np.arange(5):
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
outcomes
i = np.arange(5).item(0)
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
i = np.arange(5).item(1)
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
i = np.arange(5).item(2)
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
i = np.arange(5).item(3)
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
i = np.arange(5).item(4)
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
outcomes
np.count_nonzero(outcomes)
for i in np.arange(300):
outcome_of_bet = bet_on_one_roll()
outcomes = np.append(outcomes, outcome_of_bet)
300
To see how often the three different possible results appeared, we can use the array outcomes and Table methods.
outcome_table = Table().with_column('Outcome', outcomes)
outcome_table.group('Outcome').barh(0)
Not surprisingly, each of the three outcomes ‑1, 0, and 1 appeared about about 100 of the 300 times, give or take. We will examine the “give or take”
amounts more closely in later chapters.
9.3. Simulation
Simulation is the process of using a computer to mimic a physical experiment. In this class, those experiments will almost invariably involve chance.
We have seen how to simulate the results of tosses of a coin. The steps in that simulation were examples of the steps that will constitute every
simulation we do in this course. In this section we will set out those steps and follow them in examples.
In our earlier example we used np.random.choice and a for loop to generate multiple tosses. But sets of coin tosses are needed so often in data
science that np.random.choice simulates them for us if we include a second argument that is the number of times to toss.
Here are the results of 10 tosses.
ten_tosses = np.random.choice(coin, 10)
ten_tosses
array(['Heads', 'Tails', 'Heads', 'Tails', 'Tails', 'Heads', 'Tails',
'Tails', 'Tails', 'Heads'], dtype='<U5')
Our goal is to simulate the number of heads in 100 tosses, not 10. To do that we can just repeat the same code, replacing 10 by 100.
outcomes = np.random.choice(coin, 100)
num_heads = np.count_nonzero(outcomes == 'Heads')
num_heads
46
Since we will want to do this multiple times, let’s define a function that returns the simulated value of the number of heads. We can do this using the
code developed in the cell above.
def one_simulated_value():
outcomes = np.random.choice(coin, 100)
return np.count_nonzero(outcomes == 'Heads')
Check that the array heads contains 20,000 entries, one for each repetition of the experiment.
len(heads)
20000
To get a sense of the variability in the number of heads in 100 tosses, we can collect the results in a table and draw a histogram.
simulation_results = Table().with_columns(
'Repetition', np.arange(1, num_repetitions + 1),
'Number of Heads', heads
)
simulation_results.show(3)
Repetition Number of Heads
1 44
2 54
3 44
... (19997 rows omitted)
simulation_results.hist('Number of Heads', bins = np.arange(30.5, 69.6, 1))
Each bin has width 1 and is centered at each value of the number of heads.
Not surprisingly, the histogram looks roughly symmetric around 50 heads. The height of the bar at 50 is about 8% per unit. Since each bin is 1 unit
wide, this is the same as saying that about 8% of the repetitions produced exactly 50 heads. That’s not a huge percent, but it’s the largest compared
to the percent at every other number of heads.
The histogram also shows that in almost all of the repetitions, the number of heads in 100 tosses was somewhere between 35 and 65. Indeed, the
bulk of the repetitions produced numbers of heads in the range 45 to 55.
While in theory it is possible that the number of heads can be anywhere between 0 and 100, the simulation shows that the range of probable values is
much smaller.
This is an instance of a more general phenomenon about the variability in coin tossing, as we will see later in the course.
We can use the array die and the expression above to define a function that simulates one move in Monopoly.
def one_simulated_move():
return sum(np.random.choice(die, 2))
Now we can create an array of 10000 simulated Monopoly moves, by starting with an empty collection array and augmenting it by each new
simulated move.
num_repetitions = 10000
moves = make_array()
for i in np.arange(num_repetitions):
new_move = one_simulated_move()
moves = np.append(moves, new_move)
Seven is the most common value, with the frequencies falling off symmetrically on either side.
There are two doors left, one of which was the contestant’s original choice. One of the doors has the car behind it, and the other one has a goat.
The contestant now gets to choose which of the two doors to open.
The contestant has a decision to make. Which door should she choose to open, if she wants the car? Should she stick with her initial choice, or
switch to the other door? That is the Monty Hall problem.
9.4.2. Simulation
The simulation will be more complex that those we have done so far. Let’s break it down.
Step 1: What to Simulate
For each play we will simulate what’s behind all three doors:
the one the contestant first picks
the one that Monty opens
the remaining door
So we will be keeping track of three quantitites, not just one.
Step 2: Simulating One Play
As is often the case in simulating a game, the bulk of the work consists of simulating one play of the game. This involves several pieces.
The goats: We start by setting up an array goats that contains unimaginative names for the two goats.
goats = make_array('first goat', 'second goat')
To help Monty conduct the game, we are going to have to identify which goat is selected and which one is revealed behind the open door. The
function other_goat takes one goat and returns the other.
def other_goat(x):
if x == 'first goat':
return 'second goat'
elif x == 'second goat':
return 'first goat'
The string watermelon is not the name of one of the goats, so when watermelon is the input then other_goat does nothing.
The options: The array hidden_behind_doors contains the three things that are behind the doors.
hidden_behind_doors = np.append(goats, 'car')
hidden_behind_doors
We are now ready to simulate one play. To do this, we will define a function monty_hall_game that takes no arguments. When the function is called, it
plays Monty’s game once and returns a list consisting of:
the contestant’s guess
what Monty reveals when he opens a door
what remains behind the other door
The game starts with the contestant choosing one door at random. In doing so, the contestant makes a random choice from among the first goat, the
second goat, and the car.
If the contestant happens to pick one of the goats, then the other goat is revealed and the car is behind the remaining door.
If the contestant happens to pick the car, then Monty reveals one of the goats and the other goat is behind the remaining door.
def monty_hall_game():
"""Return
[contestant's guess, what Monty reveals, what remains behind the other door]"""
contestant_guess = np.random.choice(hidden_behind_doors)
if contestant_guess == 'car':
revealed = np.random.choice(goats)
return [contestant_guess, revealed, other_goat(revealed)]
Let’s play! Run the cell several times and see how the results change.
monty_hall_game()
for i in np.arange(10000):
games.append(monty_hall_game())
The simulation is done. Notice how short the code is. The majority of the work was done in simulating the outcome of one game.
games.show(3)
Guess Revealed Remaining
first goat second goat car
first goat second goat car
car first goat second goat
... (9997 rows omitted)
9.4.3. Visualization
To see whether the contestant should stick with her original choice or switch, let’s see how frequently the car is behind each of her two options.
It is no surprise that the three doors appear about equally often as the contestant’s original guess.
original_choice = games.group('Guess')
original_choice
Guess count
car 3319
first goat 3311
second goat 3370
Once Monty has eliminated a goat, how often is the car behind the remaining door?
remaining_door = games.group('Remaining')
remaining_door
Remaining count
car 6681
first goat 1676
second goat 1643
As our earlier solution said, the car is behind the remaining door two‑thirds of the time, to a pretty good approximation. The contestant is twice as
likely to get the car if she switches than if she sticks with her original choice.
To see this graphically, we can join the two tables above and draw overlaid bar charts.
joined = original_choice.join('Guess', remaining_door, 'Remaining')
combined = joined.relabeled(0, 'Item').relabeled(1, 'Original Door').relabeled(2,
'Remaining Door')
combined
Deterministic Samples
When you simply specify which elements of a set you want to choose, without any chances involved, you create a deterministic sample.
You have done this many times, for example by using take:
top.take(make_array(3, 18, 100))
Row Index Title Studio Gross Gross (Adjusted) Year
3 E.T.: The Extra‑Terrestrial Universal 435,110,554 1,261,085,000 1982
18 The Lion King Buena Vista 422,783,777 792,511,700 1994
100 The Hunger Games Lionsgate 408,010,692 452,174,400 2012
You have also used where:
top.where('Title', are.containing('Harry Potter'))
Probability Samples
For describing random samples, some terminology will be helpful.
A population is the set of all elements from whom a sample will be drawn.
A probability sample is one for which it is possible to calculate, before the sample is drawn, the chance with which any subset of elements will enter
the sample.
In a probability sample, all elements need not have the same chance of being chosen.
Person A has a higher chance of being selected than Persons B or C; indeed, Person A is certain to be selected. Since these differences are known
and quantified, they can be taken into account when working with the sample.
A Systematic Sample
Imagine all the elements of the population listed in a sequence. One method of sampling starts by choosing a random position early in the list, and
then evenly spaced positions after that. The sample consists of the elements in those positions. Such a sample is called a systematic sample.
Here we will choose a systematic sample of the rows of top. We will start by picking one of the first 10 rows at random, and then we will pick every
10th row after that.
"""Choose a random start among rows 0 through 9;
then take every 10th row."""
start = np.random.choice(np.arange(10))
top.take(np.arange(start, top.num_rows, 10))
Convenience Samples
Drawing a random sample requires care and precision. It is not haphazard even though that is a colloquial meaning of the word "random". If you stand
at a street corner and take as your sample the first ten people who pass by, you might think you're sampling at random because you didn't choose
who walked by. But it's not a random sample – it's a *sample of convenience*. You didn't know ahead of time the probability of each person entering
the sample; perhaps you hadn't even specified exactly who was in the population.
Face
1
2
3
4
5
6
Variables whose successive values are separated by the same fixed amount, such as the values on rolls of a die (successive values separated by 1),
fall into a class of variables that are called discrete. The histogram above is called a discrete histogram. Its bins are specified by the array die_bins
and ensure that each bar is centered over the corresponding integer value.
It is important to remember that the die can’t show 1.3 spots, or 5.2 spots – it always shows an integer number of spots. But our visualization spreads
the probability of each value over the area of a bar. While this might seem a bit arbitrary at this stage of the course, it will become important later
when we overlay smooth curves over discrete histograms.
Before going further, let’s make sure that the numbers on the axes make sense. The probability of each face is 1/6, which is 16.67% when rounded to
two decimal places. The width of each bin is 1 unit. So the height of each bar is 16.67% per unit. This agrees with the horizontal and vertical scales of
the graph.
Face
2
4
5
5
1
6
1
4
6
5
We can use the same method to simulate as many rolls as we like, and then draw empirical histograms of the results. Because we are going to do this
repeatedly, we define a function empirical_hist_die that takes the sample size as its argument, rolls a die as many times as the argument, and then
draws a histogram of the observed results.
def empirical_hist_die(n):
die.sample(n).hist(bins = die_bins)
empirical_hist_die(1000)
As we increase the number of rolls in the simulation, the area of each bar gets closer to 16.67%, which is the area of each bar in the probability
histogram.
-16
united.column('Delay').max()
580
For the purposes of this section, it is enough to zoom in on the bulk of the data and ignore the 0.8% of flights that had delays of more than 200
minutes. This restriction is just for visual convenience; the table still retains all the data.
united.where('Delay', are.above(200)).num_rows/united.num_rows
0.008390596745027125
The height of the [0, 10) bar is just under 3% per minute, which means that just under 30% of the flights had delays between 0 and 10 minutes. That
is confirmed by counting rows:
united.where('Delay', are.between(0, 10)).num_rows/united.num_rows
0.2935985533453888
As we saw with the dice, as the sample size increases, the empirical histogram of the sample more closely resembles the histogram of the population.
Compare these histograms to the population histogram above.
empirical_hist_delay(10)
empirical_hist_delay(100)
The most consistently visible discrepancies are among the values that are rare in the population. In our example, those values are in the the right
hand tail of the distribution. But as the sample size increases, even those values begin to appear in the sample in roughly the correct proportions.
empirical_hist_delay(1000)
The two histograms clearly resemble each other, though they are not identical.
10.3.1. Parameter
Frequently, we are interested in numerical quantities associated with a population.
In a population of voters, what percent will vote for Candidate A?
In a population of Facebook users, what is the largest number of Facebook friends that the users have?
In a population of United flights, what is the median departure delay?
Numerical quantities associated with a population are called parameters. For the population of flights in united, we know the value of the parameter
“median delay”:
np.median(united.column('Delay'))
2.0
The NumPy function median returns the median (half‑way point) of an array. Among all the flights in united, the median delay was 2 minutes. That is,
about 50% of flights in the population had delays of 2 or fewer minutes:
united.where('Delay', are.below_or_equal_to(2)).num_rows / united.num_rows
0.5018444846292948
Half of all flights left no more than 2 minutes after their scheduled departure time. That’s a very short delay!
Note. The percent isn’t exactly 50 because of “ties,” that is, flights that had delays of exactly 2 minutes. There were 480 such flights. Ties are quite
common in data sets, and we will not worry about them in this course.
united.where('Delay', are.equal_to(2)).num_rows
480
10.3.2. Statistic
In many situations, we will be interested in figuring out the value of an unknown parameter. For this, we will rely on data from a large random sample
drawn from the population.
A statistic (note the singular!) is any number computed using the data in a sample. The sample median, therefore, is a statistic.
Remember that sample_1000 contains a random sample of 1000 flights from united. The observed value of the sample median is:
np.median(sample_1000.column('Delay'))
2.0
Our sample – one set of 1,000 flights – gave us one observed value of the statistic. This raises an important problem of inference:
The statistic could have been different. A fundamental consideration in using any statistic based on a random sample is that the sample could
have come out differently, and therefore the statistic could have come out differently too.
np.median(united.sample(1000).column('Delay'))
3.0
Run the cell above a few times to see how the answer varies. Often it is equal to 2, the same value as the population parameter. But sometimes it is
different.
Just how different could the statistic have been? One way to answer this is to simulate the statistic many times and note the values. A histogram
of those values will tell us about the distribution of the statistic.
Let’s recall the main steps in a simulation.
Step 3: Decide how many simulated values to generate. Let’s do 5,000 repetitions.
Step 4: Use a for loop to generate an array of simulated values. As usual, we will start by creating an empty array in which to collect our results.
We will then set up a for loop for generating all the simulated values. The body of the loop will consist of generating one simulated value of the
sample median, and appending it to our collection array.
The simulation takes a noticeable amount of time to run. That is because it is performing 5000 repetitions of the process of drawing a sample of size
1000 and computing its median. That’s a lot of sampling and repeating!
medians = make_array()
for i in np.arange(5000):
medians = np.append(medians, random_sample_median())
The simulation is done. All 5,000 simulated sample medians have been collected in the array medians. Now it’s time to visualize the results.
10.3.4. Visualization
Here are the simulated random sample medians displayed in the table simulated_medians.
simulated_medians = Table().with_column('Sample Median', medians)
simulated_medians
Sample Median
2
3
1
3
2
2.5
3
3
3
2
... (4990 rows omitted)
We can also visualize the simulated data using a histogram. The histogram is called an empirical histogram of the statistic. It displays the empirical
distribution of the statistic. Remember that empirical means observed.
simulated_medians.hist(bins=np.arange(0.5, 5, 1))
You can see that the sample median is very likely to be about 2, which was the value of the population median. Since samples of 1000 flight delays
are likely to resemble the population of delays, it is not surprising that the median delays of those samples should be close to the median delay in the
population.
This is an example of how a statistic can provide a good estimate of a parameter.
Face
1
2
3
4
5
6
Run the cell below to simulate 7 rolls of a die.
die.sample(7)
Face
5
3
3
5
5
1
6
Sometimes it is more natural to sample individuals at random without replacement. This is called a simple random sample. The argument
with_replacement=False allows you to do this.
array([1, 2, 3, 4, 5, 6])
array([4, 1, 6, 3, 5, 4, 6])
The argument replace=False allows you to get a simple random sample, that is, a sample drawn at random without replacement.
# Array of actor names
actor_names = actors.column('Actor')
Just as sample did, so also np.random.choice gives you the entire sequence of sampled elements. You can use array operations to answer many
questions about the sample. For example, you can find which actor was the second one to be drawn, or the number of faces of the die that appeared
more than once. Some answers might need multiple lines of code.
sample_size = 300
# Distribution of sample
sample_distribution = sample_proportions(sample_size, species_proportions)
sample_distribution
1.0
The categories in species_proportions are in the order Red, Pink, White. That order is preserved by sample_proportions. If you just want the
proportion of pink‑flowering plants in the sample, you can use item:
# Sample proportion of Heads
sample_distribution.item(1)
0.5033333333333333
You can use sample_proportions and array operations to answer questions based only on the proportions of sampled individuals in the different
categories. You will not be able to answer questions that require more detailed information about the sample, such as which of the sampled plants
had each of the different colors.
The categories in the output array of sample_proportions are in the same order as in the input array. So the proportion of Black panelists in the
random sample is item(0) of the output array. Run the cell below a few times to see how the sample proportion of Black jurors varies in a randomly
selected panel. Do you see any values as low as 0.08?
sample_proportions(sample_size, eligible_population).item(0)
0.27
The count in each category is the sample size times the corresponding proportion. So we can just as easily simulate counts instead of proportions.
Let’s define a function that does this. The function will draw a panel at random and return the number of panelists who are Black.
def one_simulated_count():
return sample_size * sample_proportions(sample_size, eligible_population).item(0)
The histogram shows us what the model of random selection predicts about our statistic, the number of Black panelists in the sample.
To generate each simulated count, we drew 100 times at random from a population in which 26% were Black. So, as you would expect, most of the
simulated counts are around 26. They are not exactly 26: there is some variation. The counts range from about 15 to about 40.
jury
The bar chart shows that the distribution of the random sample resembles the eligible population but the distribution of the panels does not.
To assess whether this observation is particular to one random sample or more general, we can simulate multiple panels under the model of random
selection and see what the simulations predict. But we won’t be able to look at thousands of bar charts like the one above. We need a statistic that
will help us assess whether or not the model or random selection is supported by the data.
For this we will compute a quantity called the total variation distance between two distributions. The calculation is as an extension of how we find the
distance between two numbers.
To compute the total variation distance, we first find the difference between the two proportions in each category.
# Augment the table with a column of differences between proportions
jury_with_diffs = jury.with_column(
'Difference', jury.column('Panels') - jury.column('Eligible')
)
jury_with_diffs
jury_with_diffs
Ethnicity Eligible Panels Difference Absolute Difference
Asian/PI 0.15 0.26 0.11 0.11
Black/AA 0.18 0.08 ‑0.1 0.1
Caucasian 0.54 0.54 0 0
Hispanic 0.12 0.08 ‑0.04 0.04
Other 0.01 0.04 0.03 0.03
jury_with_diffs.column('Absolute Difference').sum() / 2
0.14
This quantity 0.14 is the total variation distance (TVD) between the distribution of ethnicities in the eligible juror population and the distribution in the
panels.
In general, the total variation distance between two distributions measures how close the distributions are. The larger the TVD, the more different the
two distributions appear.
Technical Note: We could have obtained the same result by just adding the positive differences. But our method of including all the absolute
differences eliminates the need to keep track of which differences are positive and which are not.
We will use the total variation distance between distributions as the statistic to simulate under the assumption of random selection. Large values of
the distance will be evidence against random selection.
This function will help us calculate our statistic in each repetition of the simulation. But first let’s check that it gives the right answer when we use it to
compute the distance between the blue (eligible) and gold (panels) distributions above. These are the distribution in the ACLU study.
total_variation_distance(jury.column('Panels'), jury.column('Eligible'))
0.14
This agrees with the value that we computed directly without using the function.
In the cell below we use the function to compute the TVD between the distributions of the eligible jurors and one random sample. Recall that
eligible_population is the array containing the distribution of the eligible jurors, and that our sample size is 1453.
In the first line, we use sample_proportions to generate a random sample from the eligible population. In the next line we use
total_variation_distance to compute the TVD between the distributions in the random sample and the eligible population.
0.018265657260839632
Run the cell a few times and notice that the distances are quite a bit smaller than 0.14, the distance between the distribution of the panels and the
eligible jurors.
We are now ready to run a simulation to assess the model of random selection.
def one_simulated_tvd():
sample_distribution = sample_proportions(1453, eligible_population)
return total_variation_distance(sample_distribution, eligible_population)
The simulation shows that the composition of the panels in the ACLU study is not consistent with the model of random selection. Our analysis
supports the ACLU’s conclusion that the panels were not representative of the distribution provided for the eligible jurors.
11.2.9. Conclusion
Because of the discussion above, it is important for us to be precise about what we can conclude from our analysis.
We can conclude that the distribution provided for the panelists who reported for service does not look like a random sample from the
estimated distribution in the eligible population.
Our discussion, like the discussion in the ACLU report, sets out reasons for some of the differences observed between the two distributions and for
why summoned panelists might nor report. Almost all the reasons have their roots in historical racial bias in society, and are examples of the lasting
negative consequences of that bias.
0.8880516684607045
11.3.4. Step 3: The Distribution of the Test Statistic, Under the Null Hypothesis
The main computational aspect of a test of hypotheses is figuring out what the model in the null hypothesis predicts. Specifically, we have to figure
out what the values of the test statistic might be if the null hypothesis were true.
The test statistic is simulated based on the assumptions of the model in the null hypothesis. That model involves chance, so the statistic comes out
differently when you simulate it multiple times.
By simulating the statistic repeatedly, we get a good sense of its possible values and which ones are more likely than others. In other words, we get a
good approximation to the probability distribution of the statistic, as predicted by the model in the null hypothesis.
As with all distributions, it is very useful to visualize this distribution by a histogram, as we have done in our previous examples. Let’s go through the
entire process here.
We will start by assigning some known quantities to names.
mendel_proportions = make_array(0.75, 0.25)
mendel_proportion_purple = mendel_proportions.item(0)
sample_size = 929
Next, we will define a function that returns one simulated value of the test statistic. Then we will use a for loop to collect 10,000 simulated values in
an array.
def one_simulated_distance():
sample_proportion_purple = sample_proportions(929, mendel_proportions).item(0)
return 100 * abs(sample_proportion_purple - mendel_proportion_purple)
repetitions = 10000
distances = make_array()
for i in np.arange(repetitions):
distances = np.append(distances, one_simulated_distance())
Now we can draw the histogram of these values. This is the histogram of the distribution of the test statistic predicted by the null hypothesis.
Table().with_column(
'Distance between Sample % and 75%', distances
).hist()
plots.title('Prediction Made by the Null Hypothesis');
Look on the horizontal axis to see the typical values of the distance, as predicted by the model. They are rather small. For example, a high proportion
of the distances are in the range 0 to 1, meaning that for a high proportion of the samples, the percent of purple‑flowering plants is in the range 75% \
(\pm\) 1%. That is, the sample percent is in the range 74% to 76%.
Also note that this prediction was made using Mendel’s model only, not the proportions observed by Mendel in the plants that he grew. It is time now
to compare the predictions and Mendel’s observation.
The observed statistic is like a typical distance predicted by the null hypothesis. The null hypothesis is Mendel’s model. So our test concludes that
the data are consistent with Mendel’s model.
Based on our data, Mendel’s model looks good.
0.0243
About 2.4% of the distances simulated under Mendel’s model were 3.2 or greater. By the law of averages, we can conclude that if Mendel’s model
were correct for these new plants, then there is about a 2.4% chance that the test statistic would be 3.2 or more.
That doesn’t seem like a big chance. If Mendel’s model is true for these plants, something quite unlikely has happened. This idea gives rise to the
conventions.
The area to the right of 45, colored gold, is just under 5%.
np.count_nonzero(statistics >= 45) / repetitions
0.04654
Large values of the test statistic favor the alternative. So if you wanted to use a 5% cutoff for the p‑value, your decision rule would be to conclude
that the coin is unfair if the test statistic comes out to be 45 or more.
However, as the figure shows, a fair coin can produce test statistics with values 45 or more. In fact it does so with chance approximately 5%.
Summary: If the coin is fair and our test uses a 5% cutoff for deciding whether it is fair or not, then there is about a 5% chance that the test will
wrongly conclude that the coin is unfair.
smoking_and_birthweight.group('Maternal Smoker')
-9.266142572024918
We are going compute such differences repeatedly in our simulations below, so we will define a function to do the job. The function takes two
arguments:
the name of the table of data
the label of the column that contains the Boolean variable for grouping
It returns the difference between the means of the True group and the False group.
You will soon see why we are specifying the two arguments. For now, just check that the function returns what it should.
def difference_of_means(table, group_label):
"""Takes: name of table,
column label that indicates the group to which the row belongs
Returns: Difference of mean birth weights of the two groups"""
reduced = table.select('Birth Weight', group_label)
means_table = reduced.group(group_label, np.average)
means = means_table.column(1)
return means.item(1) - means.item(0)
To check that the function is working, let’s use it to calculate the observed difference between the mean birth weights of the two groups in the
sample.
difference_of_means(births, 'Maternal Smoker')
-9.266142572024918
original_and_shuffled
0.4747109100050153
-9.266142572024918
But could a different shuffle have resulted in a larger difference between the group averages? To get a sense of the variability, we must simulate the
difference many times.
As always, we will start by defining a function that simulates one value of the test statistic under the null hypothesis. This is just a matter of collecting
the code that we wrote above.
The function is called one_simulated_difference_of_means. It takes no arguments, and returns the difference between the mean birth weights of
two groups formed by randomly shuffling all the labels.
def one_simulated_difference_of_means():
"""Returns: Difference between mean birthweights
of babies of smokers and non-smokers after shuffling labels"""
Run the cell below a few times to see how the output changes.
one_simulated_difference_of_means()
-0.058299434770034964
repetitions = 5000
for i in np.arange(repetitions):
new_difference = one_simulated_difference_of_means()
differences = np.append(differences, new_difference)
The array differences contains 5,000 simulated values of our test statistic: the difference between the mean weight in the smoking group and the
mean weight in the non‑smoking group, when the labels have been assigned at random.
Notice how the distribution is centered roughly around 0. This makes sense, because under the null hypothesis the two groups should have roughly
the same average. Therefore the difference between the group averages should be around 0.
The observed difference in the original sample is about \(‑9.27\) ounces, which doesn’t even appear on the horizontal scale of the histogram. The
observed value of the statistic and the predicted behavior of the statistic under the null hypothesis are inconsistent.
The conclusion of the test is that the data favor the alternative over the null. It supports the hypothesis that the average birth weight of babies born to
mothers who smoke is less than the average birth weight of babies born to non‑smokers.
If you want to compute an empirical p‑value, remember that low values of the statistic favor the alternative hypothesis.
empirical_p = np.count_nonzero(differences <= observed_difference) / repetitions
empirical_p
0.0
The empirical p‑value is 0, meaning that none of the 5,000 permuted samples resulted in a difference of ‑9.27 or lower. This is only an approximation.
The exact chance of getting a difference in that range is not 0. But it is vanishingly small, according to our simulation, and therefore we can reject the
null hypothesis.
The observed difference between the average ages is about \(‑0.8\) years.
Let’s rewrite the code that compared the birth weights so that it now compares the ages of the smokers and non‑smokers.
def difference_of_means(table, group_label):
"""Takes: name of table,
column label that indicates the group to which the row belongs
Returns: Difference of mean ages of the two groups"""
reduced = table.select('Maternal Age', group_label)
means_table = reduced.group(group_label, np.average)
means = means_table.column(1)
return means.item(1) - means.item(0)
-0.8076725017901509
Remember that the difference is calculated as the mean age of the smokers minus the mean age of the non‑smokers. The negative sign shows that
the smokers are younger on average.
Is this difference due to chance, or does it reflect an underlying difference in the population?
As before, we can use a permutation test to answer this question. If the underlying distributions of ages in the two groups are the same, then the
empirical distribution of the difference based on permuted samples will predict how the statistic should vary due to chance.
We will follow the same process as in any simulation. We will start by writing a function that returns one simulated value of the difference between
means, and then write a for loop to simulate numerous such values and collect them in an array.
def one_simulated_difference_of_means():
"""Returns: Difference between mean ages
of smokers and non-smokers after shuffling labels"""
age_differences = make_array()
repetitions = 5000
for i in np.arange(repetitions):
new_difference = one_simulated_difference_of_means()
age_differences = np.append(age_differences, new_difference)
The observed difference is in the tail of the empirical distribution of the differences simulated under the null hypothesis.
Table().with_column(
'Difference Between Group Means', age_differences).hist(
right_end = observed_age_difference)
# Plotting parameters; you can ignore the code below
plots.ylim(-0.1, 1.2)
plots.scatter(observed_age_difference, 0, color='red', s=40, zorder=3)
plots.title('Prediction Under the Null Hypothesis')
print('Observed Difference:', observed_age_difference)
Once again, the empirical distribution of the simulated differences is centered roughly around 0, because the simulation is under the null hypothesis
that there is no difference between the distributions of the two groups.
The empirical p‑value of the test is the proportion of simulated differences that were equal to or less than the observed difference. This is because
low values of the difference favor the alternative hypothesis that the smokers were younger on average.
empirical_p = np.count_nonzero(age_differences <= observed_age_difference) / 5000
empirical_p
0.0108
The empirical p‑value is around 1% and therefore the result is statistically significant. The test supports the hypothesis that the smokers were
younger on average.
12.2. Causality
Our methods for comparing two samples have a powerful use in the analysis of randomized controlled experiments. Since the treatment and control
groups are assigned randomly in such experiements, differences in their outcomes can be compared to what would happen just due to chance if the
treatment had no effect at all. If the observed differences are more marked than what we would predict as purely due to chance, we will have
evidence of causation. Because of the unbiased assignment of individuals to the treatment and control groups, differences in the outcomes of the
two groups can be ascribed to the treatment.
The key to the analysis of randomized controlled experiments is understanding exactly how chance enters the picture. This helps us set up clear null
and alternative hypotheses. Once that’s done, we can simply use the methods of the previous sections to complete the analysis.
Let’s see how to do this in an example.
After the randomization, we get to see the right half of a randomly selected set of tickets, and the left half of the remaining group.
The table observed_outcomes collects the information about every patient’s potential outcomes, leaving the unobserved half of each “ticket” blank.
(It’s just another way of thinking about the bta table, carrying the same information.)
observed_outcomes = Table.read_table(path_data + "observed_outcomes.csv")
observed_outcomes.show()
Group Outcome if assigned treatment Outcome if assigned control
Control Unknown 1
Control Unknown 1
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Control Unknown 0
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 1 Unknown
Treatment 0 Unknown
Treatment 0 Unknown
Treatment 0 Unknown
Treatment 0 Unknown
Treatment 0 Unknown
Treatment 0 Unknown
0.475
As we have done before, we will define a function that takes the following two arguments:
the name of the table of data
the column label of the group labels
and returns the distance between the two group proportions.
def distance(table, group_label):
reduced = table.select('Result', group_label)
proportions = reduced.group(group_label, np.average).column(1)
return abs(proportions.item(1) - proportions.item(0))
distance(bta, 'Group')
0.475
0.08750000000000002
This is quite different from the distance between the two original proportions.
distance(bta_with_shuffled_labels, 'Group')
0.475
distances = make_array()
repetitions = 20000
for i in np.arange(repetitions):
new_distance = one_simulated_distance()
distances = np.append(distances, new_distance)
To find the empirical p‑value numerically, we must find the proportion of simulated statistics that were equal to or larger than the observed statistic.
empirical_p = np.count_nonzero(distances >= observed_distance) / repetitions
empirical_p
0.00875
This is a small p‑value. The observed statistic is in the tail of the empirical histogram of the test statistic generated under the null hypothesis.
The result is statistically significant. The test favors the alternative hypothesis over the null. The evidence supports the hypothesis that the treatment
is doing something.
The study reports a P‑value of 0.009, or 0.9%, which is not far from our empirical value.
12.2.7. Causality
Because the trials were randomized, the test is evidence that the treatment causes the difference. The random assignment of patients to the two
groups ensures that there is no confounding variable that could affect the conclusion of causality.
If the treatment had not been randomly assigned, our test would still point toward an association between the treatment and back pain outcomes
among our 31 patients. But beware: without randomization, this association would not imply that the treatment caused a change in back pain
outcomes. For example, if the patients themselves had chosen whether to administer the treatment, perhaps the patients experiencing more pain
would be more likely to choose the treatment and more likely to experience some reduction in pain even without medication. Pre‑existing pain would
then be a confounding factor in the analysis.
12.2.8. A Meta‑Analysis
While the RCT does provide evidence that the botulinum toxin A treatment helped patients, a study of 31 patients isn’t enough to establish the
effectiveness of a medical treatment. This is not just because of the small sample size. Our results in this section are valid for the 31 patients in the
study, but we are really interested in the population of all possible patients.
In 2011, a group of researchers performed a meta‑analysis of the studies on the treatment. That is, they identified all the available studies of such
treatments for low‑back pain and summarized the collated results.
There were several studies but not many could be included in a scientifically sound manner: “We excluded evidence from nineteen studies due to
non‑randomisation, incomplete or unpublished data.” Only three randomized controlled trials remained, one of which is the one we have studied in
this section. The meta‑analysis gave it the highest assessment among all the studies (LBP stands for low‑back pain): “We identified three studies
that investigated the merits of BoNT for LBP, but only one had a low risk of bias and evaluated patients with non‑specific LBP (N = 31).”
Putting it all together, the meta‑analysis concluded, “There is low quality evidence that BoNT injections improved pain, function, or both better than
saline injections and very low quality evidence that they were better than acupuncture or steroid injections. … Further research is very likely to have
an important impact on the estimate of effect and our confidence in it. Future trials should standardize patient populations, treatment protocols and
comparison groups, enlist more participants and include long‑term outcomes, cost‑benefit analysis and clinical relevance of findings.”
It takes a lot of careful work to establish that a medical treatment has a beneficial effect. Knowing how to analyze randomized controlled trials is a
crucial part of this work. Now that you know how to do that, you are well positioned to help medical and other professions establish cause‑and‑effect
relations.
12.3. Deflategate
On January 18, 2015, the Indianapolis Colts and the New England Patriots played the American Football Conference (AFC) championship game to
determine which of those teams would play in the Super Bowl. After the game, there were allegations that the Patriots’ footballs had not been inflated
as much as the regulations required; they were softer. This could be an advantage, as softer balls might be easier to catch.
For several weeks, the world of American football was consumed by accusations, denials, theories, and suspicions: the press labeled the topic
Deflategate, after the Watergate political scandal of the 1970’s. The National Football League (NFL) commissioned an independent analysis. In this
example, we will perform our own analysis of the data.
Pressure is often measured in pounds per square inch (psi). NFL rules stipulate that game balls must be inflated to have pressures in the range 12.5
psi and 13.5 psi. Each team plays with 12 balls. Teams have the responsibility of maintaining the pressure in their own footballs, but game officials
inspect the balls. Before the start of the AFC game, all the Patriots’ balls were at about 12.5 psi. Most of the Colts’ balls were at about 13.0 psi.
However, these pre‑game data were not recorded.
During the second quarter, the Colts intercepted a Patriots ball. On the sidelines, they measured the pressure of the ball and determined that it was
below the 12.5 psi threshold. Promptly, they informed officials.
At half‑time, all the game balls were collected for inspection. Two officials, Clete Blakeman and Dyrol Prioleau, measured the pressure in each of the
balls.
Here are the data. Each row corresponds to one football. Pressure is measured in psi. The Patriots ball that had been intercepted by the Colts was not
inspected at half‑time. Nor were most of the Colts’ balls – the officials simply ran out of time and had to relinquish the balls for the start of second
half play.
football = Table.read_table(path_data + 'deflategate.csv')
football.show()
Team Blakeman Prioleau
Patriots 11.5 11.8
Patriots 10.85 11.2
Patriots 11.15 11.5
Patriots 10.7 11
Patriots 11.1 11.45
Patriots 11.6 11.95
Patriots 11.85 12.3
Patriots 11.1 11.55
Patriots 10.95 11.35
Patriots 10.5 10.9
Patriots 10.9 11.35
Colts 12.7 12.35
Colts 12.75 12.3
Colts 12.5 12.95
Colts 12.55 12.15
For each of the 15 balls that were inspected, the two officials got different results. It is not uncommon that repeated measurements on the same
object yield different results, especially when the measurements are performed by different people. So we will assign to each the ball the average of
the two measurements made on that ball.
football = football.with_column(
'Combined', (football.column(1)+football.column(2))/2
).drop(1, 2)
football.show()
Team Combined
Patriots 11.65
Patriots 11.025
Patriots 11.325
Patriots 10.85
Patriots 11.275
Patriots 11.775
Patriots 12.075
Patriots 11.325
Patriots 11.15
Patriots 10.7
Patriots 11.125
Colts 12.525
Colts 12.525
Colts 12.725
Colts 12.35
At a glance, it seems apparent that the Patriots’ footballs were at a lower pressure than the Colts’ balls. Because some deflation is normal during the
course of a game, the independent analysts decided to calculate the drop in pressure from the start of the game. Recall that the Patriots’ balls had all
started out at about 12.5 psi, and the Colts’ balls at about 13.0 psi. Therefore the drop in pressure for the Patriots’ balls was computed as 12.5 minus
the pressure at half‑time, and the drop in pressure for the Colts’ balls was 13.0 minus the pressure at half‑time.
We can calculate the drop in pressure for each football, by first setting up an array of the starting values. For this we will need an array consisting of
11 values each of which is 12.5, and another consisting of four values each of which is all 13. We will use the NumPy function np.ones, which takes a
count as its argument and returns an array of that many elements, each of which is 1.
np.ones(11)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
array([12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5, 12.5,
13. , 13. , 13. , 13. ])
The drop in pressure for each football is the difference between the starting pressure and the combined pressure measurement.
drop = start - football.column('Combined')
football = football.with_column('Pressure Drop', drop)
football.show()
0.733522727272728
This positive difference reflects the fact that the average drop in pressure of the Patriots’ footballs was greater than that of the Colts.
Just as we did in the previous section, we will write a function to calculate the difference between the mean drops in the two groups. The function
difference_of_means takes two arguments:
difference_of_means(football, 'Team')
0.733522727272728
-0.5619318181818183
difference_of_means(original_and_shuffled, 'Team')
0.733522727272728
The two teams’ average drop values are closer when the team labels are randomly assigned to the footballs than they were for the two groups
actually used in the game.
We can now use a for loop and this function to create an array differences that contains 10,000 values of the test statistic simulated under the null
hypothesis.
differences = make_array()
repetitions = 10000
for i in np.arange(repetitions):
new_difference = one_simulated_difference()
differences = np.append(differences, new_difference)
12.3.5. Conclusion of the Test
To calculate the empirical P‑value, it’s important to recall the alternative hypothesis, which is that the Patriots’ drops are too large to be the result of
chance variation alone.
Larger drops for the Patriots favor the alternative hypothesis. So the p‑value is the chance (computed under the null hypothesis) of getting a test
statistic equal to our observed value of 0.733522727272728 or larger.
The figure below visualizes this calculation. It consists of the empirical distribution of the test statistic under the null hypothesis, with the observed
statistic marked in red on the horizontal axis and the area corresponding to the p‑value shaded in gold.
Table().with_column(
'Difference Between Group Averages', differences).hist(
left_end = observed_difference
)
plots.ylim(-0.1, 1.4)
plots.scatter(observed_difference, 0, color='red', s=30, zorder=3)
plots.title('Prediction Under the Null Hypothesis')
print('Observed Difference:', observed_difference)
By eye, the p‑value looks pretty small. We can confirm this by a calculation.
empirical_p = np.count_nonzero(differences >= observed_difference) / 10000
empirical_p
0.0026
As in previous examples of this test, the bulk of the distribution is centered around 0. Under the null hypothesis, the Patriots’ drops are a random
sample of all 15 drops, and therefore so are the Colts’. Therefore the two sets of drops should be about equal on average, and therefore their
difference should be around 0.
But the observed value of the test statistic is quite far away from the heart of the distribution. By any reasonable cutoff for what is “small”, the
empirical P‑value is small. So we end up rejecting the null hypothesis of randomness, and conclude that the Patriots drops were too large to reflect
chance variation alone.
The independent investigative team analyzed the data in several different ways, taking into account the laws of physics. The final report said,
“[T]he average pressure drop of the Patriots game balls exceeded the average pressure drop of the Colts balls by 0.45 to 1.02 psi,
depending on various possible assumptions regarding the gauges used, and assuming an initial pressure of 12.5 psi for the Patriots balls
and 13.0 for the Colts balls.”
– Investigative report commissioned by the NFL regarding the AFC Championship game on January 18, 2015
Our analysis shows an average pressure drop of about 0.73 psi, which is close to the center of the interval “0.45 to 1.02 psi” and therefore consistent
with the official analysis.
Remember that our test of hypotheses does not establish the reason why the difference is not due to chance. Establishing causality is usually more
complex than running a test of hypotheses.
But the all‑important question in the football world was about causation: the question was whether the excess drop of pressure in the Patriots’
footballs was deliberate. If you are curious about the answer given by the investigators, here is the full report.
13. Estimation
In the previous chapter we began to develop ways of inferential thinking. In particular, we learned how to use data to decide between two hypotheses
about the world. But often we just want to know how big something is.
For example, in an earlier chapter we investigated how many warplanes the enemy might have. In an election year, we might want to know what
percent of voters favor a particular candidate. To assess the current economy, we might be interested in the median annual income of households in
the United States.
In this chapter, we will develop a way to estimate an unknown parameter. Remember that a parameter is a numerical value associated with a
population.
To figure out the value of a parameter, we need data. If we have the relevant data for the entire population, we can simply calculate the parameter.
But if the population is very large – for example, if it consists of all the households in the United States – then it might be too expensive and time‑
consuming to gather data from the entire population. In such situations, data scientists rely on sampling at random from the population.
This leads to a question of inference: How to make justifiable conclusions about the unknown parameter, based on the data in the random sample?
We will answer this question by using inferential thinking.
A statistic based on a random sample can be a reasonable estimate of an unknown parameter in the population. For example, you might want to use
the median annual income of sampled households as an estimate of the median annual income of all households in the U.S.
But the value of any statistic depends on the sample, and the sample is based on random draws. So every time data scientists come up with an
estimate based on a random sample, they are faced with a question:
“How different could this estimate have been, if the sample had come out differently?”
In this chapter you will learn one way of answering this question. The answer will give you the tools to estimate a numerical parameter and quantify
the amount of error in your estimate.
We will start with a preliminary about percentiles. The most famous percentile is the median, often used in summaries of income data. Other
percentiles will be important in the method of estimation that we are about to develop. So we will start by defining percentiles carefully.
13.1. Percentiles
Numerical data can be sorted in increasing or decreasing order. Thus the values of a numerical data set have a rank order. A percentile is the value at
a particular rank.
For example, if your score on a test is on the 95th percentile, a common interpretation is that only 5% of the scores were higher than yours. The
median is the 50th percentile; it is commonly assumed that 50% the values in a data set are above the median.
But some care is required in giving percentiles a precise definition that works for all ranks and all lists. To see why, consider an extreme example
where all the students in a class score 75 on a test. Then 75 is a natural candidate for the median, but it’s not true that 50% of the scores are above
75. Also, 75 is an equally natural candidate for the 95th percentile or the 25th or any other percentile. Ties – that is, equal data values – have to be
taken into account when defining percentiles.
You also have to be careful about exactly how far up the list to go when the relevant index isn’t clear. For example, what should be the 87th percentile
of a collection of 10 values? The 8th value of the sorted collection, or the 9th, or somewhere in between?
In this section, we will give a definition that works consistently for all ranks and all lists.
The 80th percentile is the smallest value that is at least as large as 80% of the elements of sizes, that is, four‑fifths of the five elements. That’s 12:
np.sort(sizes)
The 80th percentile is a value on the list, namely 12. You can see that 80% of the values are less than or equal to it, and that it is the smallest value on
the list for which this is true.
Analogously, the 70th percentile is the smallest value in the collection that is at least as large as 70% of the elements of sizes. Now 70% of 5
elements is “3.5 elements”, so the 70th percentile is the 4th element on the list. That’s 12, the same as the 80th percentile for these data.
12
13.1.2.2. Example
The table scores_and_sections contains one row for each student in a class of 359 students. The columns are the student’s discussion section and
midterm score.
scores_and_sections = Table.read_table(path_data + 'scores_by_section.csv')
scores_and_sections
Section Midterm
1 22
2 12
2 23
2 14
1 20
3 25
4 19
1 24
5 8
6 14
... (349 rows omitted)
scores_and_sections.select('Midterm').hist(bins=np.arange(-0.5, 25.6, 1))
What was the 85th percentile of the scores? To use the percentile function, create an array scores containing the midterm scores, and find the
85th percentile:
scores = scores_and_sections.column(1)
percentile(85, scores)
22
According to the percentile function, the 85th percentile was 22. To check that this is consistent with our new definition, let’s apply the definition
directly.
First, put the scores in increasing order:
sorted_scores = np.sort(scores_and_sections.column(1))
There are 359 scores in the array. So next, find 85% of 359, which is 305.15.
0.85 * 359
305.15
That’s not an integer. By our definition, the 85th percentile is the 306th element of sorted_scores, which, by Python’s indexing convention, is item
305 of the array.
# The 306th element of the sorted array
sorted_scores.item(305)
22
That’s the same as the answer we got by using percentile. In future, we will just use percentile.
13.1.3. Quartiles
The first quartile of a numercial collection is the 25th percentile. The terminology arises from the first quarter. The second quartile is the median, and
the third quartile is the 75th percentile.
For our scores data, those values are:
percentile(25, scores)
11
percentile(50, scores)
16
percentile(75, scores)
20
Distributions of scores are sometimes summarized by the “middle 50%” interval, between the first and third quartiles.
sf2019.show(3)
Organization Department Job Job Salary Overtime Benefits Compen
Group Family
Public Adult Information Trainer‑IS 91332 0 40059
Protection Probation Systems Journey
Public Adult Information Engineer‑IS 123241 0 49279 1
Protection Probation Systems Assistant
IS
Public Adult Information Business 115715 0 46752 1
Protection Probation Systems Analyst‑
Senior
... (44522 rows omitted)
There is one row for each of over 44,500 employees. There are numerous columns containing information about City departmental affiliation and
details of the different parts of the employee’s compensation package. Here is the row corresponding to London Breed, the Mayor of San Francisco in
2019.
sf2019.where('Job', 'Mayor')
sf2019.num_rows
37103
135747.0
The median total compensation of all the employees was 135,747 dollars.
From a practical perspective, there is no reason for us to draw a sample to estimate this parameter since we simply know its value. But in this section
we are going to pretend we don’t know the value, and see how well we can estimate it based on a random sample.
In later sections, we will come down to earth and work in situations where the parameter is unknown. For now, we are all‑knowing.
136835.0
The sample size is large. By the law of averages, the distribution of the sample resembles that of the population. Consequently the sample median is
quite comparable to the population median, though of course it is not exactly the same.
So now we have one estimate of the parameter. But had the sample come out differently, the estimate would have had a different value. We would like
to be able to quantify the amount by which the estimate could vary across samples. That measure of variability will help us measure how accurately
we can estimate the parameter.
To see how different the estimate would be if the sample had come out differently, we could just draw another sample from the population. But that
would be cheating. We are trying to mimic real life, in which we won’t have all the population data at hand.
Somehow, we have to get another random sample without sampling again from the population.
resample_1.select('Total Compensation').hist(bins=sf_bins)
resampled_median_1 = percentile(50, resample_1.column('Total Compensation'))
resampled_median_1
141793.0
135880.0
Let us collect this code and define a function one_bootstrap_median that returns one bootstrapped median of total compensation, based on
bootstrapping the original random sample that we called our_sample.
def one_bootstrap_median():
resampled_table = our_sample.sample()
bootstrapped_median = percentile(50, resampled_table.column('Total Compensation'))
return bootstrapped_median
Run the cell below a few times to see how the bootstrapped medians vary. Remember that each of them is an estimate of the population median.
one_bootstrap_median()
132175.0
Here is an empirical histogram of the 5000 bootstrapped medians. The green dot is the population parameter: it is the median of the entire
population, which is what we are trying to estimate. In this example we happen to know its value, but we did not use it in the bootstrap process.
resampled_medians = Table().with_column('Bootstrap Sample Median', bstrap_medians)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)
It is important to remember that the green dot is fixed: it is 135,747 dollars, the population median. The empirical histogram is the result of random
draws, and will be situated randomly relative to the green dot.
Remember also that the point of all these computations is to estimate the population median, which is the green dot. Our estimates are all the
randomly generated sampled medians whose histogram you see above. We want the set of these estimates to contain the parameter. If it doesn’t,
then the estimates are off.
129524.0
143446.0
The population median of 135,747 dollars is between these two numbers. The interval and the population median are shown on the histogram below.
resampled_medians.hist(bins = median_bins)
Now we will write a for loop that calls this function 100 times and collects the “middle 95%” of the bootstrapped medians each time.
The cell below will take several minutes to run since it has to perform 100 replications of sampling 500 times at random from the table and generating
5000 bootstrapped samples.
# THE BIG SIMULATION: This one takes several minutes.
# Generate 100 intervals and put the endpoints in the table intervals
left_ends = make_array()
right_ends = make_array()
for i in np.arange(100):
original_sample = sf2019.sample(500, with_replacement=False)
medians = bootstrap_median(original_sample, 5000)
left_ends = np.append(left_ends, percentile(2.5, medians))
right_ends = np.append(right_ends, percentile(97.5, medians))
intervals = Table().with_columns(
'Left', left_ends,
'Right', right_ends
)
For each of the 100 replications of the entire process, we get one interval of estimates of the median.
intervals
Left Right
125093 139379
129925 140757
133955 146369
129335 140847
132756 145429
130167 143200
125935 138491
131092 142472
128509 140462
131270 145998
... (90 rows omitted)
The good intervals are those that contain the parameter we are trying to estimate. Typically the parameter is unknown, but in this section we happen
to know what the parameter is.
pop_median
135747.0
How many of the 100 intervals contain the population median? That’s the number of intervals where the left end is below the population median and
the right end is above.
intervals.where(
'Left', are.below(pop_median)).where(
'Right', are.above(pop_median)).num_rows
93
It takes many minutes to construct all the intervals, but try it again if you have the patience. Most likely, about 95 of the 100 intervals will be good
ones: they will contain the parameter.
It’s hard to show you all the intervals on the horizontal axis as they have large overlaps – after all, they are all trying to estimate the same parameter.
The graphic below shows each interval on the same axes by stacking them vertically. The vertical axis is simply the number of the replication from
which the interval was generated.
The green line is where the parameter is. It has a fixed position since the parameter is fixed.
Good intervals cover the parameter. There are approximately 95 of these, typically.
If an interval doesn’t cover the parameter, it’s a dud. The duds are the ones where you can see “daylight” around the green line. There are very few of
them – about 5 out of 100, typically – but they do happen.
Any method based on sampling has the possibility of being off. The beauty of methods based on random sampling is that we can quantify how often
they are likely to be off.
To summarize what the simulation shows, suppose you are estimating the population median by the following process:
Draw a large random sample from the population.
Bootstrap your random sample and get an estimate from the new random sample.
Repeat the above bootstrap step thousands of times, and get thousands of estimates.
Pick off the “middle 95%” interval of all the estimates.
That gives you one interval of estimates. If 99 other people repeat the entire process, starting with a new random sample each time, then you will
end up with 100 such intervals. About 95 of these 100 intervals will contain the population parameter.
In other words, this process of estimation captures the parameter about 95% of the time.
You can replace 95% by a different value, as long as it’s not 100. Suppose you replace 95% by 80% and keep the sample size fixed at 500. Then your
intervals of estimates will be shorter than those we simulated here, because the “middle 80%” is a smaller range than the “middle 95%”. If you keep
repeating this process, only about 80% of your intervals will contain the parameter.
births.show(3)
ratios
0.42907801418439717
But what was the median in the population? We don’t know, so we will estimate it.
Our method will be exactly the same as in the previous section. We will bootstrap the sample 5,000 times resulting in 5,000 estimates of the median.
Our 95% confidence interval will be the “middle 95%” of all of our estimates.
Run the cell below to see how the bootstrapped ratios vary. Remember that each of them is an estimate of the unknown ratio in the population.
one_bootstrap_median()
0.43010752688172044
make_array(left, right)
array([0.42545455, 0.43272727])
The 95% confidence interval goes from about 0.425 ounces per day to about 0.433 ounces per day. We are estimating the median “birth weight to
gestational days” ratio in the population is somewhere in the interval 0.425 ounces per day to 0.433 ounces per day.
The estimate of 0.429 based on the original sample happens to be half‑way in between the two ends of the interval, though that need not be true in
general.
To visualize our results, let us draw the empirical histogram of our bootstrapped medians and place the confidence interval on the horizontal axis.
resampled_medians = Table().with_columns(
'Bootstrap Sample Median', bstrap_medians
)
resampled_medians.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);
This histogram and interval resembles those we drew in the previous section, with one big difference – there is no green dot showing where the
parameter is. We don’t know where that dot should be, or whether it is even in the interval.
We just have an interval of estimates. It is a 95% confidence interval of estimates, because the process that generates it produces a good interval
about 95% of the time. That certainly beats guessing the ratio at random!
Keep in mind that this interval is an approximate 95% confidence interval. There are many approximations involved in its computation. The
approximation is not bad, but it is not exact.
np.average(births.column('Maternal Age'))
27.228279386712096
What was the average age of the mothers in the population? We don’t know the value of this parameter.
Let’s estimate the unknown parameter by the bootstrap method. To do this, we will adapt the code for bootstrap_median to instead define the
function bootstrap_mean. The code is the same except that the statistics are means (that is, averages) instead of medians, and are collected in an
array called bstrap_means instead of bstrap_medians.
def one_bootstrap_mean():
resample = births.sample()
return np.average(resample.column('Maternal Age'))
make_array(left, right)
array([26.90630324, 27.55962521])
The 95% confidence interval goes from about 26.9 years to about 27.6 years. That is, we are estimating that the average age of the mothers in the
population is somewhere in the interval 26.9 years to 27.6 years.
Notice how close the two ends are to the average of about 27.2 years in the original sample. The sample size is very large – 1,174 mothers – and so
the sample averages don’t vary much. We will explore this observation further in the next chapter.
The empirical histogram of the 5,000 bootstrapped mean ages is shown below, along with the 95% confidence interval for the population mean age.
resampled_means = Table().with_columns(
'Bootstrap Sample Mean', bstrap_means
)
resampled_means.hist(bins=15)
plots.plot([left, right], [0, 0], color='yellow', lw=8);
Once again, the average of the original sample (27.23 years) is close to the center of the interval. That’s not very surprising, because each
bootstrapped sample is drawn from that same original sample. The averages of the bootstrapped samples are about symmetrically distributed on
either side of the average of the sample from which they were drawn.
Notice also that the empirical histogram of the resampled means has roughly a symmetric bell shape, even though the histogram of the sampled ages
was not symmetric at all:
births.select('Maternal Age').hist()
This is a consequence of the Central Limit Theorem of probability and statistics. In later sections, we will see what the theorem says.
array([27.01277683, 27.44293015])
resampled_means.hist(bins=15)
plots.plot([left_80, right_80], [0, 0], color='yellow', lw=8);
This 80% confidence interval is much shorter than the 95% confidence interval. It only goes from about 27.0 years to about 27.4 years. While that’s a
tight set of estimates, you know that this process only produces a good interval about 80% of the time.
The earlier process produced a wider interval but we had more confidence in the process that generated it.
To get a narrow confidence interval at a high level of confidence, you’ll have to start with a larger sample. We’ll see why in the next chapter.
0.3909710391822828
Remember that a proportion is an average of zeros and ones. So the proportion of mothers who smoked could also be calculated using array
operations as follows.
smoking = births.column('Maternal Smoker')
np.count_nonzero(smoking) / len(smoking)
0.3909710391822828
What percent of mothers in the population smoked during pregnancy? This is an unknown parameter which we can estimate by a bootstrap
confidence interval. The steps are analogous to those we took to estimate the population median and mean.
In a process that is now familiar, will start by defining a function one_bootstrap_proportion that bootstraps the sample and returns the proportion of
smokers in the bootstrapped sample. Then we will call the function multiple times using a for loop, and get the 2.5th perentile and 97.5th percentiles
of the bootstrapped proportions.
def one_bootstrap_proportion():
resample = births.sample()
smoking = resample.column('Maternal Smoker')
return np.count_nonzero(smoking) / len(smoking)
make_array(left, right)
array([0.36286201, 0.41908007])
births.select('Maternal Age').hist()
A small percent of the sampled ages are in the (26.9, 27.6) interval, and you would expect a similar small percent in the population. The interval just
estimates one number: the average of all the ages in the population.
However, estimating a parameter by confidence intervals does have an important use besides just telling us roughly how big the parameter is.
hodgkins.show(3)
hodgkins
height rad chemo base month15 drop
164 679 180 160.57 87.77 72.8
168 311 180 98.24 67.62 30.62
173 388 239 129.04 133.33 ‑4.29
157 370 168 85.41 81.28 4.13
160 468 151 67.94 79.26 ‑11.32
170 341 96 150.51 80.97 69.54
163 453 134 129.88 69.24 60.64
175 529 264 87.45 56.48 30.97
185 392 240 149.84 106.99 42.85
178 479 216 92.24 73.43 18.81
... (12 rows omitted)
hodgkins.select('drop').hist(bins=np.arange(-20, 81, 20))
np.average(hodgkins.column('drop'))
28.615909090909096
In the sample, the average drop is about 28.6. But could this be the result of chance variation? The data are from a random sample. Could it be that in
the entire population of patients, the average drop is just 0?
To answer this, we can set up two hypotheses:
Null hypothesis: In the population, the average drop is 0.
Alternative hypothesis: In the population, the average drop is not 0.
To test this hypothesis with a 1% cutoff for the p‑value, let’s construct an approximate 99% confidence interval for the average drop in the
population.
def one_bootstrap_mean():
resample = hodgkins.sample()
return np.average(resample.column('drop'))
make_array(left, right)
array([17.46863636, 40.97681818])
resampled_means = Table().with_columns(
'Bootstrap Sample Mean', bstrap_means
)
resampled_means.hist()
plots.plot([left, right], [0, 0], color='yellow', lw=8);
The 99% confidence interval for the average drop in the population goes from about 17 to about 40. The interval doesn’t contain 0. So we reject the
null hypothesis.
But notice that we have done better than simply concluding that the average drop in the population isn’t 0. We have estimated how big the average
drop is. That’s a more useful result than just saying, “It’s not 0.”
A note on accuracy: Our confidence interval is quite wide, for two main reasons:
The confidence level is high (99%).
The sample size is relatively small compared to those in our earlier examples.
In the next chapter, we will examine how the sample size affects accuracy. We will also examine how the empirical distributions of sample means so
often come out bell shaped even though the distributions of the underlying data are not bell shaped at all.
13.4.3. Endnote
The terminology of a field usually comes from the leading researchers in that field. Brad Efron, who first proposed the bootstrap technique, used a
term that has American origins. Not to be outdone, Chinese statisticians have proposed their own method.
np.average(not_symmetric)
4.25
np.mean(not_symmetric)
4.25
np.mean(zero_one)
0.75
Because proportions are a special case of means, results about random sample means apply to random sample proportions as well.
array([2, 3, 3, 9])
same_distribution = make_array(2, 2, 3, 3, 3, 3, 9, 9)
np.mean(same_distribution)
4.25
The mean is a physical attribute of the histogram of the distribution. Here is the histogram of the distribution of not_symmetric or equivalently the
distribution of same_distribution.
Imagine the histogram as a figure made out of cardboard attached to a wire that runs along the horizontal axis, and imagine the bars as weights
attached at the values 2, 3, and 9. Suppose you try to balance this figure on a point on the wire. If the point is near 2, the figure will tip over to the
right. If the point is near 9, the figure will tip over to the left. Somewhere in between is the point where the figure will balance; that point is the 4.25,
the mean.
The mean is the center of gravity or balance point of the histogram.
To understand why that is, it helps to know some physics. The center of gravity is calculated exactly as we calculated the mean, by using the distinct
values weighted by their proportions.
Because the mean is a balance point, it is sometimes displayed as a fulcrum or triangle at the base of the histogram.
14.1.5. The Mean and the Median
If a student’s score on a test is below average, does that imply that the student is in the bottom half of the class on that test?
Happily for the student, the answer is, “Not necessarily.” The reason has to do with the relation between the average, which is the balance point of
the histogram, and the median, which is the “half‑way point” of the data.
The relationship is easy to see in a simple example. Here is a histogram of the collection {2, 3, 3, 4} which is in the array symmetric. The distribution is
symmetric about 3. The mean and the median are both equal to 3.
symmetric = make_array(2, 3, 3, 4)
np.mean(symmetric)
3.0
percentile(50, symmetric)
In general, for symmetric distributions, the mean and the median are equal.
What if the distribution is not symmetric? Let’s compare symmetric and not_symmetric.
The blue histogram represents the original symmetric distribution. The gold histogram of not_symmetric starts out the same as the blue at the left
end, but its rightmost bar has slid over to the value 9. The brown part is where the two histograms overlap.
The median and mean of the blue distribution are both equal to 3. The median of the gold distribution is also equal to 3, though the right half is
distributed differently from the left.
But the mean of the gold distribution is not 3: the gold histogram would not balance at 3. The balance point has shifted to the right, to 4.25.
In the gold distribution, 3 out of 4 entries (75%) are below average. The student with a below average score can therefore take heart. He or she might
be in the majority of the class.
In general, if the histogram has a tail on one side (the formal term is “skewed”), then the mean is pulled away from the median in the
direction of the tail.
14.1.5.1. Example
The table sf2015 contains salary and benefits data for San Francisco City employees in 2015. As before, we will restrict our analysis to those who had
the equivalent of at least half‑time employment for the year.
sf2015 = Table.read_table(path_data + 'san_francisco_2015.csv').where('Salaries',
are.above(10000))
As we saw earlier, the highest compensation was above $600,000 but the vast majority of employees had compensations below $300,000.
sf2015.select('Total Compensation').hist(bins = np.arange(10000, 700000, 25000))
110305.79
np.mean(compensation)
114725.98411824222
Distributions of incomes of large populations tend to be right skewed. When the bulk of a population has middle to low incomes, but a very small
proportion has very high incomes, the histogram has a long, thin tail to the right.
The mean income is affected by this tail: the farther the tail stretches to the right, the larger the mean becomes. But the median is not affected by
values at the extremes of the distribution. That is why economists often summarize income distributions by the median instead of the mean.
14.2. Variability
The mean tells us where a histogram balances. But in almost every histogram we have seen, the values spread out on both sides of the mean. How
far from the mean can they be? To answer this question, we will develop a measure of variability about the mean.
We will start by describing how to calculate the measure. Then we will see why it is a good measure to calculate.
The goal is to measure roughly how far off the numbers are from their average. To do this, we first need the average:
# Step 1. The average.
mean = np.mean(any_numbers)
mean
3.75
Next, let’s find out how far each value is from the mean. These are called the deviations from the average. A “deviation from average” is just a value
minus the average. The table calculation_steps displays the results.
# Step 2. The deviations from average.
0.0
The positive deviations exactly cancel out the negative ones. This is true of all lists of numbers, no matter what the histogram of the list looks like: the
sum of the deviations from average is zero.
Since the sum of the deviations is 0, the mean of the deviations will be 0 as well:
np.mean(deviations)
0.0
Because of this, the mean of the deviations is not a useful measure of the size of the deviations. What we really want to know is roughly how big the
deviations are, regardless of whether they are positive or negative. So we need a way to eliminate the signs of the deviations.
There are two time‑honored ways of losing signs: the absolute value, and the square. It turns out that taking the square constructs a measure with
extremely powerful properties, some of which we will study in this course.
So let’s eliminate the signs by squaring all the deviations. Then we will take the mean of the squares:
# Step 3. The squared deviations from average
squared_deviations = deviations ** 2
calculation_steps = calculation_steps.with_column(
'Squared Deviations from Average', squared_deviations
)
calculation_steps
variance = np.mean(squared_deviations)
variance
13.1875
Variance: The mean squared deviation calculated above is called the variance of the values.
While the variance does give us an idea of spread, it is not on the same scale as the original variable as its units are the square of the original. This
makes interpretation very difficult.
So we return to the original scale by taking the positive square root of the variance:
# Step 5.
# Standard Deviation: root mean squared deviation from average
# Steps of calculation: 5 4 3 2 1
sd = variance ** 0.5
sd
3.6314597615834874
3.6314597615834874
It is no surprise that NBA players are tall! Their average height is just over 79 inches (6’7”), about 10 inches taller than the average height of men in
the United States.
mean_height = np.mean(nba13.column('Height'))
mean_height
79.06534653465347
About how far off are the players’ heights from the average? This is measured by the SD of the heights, which is about 3.45 inches.
sd_height = np.std(nba13.column('Height'))
sd_height
3.4505971830275546
The towering center Hasheem Thabeet of the Oklahoma City Thunder was the tallest player at a height of 87 inches.
nba13.sort('Height', descending=True).show(3)
7.934653465346528
That’s a deviation from average, and it is about 2.3 times the standard deviation:
(87 - mean_height)/sd_height
2.2995015194397923
In other words, the height of the tallest player was about 2.3 SDs above average.
At 69 inches tall, Isaiah Thomas was one of the two shortest NBA players in 2013. His height was about 2.9 SDs below average.
nba13.sort('Height').show(3)
-2.9169868288775844
What we have observed is that the tallest and shortest players were both just a few SDs away from the average height. This is an example of why the
SD is a useful measure of spread. No matter what the shape of the histogram, the average and the SD together tell you a lot about where the
histogram is situated on the number line.
(26.19009900990099, 4.321200441720307)
The average age was just over 26 years, and the SD was about 4.3 years.
How far off were the ages from the average? Just as we did with the heights, let’s look at an example.
Juwan Howard was the oldest player, at 40.
nba13.sort('Age in 2013', descending=True).show(3)
3.1958482778922357
What we have observed for the heights and ages is true in great generality. For all lists, the bulk of the entries are no more than 2 or 3 SDs away from
the average.
14.2.7. Example
As we saw in an earlier section, the table united contains a column Delay consisting of the departure delay times, in minutes, of over thousands of
United Airlines flights in the summer of 2015. We will create a new column called Delay (Standard Units) by applying the function standard_units
to the column of delay times. This allows us to see all the delay times in minutes as well as their corresponding values in standard units.
united = Table.read_table(path_data + 'united_summer2015.csv')
united = united.with_column(
'Delay (Standard Units)', standard_units(united.column('Delay'))
)
united
0.9790235081374322
The histogram of delay times is shown below, with the horizontal axis in standard units. By the table above, the right hand tail continues all the way
out to \(z=14.27\) standard units (580 minutes). The area of the histogram outside the range \(z=‑3\) to \(z=3\) is about 2%, put together in tiny little
bits that are mostly invisible in the histogram.
united.hist('Delay (Standard Units)', bins=np.arange(-5, 15.5, 0.5))
plots.xticks(np.arange(-6, 17, 3));
64.0
sd_height = np.round(np.std(heights), 1)
sd_height
2.5
The last two lines of code in the cell above change the labeling of the horizontal axis. Now, the labels correspond to “average \(\pm\) \(z\) SDs” for \(z
= 0, \pm 1, \pm 2\), and \(\pm 3\). Because of the shape of the distribution, the “center” has an unambiguous meaning and is clearly visible at 64.
The numerical value of the shaded area can be found by calling stats.norm.cdf.
stats.norm.cdf(1)
0.8413447460685429
That’s about 84%. We can now use the symmetry of the curve and the fact that the total area under the curve is 1 to find other areas.
The area to the right of \(z=1\) is about 100% ‑ 84% = 16%.
1 - stats.norm.cdf(1)
0.15865525393145707
The area between \(z=‑1\) and \(z=1\) can be computed in several different ways. It is the gold area under the curve below.
For example, we could calculate the area as “100% ‑ two equal tails”, which works out to roughly 100% ‑ 2x16% = 68%.
Or we could note that the area between \(z=1\) and \(z=‑1\) is equal to all the area to the left of \(z=1\), minus all the area to the left of \(z=‑1\).
stats.norm.cdf(1) - stats.norm.cdf(-1)
0.6826894921370859
By a similar calculation, we see that the area between \(‑2\) and 2 is about 95%.
stats.norm.cdf(2) - stats.norm.cdf(-2)
0.9544997361036416
In other words, if a histogram is roughly bell shaped, the proportion of data in the range “average \(\pm\) 2 SDs” is about 95%.
That is quite a bit more than Chebychev’s lower bound of 75%. Chebychev’s bound is weaker because it has to work for all distributions. If we know
that a distribution is normal, we have good approximations to the proportions, not just bounds.
The table below compares what we know about all distributions and about normal distributions. Notice that when \(z=1\), Chebychev’s bound is
correct but not illuminating.
Percent in Range All Distributions: Bound Normal Distribution: Approximation
average \(\pm\) 1 SD at least 0% about 68%
average \(\pm\) 2 SDs at least 75% about 95%
average \(\pm\) 3 SDs at least 88.888…% about 99.73%
net_gain_red = make_array()
for i in np.arange(repetitions):
spins = red.sample(num_bets)
new_net_gain_red = spins.column('Winnings: Red').sum()
net_gain_red = np.append(net_gain_red, new_net_gain_red)
results = Table().with_column(
'Net Gain on Red', net_gain_red
)
That’s a roughly bell shaped histogram, even though the distribution we are drawing from is nowhere near bell shaped.
Center. The distribution is centered near ‑20 dollars, roughly. To see why, note that your winnings will be $1 on about 18/38 of the bets, and ‑$1 on
the remaining 20/38. So your average winnings per dollar bet will be roughly ‑5.26 cents:
average_per_bet = 1*(18/38) + (-1)*(20/38)
average_per_bet
-0.05263157894736842
So in 400 bets you expect that your net gain will be about ‑$21:
400 * average_per_bet
-21.052631578947366
For confirmation, we can compute the mean of the 10,000 simulated net gains:
np.mean(results.column(0))
-20.9586
Spread. Run your eye along the curve starting at the center and notice that the point of inflection is near 0. On a bell shaped curve, the SD is the
distance from the center to a point of inflection. The center is roughly ‑$20, which means that the SD of the distribution is around $20.
In the next section we will see where the $20 comes from. For now, let’s confirm our observation by simply calculating the SD of the 10,000 simulated
net gains:
np.std(results.column(0))
20.029115957525438
Summary. The net gain in 400 bets is the sum of the 400 amounts won on each individual bet. The probability distribution of that sum is
approximately normal, with an average and an SD that we can approximate.
The mean delay was about 16.6 minutes and the SD was about 39.5 minutes. Notice how large the SD is, compared to the mean. Those large
deviations on the right have an effect, even though they are a very small proportion of the data.
mean_delay = np.mean(united.column('Delay'))
sd_delay = np.std(united.column('Delay'))
mean_delay, sd_delay
(16.658155515370705, 39.480199851609314)
Now suppose we sampled 400 delays at random with replacement. You could sample without replacement if you like, but the results would be very
similar to with‑replacement sampling. If you sample a few hundred out of 13,825 without replacement, you hardly change the population each time
you pull out a value.
In the sample, what could the average delay be? We expect it to be around 16 or 17, because that’s the population average; but it is likely to be
somewhat off. Let’s see what we get by sampling. We’ll work with the table delay that only contains the column of delays.
delay = united.select('Delay')
np.mean(delay.sample(400).column('Delay'))
15.59
The sample average varies according to how the sample comes out, so we will simulate the sampling process repeatedly and draw the empirical
histogram of the sample average. That will be an approximation to the probability histogram of the sample average.
sample_size = 400
repetitions = 10000
means = make_array()
for i in np.arange(repetitions):
sample = delay.sample(sample_size)
new_mean = np.mean(sample.column('Delay'))
means = np.append(means, new_mean)
results = Table().with_column(
'Sample Mean', means
)
Once again, we see a rough bell shape, even though we are drawing from a very skewed distribution. The bell is centered somewhere between 16 ad
17, as we expect.
model
Color
Purple
Purple
Purple
White
props = make_array()
num_plants = 200
repetitions = 10000
for i in np.arange(repetitions):
sample = model.sample(num_plants)
new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
props = np.append(props, new_prop)
There’s that normal curve again, as predicted by the Central Limit Theorem, centered at around 0.75 just as you would expect.
How would this distribution change if we increased the sample size? Let’s run the code again with a sample size of 800, and collect the results of
simulations in the same table in which we collected simulations based on a sample size of 200. We will keep the number of repetitions the same as
before so that the two columns have the same length.
props2 = make_array()
num_plants = 800
for i in np.arange(repetitions):
sample = model.sample(num_plants)
new_prop = np.count_nonzero(sample.column('Color') == 'Purple')/num_plants
props2 = np.append(props2, new_prop)
pop_mean = np.mean(delay.column('Delay'))
pop_mean
16.658155515370705
Now let’s take random samples and look at the probability distribution of the sample mean. As usual, we will use simulation to get an empirical
approximation to this distribution.
We will define a function simulate_sample_mean to do this, because we are going to vary the sample size later. The arguments are the name of the
table, the label of the column containing the variable, the sample size, and the number of simulations.
"""Empirical distribution of random sample means"""
means = make_array()
for i in range(repetitions):
new_sample = table.sample(sample_size)
new_sample_mean = np.mean(new_sample.column(label))
means = np.append(means, new_sample_mean)
Let us simulate the mean of a random sample of 100 delays, then of 400 delays, and finally of 625 delays. We will perform 10,000 repetitions of each
of these process. The xlim and ylim lines set the axes consistently in all the plots for ease of comparison. You can just ignore those two lines of code
in each cell.
simulate_sample_mean(delay, 'Delay', 100, 10000)
plots.xlim(5, 35)
plots.ylim(0, 0.25);
You can see the Central Limit Theorem in action – the histograms of the sample means are roughly normal, even though the histogram of the delays
themselves is far from normal.
You can also see that each of the three histograms of the sample means is centered very close to the population mean. In each case, the “average of
sample means” is very close to 16.66 minutes, the population mean. Both values are provided in the printout above each histogram. As expected, the
sample mean is an unbiased estimate of the population mean.
39.480199851609314
Take a look at the SDs in the sample mean histograms above. In all three of them, the SD of the population of delays is about 40 minutes, because all
the samples were taken from the same population.
Now look at the SD of all 10,000 sample means, when the sample size is 100. That SD is about one‑tenth of the population SD. When the sample size
is 400, the SD of all the sample means is about one‑twentieth of the population SD. When the sample size is 625, the SD of the sample means is
about one‑twentyfifth of the population SD.
It seems like a good idea to compare the SD of the empirical distribution of the sample means to the quantity “population SD divided by the square
root of the sample size.”
Here are the numerical values. For each sample size in the first column, 10,000 random samples of that size were drawn, and the 10,000 sample
means were calculated. The second column contains the SD of those 10,000 sample means. The third column contains the result of the calculation
“population SD divided by the square root of the sample size.”
The cell takes a while to run, as it’s a large simulation. But you’ll soon see that it’s worth the wait.
repetitions = 10000
sample_sizes = np.arange(25, 626, 25)
sd_means = make_array()
for n in sample_sizes:
means = make_array()
for i in np.arange(repetitions):
means = np.append(means, np.mean(delay.sample(n).column('Delay')))
sd_means = np.append(sd_means, np.std(means))
sd_comparison = Table().with_columns(
'Sample Size n', sample_sizes,
'SD of 10,000 Sample Means', sd_means,
'pop_sd/sqrt(n)', pop_sd/np.sqrt(sample_sizes)
)
sd_comparison
Remember that the possible values in the population are only 0 and 1.
The blue histogram (50% 1’s and 50% 0’s) has more spread than the gold. The mean is 0.5. Half the deviations from mean are equal to 0.5 and the
other half equal to ‑0.5, so the SD is 0.5.
In the gold histogram, all of the area is being squished up around 1, leading to less spread. 90% of the deviations are small: 0.1. The other 10% are
‑0.9 which is large, but overall the spread is smaller than in the blue histogram.
The same observation would hold if we varied the proportion of 1’s or let the proportion of 0’s be larger than the proportion of 1’s. Let’s check this by
calculating the SDs of populations of 10 elements that only consist of 0’s and 1’s, in varying proportions. The function np.ones is useful for this. It
takes a positive integer as its argument and returns an array consisting of that many 1’s.
sd = make_array()
for i in np.arange(1, 10, 1):
# Create an array of i 1's and (10-i) 0's
population = np.append(np.ones(i), 1-np.ones(10-i))
sd = np.append(sd, np.std(population))
zero_one_sds = Table().with_columns(
"Population Proportion of 1's", np.arange(0.1, 1, 0.1),
"Population SD", sd
)
zero_one_sds
Summary: The SD of a population of 1’s and 0’s is at most 0.5. That’s the value of the SD when 50% of the population is coded 1 and the other 50%
are coded 0.
15. Prediction
An important aspect of data science is to find out what data can tell us about the future. What do data about climate and pollution say about
temperatures a few decades from now? Based on a person’s internet profile, which websites are likely to interest them? How can a patient’s medical
history be used to judge how well he or she will respond to a treatment?
To answer such questions, data scientists have developed methods for making predictions. In this chapter we will study one of the most commonly
used ways of predicting the value of one variable based on the value of another.
Here is a historical dataset used for the prediction of the heights of adults based on the heights of their parents. We have studied this dataset in an
earlier section. The table heights contains data on the midparent height and child’s height (all in inches) for a population of 934 adult “children”.
Recall that the midparent height is an average of the heights of the two parents.
# Data on heights of parents and their adult children
original = Table.read_table(path_data + 'family_heights.csv')
heights = Table().with_columns(
'MidParent', original.column('midparentHeight'),
'Child', original.column('childHeight')
)
heights
MidParent Child
75.43 73.2
75.43 69.2
75.43 69
75.43 69
73.66 73.5
73.66 72.5
73.66 65.5
73.66 65.5
72.06 71
72.06 68
... (924 rows omitted)
heights.scatter('MidParent')
A primary reason for studying the data was to be able to predict the adult height of a child born to parents who were similar to those in the dataset.
We made these predictions in Section 8.1, after noticing the positive association between the two variables.
Our approach was to base the prediction on all the points that correspond to a midparent height of around the midparent height of the new person.
To do this, we wrote a function called predict_child which takes a midparent height as its argument and returns the average height of all the
children who had midparent heights within half an inch of the argument.
def predict_child(mpht):
"""Return a prediction of the height of a child
whose parents have a midparent height of mpht.
We applied the function to the column of Midparent heights, and visualized the result.
# Apply predict_child to all the midparent heights
heights_with_predictions = heights.with_column(
'Prediction', heights.apply(predict_child, 'MidParent')
)
# Draw the original scatter plot along with the predicted values
heights_with_predictions.scatter('MidParent')
The prediction at a given midparent height lies roughly at the center of the vertical strip of points at the given height. This method of prediction is
called regression. Later in this chapter we will see whether we can avoid our arbitrary definitions of “closeness” being “within 0.5 inches”. But first we
will develop a measure that can be used in many settings to decide how good one variable will be as a predictor of another.
15.1. Correlation
In this section we will develop a measure of how tightly clustered a scatter diagram is about a straight line. Formally, this is called measuring linear
association.
The table hybrid contains data on hybrid passenger cars sold in the United States from 1997 to 2013. The data were adapted from the online data
archive of Prof. Larry Winner of the University of Florida. The columns:
vehicle: model of the car
year: year of manufacture
msrp: manufacturer’s suggested retail price in 2013 dollars
acceleration: acceleration rate in km per hour per second
mpg: fuel econonmy in miles per gallon
class: the model’s class.
hybrid = Table.read_table(path_data + 'hybrid.csv')
hybrid
Notice the positive association. The scatter of points is sloping upwards, indicating that cars with greater acceleration tended to cost more, on
average; conversely, the cars that cost more tended to have greater acceleration on average.
The scatter diagram of MSRP versus mileage shows a negative association. Hybrid cars with higher mileage tended to cost less, on average. This
seems surprising till you consider that cars that accelerate fast tend to be less fuel efficient and have lower mileage. As the previous scatter plot
showed, those were also the cars that tended to cost more.
hybrid.scatter('mpg', 'msrp')
Along with the negative association, the scatter diagram of price versus efficiency shows a non‑linear relation between the two variables. The points
appear to be clustered around a curve, not around a straight line.
If we restrict the data just to the SUV class, however, the association between price and efficiency is still negative but the relation appears to be more
linear. The relation between the price and acceleration of SUV’s also shows a linear trend, but with a positive slope.
suv = hybrid.where('class', 'SUV')
suv.scatter('mpg', 'msrp')
suv.scatter('acceleration', 'msrp')
You will have noticed that we can derive useful information from the general orientation and shape of a scatter diagram even without paying attention
to the units in which the variables were measured.
Indeed, we could plot all the variables in standard units and the plots would look the same. This gives us a way to compare the degree of linearity in
two scatter diagrams.
Recall that in an earlier section we defined the function standard_units to convert an array of numbers to standard units.
def standard_units(any_numbers):
"Convert any array of numbers to standard units."
return (any_numbers - np.mean(any_numbers))/np.std(any_numbers)
We can use this function to re‑draw the two scatter diagrams for SUVs, with all the variables measured in standard units.
Table().with_columns(
'mpg (standard units)', standard_units(suv.column('mpg')),
'msrp (standard units)', standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);
Table().with_columns(
'acceleration (standard units)', standard_units(suv.column('acceleration')),
'msrp (standard units)', standard_units(suv.column('msrp'))
).scatter(0, 1)
plots.xlim(-3, 3)
plots.ylim(-3, 3);
The associations that we see in these figures are the same as those we saw before. Also, because the two scatter diagrams are now drawn on exactly
the same scale, we can see that the linear relation in the second diagram is a little more fuzzy than in the first.
We will now define a measure that uses standard units to quantify the kinds of association that we have seen.
r_scatter(0.25)
r_scatter(0)
r_scatter(-0.55)
x y
1 2
2 3
3 1
4 5
5 2
6 7
Based on the scatter diagram, we expect that \(r\) will be positive but not equal to 1.
t.scatter(0, 1, s=30, color='red')
r = np.mean(t_product.column(4))
r
0.6174163971897709
Let’s call the function on the x and y columns of t. The function returns the same answer to the correlation between \(x\) and \(y\) as we got by direct
application of the formula for \(r\).
correlation(t, 'x', 'y')
0.6174163971897709
As we noticed, the order in which the variables are specified doesn’t matter.
correlation(t, 'y', 'x')
0.6174163971897709
Calling correlation on columns of the table suv gives us the correlation between price and mileage as well as the correlation between price and
acceleration.
correlation(suv, 'mpg', 'msrp')
-0.6667143635709919
0.0
1.0
outlier = Table().with_columns(
'x', make_array(1, 2, 3, 4, 5),
'y', make_array(1, 2, 3, 4, 0)
)
outlier.scatter('x', 'y', s=30, color='r')
0.0
0.9847558411067434
That’s an extremely high correlation. But it’s important to note that this does not reflect the strength of the relation between the Math and Critical
Reading scores of students.
The data consist of average scores in each state. But states don’t take tests – students do. The data in the table have been created by lumping all the
students in each state into a single point at the average values of the two variables in that state. But not all students in the state will be at that point,
as students vary in their performance. If you plot a point for each student instead of just one for each state, there will be a cloud of points around
each point in the figure above. The overall picture will be more fuzzy. The correlation between the Math and Critical Reading scores of the students
will be lower than the value calculated based on state averages.
Correlations based on aggregates and averages are called ecological correlations and are frequently reported. As we have just seen, they must be
interpreted with care.
heights = Table().with_columns(
'MidParent', original.column('midparentHeight'),
'Child', original.column('childHeight')
)
def predict_child(mpht):
"""Return a prediction of the height of a child
whose parents have a midparent height of mpht.
heights_with_predictions = heights.with_column(
'Prediction', heights.apply(predict_child, 'MidParent')
)
heights_with_predictions.scatter('MidParent')
15.2.1. Measuring in Standard Units
Let’s see if we can find a way to identify this line. First, notice that linear association doesn’t depend on the units of measurement – we might as well
measure both variables in standard units.
def standard_units(xyz):
"Convert any array of numbers to standard units."
return (xyz - np.mean(xyz))/np.std(xyz)
heights_SU = Table().with_columns(
'MidParent SU', standard_units(heights.column('MidParent')),
'Child SU', standard_units(heights.column('Child'))
)
heights_SU
MidParent SU Child SU
3.45465 1.80416
3.45465 0.686005
3.45465 0.630097
3.45465 0.630097
2.47209 1.88802
2.47209 1.60848
2.47209 ‑0.348285
2.47209 ‑0.348285
1.58389 1.18917
1.58389 0.350559
... (924 rows omitted)
On this scale, we can calculate our predictions exactly as before. But first we have to figure out how to convert our old definition of “close” points to a
value on the new scale. We had said that midparent heights were “close” if they were within 0.5 inches of each other. Since standard units measure
distances in units of SDs, we have to figure out how many SDs of midparent height correspond to 0.5 inches.
One SD of midparent heights is about 1.8 inches. So 0.5 inches is about 0.28 SDs.
sd_midparent = np.std(heights.column(0))
sd_midparent
1.8014050969207571
0.5/sd_midparent
0.277561110965367
We are now ready to modify our prediction function to make predictions on the standard units scale. All that has changed is that we are using the
table of values in standard units, and defining “close” as above.
def predict_child_su(mpht_su):
"""Return a prediction of the height (in standard units) of a child
whose parents have a midparent height of mpht_su in standard units.
"""
close = 0.5/sd_midparent
close_points = heights_SU.where('MidParent SU', are.between(mpht_su-close, mpht_su +
close))
return close_points.column('Child SU').mean()
heights_with_su_predictions = heights_SU.with_column(
'Prediction SU', heights_SU.apply(predict_child_su, 'MidParent SU')
)
heights_with_su_predictions.scatter('MidParent SU')
This plot looks exactly like the plot drawn on the original scale. Only the numbers on the axes have changed. This confirms that we can understand
the prediction process by just working in standard units.
So the 45 degree line is not the “graph of averages.” That line is the green one shown below.
Both lines go through the origin (0, 0). The green line goes through the centers of the vertical strips (at least roughly), and is flatter than the red 45
degree line.
The slope of the 45 degree line is 1. So the slope of the green “graph of averages” line is a value that is positive but less than 1.
What value could that be? You’ve guessed it – it’s \(r\).
When \(r\) is close to 1, the scatter plot, the 45 degree line, and the regression line are all very close to each other. But for more moderate values of \
(r\), the regression line is noticeably flatter.
The slope and intercept of the regression line in original units can be derived from the diagram above.
\[\[ \mathbf{\mbox{slope of the
\mathbf{\mbox{intercept regression
of the line}}line}}
regression ~=~~=~
r \cdot \frac{\mbox{SD
\mbox{average of }yof~‑~
}y}{\mbox{SD
\mbox{slope}of }x}
\cdot\] \mbox{average of }x \]
The three functions below compute the correlation, slope, and intercept. All of them take three arguments: the name of the table, the label of the
column containing \(x\), and the label of the column containing \(y\).
def correlation(t, label_x, label_y):
return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))
0.32094989606395924
We can also find the equation of the regression line for predicting the child’s height based on midparent height.
family_slope = slope(heights, 'MidParent', 'Child')
family_intercept = intercept(heights, 'MidParent', 'Child')
family_slope, family_intercept
(0.637360896969479, 22.63624054958975)
67.55743656799862
Our original prediction, created by taking the average height of all children who had midparent heights close to 70.48, came out to be pretty close:
67.63 inches compared to the regression line’s prediction of 67.55 inches.
heights_with_predictions.where('MidParent', are.equal_to(70.48)).show(3)
It is easier to see the line in the graph below than in the one above.
heights.with_column('Fitted', fit(heights, 'MidParent', 'Child')).scatter('MidParent')
Another way to draw the line is to use the option fit_line=True with the Table method scatter.
heights.scatter('MidParent', fit_line=True)
15.2.8. Units of Measurement of the Slope
The slope is a ratio, and it worth taking a moment to study the units in which it is measured. Our example comes from the familiar dataset about
mothers who gave birth in a hospital system. The scatter plot of pregnancy weights versus heights looks like a football that has been used in one
game too many, but it’s close enough to a football that we can justify putting our fitted line through it. In later sections we will see how to make such
justifications more formal.
baby = Table.read_table(path_data + 'baby.csv')
3.572846259275056
The slope of the regression line is 3.57 pounds per inch. This means that for two women who are 1 inch apart in height, our prediction of pregnancy
weight will differ by 3.57 pounds. For a woman who is 2 inches taller than another, our prediction of pregnancy weight will be
\[ 2 \times 3.57 ~=~ 7.14 \]
pounds more than our prediction for the shorter woman.
Notice that the successive vertical strips in the scatter plot are one inch apart, because the heights have been rounded to the nearest inch. Another
way to think about the slope is to take any two consecutive strips (which are necessarily 1 inch apart), corresponding to two groups of women who
are separated by 1 inch in height. The slope of 3.57 pounds per inch means that the average pregnancy weight of the taller group is about 3.57
pounds more than that of the shorter group.
15.2.9. Example
Suppose that our goal is to use regression to estimate the height of a basset hound based on its weight, using a sample that looks consistent with the
regression model. Suppose the observed correlation \(r\) is 0.5, and that the summary statistics for the two variables are as in the table below:
average SD
height 14 inches 2 inches
weight 50 pounds 5 pounds
To calculate the equation of the regression line, we need the slope and the intercept.
\[ \mbox{slope} ~=~ \frac{r \cdot \mbox{SD of }y}{\mbox{SD of }x} ~=~ \frac{0.5 \cdot 2 \mbox{ inches}}{5 \mbox{ pounds}} ~=~ 0.2 ~\mbox{inches
per pound} \] ~=~ \mbox{average of }y ‑ \mbox{slope}\cdot \mbox{average of } x ~=~ 14 \mbox{ inches} ~‑~ 0.2 \mbox{ inches per pound} \cdot
\[ \mbox{intercept}
50 \mbox{ pounds} ~=~ 4 \mbox{ inches} \]
The equation of the regression line allows us to calculate the estimated height, in inches, based on a given weight in pounds:
\[ \mbox{estimated height} ~=~ 0.2 \cdot \mbox{given weight} ~+~ 4 \]
The slope of the line is measures the increase in the estimated height per unit increase in weight. The slope is positive, and it is important to note that
this does not mean that we think basset hounds get taller if they put on weight. The slope reflects the difference in the average heights of two groups
of dogs that are 1 pound apart in weight. Specifically, consider a group of dogs whose weight is \(w\) pounds, and the group whose weight is \(w+1\)
pounds. The second group is estimated to be 0.2 inches taller, on average. This is true for all values of \(w\) in the sample.
In general, the slope of the regression line can be interpreted as the average increase in \(y\) per unit increase in \(x\). Note that if the slope is
negative, then for every unit increase in \(x\), the average of \(y\) decreases.
15.2.10. Endnote
Even though we won’t establish the mathematical basis for the regression equation, we can see that it gives pretty good predictions when the scatter
plot is football shaped. It is a surprising mathematical fact that no matter what the shape of the scatter plot, the same equation gives the “best”
among all straight lines. That’s the topic of the next section.
Periods Characters
189 21759
188 22148
231 20558
... (44 rows omitted)
little_women.scatter('Periods', 'Characters')
To explore the data, we will need to use the functions correlation, slope, intercept, and fit defined in the previous section.
correlation(little_women, 'Periods', 'Characters')
0.9229576895854816
The scatter plot is remarkably close to linear, and the correlation is more than 0.92.
Corresponding to each point on the scatter plot, there is an error of prediction calculated as the actual value minus the predicted value. It is the
vertical distance between the point and the line, with a negative sign if the point is below the line.
actual = lw_with_predictions.column('Characters')
predicted = lw_with_predictions.column('Linear Prediction')
errors = actual - predicted
lw_with_predictions.with_column('Error', errors)
Periods Characters Linear Prediction Error
189 21759 21183.6 575.403
188 22148 21096.6 1051.38
231 20558 24836.7 ‑4278.67
195 25526 21705.5 3820.54
255 23395 26924.1 ‑3529.13
140 14622 16921.7 ‑2299.68
131 14431 16138.9 ‑1707.88
214 22476 23358 ‑882.043
337 33767 34056.3 ‑289.317
185 18508 20835.7 ‑2327.69
... (37 rows omitted)
We can use slope and intercept to calculate the slope and intercept of the fitted line. The graph below shows the line (in light blue). The errors
corresponding to four of the points are shown in red. There is nothing special about those four points. They were just chosen for clarity of the display.
The function lw_errors takes a slope and an intercept (in that order) as its arguments and draws the figure.
lw_reg_slope = slope(little_women, 'Periods', 'Characters')
lw_reg_intercept = intercept(little_women, 'Periods', 'Characters')
Had we used a different line to create our estimates, the errors would have been different. The graph below shows how big the errors would be if we
were to use another line for estimation. The second graph shows large errors obtained by using a line that is downright silly.
lw_errors(50, 10000)
lw_errors(-100, 50000)
lw_rmse(50, 10000)
lw_rmse(-100, 50000)
Bad lines have big values of rmse, as expected. But the rmse is much smaller if we choose a slope and intercept close to those of the regression line.
lw_rmse(90, 4000)
Root mean squared error: 2715.5391063834586
Here is the root mean squared error corresponding to the regression line. By a remarkable fact of mathematics, no other line can beat this one.
The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines.
lw_rmse(lw_reg_slope, lw_reg_intercept)
The proof of this statement requires abstract mathematics that is beyond the scope of this course. On the other hand, we do have a powerful tool –
Python – that performs large numerical computations with ease. So we can use Python to confirm that the regression line minimizes the mean
squared error.
Let’s check that lw_mse gets the right answer for the root mean squared error of the regression line. Remember that lw_mse returns the mean
squared error, so we have to take the square root to get the rmse.
lw_mse(lw_reg_slope, lw_reg_intercept)**0.5
2701.690785311856
You can confirm that lw_mse returns the correct value for other slopes and intercepts too. For example, here is the rmse of the extremely bad line that
we tried earlier.
lw_mse(-100, 50000)**0.5
16710.11983735375
And here is the rmse for a line that is close to the regression line.
lw_mse(90, 4000)**0.5
2715.5391063834586
If we experiment with different values, we can find a low‑error slope and intercept through trial and error, but that would take a while. Fortunately,
there is a Python function that does all the trial and error for us.
The minimize function can be used to find the arguments of a function for which the function returns its minimum value. Python uses a similar trial‑
and‑error approach, following the changes that lead to incrementally lower output values.
The argument of minimize is a function that itself takes numerical arguments and returns a numerical value. For example, the function lw_mse takes a
numerical slope and intercept as its arguments and returns the corresponding mse.
The call minimize(lw_mse) returns an array consisting of the slope and the intercept that minimize the mse. These minimizing values are excellent
approximations arrived at by intelligent trial‑and‑error, not exact values based on formulas.
best = minimize(lw_mse)
best
These values are the same as the values we calculated earlier by using the slope and intercept functions. We see small deviations due to the
inexact nature of minimize, but the values are essentially the same.
print("slope from formula: ", lw_reg_slope)
print("slope from minimize: ", best.item(0))
print("intercept from formula: ", lw_reg_intercept)
print("intercept from minimize: ", best.item(1))
shotput
0.09834382159781997
5.959629098373952
Does it still make sense to use these formulas even though the scatter plot isn’t football shaped? We can answer this by finding the slope and
intercept of the line that minimizes the mse.
We will define the function shotput_linear_mse to take an arbirtary slope and intercept as arguments and return the corresponding mse. Then
minimize applied to shotput_linear_mse will return the best slope and intercept.
minimize(shotput_linear_mse)
array([0.09834382, 5.95962911])
These values are the same as those we got by using our formulas. To summarize:
No matter what the shape of the scatter plot, there is a unique line that minimizes the mean squared error of estimation. It is called the
regression line, and its slope and intercept are given by
\[\[ \mathbf{\mbox{slope of the
\mathbf{\mbox{intercept regression
of the line}}line}}
regression ~=~~=~
r \cdot \frac{\mbox{SD
\mbox{average of }yof~‑~
}y}{\mbox{SD
\mbox{slope}of }x}
\cdot\] \mbox{average of }x \]
fitted = fit(shotput, 'Weight Lifted', 'Shot Put Distance')
shotput.with_column('Best Straight Line', fitted).scatter('Weight Lifted')
15.4.1. Nonlinear Regression
The graph above reinforces our earlier observation that the scatter plot is a bit curved. So it is better to fit a curve than a straight line. The study
postulated a quadratic relation between the weight lifted and the shot put distance. So let’s use quadratic functions as our predictors and see if we
can find the best one.
We have to find the best quadratic function among all quadratic functions, instead of the best straight line among all straight lines. The method of
least squares allows us to do this.
The mathematics of this minimization is complicated and not easy to see just by examining the scatter plot. But numerical minimization is just as easy
as it was with linear predictors! We can get the best quadratic predictor by once again using minimize. Let’s see how this works.
Recall that a quadratic function has the form
\[ f(x) ~=~ ax^2 + bx + c \]
for constants \(a\), \(b\), and \(c\).
To find the best quadratic function to predict distance based on weight lifted, using the criterion of least squares, we will first write a function that
takes the three constants as its arguments, calculates the fitted values by using the quadratic function above, and then returns the mean squared
error.
The function is called shotput_quadratic_mse. Notice that the definition is analogous to that of lw_mse, except that the fitted values are based on a
quadratic function instead of linear.
def shotput_quadratic_mse(a, b, c):
x = shotput.column('Weight Lifted')
y = shotput.column('Shot Put Distance')
fitted = a*(x**2) + b*x + c
return np.mean((y - fitted) ** 2)
We can now use minimize just as before to find the constants that minimize the mean squared error.
best = minimize(shotput_quadratic_mse)
best
Our prediction of the shot put distance for an athlete who lifts \(x\) kilograms is about
\[ ‑0.00104x^2 ~+~ 0.2827x ‑ 1.5318 \]
meters. For example, if the athlete can lift 100 kilograms, the predicted distance is 16.33 meters. On the scatter plot, that’s near the center of a
vertical strip around 100 kilograms.
(-0.00104)*(100**2) + 0.2827*100 - 1.5318
16.3382
Here are the predictions for all the values of Weight Lifted. You can see that they go through the center of the scatter plot, to a rough
approximation.
x = shotput.column(0)
shotput_fit = best.item(0)*(x**2) + best.item(1)*x + best.item(2)
Note: We fit a quadratic here because it was suggested in the original study. But it is worth noting that at the rightmost end of the graph, the
quadratic curve appears to be close to peaking, after which the curve will start going downwards. So we might not want to use this model for new
athletes who can lift weights much higher than those in our data set.
Continuing our example of estimating the heights of adult children (the response) based on the midparent height (the predictor), let us calculate the
fitted values and the residuals.
heights = heights.with_columns(
'Fitted Value', fit(heights, 'MidParent', 'Child'),
'Residual', residual(heights, 'MidParent', 'Child')
)
heights
MidParent Child Fitted Value Residual
75.43 73.2 70.7124 2.48763
75.43 69.2 70.7124 ‑1.51237
75.43 69 70.7124 ‑1.71237
75.43 69 70.7124 ‑1.71237
73.66 73.5 69.5842 3.91576
73.66 72.5 69.5842 2.91576
73.66 65.5 69.5842 ‑4.08424
73.66 65.5 69.5842 ‑4.08424
72.06 71 68.5645 2.43553
72.06 68 68.5645 ‑0.564467
... (924 rows omitted)
When there are so many variables to work with, it is always helpful to start with visualization. The function scatter_fit draws the scatter plot of the
data, as well as the regression line.
def scatter_fit(table, x, y):
table.scatter(x, y, s=15)
plots.plot(table.column(x), fit(table, x, y), lw=4, color='gold')
plots.xlabel(x)
plots.ylabel(y)
A residual plot can be drawn by plotting the residuals against the predictor variable. The function residual_plot does just that.
def residual_plot(table, x, y):
x_array = table.column(x)
t = Table().with_columns(
x, x_array,
'residuals', residual(table, x, y)
)
t.scatter(x, 'residuals', color='r')
xlims = make_array(min(x_array), max(x_array))
plots.plot(xlims, make_array(0, 0), color='darkblue', lw=4)
plots.title('Residual Plot')
Our data are a dataset on the age and length of dugongs, which are marine mammals related to manatees and sea cows (image from Wikimedia
Commons). The data are in a table called dugong. Age is measured in years and length in meters. Because dugongs tend not to keep track of their
birthdays, ages are estimated based on variables such as the condition of their teeth.
dugong = Table.read_table(path_data + 'dugongs.csv')
dugong = dugong.move_to_start('Length')
dugong
Length Age
1.8 1
1.85 1.5
1.87 1.5
1.77 1.5
2.02 2.5
2.27 4
2.15 5
2.26 5
2.35 7
2.47 8
... (17 rows omitted)
If we could measure the length of a dugong, what could we say about its age? Let’s examine what our data say. Here is a regression of age (the
response) on length (the predictor). The correlation between the two variables is substantial, at 0.83.
correlation(dugong, 'Length', 'Age')
0.8296474554905714
High correlation notwithstanding, the plot shows a curved pattern that is much more visible in the residual plot.
regression_diagnostic_plots(dugong, 'Length', 'Age')
While you can spot the non‑linearity in the original scatter, it is more clearly evident in the residual plot.
At the low end of the lengths, the residuals are almost all positive; then they are almost all negative; then positive again at the high end of lengths. In
other words the regression estimates have a pattern of being too high, then too low, then too high. That means it would have been better to use a
curve instead of a straight line to estimate the ages.
When a residual plot shows a pattern, there may be a non‑linear relation between the variables.
-2.719689807647064e-16
That doesn’t look like zero, but it is a tiny number that is 0 apart from rounding error due to computation. Here it is again, correct to 10 decimal
places. The minus sign is because of the rounding that above.
round(correlation(heights, 'MidParent', 'Residual'), 10)
-0.0
dugong = dugong.with_columns(
'Fitted Value', fit(dugong, 'Length', 'Age'),
'Residual', residual(dugong, 'Length', 'Age')
)
round(correlation(dugong, 'Length', 'Residual'), 10)
0.0
0.0
The same is true of the average of the residuals in the regression of the age of dugongs on their length. The mean of the residuals is 0, apart from
rounding error.
round(np.mean(dugong.column('Residual')), 10)
0.0
3.3880799163953426
3.388079916395342
The same is true for the regression of mileage on acceleration of hybrid cars. The correlation \(r\) is negative (about ‑0.5), but \(r^2\) is positive and
therefore \(\sqrt{1‑r^2}\) is a fraction.
r = correlation(hybrid, 'acceleration', 'mpg')
r
-0.5060703843771186
hybrid = hybrid.with_columns(
'fitted mpg', fit(hybrid, 'acceleration', 'mpg'),
'residual', residual(hybrid, 'acceleration', 'mpg')
)
np.std(hybrid.column('residual')), np.sqrt(1 - r**2)*np.std(hybrid.column('mpg'))
(9.43273683343029, 9.43273683343029)
Now let us see how the SD of the residuals is a measure of how good the regression is. Remember that the average of the residuals is 0. Therefore
the smaller the SD of the residuals is, the closer the residuals are to 0. In other words, if the SD of the residuals is small, the overall size of the errors
in regression is small.
The extreme cases are when \(r=1\) or \(r=‑1\). In both cases, \(\sqrt{1‑r^2} = 0\). Therefore the residuals have an average of 0 and an SD of 0 as well,
and therefore the residuals are all equal to 0. The regression line does a perfect job of estimation. As we saw earlier in this chapter, if \(r = \pm 1\), the
scatter plot is a perfect straight line and is the same as the regression line, so indeed there is no error in the regression estimate.
But usually \(r\) is not at the extremes. If \(r\) is neither \(\pm 1\) nor 0, then \(\sqrt{1‑r^2}\) is a proper fraction, and the rough overall size of the error
of the regression estimate is somewhere between 0 and the SD of \(y\).
The worst case is when \(r = 0\). Then \(\sqrt{1‑r^2} =1\), and the SD of the residuals is equal to the SD of \(y\). This is consistent with the observation
that if \(r=0\) then the regression line is a flat line at the average of \(y\). In this situation, the root mean square error of regression is the root mean
squared deviation from the average of \(y\), which is the SD of \(y\). In practical terms, if \(r = 0\) then there is no linear association between the two
variables, so there is no benefit in using linear regression.
The fitted values range from about 64 to about 71, whereas the heights of all the children are quite a bit more variable, ranging from about 55 to 80.
To verify the result numerically, we just have to calculate both sides of the identity.
correlation(heights, 'MidParent', 'Child')
0.32094989606395924
Here is ratio of the SD of the fitted values and the SD of the observed values of birth weight:
np.std(heights.column('Fitted Value'))/np.std(heights.column('Child'))
0.32094989606395957
-0.5060703843771186
np.std(hybrid.column('fitted mpg'))/np.std(hybrid.column('mpg'))
0.5060703843771186
In reality, of course, we will never see the true line. What the simulation shows that if the regression model looks plausible, and if we have a large
sample, then the regression line is a good approximation to the true line.
16.2. Inference for the True Slope
Our simulations show that if the regression model holds and the sample size is large, then the regression line is likely to be close to the true line. This
allows us to estimate the slope of the true line.
We will use our familiar sample of mothers and their newborn babies to develop a method of estimating the slope of the true line. First, let’s see if we
believe that the regression model is an appropriate set of assumptions for describing the relation between birth weight and the number of gestational
days.
scatter_fit(baby, 'Gestational Days', 'Birth Weight')
0.40754279338885108
By and large, the scatter looks fairly evenly distributed around the line, though there are some points that are scattered on the outskirts of the main
cloud. The correlation is 0.4 and the regression line has a positive slope.
Does this reflect the fact that the true line has a positive slope? To answer this question, let us see if we can estimate the true slope. We certainly
have one estimate of it: the slope of our regression line. That’s about 0.47 ounces per day.
slope(baby, 'Gestational Days', 'Birth Weight')
0.46655687694921522
But had the scatter plot come out differently, the regression line would have been different and might have had a different slope. How do we figure
out how different the slope might have been?
We need another sample of points, so that we can draw the regression line through the new scatter plot and find its slope. But from where will get
another sample?
You have guessed it – we will bootstrap our original sample. That will give us a bootstrapped scatter plot, through which we can draw a regression
line.
(0.38169251837495338, 0.55839374914417184)
An approximate 95% confidence interval for the true slope extends from about 0.38 ounces per day to about 0.56 ounces per day.
# Find the endpoints of the 95% confidence interval for the true slope
left = percentile(2.5, slopes)
right = percentile(97.5, slopes)
# Display results
Table().with_column('Bootstrap Slopes', slopes).hist(bins=20)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
print('Slope of regression line:', observed_slope)
print('Approximate 95%-confidence interval for the true slope:')
print(left, right)
When we call bootstrap_slope to find a confidence interval for the true slope when the response variable is birth weight and the predictor is
gestational days, we get an interval very close to the one we obtained earlier: approximately 0.38 ounces per day to 0.56 ounces per day.
bootstrap_slope(baby, 'Gestational Days', 'Birth Weight', 5000)
Slope of regression line: 0.466556876949
Approximate 95%-confidence interval for the true slope:
0.380373502308 0.558711930778
Now that we have a function that automates our process of estimating the slope of the true line in a regression model, we can use it on other
variables as well.
For example, let’s examine the relation between birth weight and the mother’s height. Do taller women tend to have heavier babies?
The regression model seems reasonable, based on the scatter plot, but the correlation is not high. It’s just about 0.2.
scatter_fit(baby, 'Maternal Height', 'Birth Weight')
0.20370417718968034
As before, we can use bootstrap_slope to estimate the slope of the true line in the regression model.
bootstrap_slope(baby, 'Maternal Height', 'Birth Weight', 5000)
0.085007669415825132
Though the slope is positive, it’s pretty small. The regression line is so close to flat that it raises the question of whether the true line is flat.
scatter_fit(baby, 'Maternal Age', 'Birth Weight')
We can use bootstrap_slope to estimate the slope of the true line. The calculation shows that an approximate 95% bootstrap confidence interval for
the true slope has a negative left end point and a positive right end point – in other words, the interval contains 0.
bootstrap_slope(baby, 'Maternal Age', 'Birth Weight', 5000)
The height of the point where the red line hits the regression line is the fitted value at 300 gestational days.
The function fitted_value computes this height. Like the functions correlation, slope, and intercept, its arguments include the name of the table
and the labels of the \(x\) and \(y\) columns. But it also requires a fourth argument, which is the value of \(x\) at which the estimate will be made.
def fitted_value(table, x, y, given_x):
a = slope(table, x, y)
b = intercept(table, x, y)
return a * given_x + b
The fitted value at 300 gestational days is about 129.2 ounces. In other words, for a pregnancy that has a duration of 300 gestational days, our
estimate for the baby’s weight is about 129.2 ounces.
fit_300 = fitted_value(baby, 'Gestational Days', 'Birth Weight', 300)
fit_300
129.2129241703143
# Display results
Table().with_column('Prediction', predictions).hist(bins=20)
plots.xlabel('predictions at x='+str(new_x))
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=8);
print('Height of regression line at x='+str(new_x)+':', original)
print('Approximate 95%-confidence interval:')
print(left, right)
The figure above shows a bootstrap empirical histogram of the predicted birth weight of a baby at 300 gestational days, based on 5,000 repetitions
of the bootstrap process. The empirical distribution is roughly normal.
An approximate 95% prediction interval of scores has been constructed by taking the “middle 95%” of the predictions, that is, the interval from the
2.5th percentile to the 97.5th percentile of the predictions. The interval ranges from about 127 to about 131. The prediction based on the original
sample was about 129, which is close to the center of the interval.
Notice that this interval is narrower than the prediction interval at 300 gestational days. Let us investigate the reason for this.
The mean number of gestational days is about 279 days:
np.mean(baby.column('Gestational Days'))
279.1013628620102
So 285 is nearer to the center of the distribution than 300 is. Typically, the regression lines based on the bootstrap samples are closer to each other
near the center of the distribution of the predictor variable. Therefore all of the predicted values are closer together as well. This explains the
narrower width of the prediction interval.
You can see this in the figure below, which shows predictions at \(x = 285\) and \(x = 300\) for each of ten bootstrap replications. Typically, the lines
are farther apart at \(x = 300\) than at \(x = 285\), and therefore the predictions at \(x = 300\) are more variable.
17. Classification
David Wagner is the primary author of this chapter.
Machine learning is a class of techniques for automatically finding patterns in data and using it to draw inferences or make predictions. You have
already seen linear regression, which is one kind of machine learning. This chapter introduces a new one: classification.
Classification is about learning how to make predictions from past examples. We are given some examples where we have been told what the correct
prediction was, and we want to learn from those examples how to make good predictions in the future. Here are a few applications where
classification is used in practice:
For each order Amazon receives, Amazon would like to predict: is this order fraudulent? They have some information about each order (e.g.,
its total value, whether the order is being shipped to an address this customer has used before, whether the shipping address is the same as
the credit card holder’s billing address). They have lots of data on past orders, and they know which of those past orders were fraudulent and
which weren’t. They want to learn patterns that will help them predict, as new orders arrive, whether those new orders are fraudulent.
Online dating sites would like to predict: are these two people compatible? Will they hit it off? They have lots of data on which matches
they’ve suggested to their customers in the past, and they have some idea which ones were successful. As new customers sign up, they’d like
to make predictions about who might be a good match for them.
Doctors would like to know: does this patient have cancer? Based on the measurements from some lab test, they’d like to be able to predict
whether the particular patient has cancer. They have lots of data on past patients, including their lab measurements and whether they ultimately
developed cancer, and from that, they’d like to try to infer what measurements tend to be characteristic of cancer (or non‑cancer) so they can
diagnose future patients accurately.
Politicians would like to predict: are you going to vote for them? This will help them focus fundraising efforts on people who are likely to
support them, and focus get‑out‑the‑vote efforts on voters who will vote for them. Public databases and commercial databases have a lot of
information about most people: e.g., whether they own a home or rent; whether they live in a rich neighborhood or poor neighborhood; their
interests and hobbies; their shopping habits; and so on. And political campaigns have surveyed some voters and found out who they plan to
vote for, so they have some examples where the correct answer is known. From this data, the campaigns would like to find patterns that will
help them make predictions about all other potential voters.
All of these are classification tasks. Notice that in each of these examples, the prediction is a yes/no question – we call this binary classification,
because there are only two possible predictions.
In a classification task, each individual or situation where we’d like to make a prediction is called an observation. We ordinarily have many
observations. Each observation has multiple attributes, which are known (for example, the total value of the order on Amazon, or the voter’s annual
salary). Also, each observation has a class, which is the answer to the question we care about (for example, fraudulent or not, or voting for you or
not).
When Amazon is predicting whether orders are fraudulent, each order corresponds to a single observation. Each observation has several attributes:
the total value of the order, whether the order is being shipped to an address this customer has used before, and so on. The class of the observation
is either 0 or 1, where 0 means that the order is not fraudulent and 1 means that the order is fraudulent. When a customer makes a new order, we do
not observe whether it is fraudulent, but we do observe its attributes, and we will try to predict its class using those attributes.
Classification requires data. It involves looking for patterns, and to find patterns, you need data. That’s where the data science comes in. In particular,
we’re going to assume that we have access to training data: a bunch of observations, where we know the class of each observation. The collection of
these pre‑classified observations is also called a training set. A classification algorithm is going to analyze the training set, and then come up with a
classifier: an algorithm for predicting the class of future observations.
Classifiers do not need to be perfect to be useful. They can be useful even if their accuracy is less than 100%. For instance, if the online dating site
occasionally makes a bad recommendation, that’s OK; their customers already expect to have to meet many people before they’ll find someone they
hit it off with. Of course, you don’t want the classifier to make too many errors — but it doesn’t have to get the right answer every single time.
ckd
Suppose Alice is a new patient who is not in the data set. If I tell you Alice’s hemoglobin level and blood glucose level, could you predict whether she
has CKD? It sure looks like it! You can see a very clear pattern here: points in the lower‑right tend to represent people who don’t have CKD, and the
rest tend to be folks with CKD. To a human, the pattern is obvious. But how can we program a computer to automatically detect patterns such as this
one?
The decision boundary is where the classifier switches from turning the red points blue to turning them gold.
Earlier, we said that we expect to get some classifications wrong, because there’s some intermingling of blue and gold points in the lower‑left.
But what about the points in the training set, that is, the points already on the scatter? Will we ever mis‑classify them?
The answer is no. Remember that 1‑nearest neighbor classification looks for the point in the training set that is nearest to the point being classified.
Well, if the point being classified is already in the training set, then its nearest neighbor in the training set is itself! And therefore it will be classified as
its own color, which will be correct because each point in the training set is already correctly colored.
In other words, if we use our training set to “test” our 1‑nearest neighbor classifier, the classifier will pass the test 100% of the time.
Mission accomplished. What a great classifier!
No, not so much. A new point in the lower‑left might easily be mis‑classified, as we noted earlier. “100% accuracy” was a nice dream while it lasted.
The lesson of this example is not to use the training set to test a classifier that is based on it.
Now let’s construct our classifier based on the points in the training sample:
training.scatter('White Blood Cell Count', 'Glucose', group='Color')
plt.xlim(-2, 6)
plt.ylim(-2, 6);
Place the test data on this graph and you can see at once that while the classifier got almost all the points right, there are some mistakes. For
example, some blue points of the test set fall in the gold region of the classifier.
Some errors notwithstanding, it looks like the classifier does fairly well on the test set. Assuming that the original sample was drawn randomly from
the underlying population, the hope is that the classifier will perform with similar accuracy on the overall population, since the test set was chosen
randomly from the original sample.
The data corresponding to the first patient is in row 0 of the table, consistent with Python’s indexing system. The Table method row accesses the row
by taking the index of the row as its argument:
ckd.row(0)
Rows have their very own data type: they are row objects. Notice how the display shows not only the values in the row but also the labels of the
corresponding columns.
Rows are in general not arrays, as their elements can be of different types. For example, some of the elements of the row above are strings (like
'abnormal') and some are numerical. So the row can’t be converted into an array.
However, rows share some characteristics with arrays. You can use item to access a particular element of a row. For example, to access the Albumin
level of Patient 0, we can look at the labels in the printout of the row above to find that it’s item 3:
ckd.row(0).item(3)
color_table = Table().with_columns(
'Class', make_array(1, 0),
'Color', make_array('darkblue', 'gold')
)
ckd = ckd.join('Class', color_table)
ckd
Class Hemoglobin Glucose Color
0 0.456884 0.133751 gold
0 1.153 ‑0.947597 gold
0 0.770138 ‑0.762223 gold
0 0.596108 ‑0.190654 gold
0 ‑0.239236 ‑0.49961 gold
0 ‑0.0304002 ‑0.159758 gold
0 0.282854 ‑0.00527964 gold
0 0.108824 ‑0.623193 gold
0 0.0740178 ‑0.515058 gold
0 0.83975 ‑0.422371 gold
... (148 rows omitted)
Here is a scatter plot of the two attributes, along with a red point corresponding to Alice, a new patient. Her value of hemoglobin is 0 (that is, at the
average) and glucose 1.1 (that is, 1.1 SDs above average).
alice = make_array(0, 1.1)
ckd.scatter('Hemoglobin', 'Glucose', group='Color')
plots.scatter(alice.item(0), alice.item(1), color='red', s=30);
To find the distance between Alice’s point and any of the other points, we only need the values of the attributes:
ckd_attributes = ckd.select('Hemoglobin', 'Glucose')
ckd_attributes
Hemoglobin Glucose
0.456884 0.133751
1.153 ‑0.947597
0.770138 ‑0.762223
0.596108 ‑0.190654
‑0.239236 ‑0.49961
‑0.0304002 ‑0.159758
0.282854 ‑0.00527964
0.108824 ‑0.623193
0.0740178 ‑0.515058
0.83975 ‑0.422371
... (148 rows omitted)
Each row consists of the coordinates of one point in our training sample. Because the rows now consist only of numerical values, it is possible to
convert them to arrays. For this, we use the function np.array, which converts any kind of sequential object, like a row, to an array. (Our old friend
make_array is for creating arrays, not for converting other kinds of sequences to arrays.)
ckd_attributes.row(3)
Row(Hemoglobin=0.5961076648232668, Glucose=-0.19065363034327712)
np.array(ckd_attributes.row(3))
This is very handy because we can now use array operations on the data in each row.
1.421664918881847
We’re going to need the distance between Alice and a bunch of points, so let’s write a function called distance that computes the distance between
any pair of points. The function will take two arrays, each containing the \((x, y)\) coordinates of a point. (Remember, those are really the Hemoglobin
and Glucose levels of a patient.)
def distance(point1, point2):
"""Returns the Euclidean distance between point1 and point2.
distance(alice, patient3)
1.421664918881847
We have begun to build our classifier: the distance function is the first building block. Now let’s work on the next piece.
Hemoglobin Glucose
0.456884 0.133751
1.153 ‑0.947597
0.770138 ‑0.762223
0.596108 ‑0.190654
‑0.239236 ‑0.49961
Just as an example, suppose that for each patient we want to know how unusual their most unusual attribute is. Concretely, if a patient’s hemoglobin
level is further from the average than her glucose level, we want to know how far it is from the average. If her glucose level is further from the average
than her hemoglobin level, we want to know how far that is from the average instead.
That’s the same as taking the maximum of the absolute values of the two quantities. To do this for a particular row, we can convert the row to an array
and use array operations.
def max_abs(row):
return np.max(np.abs(np.array(row)))
max_abs(t.row(4))
0.4996102825918697
This way of using apply will help us create the next building block of our classifier.
array([0. , 1.1])
What we need is a function that finds the distance between Alice and another point whose coordinates are contained in a row. The function distance
returns the distance between any two points whose coordinates are in arrays. We can use that to define distance_from_alice, which takes a row as
its argument and returns the distance between that row and Alice.
def distance_from_alice(row):
"""Returns distance between Alice and a row of the attributes table"""
return distance(alice, np.array(row))
distance_from_alice(ckd_attributes.row(3))
1.421664918881847
Now we can apply the function distance_from_alice to each row of ckd_attributes, and augment the table ckd with the distances. Step 1 is
complete!
distances = ckd_attributes.apply(distance_from_alice)
ckd_with_distances = ckd.with_column('Distance from Alice', distances)
ckd_with_distances
We are well on our way to implementing our k‑nearest neighbor classifier. In the next two sections we will put it together and assess its accuracy.
17.4. Implementing the Classifier
We are now ready to implement a \(k\)‑nearest neighbor classifier based on multiple attributes. We have used only two attributes so far, for ease of
visualization. But usually predictions will be based on many attributes. Here is an example that shows how multiple attributes can be better than pairs.
There does seem to be a pattern, but it’s a pretty complex one. Nonetheless, the \(k\)‑nearest neighbors classifier can still be used and will effectively
“discover” patterns out of this. This illustrates how powerful machine learning can be: it can effectively take advantage of even patterns that we
would not have anticipated, or that we would have thought to “program into” the computer.
Let’s use this on a new dataset. The table wine contains the chemical composition of 178 different Italian wines. The classes are the grape species,
called cultivars. There are three classes but let’s just see whether we can tell Class 1 apart from the other two.
wine = Table.read_table(path_data + 'wine.csv')
def is_one(x):
if x == 1:
return 1
else:
return 0
wine
distance(np.array(wine_attributes.row(0)), np.array(wine_attributes.row(1)))
31.265012394048398
The last wine in the table is of Class 0. Its distance from the first wine is:
distance(np.array(wine_attributes.row(0)), np.array(wine_attributes.row(177)))
506.05936766351834
That’s quite a bit bigger! Let’s do some visualization to see if Class 1 really looks different from Class 0.
wine_with_colors = wine.join('Class', color_table)
Let’s see if we can implement a classifier based on all of the attributes. After that, we’ll see how accurate it is.
def majority(topkclasses):
...
Let’s see how this works on our wine data. We’ll just take the first wine and find its five nearest neighbors among all the wines. Remember that since
this wine is part of the dataset, it is its own nearest neighbor. So we should expect to see it at the top of the list, followed by four others.
First let’s extract its attributes:
special_wine = wine.drop('Class').row(0)
classify(wine, special_wine, 5)
If we change special_wine to be the last one in the dataset, is our classifier able to tell that it’s in Class 0?
special_wine = wine.drop('Class').row(177)
classify(wine, special_wine, 5)
We’ll train the classifier using the 89 wines in the training set, and evaluate how well it performs on the test set. To make our lives easier, we’ll write a
function to evaluate a classifier on every wine in the test set:
def count_zero(array):
"""Counts the number of 0's in an array"""
return len(array) - np.count_nonzero(array)
Now for the grand reveal – let’s see how we did. We’ll arbitrarily use \(k=5\).
evaluate_accuracy(training_set, test_set, 5)
0.898876404494382
Brittany’s science fair project was to build a classification algorithm to diagnose breast cancer. She won grand prize for building an algorithm whose
accuracy was almost 99%.
Let’s see how well we can do, with the ideas we’ve learned in this course.
So, let me tell you a little bit about the data set. Basically, if a woman has a lump in her breast, the doctors may want to take a biopsy to see if it is
cancerous. There are several different procedures for doing that. Brittany focused on fine needle aspiration (FNA), because it is less invasive than the
alternatives. The doctor gets a sample of the mass, puts it under a microscope, takes a picture, and a trained lab tech analyzes the picture to
determine whether it is cancer or not. We get a picture like one of the following:
Unfortunately, distinguishing between benign vs malignant can be tricky. So, researchers have studied the use of machine learning to help with this
task. The idea is that we’ll ask the lab tech to analyze the image and compute various attributes: things like the typical size of a cell, how much
variation there is among the cell sizes, and so on. Then, we’ll try to use this information to predict (classify) whether the sample is malignant or not.
We have a training set of past samples from women where the correct diagnosis is known, and we’ll hope that our machine learning algorithm can
use those to learn how to predict the diagnosis for future samples.
We end up with the following data set. For the “Class” column, 1 means malignant (cancer); 0 means benign (not cancer).
patients = Table.read_table(path_data + 'breast-cancer.csv').drop('ID')
patients
Clump Uniformity
of Cell
Uniformity Marginal Single Bare
of Cell Adhesion Epithelial Nuclei Chromatin Bland Normal
Thickness Size Shape Cell Size Nucleoli
5 1 1 1 2 1 3 1
5 4 4 5 7 10 3 2
3 1 1 1 2 2 3 1
6 8 8 1 3 4 3 7
4 1 1 3 2 1 3 1
8 10 10 8 7 10 9 7
1 1 1 1 2 10 3 1
2 1 2 1 2 1 3 1
2 1 1 1 2 1 1 1
4 2 1 1 2 1 2 1
... (673 rows omitted)
So we have 9 different attributes. I don’t know how to make a 9‑dimensional scatterplot of all of them, so I’m going to pick two and plot them:
color_table = Table().with_columns(
'Class', make_array(1, 0),
'Color', make_array('darkblue', 'gold')
)
patients_with_colors = patients.join('Class', color_table)
patients_with_colors.scatter('Bland Chromatin', 'Single Epithelial Cell Size',
group='Color')
Oops. That plot is utterly misleading, because there are a bunch of points that have identical values for both the x‑ and y‑coordinates. To make it
easier to see all the data points, I’m going to add a little bit of random jitter to the x‑ and y‑values. Here’s how that looks:
For instance, you can see there are lots of samples with chromatin = 2 and epithelial cell size = 2; all non‑cancerous.
Keep in mind that the jittering is just for visualization purposes, to make it easier to get a feeling for the data. We’re ready to work with the data now,
and we’ll use the original (unjittered) data.
First we’ll create a training set and a test set. The data set has 683 patients, so we’ll randomly permute the data set and put 342 of them in the
training set and the remaining 341 in the test set.
shuffled_patients = patients.sample(683, with_replacement=False)
training_set = shuffled_patients.take(np.arange(342))
test_set = shuffled_patients.take(np.arange(342, 683))
Let’s stick with 5 nearest neighbors, and see how well our classifier does.
evaluate_accuracy(training_set, test_set, 5)
0.967741935483871
Over 96% accuracy. Not bad! Once again, pretty darn good for such a simple technique.
As a footnote, you might have noticed that Brittany Wenger did even better. What techniques did she use? One key innovation is that she
incorporated a confidence score into her results: her algorithm had a way to determine when it was not able to make a confident prediction, and for
those patients, it didn’t even try to predict their diagnosis. Her algorithm was 99% accurate on the patients where it made a prediction – so that
extension seemed to help quite a bit.
17.6. Multiple Regression
Now that we have explored ways to use multiple attributes to predict a categorical variable, let us return to predicting a quantitative variable.
Predicting a numerical quantity is called regression, and a commonly used method to use multiple attributes for regression is called multiple linear
regression.
17.6.1.1. Correlation
No single attribute is sufficient to predict the sale price. For example, the area of first floor, measured in square feet, correlates with sale price but
only explains some of its variability.
sales.scatter('1st Flr SF', 'SalePrice')
0.6424662541030225
In fact, none of the individual attributes have a correlation with sale price that is above 0.7 (except for the sale price itself).
for label in sales.labels:
print('Correlation of', label, 'and SalePrice:\t', correlation(sales, label,
'SalePrice'))
However, combining attributes can provide higher correlation. In particular, if we sum the first floor and second floor areas, the result has a higher
correlation than any single attribute alone.
both_floors = sales.column(1) + sales.column(2)
correlation(sales.with_column('Both Floors', both_floors), 'SalePrice', 'Both Floors')
0.7821920556134877
This high correlation indicates that we should try to use more than one attribute to predict the sale price. In a dataset with multiple observed
attributes and a single numerical value to be predicted (the sale price in this case), multiple linear regression can be an effective technique.
example_row = test.drop('SalePrice').row(0)
print('Predicting sale price for:', example_row)
example_slopes = np.random.normal(10, 1, len(example_row))
print('Using slopes:', example_slopes)
print('Result:', predict(example_slopes, example_row))
Predicting sale price for: Row(1st Flr SF=1207, 2nd Flr SF=0, Total Bsmt SF=1135.0,
Garage Area=264.0, Wood Deck SF=0, Open Porch SF=240, Lot Area=9510, Year Built=1962, Yr
Sold=2006)
Using slopes: [ 9.52065867 8.58939769 11.48702417 9.50389131 9.09151019 9.86944284
10.71929443 10.88966608 8.33339346]
Result: 169429.7032316262
The result is an estimated sale price, which can be compared to the actual sale price to assess whether the slopes provide accurate predictions.
Since the example_slopes above were chosen at random, we should not expect them to provide accurate predictions at all.
print('Actual sale price:', test.column('SalePrice').item(0))
print('Predicted sale price using random slopes:', predict(example_slopes, example_row))
def rmse_train(slopes):
return rmse(slopes, train_attributes, train_prices)
Finally, we use the minimize function to find the slopes with the lowest RMSE. Since the function we want to minimize, rmse_train, takes an array
instead of a number, we must pass the array=True argument to minimize. When this argument is used, minimize also requires an initial guess of the
slopes so that it knows the dimension of the input array. Finally, to speed up optimization, we indicate that rmse_train is a smooth function using the
smooth=True attribute. Computation of the best slopes may take several minutes.
1st Flr 2nd Flr Total Garage Wood Porch Open Year Yr Sold
SF SF Bsmt SF Area Deck SF Lot Area Built
SF
68.7068 74.3857 56.0494 36.1706 26.4397 21.4779 0.558904 534.101 ‑528.216
RMSE of all training examples using the best slopes: 29311.117940347867
def rmse_test(slopes):
return rmse(slopes, test_attributes, test_prices)
rmse_linear = rmse_test(best_slopes)
print('Test set RMSE for multiple linear regression:', rmse_linear)
If the predictions were perfect, then a scatter plot of the predicted and actual values would be a straight line with slope 1. We see that most dots fall
near that line, but there is some error in the predictions.
def fit(row):
return sum(best_slopes * np.array(row))
test.with_column('Fitted', test.drop(0).apply(fit)).scatter('Fitted', 0)
plots.plot([0, 5e5], [0, 5e5]);
A residual plot for multiple regression typically compares the errors (residuals) to the actual values of the predicted variable. We see in the residual
plot below that we have systematically underestimated the value of expensive houses, shown by the many positive residual values on the right side of
the graph.
test.with_column('Residual', test_prices-test.drop(0).apply(fit)).scatter(0, 'Residual')
plots.plot([0, 7e5], [0, 0]);
As with simple linear regression, interpreting the result of a predictor is at least as important as making predictions. There are many lessons about
interpreting multiple regression that are not included in this textbook. A natural next step after completing this text would be to study linear modeling
and regression in further depth.
SalePrice 1st Flr SF 2nd Flr SF Total Bsmt SF Garage Area Year Built
270000 1673 0 1673 583 2000
158000 986 537 1067 295 1949
249000 1506 0 1494 672 2005
... (998 rows omitted)
The computation of closest neighbors is identical to a nearest‑neighbor classifier. In this case, we will exclude the 'SalePrice' rather than the
'Class' column from the distance computation. The five nearest neighbors of the first test row are shown below.
example_nn_row = test_nn.drop(0).row(0)
closest(train_nn, example_nn_row, 5, 'SalePrice')
SalePrice 1st Flr SF 2nd Flr SF Total Bsmt SF Garage Area Year Built Distance
137000 1176 0 1158 303 1958 55.0182
146000 1150 0 1150 288 1961 63.6475
126175 1163 0 1162 220 1955 68.1909
157900 1188 0 1188 312 1962 73.9865
150000 1144 0 1169 286 1956 75.1332
One simple method for predicting the price is to average the prices of the nearest neighbors.
def predict_nn(example):
"""Return the majority class among the k nearest neighbors."""
return np.average(closest(train_nn, example, 5, 'SalePrice').column('SalePrice'))
predict_nn(example_nn_row)
143415.0
Finally, we can inspect whether our prediction is close to the true sale price for our one test example. Looks reasonable!
print('Actual sale price:', test_nn.column('SalePrice').item(0))
print('Predicted sale price using nearest neighbors:', predict_nn(example_nn_row))
17.6.3.1. Evaluation
To evaluate the performance of this approach for the whole test set, we apply predict_nn to each test example, then compute the root mean squared
error of the predictions. Computation of the predictions may take several minutes.
nn_test_predictions = test_nn.drop('SalePrice').apply(predict_nn)
rmse_nn = np.mean((test_prices - nn_test_predictions) ** 2) ** 0.5
For these data, the errors of the two techniques are quite similar! For different data sets, one technique might outperform another. By computing the
RMSE of both techniques on the same data, we can compare methods fairly. One note of caution: the difference in performance might not be due to
the technique at all; it might be due to the random variation due to sampling the training and test sets in the first place.
Finally, we can draw a residual plot for these predictions. We still underestimate the prices of the most expensive houses, but the bias does not
appear to be as systematic. However, fewer residuals are very close to zero, indicating that fewer prices were predicted with very high accuracy.
test.with_column('Residual', test_prices-nn_test_predictions).scatter(0, 'Residual')
plots.plot([0, 7e5], [0, 0]);
18. Updating Predictions
We know how to use training data to classify a point into one of two categories. Our classification is just a prediction of the class, based on the most
common class among the training points that are nearest our new point.
Suppose that we eventually find out the true class of our new point. Then we will know whether we got the classification right. Also, we will have a
new point that we can add to our training set, because we know its class. This updates our training set. So, naturally, we will want to update our
classifier based on the new training set.
This chapter looks at some simple scenarios where new data leads us to update our predictions. While the examples in the chapter are simple in
terms of calculation, the method of updating can be generalized to work in complex settings and is one of the most powerful tools used for machine
learning.
Year Major
Second Undeclared
Second Undeclared
Second Undeclared
... (97 rows omitted)
To check that the proportions are correct, let’s use pivot to cross‑classify each student according to the two variables.
students.pivot('Major', 'Year')
0.5161290322580645
0.5161290322580645
0.4838709677419354
That’s about 0.484, which is less than half, consistent with our classification of Third Year.
Notice that both the posterior probabilities have the same denominator: the chance of the new information, which is that the student has Declared.
Because of this, Bayes’ method is sometimes summarized as a statement about proportionality:
\[ \mbox{posterior} ~ \propto ~ \mbox{prior} \times \mbox{likelihood} \]
Formulas are great for efficiently describing calculations. But in settings like our example about students, it is simpler not to think in terms of
formulas. Just use the tree diagram.
Overall, only 4 in 1000 of the population has the disease. The test is quite accurate: it has a very small false positive rate of 5 in 1000, and a
somewhat larger (though still small) false negative rate of 1 in 100.
Individuals might or might not know whether they have the disease; typically, people get tested to find out whether they have it.
So suppose a person is picked at random from the population and tested. If the test result is Positive, how would you classify them: Disease, or
No disease?
We can answer this by applying Bayes’ Rule and using our “more likely than not” classifier. Given that the person has tested Positive, the chance that
he or she has the disease is the proportion in the top branch, relative to the total proportion in the Test Positive branches.
(0.004 * 0.99)/(0.004 * 0.99 + 0.996*0.005 )
0.44295302013422816
Given that the person has tested Positive, the chance that he or she has the disease is about 44%. So we will classify them as: No disease.
This is a strange conclusion. We have a pretty accurate test, and a person who has tested Positive, and our classification is … that they don’t have
the disease? That doesn’t seem to make any sense.
When faced with a disturbing answer, the first thing to do is to check the calculations. The arithmetic above is correct. Let’s see if we can get the
same answer in a different way.
The function population returns a table of outcomes for 100,000 patients, with columns that show the True Condition and Test Result. The test is
the same as the one described in the tree. But the proportion who have the disease is an argument to the function.
We will call population with 0.004 as the argument, and then pivot to cross‑classify each of the 100,000 people.
population(0.004).pivot('Test Result', 'True Condition')
0.4429530201342282
That’s the answer we got by using Bayes’ Rule. The counts in the Positives column show why it is less than 1/2. Among the Positives, more people
don’t have the disease than do have the disease.
The reason is that a huge fraction of the population doesn’t have the disease in the first place. The tiny fraction of those that falsely test Positive are
still greater in number than the people who correctly test Positive. This is easier to visualize in the tree diagram:
The proportion of true Positives is a large fraction (0.99) of a tiny fraction (0.004) of the population.
The proportion of false Positives is a tiny fraction (0.005) of a large fraction (0.996) of the population.
These two proportions are comparable; the second is a little larger.
So, given that the randomly chosen person tested positive, we were right to classify them as more likely than not to not have the disease.
18.2.2. A Subjective Prior
Being right isn’t always satisfying. Classifying a Positive patient as not having the disease still seems somehow wrong, for such an accurate test.
Since the calculations are right, let’s take a look at the basis of our probability calculation: the assumption of randomness.
Our assumption was that a randomly chosen person was tested and got a Positive result. But this doesn’t happen in reality. People go in to get tested
because they think they might have the disease, or because their doctor thinks they might have the disease. People getting tested are not
randomly chosen members of the population.
That is why our intuition about people getting tested was not fitting well with the answer that we got. We were imagining a realistic situation of a
patient going in to get tested because there was some reason for them to do so, whereas the calculation was based on a randomly chosen person
being tested.
So let’s redo our calculation under the more realistic assumption that the patient is getting tested because the doctor thinks there’s a chance the
patient has the disease.
Here it’s important to note that “the doctor thinks there’s a chance” means that the chance is the doctor’s opinion, not the proportion in the
population. It is called a subjective probability. In our context of whether or not the patient has the disease, it is also a subective prior probability.
Some researchers insist that all probabilities must be relative frequencies, but subjective probabilities abound. The chance that a candidate wins the
next election, the chance that a big earthquake will hit the Bay Area in the next decade, the chance that a particular country wins the next soccer
World Cup: none of these are based on relative frequencies or long run frequencies. Each one contains a subjective element. All calculations involving
them thus have a subjective element too.
Suppose the doctor’s subjective opinion is that there is a 5% chance that the patient has the disease. Then just the prior probabilities in the tree
diagram will change:
Given that the patient tests Positive, the chance that he or she has the disease is given by Bayes’ Rule.
(0.05 * 0.99)/(0.05 * 0.99 + 0.95 * 0.005)
0.9124423963133641
The effect of changing the prior is stunning. Even though the doctor has a pretty low prior probability (5%) that the patient has the disease, once the
patient tests Positive the posterior probability of having the disease shoots up to more than 91%.
If the patient tests Positive, it would be reasonable for the doctor to proceed as though the patient has the disease.
0.9124423963133641
Because we can generate a population that has the right proportions, we can also use simulation to confirm that our answer is reasonable. The table
pop_05 contains a population of 100,000 people generated with the doctor’s prior disease probability of 5% and the error rates of the test. We take a
simple random sample of size 10,000 from the population, and extract the table positive consisting only of those in the sample that had Positive test
results.
pop_05 = population(0.05)
Among these Positive results, what proportion were true Positives? That’s the proportion of Positives that had the disease:
positive.where('True Condition', are.equal_to('Disease')).num_rows/positive.num_rows
0.9218181818181819
Run the two cells a few times and you will see that the proportion of true Positives among the Positives hovers around the value of 0.912 that we
calculated by Bayes’ Rule.
You can also use the population function with a different argument to change the prior disease probability and see how the posterior probabilities
are affected.