Foundationfor DataScience
Foundationfor DataScience
2 Fundamentals of Python 13
3 Sequences 25
5 Execution Control 55
6 Functions 67
8 SciPy 103
9 Pandas 113
1 Introduction to Notebooks 3
Running Python Statements 4
Jupyter Notebooks 4
Google Colab 5
Colab Text Cells 6
Colab Code Cells 9
Colab Files 9
Managing Colab Documents 10
Colab Code Snippets 11
Existing Collections 11
System Aliases 11
Magic Functions 12
Summary 12
Questions 12
2 Fundamentals of Python 13
Basic Types in Python 14
High-Level Versus Low-Level Languages 15
Statements 15
Performing Basic Math Operations 21
Using Classes and Objects with Dot Notation 22
Summary 22
Questions 23
3 Sequences 25
Shared Operations 25
Testing Membership 26
Indexing 26
Slicing 27
Interrogation 27
Math Operations 28
Lists and Tuples 29
5 Execution Control 55
Compound Statements 55
Compound Statement Structure 56
Evaluating to True or False 56
if Statements 59
while Loops 62
for Loops 63
break and continue Statements 64
Summary 65
Questions 65
6 Functions 67
Defining Functions 67
Control Statement 68
Docstrings 68
Parameters 69
Return Statements 75
Scope in Functions 75
Decorators 76
Anonymous Functions 80
Summary 81
Questions 81
7 NumPy 85
Installing and Importing NumPy 86
Creating Arrays 86
Indexing and Slicing 89
Element-by-Element Operations 91
Filtering Values 92
Views Versus Copies 94
Some Array Methods 95
Broadcasting 98
NumPy Math 100
Summary 102
Questions 102
8 SciPy 103
SciPy Overview 103
The scipy.misc Submodule 104
The scipy.special Submodule 105
The scipy.stats Submodule 105
Discrete Distributions 105
Continuous Distributions 108
Summary 111
Questions 111
9 Pandas 113
About DataFrames 113
Creating DataFrames 114
Creating a DataFrame from a Dictionary 114
10 Visualization Libraries 135
matplotlib 135
Styling Plots 137
Labeled Data 140
Plotting Multiple Sets of Data 141
Object-Oriented Style 143
Seaborn 144
Seaborn Themes 145
Plotly 148
Bokeh 149
Other Visualization Libraries 151
Summary 151
Questions 151
13 Functional Programming 173
Introduction to Functional Programming 173
Scope and State 174
Depending on Global State 174
Changing State 175
Changing Mutable Data 176
Functional Programming Functions 177
List Comprehensions 179
List Comprehension Basic Syntax 179
Replacing map and filter 180
Multiple Variables 181
Dictionary Comprehensions 181
Generators 182
Generator Expressions 182
Generator Functions 183
Summary 184
Questions 185
14 Object-Oriented Programming 187
Grouping State and Function 187
Classes and Instances 188
Private Methods and Variables 190
Class Variables 190
Special Methods 191
Representation Methods 192
15 Other Topics 201
Sorting 201
Lists 201
Reading and Writing Files 204
Context Managers 205
datetime Objects 206
Regular Expressions 207
Character Sets 208
Character Classes 209
Groups 209
Named Groups 210
Find All 210
Find Iterator 211
Substitution 211
Substitution Using Named Groups 211
Compiling Regular Expressions 211
Summary 212
Questions 212
Index 221
I was first introduced to Python working in the film industry, where we used it to automate
data management across departments and locations. In the last decade, Python has become a
dominant tool in Data Science.
This dominance evolved due to two developments: the Jupyter notebook, and powerful third-
party libraries. In 2001 Fernando Perez began the IPython project, an interactive Python
environment inspired by Maple and Mathematica notebooks.3 By 2014, the notebook-specific
part of the project was split off as the Jupyter project. These notebooks have excelled for
scientific and statistical work environments. In parallel with this development, third-party
libraries for scientific and statistical computing were developed for Python. With so many
applications, the functionality available to a Python programmer has grown immensely. With
specialized packages for everything from opening web sockets to processing natural language
text, there is more available than a beginning developer needs.
This project was the brainchild of Noah Gift.4 In his work as an educator, he found that
students of Data Science did not have a resource to learn just the parts of Python they needed.
There were many general Python books and books about Data Science, but not resources for
learning just the Python needed to get started in Data Science. That is what we have attempted
to provide here. This book will not teach the Python needed to set up a web page or perform
system administration. It is also not intended to teach you Data Science, but rather the Python
needed to learn Data Science.
I hope you will find this guide a good companion in your quest to grow your Data Science
knowledge.
Example Code
Most of the code shown in examples in this book can be found on GitHub at:
https://github.com/kbehrman/foundational-python-for-data-science.
1 https://docs.python.org/3/faq/general.html#why-was-python-created-in-the-first-place
2 https://www.python.org/success-stories/
3 http://blog.fperez.org/2012/01/ipython-notebook-historical.html
4 https://noahgift.com
Errors using inadequate data are much less than those using no data at all.
Charles Babbage
In This Chapter
QQ Shared sequence operations
QQ Lists and tuples
QQ Strings and string methods
QQ Ranges
In Chapter 2, “Fundamentals of Python,” you learned about collections of types. This chapter
introduces the group of built-in types called sequences. A sequence is an ordered, finite collection.
You might think of a sequence as a shelf in a library, where each book on the shelf has a location
and can be accessed easily if you know its place. The books are ordered, with each book (except
those at the ends) having books before and after it. You can add books to the shelf, and you
can remove them, and it is possible for the shelf to be empty. The built-in types that comprise a
sequence are lists, tuples, strings, binary strings, and ranges. This chapter covers the shared
characteristics and specifics of these types.
Shared Operations
The sequences family shares quite a bit of functionality. Specifically, there are ways of using
sequences that are applicable to most of the group members. There are operations that relate to
sequences having a finite length, for accessing the items in a sequence, and for creating a new
sequence based a sequence’s content.
Testing Membership
You can test whether an item is a member of a sequence by using the in operation. This
operation returns True if the sequence contains an item that evaluates as equal to the item in
question, and it returns False otherwise. The following are examples of using in with differ-
ent sequence types:
'first' in ['first', 'second', 'third']
True
23 in (23,)
True
'b' in 'cat'
False
b'a' in b'ieojjza'
True
You can use the keyword not in conjunction with in to check whether something is absent from
a sequence:
'b' not in 'cat'
True
The two places you are most likely to use in and not in are in an interactive session to explore
data and as part of an if statement (see Chapter 5, “Execution Control”).
Indexing
Because a sequence is an ordered series of items, you can access an item in a sequence by using
its position, or index. Indexes start at zero and go up to one less than the number of items. In an
eight-item sequence, for example, the first item has an index of zero, and the last item an index of
seven.
To access an item by using its index, you use square brackets around the index number. The
following example defines a string and accesses its first and last substrings using their index
numbers:
name = "Ignatius"
name[0]
'I'
name[4]
't'
You can also index counting back from the end of a sequence by using negative index numbers:
name[-1]
's'
name[-2]
'u'
Slicing
You can use indexes to create new sequences that represent subsequences of the original. In
square brackets, supply the beginning and ending index numbers of the subsequence separated
by a colon, and a new sequence is returned:
name = "Ignatius"
name[2:5]
'nat'
The subsequence that is returned contains items starting from the first index and up to, but not
including, the ending index. If you leave out the beginning index, the subsequence starts at the
beginning of the parent sequence; if you leave out the end index, the subsequence goes to the end
of the sequence:
name[:5]
'Ignat'
name[4:]
'tius'
You can use negative index numbers to create slices counting from the end of a sequence. This
example shows how to grab the last three letters of a string:
name[-3:]
'ius'
If you want a slice to skip items, you can provide a third argument that indicates what to count
by. So, if you have a list sequence of integers, as shown earlier, you can create a slice just by using
the starting and ending index numbers:
scores = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
scores[3:15]
[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
But you can also indicate the step to take, such as counting by threes:
scores[3:15:3]
[3, 6, 9, 12]
Interrogation
You can perform shared operations on sequences to glean information about them. Because a
sequence is finite, it has a length, which you can find by using the len function:
name = "Ignatius"
len(name)
8
You can use the min and max functions to find the minimum and maximum items, respectively:
scores = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
min(scores)
0
max(name)
'u'
These methods assume that the contents of a sequence can be compared in a way that implies
an ordering. For sequence types that allow for mixed item types, an error occurs if the contents
cannot be compared:
max(['Free', 2, 'b'])
-----------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-d8babe38f9d9> in <module>()
----> 1 max(['Free', 2, 'b'])
TypeError: '>' not supported between instances of 'int' and 'str'
You can find out how many times an item appears in a sequence by using the count method:
name.count('a')
1
You can get the index of an item in a sequence by using the index method:
name.index('s')
7
You can use the result of the index method to create a slice up to an item, such as a letter in a
string:
name[:name.index('u')]
'Ignati'
Math Operations
You can perform addition and multiplication with sequences of the same type. When you do, you
conduct these operations on the sequence, not on its contents. So, for example, adding the list [1]
to the list [2] will produce the list [1,2], not [3]. Here is an example of using the plus (+) operator
to create a new string from three separate strings:
"prefix" + "-" + "postfix"
'prefix-postfix'
The multiplication (*) operator works by performing multiple additions on the whole sequence,
not on its contents:
[0,2] * 4
[0, 2, 0, 2, 0, 2, 0, 2]
This is a useful way of setting up a sequence with default values. For example, say that you want
to track scores for a set number of participants in a list. You can initialize that list so that it has an
initial score for each participant by using multiplication:
num_participants = 10
scores = [0] * num_participants
scores
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
You can create tuples by using the tuple constructor, tuple(), or using parentheses. If you want
to create a tuple with a single item, you must follow that item with a comma, or Python will
interpret the parentheses not as indicating a tuple but as indicating a logical grouping. You can
also create a tuple without parentheses by just putting a comma after an item. Listing 3.1 provides
examples of tuple creation.
tup = (1,)
tup
(1,)
tup = 1,2,
tup
(1,2)
Warning
A common but subtle bug occurs when you leave a trailing comma behind an argument to
a function. It turns the argument into a tuple containing the original argument. So the second
argument to the function my_function(1, 2,) will be (2,) and not 2.
You can also use the list or tuple constructors with a sequence as an argument. The following
example uses a string and creates a list of the items the string contains:
name = "Ignatius"
letters = list(name)
letters
['I', 'g', 'n', 'a', 't', 'i', 'u', 's']
flavours.append('SuperFudgeNutPretzelTwist')
flavours
['Chocolate', 'Vanilla', 'SuperFudgeNutPretzelTwist']
flavours.insert(0,"sourMash")
flavours
['sourMash', 'Chocolate', 'Vanilla', 'SuperFudgeNutPretzelTwist']
To remove an item from a list, you use the pop method. With no argument, this method removes
the last item. By using an optional index argument, you can specify a specific item. In either case,
the item is removed from the list and returned.
The following example pops the last item off the list and then pops off the item at index 0. You can see
that both items are returned when they are popped and that they are then gone from the list:
flavours.pop()
'SuperFudgeNutPretzelTwist'
flavours.pop(0)
'sourMash'
flavours
['Chocolate', 'Vanilla']
To add the contents of one list to another, you use the extend method:
deserts = ['Cookies', 'Water Melon']
desserts
['Cookies', 'Water Melon']
desserts.extend(flavours)
desserts
['Cookies', 'Water Melon', 'Chocolate', 'Vanilla']
This method modifies the first list so that it now has the contents of the second list appended to
its contents.
This appears to have worked, until you modify one of the sublists:
lists[-1].append(4)
lists
[[4], [4], [4], [4]]
All of the sublists are modified! This is because the multiplication only initializes one list and
references it four times. The references look independent until you try modifying one. The
solution to this is to use a list comprehension (discussed further in Chapter 13, “Functional
Programming”):
lists = [[] for _ in range(4)]
lists[-1].append(4)
lists
[[], [], [], [4]]
Unpacking
You can assign values to multiple variables from a list or tuple in one line:
a, b, c = (1,3,4)
a
1
b
3
c
4
Or, if you want to assign multiple values to one variable while assigning single ones to the others,
you can use a * next to the variable that will take multiple values. Then that variable will absorb
all the items not assigned to other variables:
*first, middle, last = ['horse', 'carrot', 'swan', 'burrito', 'fly']
first
['horse', 'carrot', 'swan']
last
'fly'
middle
'burrito'
Sorting Lists
For lists you can use built-in sort and reverse methods that can change the order of the
contents. Much like the sequence min and max functions, these methods work only if the
contents are comparable, as shown in these examples:
name = "Ignatius"
letters = list(name)
letters
['I', 'g', 'n', 'a', 't', 'i', 'u', 's']
letters.sort()
letters
['I', 'a', 'g', 'i', 'n', 's', 't', 'u']
letters.reverse()
letters
['u', 't', 's', 'n', 'i', 'g', 'a', 'I']
Strings
A string is a sequence of characters. In Python, strings are Unicode by default, and any Unicode
character can be part of a string. Strings are represented as characters surrounded by quotation
marks. Single or double quotations both work, and strings made with them are equal:
'Here is a string'
'Here is a string'
If you want to include quotation marks around a word or words within a string, you need to use
one type of quotation marks—single or double—to enclose that word or words and use the other
type of quotation marks to enclose the whole string. The following example shows the word is
enclosed in double quotation marks and the whole string enclosed in single quotation marks:
'Here "is" a string'
'Here "is" a string'
You enclose multiple-line strings in three sets of double quotation marks as shown in the follow-
ing example:
a_very_large_phrase = """
Wikipedia is hosted by the Wikimedia Foundation,
a non-profit organization that also hosts a range of other projects.
"""
With Python strings you can use special characters, each preceded by a backslash. The special
characters include \t for tab, \r for carriage return, and \n for newline. These characters are
interpreted with special meaning during printing. While these characters are generally useful,
they can be inconvenient if you are representing a Windows path:
windows_path = "c:\row\the\boat\now"
print(windows_path)
ow heoat
ow
For such situations, you can use Python’s raw string type, which interprets all characters literally.
You signify the raw string type by prefixing the string with an r:
windows_path = r"c:\row\the\boat\now"
print(windows_path)
c:\row\the\boat\now
As demonstrated in Listing 3.3, there are a number of string helper functions that enable you to
deal with different capitalizations.
captain.capitalize()
'Patrick tayluer'
captain.lower()
'patrick tayluer'
captain.upper()
'PATRICK TAYLUER'
captain.swapcase()
'pATRICK tAYLUER'
Python 3.6 introduced format strings, or f-strings. You can insert values into f-strings at runtime
by using replacement fields, which are delimited by curly braces. You can insert any expression,
including variables, into the replacement field. An f-string is prefixed with either an F or an f, as
shown in this example:
strings_count = 5
frets_count = 24
f"Noam Pikelny's banjo has {strings_count} strings and {frets_count} frets"
'Noam Pikelny's banjo has 5 strings and 24 frets'
This example shows how to insert a mathematic expression into the replacement field:
a = 12
b = 32
f"{a} times {b} equals {a*b}"
'12 times 32 equals 384'
This example shows how to insert items from a list into the replacement field:
players = ["Tony Trischka", "Bill Evans", "Alan Munde"]
f"Performances will be held by {players[1]}, {players[0]}, and {players[2]}"
'Performances will be held by Bill Evans, Tony Trischka, and Alan Munde'
Ranges
Using range objects is an efficient way to represent a series of numbers, ordered by value. They are
largely used for specifying the number of times a loop should run. Chapter 5 introduces loops.
Range objects can take start (optional), end, and step (optional) arguments. Much as with slicing,
the start is included in the range, and the end is not. Also as with slicing, you can use negative
steps to count down. Ranges calculate numbers as you request them, and so they don’t need
to store more memory for large ranges. Listing 3.4 demonstrates how to create ranges with and
without the optional arguments. This listing makes lists from the ranges so that you can see the
full contents that the range would supply.
list(range(1, 10))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
list(range(0,10,2))
[0, 2, 4, 6, 8]
list(range(10, 0, -2))
[10, 8, 6, 4, 2]
Summary
This chapter covers the import group of types known as sequences. A sequence is an ordered,
finite collection of items. Lists and tuples can contain mixed types. Lists can be modified after
creation, but tuples cannot. Strings are sequences of text. Range objects are used to describe ranges
of numbers. Lists, strings, and ranges are among the most commonly used types in Python.
Questions
1. How would you test whether a is in the list my_list?
2. How would you find out how many times b appears in a string named my_string?
A
anonymous functions, 80
Anscombe, F., 135
apply() method, 132–133
arguments, 30
arrays
broadcasting, 98–99
expanding dimensions, 99–100
changing values in, 91
copies, changing values in, 95
creating, 86–88
one-dimensional, 87
two-dimensional, 88
using reshape method, 88–89
element-by-element operations, 91–92
filtering values, 92–94
indexing, 89–90
matrix operations, 96–97
methods, 95–96
one-dimensional, 87
sequences and, 91
setting type automatically, 97
setting type explicitly, 97–98
slicing, 89–90
two-dimensional, 88
indexing and slicing, 90
views, 94
changing values in, 94
assert statements, 16–17
assignment statements, 17
attributes, 22 list(), 29
axes, 136, 143–144 tuple(), 29
context managers, 205
B continue statements, 19
binomial distribution, 105–107 continuous distributions, 108
Bokeh, 149–150 exponential distribution, 110
Natural Language Processing with Python, NumPy. See also arrays; SciPy
169 creating arrays, 86–87
nested functions, 77 installing and importing, 86
nested lists, 31 polynomials, 100–101
nested wrapping functions, 78–79
NLTK (Natural Language Toolkit), 159 O
classifier classes, 166 object-oriented programming, 187
defining features, 168 classes, 188–189
downloading corpuses, 166–167 variables, 190–191
flattening nested lists, 167 inheritance, 196–199
labeling data, 167 instances, 188
training and testing, 168–169 methods, 188–190
corpus readers, 160 math operator, 195–196
loading text, 160–161 representation, 192
tokenizers, 161 rich comparison, 192–195
fileids() method, 160 special, 191
FreqDist class objects, 187–188
built-in plot method, 164 private methods, 190
S testing membership, 26
tuples, 29
Scikit-learn, 154
unpacking, 31–32
estimators, 156
sets, 46–48
MinMaxScaler transformer, 154–155
difference between, 51
splitting test and training data, 155–156
disjoint, 48
training a model, 156
proper subsets, 49
training and testing, 156
subsets and, 49
tutorials, 157
supersets and, 50
SciPy, 103
symmetric difference, 51
continuous distributions, 108
union, 50
exponential distribution, 110
updating, 51–52
normal distribution, 108–110
shared operations, 25
uniform distribution, 110–111
similar() method, 165
discrete distributions, 105
slicing, 27
binomial distribution, 105–107
arrays, 89–90
Poisson distribution, 107–108
DataFrames, 122
scipy.misc submodule, 104–105
sort() method, 201–202
scipy.special submodule, 105
sort method, 32
scipy.stats submodule, 105
sorted() function, 202–204
scope, 20, 75–76, 173–174
sorting, lists, 32, 201–204
inheriting, 174
special characters, 33
Seaborn, 144–145
statements, 15–16
plot types, 148
assert, 16–17
themes, 145–147
assignment, 17
sequences, 14, 25
break, 19, 64
arrays and, 91
code blocks, 56, 63–64
frozensets and, 53
continue, 19, 64–65
indexing, 26
delete, 18
interrogation, 27–28
elif, 62
intersections, 51
else, 61
lists, 29
expression, 16
adding and removing items, 30–31
future, 20
nested, 31
global, 20
sorting, 32
if, 59–62
unpacking, 31–32
import, 19–20
math operations, 28–29
multiple, 16
slicing, 27
nonlocal, 20