Unit 5 Analysis with Pandas in python
Unit 5 Analysis with Pandas in python
Here we focus the basic analysis, concerned with evaluating data quality and
and missing values. For descriptive statistics, see Section Descriptive
Statistics.
import numpy as np
import pandas as pd
titanic = pd.read_csv("../data/titanic.csv.bz2")
1/26
2/5/24, 9:47 AM
We read the compressed titanic data file directly into pandas. As this is
comma-separated, we do not have to specify a separator as comma is the
default one.
As a first step, we may query the size of the dataset. Data frames have
.shape attribute, exactly as numpy arrays:
titanic.shape
## (1309, 14)
Shape tells us that the dataset contains 1309 rows and 14 columns.
Next, we may want to take a quick look at a few rows of the data. .head()
and .tail() attributes are useful for looking at a few first and last lines,
.sample() extract a few random rows:
titanic.head()
2/26
2/5/24, 9:47 AM
titanic.tail(3)
titanic.sample(4)
Exercise 5.1 Extract the 1000th row from the data. What is the name of the
person? Did (s)he survive? See the solution
titanic.columns
3/26
2/5/24, 9:47 AM
The attribute .columns lists the variable names but it is in a form of a special
Index data structure. This may be enough for e.g. looping over variables, but
we may also convert this to a list as
list(titanic.columns)
Now is time to consult the documentation if it exists. Below we work with the
following variables
sex:
age:
4/26
2/5/24, 9:47 AM
Next, let us check the data types of the data. Although some, e.g. pclass
(passenger class) and survived look numeric, they may actually be strings.
This can be queried by .dtypes attribute:
titanic.dtypes
## pclass int64
## survived int64
## name object
## sex object
## age float64
## sibsp int64
## parch int64
## ticket object
## fare float64
## cabin object
## embarked object
## boat object
## body float64
## home.dest object
## dtype: object
The data types tell that both pclass and survived are integers while name and
sex are “objects”. Here it means strings, in principle they may also be more
complex objects.
5/26
2/5/24, 9:47 AM
One of the first steps to do with a new dataset is to check what are the values
of the relevant variables. Pandas library contains a lot of tools for descriptive
data analysis. For the categorical variables we usually want to see the explicit
values, for the numeric ones we may check minimum and maximum values.
What are the possible values for sex? The quick look at data tells these are
male and female, but are there any more? We can query that with .unique()
method:
titanic.sex.unique()
The result tells us that all passengers in this data are categorized as “male” or
“female”. This looks perfectly plausible. We can also see a technical detail,
namely that the result is returned as a numpy array.
Tip:
A word about methods and method chaining. Let’s revisit the previous
command, titanic.sex.unique() . This contains three components, and
the process is applying these in a sequential order:
6/26
2/5/24, 9:47 AM
Some of the methods can be applied to either the whole data frames or
to individual variables. If applied to the whole data frame, then they apply
apply to every single variable separately, pretty much as if looping over
individual variables and applying it to each of them.
But what about boats? (“boat” means in which lifeboat was the passenger
found)
titanic.boat.unique()
## array(['2', '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A',
## '7', 'C', '14', '5 9', '13', '1', '15', '5 7', '8 10', '12', '1
## '13 15 B', 'C D', '15 16', '13 15'], dtype=object)
This list gives a much more complex picture. There are a plethora of boat id-s,
including numbers and letters. For a number of cases the data also contains
multiple boat id-s, perhaps because the rescuers out on the sea had more
important things to do than to record the boat numbers. And nan tells us that
not everyone was fortunate enough to get to a boat.
We can also easily query the number of unique values (using .nunique
method):
7/26
2/5/24, 9:47 AM
titanic.boat.nunique()
## 27
We can see that there are 27 different boat codes. But note that this is not the
same as the number of distinct values returned by .unique :
len(titanic.boat.unique())
## 28
The discrepancy is there because .unique also includes the missings, but
.nunique does not count missings.
titanic.sex.value_counts()
## male 843
## female 466
## Name: sex, dtype: int64
We see that there are almost twice as many males than females in the data.
This is plausible as well. (See also Section 7.1.)
8/26
2/5/24, 9:47 AM
For continuous variables, we do not want to count and print individual values.
Instead, we may choose to just check the minimum and maximum value. This
gives a quick overview if the values are feasible, e.g. do we see negative age,
or implausibly large age values:
titanic.age.min(), titanic.age.max()
## (0.1667, 80.0)
We can see that the youngest passenger was 0.167 years (= 2 months) and
the oldest passenger was 80 years old. Both of these values are plausible and
hence we are not worried about any too large or too small values. See Section
6.1 below for how to detect and analyze missing values.
titanic.survived.mean()
## 0.3819709702062643
9/26
2/5/24, 9:47 AM Chapter 5 Descriptive Analysis with Pandas | Machine learning in python
NB!
This figure may not be what you want because of missing values. See
Section 6.1.4 below for explanations.
of discrete values, e.g. the number of women. But you cannot easily use
.value_counts to compute number of children–there are no such category as
“children”, and hence there are no values to count. Instead, we should create
a logical vector that describes the condition, and sum it. Let’s find the number
of children on the ship using this approach.
Assume children are passengers who are less than 14 years old. Counting
can be done explicitly as:
10/26
2/5/24, 9:47 AM
## 99
So there was 99 children on Titanic. Let’s explain these steps in a more detail.
First, the line
creates a logical variable children , containing Trues and Falses, telling for
each passenger whether they were a child or not. A sample of this variable
looks like
children.head(5)
## 0 False
## 1 True
## 2 True
## 3 False
## 4 False
## Name: age, dtype: bool
So the second and third passenger were children, while the others in the
sample here were not. Next, children.sum() adds these values together–
remember, as soon as we start doing math with logical values, Trues will be
11/26
2/5/24, 9:47 AM
converted to “1” and False to “0”, and the latter will not matter when adding.
So we are as well as counting children here. In practice, you do not need to
create a separate variable,
## 99
## 0.07563025210084033
NB!
This figure may not be what you mean as proportion of children because of
missing values. See Section 6.1.4 below for explanations.
12/26
2/5/24, 9:47 AM
Finally, let’s get the subset of data that were in either “A” or “B” boat. This can
be done using .isin() method to check if a string is in a given list:
ab = titanic[titanic.boat.isin(["A", "B"])]
ab.shape
## (20, 14)
ab.head(10)
One can see that these two boats together only took 20 passengers…
13/26
2/5/24, 9:47 AM Chapter 5 Descriptive Analysis with Pandas | Machine learning in python
Typical datasets we encounter contain variables that are irrelevant for the
current analysis. In such case we may want to either select only the relevant
ones (if there are only few such variables), or remove the irrelevant ones (if
there are not too many of those). Removing irrelevant information helps us to
see data better when printing it, to understand it better, and hence to have
fewer coding errors. It also helps with certain transformations where we now
may want to operate on the whole narrowed-down data frame instead of
selected variables only. Finally, it also helps to conserve memory and speed
up processing.
14/26
2/5/24, 9:47 AM
Note that the columns are returned in the order they are listed in the selection,
not in the original order.
Instead of loading the complete dataset and selecting the desired variables
later, we can also specify which columns to read with pd.read_csv :
pd.read_csv("../data/titanic.csv.bz2",
usecols=["sex", "pclass", "survived"]).sample(4)
This achieves a similar result, although the columns now are in the original
order, not in the specified order.
15/26
2/5/24, 9:47 AM
Note that we need to tell .drop that we are talking about removing columns
with these names–not about rows with this index. This is what the argument
axis=1 does. Otherwise we get an error:
In particular, this means that .drop will cause an error if we want to remove
an non-existent column. This may cause problems when overwriting data in
notebooks: in the first pass you remove a variable, and when you re-run the
same cell again, the variable is gone and you get an error. As always, either
rename data, avoid re-running the data cleaning step, or incorporate error
handling.
Exercise 5.3 You want to plot histogram of age distribution for Titanic
passengers, separate for males and females.
16/26
2/5/24, 9:47 AM
titanic.groupby("sex")
But we can use the grouped dataframe for other operations, in this case we
compute mean of survived:
titanic.groupby("sex").survived.mean()
17/26
2/5/24, 9:47 AM
## sex
## female 0.727468
## male 0.190985
## Name: survived, dtype: float64
The result is a series with two values (numeric survival rate) with index being
the group values, female and _male. In this example we see that female
survival rate was 72% while only 19% men survived the disaster.
We can also group by more than one variable. In that case we have to supply
those as a list, e.g. in order to compute the survival rate by gender and
passencer class (variable pclass) we can do
titanic.groupby(["sex", "pclass"]).survived.mean()
## sex pclass
## female 1 0.965278
## 2 0.886792
## 3 0.490741
## male 1 0.340782
## 2 0.146199
## 3 0.152130
## Name: survived, dtype: float64
.groupby supports more options, e.g. one can keep the group indicators as
18/26
2/5/24, 9:47 AM
Pandas opens string variables to a large list of string functions using .str
attribute. These largely replicate the re module but the syntax is different, and
often the function names are different too. We do walk through a number of
examples here, namely
Let’s use Titanic data and analyze whether there is a regular pattern of cabin
codes and passenger class. First, what are the existing cabin numbers?
titanic.cabin.unique()
19/26
2/5/24, 9:47 AM
## array(['B5', 'C22 C26', 'E12', 'D7', 'A36', 'C101', nan, 'C62 C64', 'B
## 'A23', 'B58 B60', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C9
## 'C52', 'T', 'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B
## 'E40', 'B38', 'E24', 'B51 B53 B55', 'B96 B98', 'C46', 'E31', 'E
## 'B61', 'B77', 'A9', 'C89', 'A14', 'E58', 'E49', 'E52', 'E45',
## 'B22', 'B26', 'C85', 'E17', 'B71', 'B20', 'A34', 'C86', 'A16',
## 'A20', 'A18', 'C54', 'C45', 'D20', 'A29', 'C95', 'E25', 'C111',
## 'C23 C25 C27', 'E36', 'D34', 'D40', 'B39', 'B41', 'B102', 'C123
## 'E63', 'C130', 'B86', 'C92', 'A5', 'C51', 'B42', 'C91', 'C125',
## 'D10 D12', 'B82 B84', 'E50', 'D33', 'C83', 'B94', 'D49', 'D45',
## 'B69', 'B11', 'E46', 'C39', 'B18', 'D11', 'C93', 'B28', 'C49',
## 'B52 B54 B56', 'E60', 'C132', 'B37', 'D21', 'D19', 'C124', 'D17
## 'B101', 'D28', 'D6', 'D9', 'B80', 'C106', 'B79', 'C47', 'D30',
## 'C90', 'E38', 'C78', 'C30', 'C118', 'D36', 'D48', 'D47', 'C105
## 'B36', 'B30', 'D43', 'B24', 'C2', 'C65', 'B73', 'C104', 'C110',
## 'C50', 'B3', 'A24', 'A32', 'A11', 'A10', 'B57 B59 B63 B66', 'C2
## 'E44', 'A26', 'A6', 'A7', 'C31', 'A19', 'B45', 'E34', 'B78', 'B
## 'C87', 'C116', 'C55 C57', 'D50', 'E68', 'E67', 'C126', 'C68',
## 'C70', 'C53', 'B19', 'D46', 'D37', 'D26', 'C32', 'C80', 'C82',
## 'C128', 'E39 E41', 'D', 'F4', 'D56', 'F33', 'E101', 'E77', 'F2
## 'D38', 'F', 'F G63', 'F E57', 'F E46', 'F G73', 'E121', 'F E69
## 'E10', 'G6', 'F38'], dtype=object)
We can see that cabin code is typically a letter, followed by two or three digits.
We can guess that the letter denotes the deck, and number is the cabin
number in that deck. In several cases however there are apparently certain
data entry errors, manifested by just a single letter code, or multiple codes for
a single person. (Although passengers may also have booked more than one
cabin.)
20/26
2/5/24, 9:47 AM
Assume everyone only has a single cabin. We’ll assign a single cabin code to
each passenger. This will be the existing single cabin if coded in that way. If
the value of cabin contains multiple codes, then we take the first valid code
(i.e. the one in the form of letter + digits).
As the first step, we find the problematic cabin codes. In most cases, these
contain a space, there are also a single-letter codes which are presumably
wrong. We can identify patterns in strings using str.contains(pattern) . This
function returns a vector of trues/falses, depending on whether the original
string vector contains the pattern:
## 0 False
## 1 True
## 2 True
## 3 True
## 4 True
## 5 False
## Name: cabin, dtype: bool
21/26
2/5/24, 9:47 AM
expressions, but the latter are much more powerful. We see that neither first
nor sixth cabin contain a space, but the second through fifth cases contain.
This can be confirmed by looking at the cabin values:
titanic.cabin.head(6)
## 0 B5
## 1 C22 C26
## 2 C22 C26
## 3 C22 C26
## 4 C22 C26
## 5 E12
## Name: cabin, dtype: object
## False 1268
## True 41
## Name: cabin, dtype: int64
Note that:
first, you have to use .str attribute to open a column up for string
operations. .cabin.contains() does not work.
22/26
2/5/24, 9:47 AM
Next, we use a regular expression that identifies both types of errors: space in
code, and missing number. Instead of looking for strings that contain a space,
we are looking for strings that match the valid pattern: a single letter followed
by digits. We demonstrate this using str.match , a function similar to
str.contains , just it looks if the beginning of the string matches the pattern
titanic.cabin.str.match(r"\w\d+$", na=False).value_counts()
## False 1061
## True 248
## Name: cabin, dtype: int64
23/26
2/5/24, 9:47 AM
The result shows that now we picked many more cases that do not match the
pattern, 248 instead of 41 above. However, not an important difference
between this approach, and the approach above: when looking for patterns
that contain a space we were looking for wrong patterns, now we are looking
for valid patterns. By specifying na=False above, we told pandas not to
consider missings as invalid patterns. As we are only interested in invalid
codes, not missings, we should count missings as correct in the example
below. Here is an example of mismatched cabin codes:
i = titanic.cabin.str.match(r"\w\d+$", na=True)
titanic.cabin[~i].sample(6)
Finally, let’s replace the full cabin number with only the deck code. This can be
done by replacing everything that follows the first letter by an empty string.
str.replace takes two arguments: pattern and replacement. We specify the
24/26
2/5/24, 9:47 AM
titanic.cabin.str.replace(r"(\w).*", r"\1").sample(6)
## 412 NaN
## 1050 NaN
## 950 NaN
## 57 B
## 1197 NaN
## 1180 NaN
## Name: cabin, dtype: object
In practical terms, it is often useful to print the original and modifies codes
side-by-side to see that the coding was done right. We can achieve this by
creating a new variable in the data frame, and printing out a sample of old and
new variables (see Section @ref(concatenating-data-pd.concat) for another
approach):
25/26
2/5/24, 9:47 AM
## cabin deck
## 7 A36 A
## 228 C65 C
## 169 C49 C
## 304 D37 D
## 133 C92 C
## 269 A19 A
We can see that the deck codes (the ones that are not missing) are done
correctly.
26/26