0% found this document useful (0 votes)
2 views26 pages

Unit 5 Analysis with Pandas in python

The document provides a comprehensive guide on performing descriptive analysis using the Pandas library, focusing on data quality and missing values. It covers initial steps such as loading data, exploring datasets, checking variable types, and analyzing discrete and continuous values. Additionally, it discusses filtering data, selecting relevant variables, and computing counts and proportions for better data understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views26 pages

Unit 5 Analysis with Pandas in python

The document provides a comprehensive guide on performing descriptive analysis using the Pandas library, focusing on data quality and missing values. It covers initial steps such as loading data, exploring datasets, checking variable types, and analyzing discrete and continuous values. Additionally, it discusses filtering data, selecting relevant variables, and computing counts and proportions for better data understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Descriptive Analysis with Pandas

Here we focus the basic analysis, concerned with evaluating data quality and
and missing values. For descriptive statistics, see Section Descriptive
Statistics.

We assume you have loaded pandas and numpy as

import numpy as np
import pandas as pd

5.1 First steps: know your data

Next, we do some examples of exploratory data analysis with pandas. Let’s


load the Titanic data here as this allows to demonstrate more features. Note
that pd.read_csv allows to read compressed data formats directly, there is no
need to decompress those.

titanic = pd.read_csv("../data/titanic.csv.bz2")

1/26
2/5/24, 9:47 AM

We read the compressed titanic data file directly into pandas. As this is
comma-separated, we do not have to specify a separator as comma is the
default one.

As a first step, we may query the size of the dataset. Data frames have
.shape attribute, exactly as numpy arrays:

titanic.shape

## (1309, 14)

Shape tells us that the dataset contains 1309 rows and 14 columns.

Next, we may want to take a quick look at a few rows of the data. .head()
and .tail() attributes are useful for looking at a few first and last lines,
.sample() extract a few random rows:

titanic.head()

## pclass survived ... body home.dest


## 0 1 1 ... NaN St Louis, MO
## 1 1 1 ... NaN Montreal, PQ / Chesterville, ON
## 2 1 0 ... NaN Montreal, PQ / Chesterville, ON
## 3 1 0 ... 135.0 Montreal, PQ / Chesterville, ON
## 4 1 0 ... NaN Montreal, PQ / Chesterville, ON
##
## [5 rows x 14 columns]

2/26
2/5/24, 9:47 AM

titanic.tail(3)

## pclass survived name sex age ... ca


## 1306 3 0 Zakarian, Mr. Mapriededer male 26.5 ...
## 1307 3 0 Zakarian, Mr. Ortin male 27.0 ...
## 1308 3 0 Zimmerman, Mr. Leo male 29.0 ...
##
## [3 rows x 14 columns]

titanic.sample(4)

## pclass survived ... body home.dest


## 154 1 0 ... 307.0 Montreal, PQ
## 621 3 1 ... NaN NaN
## 437 2 1 ... NaN Somerset / Bernardsville, NJ
## 1018 3 0 ... NaN NaN
##
## [4 rows x 14 columns]

Exercise 5.1 Extract the 1000th row from the data. What is the name of the
person? Did (s)he survive? See the solution

We may also want to know all the variable names in data:

titanic.columns

3/26
2/5/24, 9:47 AM

## Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch',


## 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
## dtype='object')

The attribute .columns lists the variable names but it is in a form of a special
Index data structure. This may be enough for e.g. looping over variables, but
we may also convert this to a list as

list(titanic.columns)

## ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket

Now is time to consult the documentation if it exists. Below we work with the
following variables

pclass: passenger class (1, 2, 3)

survived: whether the passenger survived (1) or not (0)

sex:

age:

boat: boat number where the passenger was rescued.

We also check if the actual coding corresponds to the documentation.

4/26
2/5/24, 9:47 AM

Next, let us check the data types of the data. Although some, e.g. pclass
(passenger class) and survived look numeric, they may actually be strings.
This can be queried by .dtypes attribute:

titanic.dtypes

## pclass int64
## survived int64
## name object
## sex object
## age float64
## sibsp int64
## parch int64
## ticket object
## fare float64
## cabin object
## embarked object
## boat object
## body float64
## home.dest object
## dtype: object

The data types tell that both pclass and survived are integers while name and
sex are “objects”. Here it means strings, in principle they may also be more
complex objects.

5/26
2/5/24, 9:47 AM

5.2 What are the values?

One of the first steps to do with a new dataset is to check what are the values
of the relevant variables. Pandas library contains a lot of tools for descriptive
data analysis. For the categorical variables we usually want to see the explicit
values, for the numeric ones we may check minimum and maximum values.

5.2.1 Different discrete values

What are the possible values for sex? The quick look at data tells these are
male and female, but are there any more? We can query that with .unique()
method:

titanic.sex.unique()

## array(['female', 'male'], dtype=object)

The result tells us that all passengers in this data are categorized as “male” or
“female”. This looks perfectly plausible. We can also see a technical detail,
namely that the result is returned as a numpy array.

Tip:
A word about methods and method chaining. Let’s revisit the previous
command, titanic.sex.unique() . This contains three components, and
the process is applying these in a sequential order:

titanic is the dataset. This is what we start with.

6/26
2/5/24, 9:47 AM

.sex is an attribute of titanic, here it means to extract the sex


variable from the dataset. It results in a series.

Finally, .unique() is a method, it takes whatever is at it’s left side


(here the series sex), and works with this one (here returns an array
of unique values).

Some of the methods can be applied to either the whole data frames or
to individual variables. If applied to the whole data frame, then they apply
apply to every single variable separately, pretty much as if looping over
individual variables and applying it to each of them.
But what about boats? (“boat” means in which lifeboat was the passenger
found)

titanic.boat.unique()

## array(['2', '11', nan, '3', '10', 'D', '4', '9', '6', 'B', '8', 'A',
## '7', 'C', '14', '5 9', '13', '1', '15', '5 7', '8 10', '12', '1
## '13 15 B', 'C D', '15 16', '13 15'], dtype=object)

This list gives a much more complex picture. There are a plethora of boat id-s,
including numbers and letters. For a number of cases the data also contains
multiple boat id-s, perhaps because the rescuers out on the sea had more
important things to do than to record the boat numbers. And nan tells us that
not everyone was fortunate enough to get to a boat.

We can also easily query the number of unique values (using .nunique
method):

7/26
2/5/24, 9:47 AM

titanic.boat.nunique()

## 27

We can see that there are 27 different boat codes. But note that this is not the
same as the number of distinct values returned by .unique :

len(titanic.boat.unique())

## 28

The discrepancy is there because .unique also includes the missings, but
.nunique does not count missings.

We can get a frequency table by .value_counts() :

titanic.sex.value_counts()

## male 843
## female 466
## Name: sex, dtype: int64

We see that there are almost twice as many males than females in the data.
This is plausible as well. (See also Section 7.1.)

8/26
2/5/24, 9:47 AM

5.2.2 Continuous values

For continuous variables, we do not want to count and print individual values.
Instead, we may choose to just check the minimum and maximum value. This
gives a quick overview if the values are feasible, e.g. do we see negative age,
or implausibly large age values:

titanic.age.min(), titanic.age.max()

## (0.1667, 80.0)

We can see that the youngest passenger was 0.167 years (= 2 months) and
the oldest passenger was 80 years old. Both of these values are plausible and
hence we are not worried about any too large or too small values. See Section
6.1 below for how to detect and analyze missing values.

Another common way to get an idea about a numeric variable is to compute


average of it. Let us compute mean survival rate. Remember: survived = 1
means someone survived, and survived = 0 means the person went down with
Titanic. We can compute average with the .mean() method:

titanic.survived.mean()

## 0.3819709702062643

We see that less than 40% of passengers in this dataset survived.

9/26
2/5/24, 9:47 AM Chapter 5 Descriptive Analysis with Pandas | Machine learning in python

NB!
This figure may not be what you want because of missing values. See
Section 6.1.4 below for explanations.

See Section 6.1 below for how to count missing values().

Exercise 5.2 Consider methods .min , .mean , .unique and .nunique .


Which of these can be applied to the whole data frame? See the tip in Section
5.2.1.

See the solution

5.2.3 Counts and proportions

Another common task we face is to count, or to compute proportions, of


certain values, such as number of women, or percent of children in data. The
.value_counts method (see Section 5.2.1 above) can be used to get counts

of discrete values, e.g. the number of women. But you cannot easily use
.value_counts to compute number of children–there are no such category as

“children”, and hence there are no values to count. Instead, we should create
a logical vector that describes the condition, and sum it. Let’s find the number
of children on the ship using this approach.

Assume children are passengers who are less than 14 years old. Counting
can be done explicitly as:

10/26
2/5/24, 9:47 AM

## create a logical variable 'children'


children = titanic.age < 14
## count children
children.sum()

## 99

So there was 99 children on Titanic. Let’s explain these steps in a more detail.
First, the line

children = titanic.age < 14

creates a logical variable children , containing Trues and Falses, telling for
each passenger whether they were a child or not. A sample of this variable
looks like

children.head(5)

## 0 False
## 1 True
## 2 True
## 3 False
## 4 False
## Name: age, dtype: bool

So the second and third passenger were children, while the others in the
sample here were not. Next, children.sum() adds these values together–
remember, as soon as we start doing math with logical values, Trues will be
11/26
2/5/24, 9:47 AM

converted to “1” and False to “0”, and the latter will not matter when adding.
So we are as well as counting children here. In practice, you do not need to
create a separate variable,

(titanic.age < 14).sum()

## 99

will usually suffice.

Computing proportions can be done in a similar way. Just instead of using


.sum , we’ll use .mean() :

(titanic.age < 14).mean()

## 0.07563025210084033

Hence 7.56 percent of the passengers were children.

NB!
This figure may not be what you mean as proportion of children because of
missing values. See Section 6.1.4 below for explanations.

What happens here is similar to counting: when computing proportion, we


need to divide the count by the total number of cases. You can do it explicitly,
but .mean() is easier, and it is also easier to read.

12/26
2/5/24, 9:47 AM

5.3 Filtering data

Finally, let’s get the subset of data that were in either “A” or “B” boat. This can
be done using .isin() method to check if a string is in a given list:

ab = titanic[titanic.boat.isin(["A", "B"])]
ab.shape

## (20, 14)

ab.head(10)

## pclass survived ... body home.dest


## 14 1 1 ... NaN Hessle, Yorks
## 19 1 0 ... NaN Winnipeg, MN
## 136 1 1 ... NaN Washington, DC
## 235 1 1 ... NaN Paris / New York, NY
## 295 1 1 ... NaN Haverford, PA
## 317 1 1 ... NaN Geneva, Switzerland / Radnor, PA
## 503 2 1 ... NaN Chelsea, London
## 603 3 1 ... NaN East Providence, RI
## 605 3 1 ... NaN Perkins County, SD
## 630 3 1 ... NaN NaN
##
## [10 rows x 14 columns]

One can see that these two boats together only took 20 passengers…

13/26
2/5/24, 9:47 AM Chapter 5 Descriptive Analysis with Pandas | Machine learning in python

5.4 Selecting variables

Typical datasets we encounter contain variables that are irrelevant for the
current analysis. In such case we may want to either select only the relevant
ones (if there are only few such variables), or remove the irrelevant ones (if
there are not too many of those). Removing irrelevant information helps us to
see data better when printing it, to understand it better, and hence to have
fewer coding errors. It also helps with certain transformations where we now
may want to operate on the whole narrowed-down data frame instead of
selected variables only. Finally, it also helps to conserve memory and speed
up processing.

5.4.1 Selecting desired variables

We discussed how to select the desired variables in Section 3.3.1 but we


briefly review it here too. You can select only the desired variables with
dataframe[["var1", "var2", ..."]] . For instance, let’s focus on only sex,
pclass, and survived:

titanic[["sex", "pclass", "survived"]].sample(4)

## sex pclass survived


## 904 male 3 0
## 374 male 2 0
## 140 male 1 1
## 758 male 3 1

14/26
2/5/24, 9:47 AM

Note that the columns are returned in the order they are listed in the selection,
not in the original order.

Instead of loading the complete dataset and selecting the desired variables
later, we can also specify which columns to read with pd.read_csv :

pd.read_csv("../data/titanic.csv.bz2",
usecols=["sex", "pclass", "survived"]).sample(4)

## pclass survived sex


## 137 1 1 female
## 131 1 1 female
## 640 3 0 male
## 486 2 0 male

This achieves a similar result, although the columns now are in the original
order, not in the specified order.

5.4.2 Removing undesired variables

Alternatively, if we want to keep most of the variables then we may instead


specify which ones to remove. This can be done with .drop method. Below
we drop a large number of variables:

titanic.drop(["name", "sibsp", "boat", "parch", "cabin", "embarked",


"home.dest", "ticket", "body"],
axis=1).sample(4)

15/26
2/5/24, 9:47 AM

## pclass survived sex age fare


## 998 3 0 male NaN 7.8958
## 274 1 1 male 45.0 134.5000
## 528 2 0 male NaN 0.0000
## 1078 3 1 female NaN 7.8792

Note that we need to tell .drop that we are talking about removing columns
with these names–not about rows with this index. This is what the argument
axis=1 does. Otherwise we get an error:

titanic.drop(["name", "sibsp", "boat", "parch", "embarked",


"home.dest"]).sample(4)

## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "

In particular, this means that .drop will cause an error if we want to remove
an non-existent column. This may cause problems when overwriting data in
notebooks: in the first pass you remove a variable, and when you re-run the
same cell again, the variable is gone and you get an error. As always, either
rename data, avoid re-running the data cleaning step, or incorporate error
handling.

Exercise 5.3 You want to plot histogram of age distribution for Titanic
passengers, separate for males and females.

What are the relevant variable names?

How will you select the relevant columns?

16/26
2/5/24, 9:47 AM

See the solution

5.5 Grouped operations

As other major dataframe libraries, pandas supports grouped operations.


Grouped operations are performed essentially like this: first, data is split into
groups by the grouping variable values. Second, the following operations are
performed on all groups individually, independent of the other groups. And
finally, all the results are put together again, together with group indicators.

Grouping in pandas can be done by .groupby method. It expects a list of


grouping variables. We demonstrate this by computing the survival rate by
gender (gender is in variable sex and survival is survived). By itself, .groupy
just creates a grouped data frame that unfortunately does not have any
printing method:

titanic.groupby("sex")

## <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f20cb35f0d0

But we can use the grouped dataframe for other operations, in this case we
compute mean of survived:

titanic.groupby("sex").survived.mean()

17/26
2/5/24, 9:47 AM

## sex
## female 0.727468
## male 0.190985
## Name: survived, dtype: float64

The result is a series with two values (numeric survival rate) with index being
the group values, female and _male. In this example we see that female
survival rate was 72% while only 19% men survived the disaster.

We can also group by more than one variable. In that case we have to supply
those as a list, e.g. in order to compute the survival rate by gender and
passencer class (variable pclass) we can do

titanic.groupby(["sex", "pclass"]).survived.mean()

## sex pclass
## female 1 0.965278
## 2 0.886792
## 3 0.490741
## male 1 0.340782
## 2 0.146199
## 3 0.152130
## Name: survived, dtype: float64

We see that passenger class was also a very important determinant of


survival.

.groupby supports more options, e.g. one can keep the group indicators as

variables instead of index.

18/26
2/5/24, 9:47 AM

5.6 String operations

Pandas opens string variables to a large list of string functions using .str
attribute. These largely replicate the re module but the syntax is different, and
often the function names are different too. We do walk through a number of
examples here, namely

str.contains to search patterns in strings

str.match to find if string begins with a given pattern

str.replace to replace parts of strings

Many of the strings functions take advantage of regular expressions, so it is


useful to have an idea of regular expression basics.

Let’s use Titanic data and analyze whether there is a regular pattern of cabin
codes and passenger class. First, what are the existing cabin numbers?

titanic.cabin.unique()

19/26
2/5/24, 9:47 AM

## array(['B5', 'C22 C26', 'E12', 'D7', 'A36', 'C101', nan, 'C62 C64', 'B
## 'A23', 'B58 B60', 'D15', 'C6', 'D35', 'C148', 'C97', 'B49', 'C9
## 'C52', 'T', 'A31', 'C7', 'C103', 'D22', 'E33', 'A21', 'B10', 'B
## 'E40', 'B38', 'E24', 'B51 B53 B55', 'B96 B98', 'C46', 'E31', 'E
## 'B61', 'B77', 'A9', 'C89', 'A14', 'E58', 'E49', 'E52', 'E45',
## 'B22', 'B26', 'C85', 'E17', 'B71', 'B20', 'A34', 'C86', 'A16',
## 'A20', 'A18', 'C54', 'C45', 'D20', 'A29', 'C95', 'E25', 'C111',
## 'C23 C25 C27', 'E36', 'D34', 'D40', 'B39', 'B41', 'B102', 'C123
## 'E63', 'C130', 'B86', 'C92', 'A5', 'C51', 'B42', 'C91', 'C125',
## 'D10 D12', 'B82 B84', 'E50', 'D33', 'C83', 'B94', 'D49', 'D45',
## 'B69', 'B11', 'E46', 'C39', 'B18', 'D11', 'C93', 'B28', 'C49',
## 'B52 B54 B56', 'E60', 'C132', 'B37', 'D21', 'D19', 'C124', 'D17
## 'B101', 'D28', 'D6', 'D9', 'B80', 'C106', 'B79', 'C47', 'D30',
## 'C90', 'E38', 'C78', 'C30', 'C118', 'D36', 'D48', 'D47', 'C105
## 'B36', 'B30', 'D43', 'B24', 'C2', 'C65', 'B73', 'C104', 'C110',
## 'C50', 'B3', 'A24', 'A32', 'A11', 'A10', 'B57 B59 B63 B66', 'C2
## 'E44', 'A26', 'A6', 'A7', 'C31', 'A19', 'B45', 'E34', 'B78', 'B
## 'C87', 'C116', 'C55 C57', 'D50', 'E68', 'E67', 'C126', 'C68',
## 'C70', 'C53', 'B19', 'D46', 'D37', 'D26', 'C32', 'C80', 'C82',
## 'C128', 'E39 E41', 'D', 'F4', 'D56', 'F33', 'E101', 'E77', 'F2
## 'D38', 'F', 'F G63', 'F E57', 'F E46', 'F G73', 'E121', 'F E69
## 'E10', 'G6', 'F38'], dtype=object)

We can see that cabin code is typically a letter, followed by two or three digits.
We can guess that the letter denotes the deck, and number is the cabin
number in that deck. In several cases however there are apparently certain
data entry errors, manifested by just a single letter code, or multiple codes for
a single person. (Although passengers may also have booked more than one
cabin.)

20/26
2/5/24, 9:47 AM

Assume everyone only has a single cabin. We’ll assign a single cabin code to
each passenger. This will be the existing single cabin if coded in that way. If
the value of cabin contains multiple codes, then we take the first valid code
(i.e. the one in the form of letter + digits).

5.6.1 Find patterns in strings

As the first step, we find the problematic cabin codes. In most cases, these
contain a space, there are also a single-letter codes which are presumably
wrong. We can identify patterns in strings using str.contains(pattern) . This
function returns a vector of trues/falses, depending on whether the original
string vector contains the pattern:

titanic.cabin.str.contains(" ", na=False, regex=False).head(6)

## 0 False
## 1 True
## 2 True
## 3 True
## 4 True
## 5 False
## Name: cabin, dtype: bool

argument na tells what to do with missing values, here we force them to


return false, regex=False means that the pattern we supply is not a regular
expression. Plain patterns are simpler and faster to use than regular

21/26
2/5/24, 9:47 AM

expressions, but the latter are much more powerful. We see that neither first
nor sixth cabin contain a space, but the second through fifth cases contain.
This can be confirmed by looking at the cabin values:

titanic.cabin.head(6)

## 0 B5
## 1 C22 C26
## 2 C22 C26
## 3 C22 C26
## 4 C22 C26
## 5 E12
## Name: cabin, dtype: object

We can also check how many such problematic cases do we have:

titanic.cabin.str.contains(" ", na=False, regex=False).value_counts()

## False 1268
## True 41
## Name: cabin, dtype: int64

We can see we have 41 incorrecly filled cabin numbers.

Note that:

first, you have to use .str attribute to open a column up for string
operations. .cabin.contains() does not work.

22/26
2/5/24, 9:47 AM

.contains returns a logical value, or NA in case of missing entry. If we


want to use this for indexing later, we want to exclude the missing entries,
and hence mark the missings as False ( na=False ).

the first argument (pat) is normally a regular expression. However,


currently we can do with a simple pattern and no regular expression is
needed. Hence we say regex=False .

5.6.2 Match a pattern

Next, we use a regular expression that identifies both types of errors: space in
code, and missing number. Instead of looking for strings that contain a space,
we are looking for strings that match the valid pattern: a single letter followed
by digits. We demonstrate this using str.match , a function similar to
str.contains , just it looks if the beginning of the string matches the pattern

( str.contains looks for the pattern anywhere in the string). This is


fortunately easy to do using regular expressions. This can be coded as
r"\w\d+$" :

titanic.cabin.str.match(r"\w\d+$", na=False).value_counts()

## False 1061
## True 248
## Name: cabin, dtype: int64

str.match always assumes we have a regular expression, so there is no


regexp argument. The “r” in front of the expression means it is a “raw” string,
i.e. the backslashes in the string are interpreted literally, not as special codes.

23/26
2/5/24, 9:47 AM

Inside of the expression we have \w for a letter (word character), followed by


\d+ –one or more digits. Final $ means that the string must end here, i.e. it
must not contain extra spaces or other symbols (such as another cabin code).

The result shows that now we picked many more cases that do not match the
pattern, 248 instead of 41 above. However, not an important difference
between this approach, and the approach above: when looking for patterns
that contain a space we were looking for wrong patterns, now we are looking
for valid patterns. By specifying na=False above, we told pandas not to
consider missings as invalid patterns. As we are only interested in invalid
codes, not missings, we should count missings as correct in the example
below. Here is an example of mismatched cabin codes:

i = titanic.cabin.str.match(r"\w\d+$", na=True)
titanic.cabin[~i].sample(6)

## 116 C23 C25 C27


## 17 B58 B60
## 111 C23 C25 C27
## 491 D
## 568 F
## 115 C23 C25 C27
## Name: cabin, dtype: object

5.6.3 Replacing strings

Finally, let’s replace the full cabin number with only the deck code. This can be
done by replacing everything that follows the first letter by an empty string.
str.replace takes two arguments: pattern and replacement. We specify the

24/26
2/5/24, 9:47 AM

former as r"(\w).*" , i.e. a letter, followed by an unspecified number of other


symbols, and replace it with just the same letter:

titanic.cabin.str.replace(r"(\w).*", r"\1").sample(6)

## 412 NaN
## 1050 NaN
## 950 NaN
## 57 B
## 1197 NaN
## 1180 NaN
## Name: cabin, dtype: object

The result contains valid cabin numbers (and many missings).

In practical terms, it is often useful to print the original and modifies codes
side-by-side to see that the coding was done right. We can achieve this by
creating a new variable in the data frame, and printing out a sample of old and
new variables (see Section @ref(concatenating-data-pd.concat) for another
approach):

titanic["deck"] = titanic.cabin.str.replace(r"(\w).*", r"\1")


titanic[["cabin", "deck"]].dropna().sample(6)

25/26
2/5/24, 9:47 AM

## cabin deck
## 7 A36 A
## 228 C65 C
## 169 C49 C
## 304 D37 D
## 133 C92 C
## 269 A19 A

We can see that the deck codes (the ones that are not missing) are done
correctly.

26/26

You might also like