0% found this document useful (0 votes)
0 views

7. Loading and Wrangling Data with Pandas and NumPy

The document provides a comprehensive guide on loading, wrangling, and cleaning data using Pandas and NumPy, focusing on the chinook iTunes dataset. It covers various aspects including loading data from different formats, exploratory data analysis (EDA), data cleaning techniques, and filtering DataFrames. Additionally, it discusses handling missing values and visualizing data through plotting.

Uploaded by

23070229
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

7. Loading and Wrangling Data with Pandas and NumPy

The document provides a comprehensive guide on loading, wrangling, and cleaning data using Pandas and NumPy, focusing on the chinook iTunes dataset. It covers various aspects including loading data from different formats, exploratory data analysis (EDA), data cleaning techniques, and filtering DataFrames. Additionally, it discusses handling missing values and visualizing data through plotting.

Uploaded by

23070229
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Loading and Wrangling Data with Pandas

and NumPy
Outline

1. Loading and saving data with Pandas

2. Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
3. Cleaning data
4. Data transformations

2
• The terms "data wrangling" have become common phrases in data
science, and generally mean to clean and prepare data for downstream
uses such as analytics and modeling.

• We use chinook iTunes dataset for this session.

3
1. Loading and saving data with Pandas
Pandas provides several functions to load data for a variety of file types: CSV, Excel, and
SQLite database files that we need to load into Python for analysis.

We'll start with the simplest file – the CSV. The acronym CSV stands for comma-separated
values and is a plain text file.

There are values separated by commas. The first line is the headers, which are the
column labels in a spreadsheet. Lines after that are data, with each value separated by
commas.
4
1. Loading and saving data with Pandas

• We can load it to a DataFrame like so:

• Install Pandas: !pip install pandas


• After loading pandas library, load the data with pd.read_csv(“source file”)
• Look at the first 5 rows: df.head()

5
1. Loading and saving data with Pandas

• Another common file format is Excel. We can load it as a DataFrame like


this:

6
1. Loading and saving data with Pandas
• Load chinook.db with SQLite in python: Last lecture

• Then we create our SQL query as a multi-line string:

7
1. Loading and saving data with Pandas

• In this query, we are getting the data from the tracks table in the
database, and joining it on the genres, albums, and artists tables to get
the names of the genres, albums, and artists. We are also selecting only
the non-ID columns from each table. Notice we are aliasing some of the
columns, such as tracks.name as Track . This makes it easier to
understand the data when it is in the pandas DataFrame, since the column
names will be changed to these aliases.

8
1. Loading and saving data with Pandas

• Use the pandas read command for SQL queries:


sql_df = pd.read_sql_query(query, con)
• Now we can look at the top few rows of the data. We use the head()
method of DataFrames with the first and only argument of n as 2 to print
out two rows, and then transpose it with T , which switches the rows and
columns.
sql_df.head(2).T

9
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

• DataFrames have a certain structure: a number of columns storing data,


and an index. This index can be used as one method to access the data.
We can access the index like so:

sql_df.index

10
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

We can view the columns we have with this:

sql_df.columns
Which gives us a list of columns as a pandas index (a list-like structure)

11
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

We can check the type of a variable to see if it is a DataFrame or Series:

type(sql_df)
To combine our three DataFrames into one, we'll use the pd.concat()
function:
itunes_df = pd.concat([csv_df, excel_df, sql_df])

12
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

In general, we can follow a general EDA checklist:

• Examine the top and bottom of the data


• Examine the data's dimensions
• Examine the datatypes and missing values
• Investigate statistical properties of the data
• Create plots of the data

13
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

For looking at the bottom of the data, we use:

itunes_df.tail()

14
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Remember that if we have many columns, we can transpose the printout

with itunes_df.tail().T , which transposes columns and rows.

15
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

There are two ways to index in pandas: by row number or by index value.

To index by row number, we use iloc:


print(itunes_df.iloc[0])
print(itunes_df.iloc[-1])

16
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

With iloc , we can also choose a single column. For example, the following
commands print out the value of the first row and first column (an index of
[0, 0]) and the last row and last column (and an index of [-1, -1]):

print(itunes_df.iloc[0, 0])
print(itunes_df.iloc[-1, -1])

17
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
Indexing can be done by index value. From our use of tail() , we saw our
last index value is 3502 (the Name value from tail() above). So, let's print
that out using loc indexing with,

print(itunes_df.loc[3502])
which takes an index value instead of a row number. This will print out all
rows with the index value of 3502.

18
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Create another DataFrame, append a copy of the last row, and use loc
again:

test_df = itunes_df.copy()
test_df = pd.concat([test_df, itunes_df.loc[[3502]]])
test_df.loc[3502]
On the first line, we make a copy of the existing itunes_df so that we won't
be altering our original DataFrame. Then we add on the last row with
append .
19
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

If we do have a situation with duplicate index values, we can change our


index to be unique, sequential numbers like so:

test_df.reset_index(inplace=True, drop=True)
This resets our index for test_df to a sequential RangeIndex . Note that if
we don't use drop=True , then the current index is inserted as a new
column in the DataFrame.

20
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Select a column of data:

itunes_df['Milliseconds’]
Select multiple columns, we can use a list of strings:
itunes_df[['Milliseconds', 'Bytes']]

21
2.1. Examining the data's dimensions, datatypes, and
missing values

Examine the rows and columns with the shape parameter:

print(itunes_df.shape)
Looking at the datatypes and missing values can be done with info:
itunes_df.info()

22
2.1. Examining the data's dimensions, datatypes, and
missing values

Look at the number of missing values with this:

itunes_df.isna().sum()
This gives us the counts of missing values (stored as NaN for not a number,
but also called NA for not available). The first part, itunes_df.isna(), returns
a DataFrame full of True and False values for each cell in the original
DataFrame – True if the value is NA, False if the value is not NA. Then
.sum() sums this up along each column.

23
2.2. Investigating statistical properties of the
data
First step to examine the statistics of a dataset is to use a pandas
command:

itunes_df.describe()

24
2.2. Investigating statistical properties of the data

This shows a summary of a few statistics, including the number of


nonmissing (non-NA) values (count), the average (mean), the standard
deviation (std), the minimum and maximum, and a few percentiles. Note
that these statistics can also be found with functions, like df.std() for
standard deviation. The 25% row means the 25th percentile. What this
tellsus is that 25% of the data lies at or below the value of 0.99 for the
UnitPrice column, for example. We have 25% of the data contained between
each of the breakpoints from min through max (including the 25th, 50th,
and 75th percentiles).

25
2.2. Investigating statistical properties of the data

For non-numeric columns, we can look at the mode (most frequent value):

itunes_df['Genre'].mode()
We can get more specific and look at how many times each unique value
appears:
itunes_df['Genre'].value_counts()

26
2.2. Investigating statistical properties of the data

If we have many unique values, we can look at only a few with indexing:

itunes_df['Genre'].value_counts()[:5]
Investigate how many unique items there are, the unique function is helpful:
itunes_df['Artist'].unique().shape

27
Investigating statistical properties of the data
Let's look at correlations between our data columns:
numeric_df = itunes_df.select_dtypes(include=['float64', 'int64’])
correlation_matrix = numeric_df.corr()
print(correlation_matrix)
This calculation is Pearson correlation, which measures how linearly correlated
two datasets are. It ranges from -1 (inverse correlation; variable 1 increases
proportionally when variable 2 decreases) to 0 (no correlation) to 1 (perfectly
linear correlation; variable 1 increases proportionally when variable 2
increases). So, the correlation between a single numeric data column and
itself is 1 by definition – if we plotted a dataset against itself, it would be a
straight, diagonal line. We can see from our data that all of our numeric data
is strongly linearly correlated, which makes sense. 28
Plotting with DataFrames

Plotting is part of any EDA process, pandas makes it easy to plot data from
DataFrames. A few common plots you might use from pandas are bar,
histogram, and scatter plots.

29
2.3. Plotting with DataFrames
First, we need to import the standard Python plotting library, matplotlib:

import matplotlib.pyplot as plt


Plot our data:
itunes_df['Milliseconds'].hist(bins=30)
plt.show()

The first line selects the Milliseconds column, and uses the hist() method of
pandas Series objects. We set the option bins=30 to increase it from the
default of 10 – this specifies how many bars the histogram is broken up
into. Then we use the plt.show() command to display the plot.
30
2.3. Plotting with DataFrames

31
2.3. Plotting with DataFrames
Let's look at a scatter plot of song length and song size in bytes:
itunes_df.plot.scatter(x='Milliseconds', y='Bytes')
plt.show()

32
2.3. Plotting with DataFrames
Let's look at a bar plot of non-numeric data. We can use value_counts again
and create a bar plot:
itunes_df['Genre'].value_counts().plot.bar()
plt.show()

33
3. Cleaning data

Some common data cleaning steps include:


• Removing irrelevant data
• Dealing with missing values (filling in or dropping them)
• Dealing with outliers
• Dealing with duplicate values
• Ensuring datatypes are correct
• Standardizing data formats (e.g. mismatched capitalization, converting
units)
34
4. Filtering DataFrames

We can filter data in pandas DataFrames. For example, let's get the longest
songs from our data that we saw in our scatter plot. These have lengths
over 4,000,000 ms:

itunes_df[itunes_df['Milliseconds'] > 4e6]

35
4. Filtering DataFrames

We use the usual indexing format for DataFrames – the variable name
followed by square brackets. Then, we give it a so-called Boolean mask. Try
running just the inner part of the indexing:
itunes_df['Milliseconds'] > 4e6
You will see this returns a pandas Series with True or False values. This is
our Boolean mask. When we provide this as an indexing command to
ourDataFrame, it only returns rows where our mask is True . Notice we are
using scientific notation for 4,000,000 as well – 4e6 means 4*106.

36
4. Filtering DataFrames
Let's take this one step further and look at the value counts of genres from
songs over 2,000,000 ms:

The first part, itunes_df[itunes_df['Milliseconds'] > 2e6] , returns a


DataFrame, which can be indexed as usual. Then we use value_counts to
get the following output:

37
4. Filtering DataFrames

We can get the points with the smaller values of bytes by filtering with
multiple conditions:

38
4. Filtering DataFrames

We want to get all genres that are not TV Shows, we could use this filter:

itunes_df[itunes_df['Genre'] != 'TV Shows’]


Another way to negate a condition is with the tilde character:
itunes_df[~(itunes_df['Genre'] == 'TV Shows')]

39
4.1. Removing irrelevant data
With our iTunes data, we may not really need the Composer column. We
could drop this column like so:
itunes_df.drop('Composer', axis=1, inplace=True)
itunes_df.columns
We use the drop function of DataFrames, giving the column name as the
first argument. We can drop multiple columns at once by supplying a list.
The axis=1 argument specifies to drop a column, not a row, and
inplace=True changes the DataFrame itself instead of returning a new,
modified DataFrame. Then we examine the remaining columns with the
columns attribute of our DataFrame.

40
4.1. Removing irrelevant data

If we have other irrelevant data we want to remove, say any genres that are
not music, we could do so with filtering:

This uses filtering with the isin method. The isin method checks if each value
is in the list or set provided to the function. In this case, we also negate this
condition with the tilde ( ~ ) so that any of the non-music genres are
excluded, and our only_music DataFrame has only genres that are music, just
as the variable name suggests.

41
4.2. Dealing with missing values

Missing values arise all the time in datasets. Often, they are represented as
NA or NaN values.
In terms of dealing with missing values, we have some options:
• Leave the missing values as-is
• Drop the data
• Fill with a specific value
• Replace with the mean, median, or mode
• Use machine learning to replace missing values
42
4.2. Dealing with missing values

For example, we saw that our Composer column has several missing values.
We can use filtering to see what some of these rows look like:

Here, we take a random sample of 5 datapoints with sample() , giving it a


random_state so the results are the same every time we run it. Then we
look at a few rows with head() . In this case, we get results from all sorts of
genres – TV shows, latin, and so on.

43
4.2. Dealing with missing values

Another option is to drop the missing values. We can either drop the entire
column, as we did earlier, or we can drop the rows with missing values like
this:

itunes_df.dropna(inplace=True)

The dropna function has several other parameters (options), but we are
simply specifying that it should modify the existing DataFrame with
inplace=True . By default, this drops any rows with at least one missing
value.

44
4.2. Dealing with missing values

If we were trying to do another type of machine learning, like clustering, we

might want to fill the missing values with a specific value. Filling with a
specific value could be done like so:

45
Thank you

You might also like