7. Loading and Wrangling Data with Pandas and NumPy
7. Loading and Wrangling Data with Pandas and NumPy
and NumPy
Outline
2. Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
3. Cleaning data
4. Data transformations
2
• The terms "data wrangling" have become common phrases in data
science, and generally mean to clean and prepare data for downstream
uses such as analytics and modeling.
3
1. Loading and saving data with Pandas
Pandas provides several functions to load data for a variety of file types: CSV, Excel, and
SQLite database files that we need to load into Python for analysis.
We'll start with the simplest file – the CSV. The acronym CSV stands for comma-separated
values and is a plain text file.
There are values separated by commas. The first line is the headers, which are the
column labels in a spreadsheet. Lines after that are data, with each value separated by
commas.
4
1. Loading and saving data with Pandas
5
1. Loading and saving data with Pandas
6
1. Loading and saving data with Pandas
• Load chinook.db with SQLite in python: Last lecture
7
1. Loading and saving data with Pandas
• In this query, we are getting the data from the tracks table in the
database, and joining it on the genres, albums, and artists tables to get
the names of the genres, albums, and artists. We are also selecting only
the non-ID columns from each table. Notice we are aliasing some of the
columns, such as tracks.name as Track . This makes it easier to
understand the data when it is in the pandas DataFrame, since the column
names will be changed to these aliases.
8
1. Loading and saving data with Pandas
9
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames
sql_df.index
10
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames
sql_df.columns
Which gives us a list of columns as a pandas index (a list-like structure)
11
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames
type(sql_df)
To combine our three DataFrames into one, we'll use the pd.concat()
function:
itunes_df = pd.concat([csv_df, excel_df, sql_df])
12
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
13
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
itunes_df.tail()
14
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
15
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
There are two ways to index in pandas: by row number or by index value.
16
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
With iloc , we can also choose a single column. For example, the following
commands print out the value of the first row and first column (an index of
[0, 0]) and the last row and last column (and an index of [-1, -1]):
print(itunes_df.iloc[0, 0])
print(itunes_df.iloc[-1, -1])
17
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
Indexing can be done by index value. From our use of tail() , we saw our
last index value is 3502 (the Name value from tail() above). So, let's print
that out using loc indexing with,
print(itunes_df.loc[3502])
which takes an index value instead of a row number. This will print out all
rows with the index value of 3502.
18
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
Create another DataFrame, append a copy of the last row, and use loc
again:
test_df = itunes_df.copy()
test_df = pd.concat([test_df, itunes_df.loc[[3502]]])
test_df.loc[3502]
On the first line, we make a copy of the existing itunes_df so that we won't
be altering our original DataFrame. Then we add on the last row with
append .
19
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
test_df.reset_index(inplace=True, drop=True)
This resets our index for test_df to a sequential RangeIndex . Note that if
we don't use drop=True , then the current index is inserted as a new
column in the DataFrame.
20
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
itunes_df['Milliseconds’]
Select multiple columns, we can use a list of strings:
itunes_df[['Milliseconds', 'Bytes']]
21
2.1. Examining the data's dimensions, datatypes, and
missing values
print(itunes_df.shape)
Looking at the datatypes and missing values can be done with info:
itunes_df.info()
22
2.1. Examining the data's dimensions, datatypes, and
missing values
itunes_df.isna().sum()
This gives us the counts of missing values (stored as NaN for not a number,
but also called NA for not available). The first part, itunes_df.isna(), returns
a DataFrame full of True and False values for each cell in the original
DataFrame – True if the value is NA, False if the value is not NA. Then
.sum() sums this up along each column.
23
2.2. Investigating statistical properties of the
data
First step to examine the statistics of a dataset is to use a pandas
command:
itunes_df.describe()
24
2.2. Investigating statistical properties of the data
25
2.2. Investigating statistical properties of the data
For non-numeric columns, we can look at the mode (most frequent value):
itunes_df['Genre'].mode()
We can get more specific and look at how many times each unique value
appears:
itunes_df['Genre'].value_counts()
26
2.2. Investigating statistical properties of the data
If we have many unique values, we can look at only a few with indexing:
itunes_df['Genre'].value_counts()[:5]
Investigate how many unique items there are, the unique function is helpful:
itunes_df['Artist'].unique().shape
27
Investigating statistical properties of the data
Let's look at correlations between our data columns:
numeric_df = itunes_df.select_dtypes(include=['float64', 'int64’])
correlation_matrix = numeric_df.corr()
print(correlation_matrix)
This calculation is Pearson correlation, which measures how linearly correlated
two datasets are. It ranges from -1 (inverse correlation; variable 1 increases
proportionally when variable 2 decreases) to 0 (no correlation) to 1 (perfectly
linear correlation; variable 1 increases proportionally when variable 2
increases). So, the correlation between a single numeric data column and
itself is 1 by definition – if we plotted a dataset against itself, it would be a
straight, diagonal line. We can see from our data that all of our numeric data
is strongly linearly correlated, which makes sense. 28
Plotting with DataFrames
Plotting is part of any EDA process, pandas makes it easy to plot data from
DataFrames. A few common plots you might use from pandas are bar,
histogram, and scatter plots.
29
2.3. Plotting with DataFrames
First, we need to import the standard Python plotting library, matplotlib:
The first line selects the Milliseconds column, and uses the hist() method of
pandas Series objects. We set the option bins=30 to increase it from the
default of 10 – this specifies how many bars the histogram is broken up
into. Then we use the plt.show() command to display the plot.
30
2.3. Plotting with DataFrames
31
2.3. Plotting with DataFrames
Let's look at a scatter plot of song length and song size in bytes:
itunes_df.plot.scatter(x='Milliseconds', y='Bytes')
plt.show()
32
2.3. Plotting with DataFrames
Let's look at a bar plot of non-numeric data. We can use value_counts again
and create a bar plot:
itunes_df['Genre'].value_counts().plot.bar()
plt.show()
33
3. Cleaning data
We can filter data in pandas DataFrames. For example, let's get the longest
songs from our data that we saw in our scatter plot. These have lengths
over 4,000,000 ms:
35
4. Filtering DataFrames
We use the usual indexing format for DataFrames – the variable name
followed by square brackets. Then, we give it a so-called Boolean mask. Try
running just the inner part of the indexing:
itunes_df['Milliseconds'] > 4e6
You will see this returns a pandas Series with True or False values. This is
our Boolean mask. When we provide this as an indexing command to
ourDataFrame, it only returns rows where our mask is True . Notice we are
using scientific notation for 4,000,000 as well – 4e6 means 4*106.
36
4. Filtering DataFrames
Let's take this one step further and look at the value counts of genres from
songs over 2,000,000 ms:
37
4. Filtering DataFrames
We can get the points with the smaller values of bytes by filtering with
multiple conditions:
38
4. Filtering DataFrames
We want to get all genres that are not TV Shows, we could use this filter:
39
4.1. Removing irrelevant data
With our iTunes data, we may not really need the Composer column. We
could drop this column like so:
itunes_df.drop('Composer', axis=1, inplace=True)
itunes_df.columns
We use the drop function of DataFrames, giving the column name as the
first argument. We can drop multiple columns at once by supplying a list.
The axis=1 argument specifies to drop a column, not a row, and
inplace=True changes the DataFrame itself instead of returning a new,
modified DataFrame. Then we examine the remaining columns with the
columns attribute of our DataFrame.
40
4.1. Removing irrelevant data
If we have other irrelevant data we want to remove, say any genres that are
not music, we could do so with filtering:
This uses filtering with the isin method. The isin method checks if each value
is in the list or set provided to the function. In this case, we also negate this
condition with the tilde ( ~ ) so that any of the non-music genres are
excluded, and our only_music DataFrame has only genres that are music, just
as the variable name suggests.
41
4.2. Dealing with missing values
Missing values arise all the time in datasets. Often, they are represented as
NA or NaN values.
In terms of dealing with missing values, we have some options:
• Leave the missing values as-is
• Drop the data
• Fill with a specific value
• Replace with the mean, median, or mode
• Use machine learning to replace missing values
42
4.2. Dealing with missing values
For example, we saw that our Composer column has several missing values.
We can use filtering to see what some of these rows look like:
43
4.2. Dealing with missing values
Another option is to drop the missing values. We can either drop the entire
column, as we did earlier, or we can drop the rows with missing values like
this:
itunes_df.dropna(inplace=True)
The dropna function has several other parameters (options), but we are
simply specifying that it should modify the existing DataFrame with
inplace=True . By default, this drops any rows with at least one missing
value.
44
4.2. Dealing with missing values
might want to fill the missing values with a specific value. Filling with a
specific value could be done like so:
45
Thank you