0% found this document useful (0 votes)

0 views

7. Loading and Wrangling Data with Pandas and NumPy

The document provides a comprehensive guide on loading, wrangling, and cleaning data using Pandas and NumPy, focusing on the chinook iTunes dataset. It covers various aspects including loading data from different formats, exploratory data analysis (EDA), data cleaning techniques, and filtering DataFrames. Additionally, it discusses handling missing values and visualizing data through plotting.

Uploaded by

23070229

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views

7. Loading and Wrangling Data with Pandas and NumPy

Uploaded by

23070229

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Loading and Wrangling Data with Pandas

and NumPy
Outline

1. Loading and saving data with Pandas

2. Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
3. Cleaning data
4. Data transformations

2
• The terms "data wrangling" have become common phrases in data
science, and generally mean to clean and prepare data for downstream
uses such as analytics and modeling.

• We use chinook iTunes dataset for this session.

3
1. Loading and saving data with Pandas
Pandas provides several functions to load data for a variety of file types: CSV, Excel, and
SQLite database files that we need to load into Python for analysis.

We'll start with the simplest file – the CSV. The acronym CSV stands for comma-separated
values and is a plain text file.

There are values separated by commas. The first line is the headers, which are the
column labels in a spreadsheet. Lines after that are data, with each value separated by
commas.
4
1. Loading and saving data with Pandas

• We can load it to a DataFrame like so:

• Install Pandas: !pip install pandas

• After loading pandas library, load the data with pd.read_csv(“source file”)
• Look at the first 5 rows: df.head()

5
1. Loading and saving data with Pandas

• Another common file format is Excel. We can load it as a DataFrame like

this:

6
1. Loading and saving data with Pandas
• Load chinook.db with SQLite in python: Last lecture

• Then we create our SQL query as a multi-line string:

7
1. Loading and saving data with Pandas

• In this query, we are getting the data from the tracks table in the
database, and joining it on the genres, albums, and artists tables to get
the names of the genres, albums, and artists. We are also selecting only
the non-ID columns from each table. Notice we are aliasing some of the
columns, such as tracks.name as Track . This makes it easier to
understand the data when it is in the pandas DataFrame, since the column
names will be changed to these aliases.

8
1. Loading and saving data with Pandas

• Use the pandas read command for SQL queries:

sql_df = pd.read_sql_query(query, con)
• Now we can look at the top few rows of the data. We use the head()
method of DataFrames with the first and only argument of n as 2 to print
out two rows, and then transpose it with T , which switches the rows and
columns.
sql_df.head(2).T

9
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

• DataFrames have a certain structure: a number of columns storing data,

and an index. This index can be used as one method to access the data.
We can access the index like so:

sql_df.index

10
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

We can view the columns we have with this:

sql_df.columns
Which gives us a list of columns as a pandas index (a list-like structure)

11
1.1. Understanding the DataFrame structure and
combining/concatenating multiple DataFrames

We can check the type of a variable to see if it is a DataFrame or Series:

type(sql_df)
To combine our three DataFrames into one, we'll use the pd.concat()
function:
itunes_df = pd.concat([csv_df, excel_df, sql_df])

12
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

In general, we can follow a general EDA checklist:

• Examine the top and bottom of the data

• Examine the data's dimensions
• Examine the datatypes and missing values
• Investigate statistical properties of the data
• Create plots of the data

13
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

For looking at the bottom of the data, we use:

itunes_df.tail()

14
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Remember that if we have many columns, we can transpose the printout

with itunes_df.tail().T , which transposes columns and rows.

15
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

There are two ways to index in pandas: by row number or by index value.

To index by row number, we use iloc:

print(itunes_df.iloc[0])
print(itunes_df.iloc[-1])

16
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

With iloc , we can also choose a single column. For example, the following
commands print out the value of the first row and first column (an index of
[0, 0]) and the last row and last column (and an index of [-1, -1]):

print(itunes_df.iloc[0, 0])
print(itunes_df.iloc[-1, -1])

17
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas
Indexing can be done by index value. From our use of tail() , we saw our
last index value is 3502 (the Name value from tail() above). So, let's print
that out using loc indexing with,

print(itunes_df.loc[3502])
which takes an index value instead of a row number. This will print out all
rows with the index value of 3502.

18
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Create another DataFrame, append a copy of the last row, and use loc
again:

test_df = itunes_df.copy()
test_df = pd.concat([test_df, itunes_df.loc[[3502]]])
test_df.loc[3502]
On the first line, we make a copy of the existing itunes_df so that we won't
be altering our original DataFrame. Then we add on the last row with
append .
19
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

If we do have a situation with duplicate index values, we can change our

index to be unique, sequential numbers like so:

test_df.reset_index(inplace=True, drop=True)
This resets our index for test_df to a sequential RangeIndex . Note that if
we don't use drop=True , then the current index is inserted as a new
column in the DataFrame.

20
2. Exploratory Data Analysis (EDA) and basic
data cleaning with Pandas

Select a column of data:

itunes_df['Milliseconds’]
Select multiple columns, we can use a list of strings:
itunes_df[['Milliseconds', 'Bytes']]

21
2.1. Examining the data's dimensions, datatypes, and
missing values

Examine the rows and columns with the shape parameter:

print(itunes_df.shape)
Looking at the datatypes and missing values can be done with info:
itunes_df.info()

22
2.1. Examining the data's dimensions, datatypes, and
missing values

Look at the number of missing values with this:

itunes_df.isna().sum()
This gives us the counts of missing values (stored as NaN for not a number,
but also called NA for not available). The first part, itunes_df.isna(), returns
a DataFrame full of True and False values for each cell in the original
DataFrame – True if the value is NA, False if the value is not NA. Then
.sum() sums this up along each column.

23
2.2. Investigating statistical properties of the
data
First step to examine the statistics of a dataset is to use a pandas
command:

itunes_df.describe()

24
2.2. Investigating statistical properties of the data

This shows a summary of a few statistics, including the number of

nonmissing (non-NA) values (count), the average (mean), the standard
deviation (std), the minimum and maximum, and a few percentiles. Note
that these statistics can also be found with functions, like df.std() for
standard deviation. The 25% row means the 25th percentile. What this
tellsus is that 25% of the data lies at or below the value of 0.99 for the
UnitPrice column, for example. We have 25% of the data contained between
each of the breakpoints from min through max (including the 25th, 50th,
and 75th percentiles).

25
2.2. Investigating statistical properties of the data

For non-numeric columns, we can look at the mode (most frequent value):

itunes_df['Genre'].mode()
We can get more specific and look at how many times each unique value
appears:
itunes_df['Genre'].value_counts()

26
2.2. Investigating statistical properties of the data

If we have many unique values, we can look at only a few with indexing:

itunes_df['Genre'].value_counts()[:5]
Investigate how many unique items there are, the unique function is helpful:
itunes_df['Artist'].unique().shape

27
Investigating statistical properties of the data
Let's look at correlations between our data columns:
numeric_df = itunes_df.select_dtypes(include=['float64', 'int64’])
correlation_matrix = numeric_df.corr()
print(correlation_matrix)
This calculation is Pearson correlation, which measures how linearly correlated
two datasets are. It ranges from -1 (inverse correlation; variable 1 increases
proportionally when variable 2 decreases) to 0 (no correlation) to 1 (perfectly
linear correlation; variable 1 increases proportionally when variable 2
increases). So, the correlation between a single numeric data column and
itself is 1 by definition – if we plotted a dataset against itself, it would be a
straight, diagonal line. We can see from our data that all of our numeric data
is strongly linearly correlated, which makes sense. 28
Plotting with DataFrames

Plotting is part of any EDA process, pandas makes it easy to plot data from
DataFrames. A few common plots you might use from pandas are bar,
histogram, and scatter plots.

29
2.3. Plotting with DataFrames
First, we need to import the standard Python plotting library, matplotlib:

import matplotlib.pyplot as plt

Plot our data:
itunes_df['Milliseconds'].hist(bins=30)
plt.show()

The first line selects the Milliseconds column, and uses the hist() method of
pandas Series objects. We set the option bins=30 to increase it from the
default of 10 – this specifies how many bars the histogram is broken up
into. Then we use the plt.show() command to display the plot.
30
2.3. Plotting with DataFrames

31
2.3. Plotting with DataFrames
Let's look at a scatter plot of song length and song size in bytes:
itunes_df.plot.scatter(x='Milliseconds', y='Bytes')
plt.show()

32
2.3. Plotting with DataFrames
Let's look at a bar plot of non-numeric data. We can use value_counts again
and create a bar plot:
itunes_df['Genre'].value_counts().plot.bar()
plt.show()

33
3. Cleaning data

Some common data cleaning steps include:

• Removing irrelevant data
• Dealing with missing values (filling in or dropping them)
• Dealing with outliers
• Dealing with duplicate values
• Ensuring datatypes are correct
• Standardizing data formats (e.g. mismatched capitalization, converting
units)
34
4. Filtering DataFrames

We can filter data in pandas DataFrames. For example, let's get the longest
songs from our data that we saw in our scatter plot. These have lengths
over 4,000,000 ms:

itunes_df[itunes_df['Milliseconds'] > 4e6]

35
4. Filtering DataFrames

We use the usual indexing format for DataFrames – the variable name
followed by square brackets. Then, we give it a so-called Boolean mask. Try
running just the inner part of the indexing:
itunes_df['Milliseconds'] > 4e6
You will see this returns a pandas Series with True or False values. This is
our Boolean mask. When we provide this as an indexing command to
ourDataFrame, it only returns rows where our mask is True . Notice we are
using scientific notation for 4,000,000 as well – 4e6 means 4*106.

36
4. Filtering DataFrames
Let's take this one step further and look at the value counts of genres from
songs over 2,000,000 ms:

The first part, itunes_df[itunes_df['Milliseconds'] > 2e6] , returns a

DataFrame, which can be indexed as usual. Then we use value_counts to
get the following output:

37
4. Filtering DataFrames

We can get the points with the smaller values of bytes by filtering with
multiple conditions:

38
4. Filtering DataFrames

We want to get all genres that are not TV Shows, we could use this filter:

itunes_df[itunes_df['Genre'] != 'TV Shows’]

Another way to negate a condition is with the tilde character:
itunes_df[~(itunes_df['Genre'] == 'TV Shows')]

39
4.1. Removing irrelevant data
With our iTunes data, we may not really need the Composer column. We
could drop this column like so:
itunes_df.drop('Composer', axis=1, inplace=True)
itunes_df.columns
We use the drop function of DataFrames, giving the column name as the
first argument. We can drop multiple columns at once by supplying a list.
The axis=1 argument specifies to drop a column, not a row, and
inplace=True changes the DataFrame itself instead of returning a new,
modified DataFrame. Then we examine the remaining columns with the
columns attribute of our DataFrame.

40
4.1. Removing irrelevant data

If we have other irrelevant data we want to remove, say any genres that are
not music, we could do so with filtering:

This uses filtering with the isin method. The isin method checks if each value
is in the list or set provided to the function. In this case, we also negate this
condition with the tilde ( ~ ) so that any of the non-music genres are
excluded, and our only_music DataFrame has only genres that are music, just
as the variable name suggests.

41
4.2. Dealing with missing values

Missing values arise all the time in datasets. Often, they are represented as
NA or NaN values.
In terms of dealing with missing values, we have some options:
• Leave the missing values as-is
• Drop the data
• Fill with a specific value
• Replace with the mean, median, or mode
• Use machine learning to replace missing values
42
4.2. Dealing with missing values

For example, we saw that our Composer column has several missing values.
We can use filtering to see what some of these rows look like:

Here, we take a random sample of 5 datapoints with sample() , giving it a

random_state so the results are the same every time we run it. Then we
look at a few rows with head() . In this case, we get results from all sorts of
genres – TV shows, latin, and so on.

43
4.2. Dealing with missing values

Another option is to drop the missing values. We can either drop the entire
column, as we did earlier, or we can drop the rows with missing values like
this:

itunes_df.dropna(inplace=True)

The dropna function has several other parameters (options), but we are
simply specifying that it should modify the existing DataFrame with
inplace=True . By default, this drops any rows with at least one missing
value.

44
4.2. Dealing with missing values

If we were trying to do another type of machine learning, like clustering, we

might want to fill the missing values with a specific value. Filling with a
specific value could be done like so:

45
Thank you

Exam Porfolio Com3706 2016
82% (17)
Exam Porfolio Com3706 2016
61 pages
Dspace-Cris-2022 02 00
No ratings yet
Dspace-Cris-2022 02 00
227 pages
OpenTxt VIM Tables
No ratings yet
OpenTxt VIM Tables
20 pages
Pandas_numpy_handing_data
No ratings yet
Pandas_numpy_handing_data
32 pages
Python & MySQL for Data Analysis
No ratings yet
Python & MySQL for Data Analysis
45 pages
Unit 3 Data Analysis using pandas - Copy
No ratings yet
Unit 3 Data Analysis using pandas - Copy
49 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Pandas_Notes
No ratings yet
Pandas_Notes
6 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
RA Continuing Education (Data Processing With Pandas)
No ratings yet
RA Continuing Education (Data Processing With Pandas)
77 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
No ratings yet
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
25 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Lesson Summary _ Coursera
No ratings yet
Lesson Summary _ Coursera
1 page
SyamilFakhruddin - DS - Summary - Data Analysis
No ratings yet
SyamilFakhruddin - DS - Summary - Data Analysis
17 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
pandas
No ratings yet
pandas
10 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
45 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
Pandas Basics
No ratings yet
Pandas Basics
21 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
5CS037 WS02 PandasForDataAnalysis
No ratings yet
5CS037 WS02 PandasForDataAnalysis
30 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
Wa0000
No ratings yet
Wa0000
13 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
dataframing_in_csv
No ratings yet
dataframing_in_csv
14 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
CO3_3_Indexing and Sorting, Loading Data From CSV
No ratings yet
CO3_3_Indexing and Sorting, Loading Data From CSV
29 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Pandas
No ratings yet
Pandas
25 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
JOINS (1)
No ratings yet
JOINS (1)
10 pages
Pandas
No ratings yet
Pandas
13 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Pandas
No ratings yet
Pandas
28 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
Instant Download Learning the Pandas Library Python Tools for Data Munging Analysis and Visual Matt Harrison PDF All Chapters
100% (5)
Instant Download Learning the Pandas Library Python Tools for Data Munging Analysis and Visual Matt Harrison PDF All Chapters
65 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
Report
No ratings yet
Report
18 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
GRADSmanual PDF
No ratings yet
GRADSmanual PDF
167 pages
Oracle Quiz Answer - Section 6
No ratings yet
Oracle Quiz Answer - Section 6
4 pages
Unit 5
No ratings yet
Unit 5
7 pages
1Z0 053 Enu PDF
No ratings yet
1Z0 053 Enu PDF
5 pages
Food Delivery Management System
No ratings yet
Food Delivery Management System
7 pages
Business Intelligence PDF
No ratings yet
Business Intelligence PDF
3 pages
Business Statistics
No ratings yet
Business Statistics
17 pages
eXtremeDB Dot Net User Guide
No ratings yet
eXtremeDB Dot Net User Guide
20 pages
Statistician or Data Analyst or Data Manager
No ratings yet
Statistician or Data Analyst or Data Manager
5 pages
iDC File Manager - Knowledge Base
No ratings yet
iDC File Manager - Knowledge Base
40 pages
3-Migrating and DataLoading Into ADB
No ratings yet
3-Migrating and DataLoading Into ADB
9 pages
Dual Write - Troubleshooting_004_Initial sync
No ratings yet
Dual Write - Troubleshooting_004_Initial sync
8 pages
Easy Steganography - Writeup: These Challenges
No ratings yet
Easy Steganography - Writeup: These Challenges
6 pages
What Is Multivariate Analysis
No ratings yet
What Is Multivariate Analysis
7 pages
A Database System High Level Design Template
100% (1)
A Database System High Level Design Template
24 pages
Practical SAP HANA ABAP Interview Q&A
No ratings yet
Practical SAP HANA ABAP Interview Q&A
44 pages
Computer Question Paper
No ratings yet
Computer Question Paper
2 pages
CBAR Format
0% (1)
CBAR Format
3 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
BSC CS-lab-2023
No ratings yet
BSC CS-lab-2023
15 pages
Oracle Faqs2
No ratings yet
Oracle Faqs2
7 pages
English For Seafarers Need Analysis and Course Des
No ratings yet
English For Seafarers Need Analysis and Course Des
5 pages
Amazon Redshift Serverless - Amazon Redshift
No ratings yet
Amazon Redshift Serverless - Amazon Redshift
13 pages
BALVBUFDEL
No ratings yet
BALVBUFDEL
2 pages
A Study On Consumer's Perception Towards Honda Activa in Muzaffarpur City
No ratings yet
A Study On Consumer's Perception Towards Honda Activa in Muzaffarpur City
34 pages
JCL Spawning Through CICS Screen
100% (2)
JCL Spawning Through CICS Screen
5 pages
TST CT119-3-2-Data Mining and Predictive Modelling (VE1)
No ratings yet
TST CT119-3-2-Data Mining and Predictive Modelling (VE1)
1 page