0% found this document useful (0 votes)
122 views

Pythonic Data Cleaning With Numpy and Pandas

This document discusses using Python's Pandas and NumPy libraries to clean data. It describes dropping unnecessary columns and rows, changing indexes, using string methods to clean columns, applying functions element-wise, and renaming columns. It also lists some datasets that will be used, including books, college towns, and Olympic participation. The tutorial assumes familiarity with Pandas DataFrames and NumPy NaN values. It provides examples of removing extra dates, converting date ranges to start dates, replacing uncertain dates with NaN, converting string NaN to numeric NaN, and using a regular expression to extract four-digit numbers at the start of strings.

Uploaded by

Alok Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views

Pythonic Data Cleaning With Numpy and Pandas

This document discusses using Python's Pandas and NumPy libraries to clean data. It describes dropping unnecessary columns and rows, changing indexes, using string methods to clean columns, applying functions element-wise, and renaming columns. It also lists some datasets that will be used, including books, college towns, and Olympic participation. The tutorial assumes familiarity with Pandas DataFrames and NumPy NaN values. It provides examples of removing extra dates, converting date ranges to start dates, replacing uncertain dates with NaN, converting string NaN to numeric NaN, and using a regular expression to extract four-digit numbers at the start of strings.

Uploaded by

Alok Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Pythonic Data Cleaning

With NumPy and Pandas


Dropping unnecessary columns in a DataFrame

In this tutorial, Changing the index of a DataFrame

we’ll leverage
Python’s Using .str() methods to clean columns

Pandas and
NumPy
Using the DataFrame.applymap() function to clean the
entire dataset, element-wise

libraries to
clean data. Renaming columns to a more recognizable set of labels

Skipping unnecessary rows in a CSV file


BL-Flickr-Images-Book.csv – A CSV file
containing information about books
from the British Library

Here are the university_towns.txt – A text file


datasets that containing names of college towns in
we will be every US state
using
olympics.csv – A CSV file summarizing
the participation of all countries in
the Summer and Winter Olympics
• This tutorial assumes a basic understanding of the
Pandas and NumPy libraries, including Panda’s
workhorse Series and DataFrame objects, common
methods that can be applied to these objects, and
familiarity with NumPy’s NaN values.
• Remove the extra dates in square brackets, wherever present: 1879
[1878]
• Convert date ranges to their “start date”, wherever present: 1860-63;
1839, 38-54
• Completely remove the dates we are not certain about and replace
them with NumPy’s NaN: [1897?]
• Convert the string nan to NumPy’s NaN value
• regex = r'^(\d{4})'
• The regular expression above is meant to find any four digits at the
beginning of a string, which suffices for our case. The above is a raw
string (meaning that a backslash is no longer an escape character),
which is standard practice with regular expressions.
• The \d represents any digit, and {4} repeats this rule four times.
The ^ character matches the start of a string, and the parentheses
denote a capturing group, which signals to Pandas that we want to
extract that part of the regex. (We want ^ to avoid cases
where [ starts off the string.)

You might also like