0% found this document useful (0 votes)
30 views

Pandas

Pandas is a Python library used for working with tabular data and doing data analysis. It allows importing data from various formats, cleaning, transforming, and preparing the data. Pandas provides data structures and functions for fast manipulation of numerical tables and time series. It aims to make working with structured data intuitive with data manipulation tools and data analysis tools.

Uploaded by

Mehran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Pandas

Pandas is a Python library used for working with tabular data and doing data analysis. It allows importing data from various formats, cleaning, transforming, and preparing the data. Pandas provides data structures and functions for fast manipulation of numerical tables and time series. It aims to make working with structured data intuitive with data manipulation tools and data analysis tools.

Uploaded by

Mehran
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

Pandas

Dr. Noman Islam


Getting Started
Introduction
• It contains data structures and data manipulation
tools designed to make data cleaning and analysis fast
and easy in Python.
• Pandas is designed for working with tabular or
heterogeneous data.
• NumPy, by contrast, is best suited for working with
homogeneous numerical array data.
Series
Using indexes to access data
Filtering
Creating series from dictionary
Checking for null values
Joining two series
Dataframe
Modifying column values
Adding / deleting a new column
Transpose a dataframe
Slicing in dataframe
Values attribute
Indexing, selection and filtering
Dropping entries from axis
Axis parameter
Indexing and slicing
Selection with loc and iloc
Apply function
Function application and mapping
Descriptive statistics
Reading and writing data
Filtering out missing values
Filling missing data
Data Loading, Storage, and
File Formats
Reading data in pandas
Optional arguments
• Indexing Can treat one or more columns as the returned DataFrame,
and whether to get column names from the file, the user, or not at all.
Type inference and data conversion This includes the user-defined
value conversions and custom list of missing value markers.
• Datetime parsing Includes combining capability, including combining
date and time information spread over multiple columns into a single
column in the result.
• Iterating Support for iterating over chunks of very large files.
• Unclean data issues Skipping rows or a footer, comments, or other
minor things like numeric data with thousands separated by commas.
Examples
df = pd.read_csv('examples/ex1.csv')

pd.read_table('examples/ex1.csv', sep=',')

pd.read_csv('examples/ex2.csv', header=None)

pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

names = ['a', 'b', 'c', 'd', 'message']


pd.read_csv('examples/ex2.csv', names=names, index_col='message')
• Passing regular expression as separator
• result = pd.read_table('examples/ex3.txt', sep='\s+')
• Skipping rows
• pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3])
Handling null values
Reading text file in pieces
Iterating over chunk
Writing file
Working with delimited format
Writing csv data
JSON format
• The pandas.read_json can automatically convert JSON datasets in
specific arrange‐ ments into a Series or DataFrame
• The default options for pandas.read_json assume that each object in
the JSON array is a row in the table
Web scraping

• The pandas.read_html function has a number of options, but by


default it searches for and attempts to parse all tabular data
contained within tags.
Saving in binary format
HD5 format
Excel format
Interacting with Web API
Interacting with database
Data Cleaning and
Preparation
Handling missing data
dropna options
• With DataFrame objects, things are a bit more complex. You may
want to drop rows or columns that are all NA or only those containing
any NAs. dropna by default drops any row containing a missing value
• Passing how='all' will only drop rows that are all NA
• To drop columns in the same way, pass axis=1
• df.dropna(thresh=2)
Filling In Missing Data
• Calling fillna with a constant replaces missing values with that value
• Calling fillna with a dict, you can use a different fill value for each
column
• df.fillna({1: 0.5, 2: 0})
• fillna returns a new object, but you can modify the existing object in-
place:
• _ = df.fillna(0, inplace=True)
Removing Duplicates
• data.duplicated()
• Relatedly, drop_duplicates returns a DataFrame where the duplicated
array is False
• You can specify any subset of them to detect duplicates
• data.drop_duplicates(['k1'])
• duplicated and drop_duplicates by default keep the first observed
value combina‐ tion. Passing keep='last' will return the last one
Map
Replace
• data.replace(-999, np.nan)
• data.replace([-999, -1000], np.nan)
• data.replace([-999, -1000], [np.nan, 0])
• data.replace({-999: np.nan, -1000: 0})
Map with index
Other index methods
• data.rename(index=str.title, columns=str.upper)
• data.rename(index={'OHIO': 'INDIANA'}, columns={'three':
'peekaboo'})
• data.rename(index={'OHIO': 'INDIANA'}, inplace=True)
Discretization and binning
• Let’s divide data into bins
• ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
• bins = [18, 25, 35, 60, 100]
• cats = pd.cut(ages, bins)
• Lets see the code for each data and categories:
• cats.codes
• cats.categories
• What are the counts for each category?
• pd.value_counts(cats)
• Making the left side a close interval
• pd.cut(ages, [18, 26, 36, 61, 100], right=False)
• Providing your own group names:
• group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
• pd.cut(ages, bins, labels=group_names)
Detecting and filtering outliers
• data.describe()
• Suppose you wanted to find values in one of the columns exceeding 3
in absolute value:
• col = data[2]
• col[np.abs(col) > 3]
• To select all rows having a value exceeding 3 or –3, you can use the
any method on a boolean DataFrame:
• data[(np.abs(data) > 3).any(1)]
• Here is code to cap values outside the inter‐ val –3 to 3
• data[np.abs(data) > 3] = np.sign(data) * 3
Permutation and random sampling
• Calling permutation with the length of the axis you want to permute
produces an array of integers indicating the new ordering:
• df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4))
• sampler = np.random.permutation(5)
• df.take(sampler)
• To select a random subset without replacement, you can use the
sample method on Series and DataFrame:
• df.sample(n=3)
• To generate a sample with replacement (to allow repeat choices),
pass replace=True to sample:
• choices = pd.Series([5, 7, -1, 6, 4])
• draws = choices.sample(n=10, replace=True)
Data Wrangling: Join,
Combine, and Reshape
Hierarchical Indexing
• In many applications, data may be spread across a number of files or
databases or be arranged in a form that is not easy to analyze.
• This chapter focuses on tools to help combine, join, and rearrange
data.
• Hierarchical indexing is an important feature of pandas that enables
you to have multiple (two or more) index levels on an axis.
• Somewhat abstractly, it provides a way for you to work with higher
dimensional data in a lower dimensional form.
Example
• data.unstack()
• data.unstack().stack()
• With a DataFrame, either axis can have a hierarchical index:
Names for indexes
• The hierarchical levels can have names
Reordering and Sorting Levels
• The swaplevel takes two level numbers or names and returns a new
object with the levels interchanged
• sort_index, on the other hand, sorts the data using only the values in
a single level
Summary statistics
• frame.sum(level='key2')
• frame.sum(level='color', axis=1)
Combining and Merging Datasets
• pandas.merge connects rows in DataFrames based on one or more
keys. This will be familiar to users of SQL or other relational
databases, as it implements database join operations.
• pandas.concat concatenates or “stacks” together objects along an
axis.
• The combine_first instance method enables splicing together
overlapping data to fill in missing values in one object with values
from another.
Merge
• pd.merge(df1, df2)
• pd.merge(df1, df2, on='key')
• pd.merge(df3, df4, left_on='lkey', right_on='rkey')
• By default merge does an 'inner' join; the keys in the result are the
intersection, or the common set found in both tables.
• Other possible options are 'left', 'right', and 'outer'.
• The outer join takes the union of the keys, combining the effect of
applying both left and right joins
• To merge with multiple keys, pass a list of column names
• A last issue to consider in merge operations is the treatment of
overlapping column names.
• Merge has a suffixes option for specifying strings to append to
overlapping names in the left and right DataFrame objects
Merging on index
• In some cases, the merge key(s) in a DataFrame will be found in its
index. In this case, you can pass left_index=True or right_index=True
(or both) to indicate that the index should be used as the merge key
Join method
• DataFrame has a convenient join instance for merging by index.
Concatenating Along an Axis
• pd.concat([s1, s2, s3])
• pd.concat([s1, s2, s3], axis=1)
• pd.concat([s1, s4], axis=1, join='inner')
• pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])
Combining Data with Overlap
• np.where(pd.isnull(a), b, a)
• Series has a combine_first method, which performs the equivalent of
this operation along with pandas’s usual data alignment logic

• With DataFrames, combine_first does the same thing column by


column, so you can think of it as “patching” missing data in the calling
object with data from the object you pass

You might also like