0% found this document useful (0 votes)
39 views

Pandas Notes(1)

Uploaded by

krithikb87
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Pandas Notes(1)

Uploaded by

krithikb87
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data analysis made simple: Python Pandas

Python for Data Analysis  Pandas


The Pandas library is one of the most important and popular tools for Python data scien sts and
analysts, as it is the backbone of many data projects.

Pandas is an open-source Python package for data cleaning and data manipula on. It provides
extended, flexible data structures to hold different types of labelled and rela onal data.

Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is
also used in SciPy for sta s cal analysis or with Matplotlib for plo ng func ons.

Using Pandas, you can do things like:

 Easily calculate sta s cs about data such as finding the average, distribu on, and median of
columns

 Use data visualiza on tools, such as Matplotlib, to easily create plot bars, histograms, and
more

 Clean your data by filtering columns by par cular criteria or easily removing values

 Manipulate your data flexibly using opera ons like merging, joining, reshaping, and more

 Read, write, and store your clean data as a database, txt file, or CSV file

Installing Pandas

You can install Pandas using the built-in Python tool pip and run the following command.

$ pip install pandas

Data types available to us in Pandas, also called dtypes.

 object: text or mixed numeric or non-numeric values

 int64: integer numbers

 bool: true/false vaues

 float64: floa ng point numbers

 category: finite list of text values

 date me64: Date and me values

 medelta[ns]: differences between two date mes


A data structure is a par cular way of organizing our data. Pandas has two data structures, and all
opera ons are based on those two objects:

 Series

 DataFrame

Series are the columns, and the DataFrame is a table composed of a collec on of series. Series can
be best described as the single column of a 2-D array that can store data of any type.

DataFrame is like a table that stores data similar to a spreadsheet using mul ple columns and rows.
Each value in a DataFrame object is associated with a row index and a column index.
Pandas data structures below with some addi onal annota on.

We create series by invoking the pd.Series() method and then passing a list of values.

Pandas will, by default, count index from 0. We then explicitly define those values.

srs.values func on on line 9 returns the values stored in the Series object, and the
func on srs.index.values on line 13 returns the index values.
Assign names to our values

Each index corresponds to its value in the Series object. Let’s look at an example where we assign a
country name to popula on growth rates.

Example:

#importing pandas in our program


import pandas as pd

# Defining a series object


srs = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])

# Set Series name


srs.name = "Growth Rate"

# Set index name


srs.index.name = "Country"

# printing series values


print("The Indexed Series values are:")
print(srs)

The a ribute srs.name sets the name of our series object. The a ribute srs.index.name then sets the
name for the indexes.

Select entries from a Series


We select elements based on the index name or index number.
import numpy as np
import pandas as pd

srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)

print("\nSeries element at index ind3:")


print(srs['ind3']) # Fetch element at index named ind3

print("\nSeries element at index 3:")


print(srs[3]) # Fetch element at index 3

print("\nSeries elements at multiple indexes:\n")


print(srs[['ind1', 'ind4']]) # Fetch elements at multiple indexes

The elements from the Series are selected in 3 ways.


 On line 9, the element is selected based on the index name.
 On line 12, the element is selected based on the index number. Keep in mind that index
numbers start from 0.
 On line 15, mul ple elements are selected from the Series by selec ng mul ple index names
inside the [].

Drop entries from a Series


Dropping and unwanted index is a common func on in Pandas. If the drop(index_name) func on is
called with a given index on a Series object, the desired index name is deleted.
import numpy as np
import pandas as pd

srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)

srs = srs.drop('ind2') # drop index named ind2

print("The New Series:\n", srs)

The output that the ind2 index is dropped. Also, an index can only be dropped by specifying the
index name and not the number. So, srs.drop(srs[2]) does not work.
DataFrame: the most important opera ons
Using the pandas.DataFrame() func on
To create a pandas dataframe from a numpy array, pass the numpy array as an argument to
the pandas.DataFrame() func on. You can also pass the index and column labels for the
dataframe. The following is the syntax:

df = pandas.DataFrame(data=arr, index=None, columns=None)


There are several ways to make a DataFrame in Pandas. The easiest way to create one from scratch is
to create and print a df.

We can also create a dict and pass our dic onary data to the DataFrame constructor. Say we have
some data on vegetable sales and want to organize it by type of vegetable and quan ty. Our data
would look like this:

And now we pass it to the constructor using a simple command.

Each item, or value, in our data will correspond with a column in the DataFrame we created, just like
a chart. The index for this DataFrame is listed as numbers, but we can specify them further
depending on our needs. Say we wanted to know quan ty per month. That would be our new index.
We do that using the following command.
Get info about your data
One of the first commands you run a er loading your data is .info(), which provides all the essen al
informa on about a dataset.

You can access more informa on with other opera ons, like .shape, which outputs a tuple of (rows,
columns).
We use the .columns operator to print a dataset’s column names.
quan ty.columns
You can then rename your columns easily. On top of that, the .rename() method allows us to rename
columns.
quan ty.rename(columns = {'carrots':'bananas'})

Searching and selec ng in our DataFrame


We need to know how to manipulate or access the data in our DataFrame, such as selec ng,
searching, or dele ng data values. You can do this either by column or by row. Let’s see how it’s
done. The easiest way to select a column of data is by using brackets [ ]. We can also use brackets to
select mul ple columns. Say we only wanted to look at June’s vegetable quan ty.

Note: loc and iloc are used for loca ng data.


 .iloc locates by numerical index
 .loc locates by the index name. This is similar to list slicing in Python.
Pandas DataFrame object also provides methods to select specific columns. The following example
shows how it can be done.
quan ty['peppers']

Create a new DataFrame from pre-exis ng columns


We can also grab mul ple columns and create a new DataFrame object from it.

Reindex data in a DataFrame


We can also reindex the data either by the indexes themselves or the columns. Reindexing
with reindex() allows us to make changes without messing up the ini al se ng of the objects.
Note: The rules for reindexing are the same for Series and DataFrame objects.

#impor ng pandas in our program


import pandas as pd

# Defining a series object


srs1 = pd.Series([11.9, 36.0, 16.6, 21.8, 34.2], index = ['China', 'India', 'USA', 'Brazil', 'Pakistan'])

# Set Series name


srs1.name = "Growth Rate"

# Set index name


srs1.index.name = "Country"

srs2 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'])


print("The series with new indexes is:\n",srs2)

srs3 = srs1.reindex(['China', 'India', 'Malaysia', 'USA', 'Brazil', 'Pakistan', 'England'], fill_value=0)


print("\nThe series with new indexes is:\n",srs3)

On line 11, the indexes are changed. The new index name is added between Row2 and Row4. One
line 14, the columns keyword should be specifically used to reindex the columns of DataFrame. The
rules are the same as for the indexes. NaN values were assigned to the whole column by default.

How to read or import Pandas data


It is quite easy to read or import data from other files using the Pandas library. In fact, we can use
various sources, such as CSV, JSON, or Excel to load our data and access it.

Reading and impor ng data from CSV files


We can import data from a CSV file, which is common prac ce for Pandas users.
We simply create or open CSV file, copy the data, paste it in our Notepad, and save it in the same
directory that houses your Python scripts.
You then use a bit of code to read the data using the read_csv func on build into Pandas.
read_csv will generate the index column as a default, so we need to change this for the first column
is the index column. We can do this by passing the parameter index_col to tell Pandas which column
to index.

Once we’ve used Pandas to sort and clean data, we can then save it back as the original file with
simple commands. You only have to input the filename and extension.

Reading and impor ng data from JSON ( JSON (JavaScript Object Nota on))
Examples:
{
"glossary": {
" tle": "example glossary",
"GlossDiv": {
" tle": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}

Say you have a JSON file. A JSON file is basically like a stored Python dict, so Pandas can easily access
and read it using the read_json func on. Let’s look at an example.
Reading and impor ng data from Excel file
Say you have an Excel file. You can similarly use the read_excel func on to access and read that data.

Once we call the read_excel func on, we pass the name of the Excel file as our argument,
so read_excel will open the file’s data. We can the print() to display the data. If we want to go one
step further, we can add the loc() method from earlier, allowing us to read specific rows and columns
of our file.

Data Wrangling with Pandas (Combining DataFrames)


Once we have our data, we can use data wrangling processes to manipulate and prepare data for the
analysis. The most common data wrangling processes are merging, concatena on, and grouping.

Merge method Descrip on

le Use keys from le frame only

right Use keys from right frame only

outer Use union of keys from both frames

inner Use intersec on of keys from both frames


merge()
We can Join or merge two data frames in pandas python by using the merge() func on. The different
arguments to merge() allow you to perform natural join, le join, right join, and full outer join in
pandas. We have also other type join or concatenate opera ons like join based on index, Row index
and column index.

Join or Merge in Pandas – Syntax:


merge(le _df, right_df, on=’Customer_id’, how=’inner’)

le _df – Dataframe1
right_df– Dataframe2.
on− Columns (names) to join on. Must be found in both the le and right DataFrame objects.
how – type of join needs to be performed – ‘le ’, ‘right’, ‘outer’, ‘inner’, Default is inner join
The data frames must have same column names on which the merging happens. Merge() Func on in
pandas is similar to database join opera on in SQL.
UNDERSTANDING THE DIFFERENT TYPES OF JOIN OR MERGE IN PANDAS:
 Inner Join or Natural join: To keep only rows that match from the data frames, specify the
argument how=‘inner’.
 Outer Join or Full outer join:To keep all rows from both data frames, specify how=‘outer’.
 Le Join or Le outer join:To include all the rows of your data frame x and only those
from y that match, specify how=‘le ’.
 Right Join or Right outer join:To include all the rows of your data frame y and only those
from x that match, specify how=‘right’.

Example:
import pandas as pd
import numpy as np

# data frame 1
d1 = {'Customer_id':pd.Series([1,2,3,4,5,6]),
'Product':pd.Series(['Oven','Oven','Oven','Television','Television','Television'])}
df1 = pd.DataFrame(d1)

# data frame 2
d2 = {'Customer_id':pd.Series([2,4,6,7,8]),
'State':pd.Series(['California','California','Texas','New York','Indiana'])}
df2 = pd.DataFrame(d2)

df1:

df2:

Inner join pandas:


Return only the rows in which the le table have matching keys in the right table

Example:
#inner join in python pandas
inner_join_df= pd.merge(df1, df2, on='Customer_id', how='inner')
inner_join_df

the resultant data frame df will be

Outer join in pandas:

Returns all rows from both tables, join records from the le which have matching keys in the right
table.When there is no Matching from any table NaN will be returned
Example:
# outer join in python pandas
outer_join_df=pd.merge(df1, df2, on='Customer_id', how='outer')
outer_join_df

the resultant data frame df will be

Le outer Join or Le join pandas:

Return all rows from the le table, and any rows with matching keys from the right table.When there
is no Matching from right table NaN will be returned

Example:
# le join in python
le _join_df= pd.merge(df1, df2, on='Customer_id', how='le ')
le _join_df
the resultant data frame df will be

Right outer join or Right Join pandas:

Return all rows from the right table, and any rows with matching keys from the le table.

Example:
# right join in python pandas
right_join_df= pd.merge(df1, df2, on='Customer_id', how='right')
right_join_df

the resultant data frame df will be


Pandas - Joining DataFrames with Concat and Append

It is frequently required to join dataframes together, such as when data is loaded from mul ple files
or even mul ple sources. pandas.concat() is used to add the rows of mul ple dataframes together
and produce a new dataframe with the the combined data.

concat
The Pandas.concat func on joins a number of dataframes on one of the axis. The default is to join
along the index.
Concatenate the two dataframes together to join along the index.
Pandas - concat by default joins two dataframes along the index

Pandas.concat Parameters:

Parameter Note Default

objs list of DataFrame or Series objects

axis the axis to concatenate along, (0 = ’index’, 1 = ’columns’) 0

join how to handle indexes on other axis, (op ons are ‘inner’ or ‘outer’) 'outer'

ignore_index boolean value on preserving source index False

keys sequence used to create hierarchical index using the passed keys None

levels list of sequences used to create a Mul Index None

names list of names for the levels in hierarchical index None

verify_integrityboolean value to specify whether the new concatenated axis contains duplicates False

boolean value to specify sor ng non-concatena on axis if it is not already aligned when join
sort False
is ‘outer’

copy boolean value to specify whether data is copied unnecessarily True


concat with different column names

concat with axis = 1


The concat func on has a number of parameters, which have defaults. the axis parameter specifies
along which to join the dataframes, o for index (default) and 1 for columns.

Pandas - concat() with index = 1 joins two dataframes along the columns
concat with inner join

Pandas - concatenate two dataframes with inner join only keeps matching indexes

Concatena ng mul ple dataframes


More than two dataframes can be concatenated together. The default is to concatenate along the
index.

Pandas - concatenate mul ple dataframes along index


Mul ple dataframes can also be concatenated along the columns with axis=1.

Pandas - concatenate mul ple dataframes along columns

Dataframe.Append
instance method performs the same func on as concat by appending a Series or Dataframe onto the
end of the calling dataframe and returning a new dataframe.
groupby() func on in pandas
Pandas DataFrame.groupby() func on is used to collect iden cal data into groups and
perform aggregate func ons on the grouped data. Group by opera on involves spli ng the data,
applying some func ons, and finally aggrega ng the results.

In Pandas, you can use groupby() with the combina on


of sum(), count(), pivot(), transform(), aggregate(), and many more methods to perform various
opera ons on grouped data.

Aggrega on-Func on Descrip on

sum() Sum of values

mean() Mean (average) of values

min() / max() Minimum / Maximum value

std() / var() Standard devia on / Variance


Aggrega on-Func on Descrip on

count() Count of non-missing values

nunique() Number of unique values

cumsum() Cumula ve sum

agg() Apply mul ple aggrega on func ons

The 'groupby' func on is commonly used in data analysis. It is used to gain insights into the
rela onship between variables.

Key Points –
 groupby() is used to split data into groups based on one or more keys, allowing for efficient
analysis and aggrega on of grouped data.
 It supports various aggrega on func ons like sum, mean, count, min, and max, which can
be applied to each group.
 You can apply mul ple aggrega ons on different columns using .agg(), offering more
flexibility in analysis.
 The result of groupby() o en returns a DataFrame with a Mul Index, where each level
represents a grouping key.
 You can filter groups based on specific condi ons by using .filter() a er groupby().
 groupby() allows itera on over groups, enabling customized opera ons on each subset of
data.

The syntax for 'groupby()' is as follows:

Parameters of Pandas DataFrame.groupby()

 by – List of column names to group by


 axis – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
 level – Used with Mul Index.
 as_index – sql style grouped output.
 sort – Default to True. Specify whether to sort a er the group
 group_keys – add group keys or not
 squeeze – deprecated in new versions
 observed – This only applies if any of the groupers are Categoricals.
 dropna – Default to False. Use True to drop None/Nan on group keys

Example:

output.

Apply the groupby() func on along with the sum() func on to perform the sum opera on on grouped
data.
output.

groupby() on Two or More Columns


Example:

Add Index to the Grouped Data


By default groupby() func on doesn’t return the row Index, you can add the index using
the DataFrame.reset_index() method.

Example:

Drop NA /None/Nan (on group key) from the Result


You can also choose whether to include NA/None/Nan in group keys or not by
se ng dropna parameter. By default the value of dropna set to True. So, it does not include
None/Nan values on the group keys set dropna=False parameter.

Example:
Sort groupby() result by Group Key
To remove sor ng on grouped results in pandas, you can pass sort=False parameter to
the groupby() func on. By passing sort=False to the groupby() func on, you ensure that the grouped
results are not sorted by the group key, preserving the original order of appearance of the courses in
the DataFrame.

To sort the group keys (courses) in descending order a er performing the groupby() opera on, you
can use the sort_index() method with the ascending=False parameter.

This code first groups the DataFrame by Courses, calculates the sum of each group, and then sorts
the group keys (courses) in descending order using the sort_index() method with ascending=False.
Apply More Aggrega ons
compute mul ple aggrega ons at the same me in grouped data simply bypassing the list of
aggregate func ons to the aggregate().
Example:
To compute different aggrega ons on different columns in a grouped DataFrame, you can pass a
dic onary to the agg() func on specifying the aggrega on func on for each column. Here, calculates
the count on the Dura on grouped column and calculates min and max on the Fee grouped column.

Pandas Handling Missing Data in DataFrame


What is Missing Data?
In the world of Data Science, a Pandas DataFrame is the most popular and globally accepted data
structure for storing large-scale data in the form of rows and columns just like an excel spreadsheet
or SQL table. A DataFrame can contain almost any type of data, however, the missing data in a
DataFrame is refer to the values that are unavailable.
Example of Missing Data in a Pandas DataFrame
The word “Missing Data in a DataFrame” simply means the values that are unavailable or missing in a
Pandas DataFrame. Values that are missing in a DataFrame are automa cally replaced by
the NaN type (Here NaN is used from NumPy). In the following example, we have two missing
values in a DataFrame which is replaced by the “NaN” value

Why Should You Handle Missing Data in DataFrame?


In the process of exploratory data analysis, one of the most important steps is data preprocessing
where you will be mainly dealing with missing data handling. Before looking into the insights of data
you need a clean dataset, free of outliers and missing values.
You need to handle missing data in a Pandas DataFrame because
1. Missing values in a DataFrame nega vely affect the data insights
2. Training a Machine Learning model needs a clean dataset
3. DataFrame with missing values is hard to process, visualize and create a data pipeline
So, you need to find out the missing data in your DataFrame and get rid of missing values.

How to Find Missing Data in a DataFrame?


Use func ons like isna() or isnull() to detect missing values. Pair them with sum() to count missing
entries.

1. Find Rows Having NaN Values

Example:
Output:

2: Find Columns Having NaN Values

3: Find Percentage of Missing Data in Column


Here, DataFrame.isna() is used to check if the DataFrame has NA values.
Output:

4: Find Number of NaN Values in Each Row w.r.t Column

Output:

Different Methods to Handle Missing Data in a DataFrame


Based on the data you are working with, you may have to follow any of the following different
techniques for handling missing data in a DataFrame. Review all of the methods and apply the one
which suits best your need.
The best ways to handle missing data in a DataFrame are:
1. Remove rows or columns from the DataFrame that have missing data
2. Replace the missing data with another value

1. Remove Rows or Columns Having Missing Data

We can simply find out rows, or columns where we have missing data and drop them by using
Pandas func ons.

1.1 Removing Rows Having Missing Data


In Pandas, we can use the func on df.dropna() to remove all rows that have missing data.

1.2 Removing Columns Having Missing Data


Just like removing rows, we can also remove columns from our DataFrame that have missing data.
The same pandas built-in func on, df.dropna() can be used with an extra “axis” parameter.
2. Replace Missing Data in DataFrame
This method is a bit tedious yet a more powerful and op mis c way to handle missing data in
DataFrame. You will have a lot of ways to replace the missing data in the DataFrame.

To replace missing data in a DataFrame you can use the following different methods:
1. Replace missing data with fixed values in DataFrame
2. Replace missing data with Mean value
3. Replace missing data with Median value

2.1 Replace Missing Data with Fixed Values in dataFrame

We can impute the missing values in the dataFrame by a fixed value. The fixed value can be
an Integer or any other data depending on the nature of your Dataset. For example, if you
are dealing with gender data, you can replace all the missing values with the word
“unknown”, “Male”, or “Female”.
 Pandas Replace NaN with 0
 Pandas Replace NaN with empty String

Imputed all missing values by a random number, generated using the python random module.

2.2 Replace Missing Data with Mean Value


You can use the mean values to replace the missing values in case the data distribu on is symmetric.
You have a choice to choose between the three sta s cs func ons either mean mode, or Median. It
strongly depends on the dataset you working on.
Example:

Pivot Tables in Pandas


Pivot tables are tables of grouped values that aggregate specific items of an original table into one or
more discrete categories. They are a way of crea ng short summaries of your original dataset that
display things such as sums of columns, averages, or any other sta s c value you are interested in. By
summarizing large amounts of data into pivot tables, usually no ce some pa erns in it which helps
deduce how your data behaves based on certain factors. This knowledge is very useful because it can
help subject ma er experts make be er strategic decisions.

Key Differences:

Feature Pivot Tables groupby merge concat

Reshape and Group and Combine data


Purpose Stack/append data
summarize data aggregate based on a key

Requires
Yes Yes No No
Aggrega on

Reshapes Data Yes (grid format) No No No

Key for
No Grouping key(s) Common key(s) Not required
Combining

Mul -dimensional Aggrega ng Adding new rows or


Use Case Joining datasets
summary analysis column values columns

The pivot_table() func on in Pandas allows us to create a spreadsheet-style pivot table making it
easier to group and analyze our data.

To create a pivot table using this method you need to define values for the following parameters:
 Index
 Columns (op onal)
 Values
 Aggfunc

The index parameter defines what is going to be the index of your pivot table. For example, it defines
how the rows of your original DataFrame are going to be grouped into categories. If you input a list
of values instead of just one value, you are going to end up with a mul -index as your row index.
The columns parameter is an op onal parameter that allows you to introduce an extra value to your
columns index, which in turn transforms your pivot table column index into a mul -index.
The values parameter defines which columns you want to aggregate. Essen ally it tells Pandas what
it needs to aggregate based on some aggrega on func on a er your data has been grouped based
on the values you entered for the index parameter.
The aggfunc parameter defines which type of aggrega on you want to perform. Based on what you
decide to use here, you can access various informa on such as the means, the sums, etc. If you want
to, you can also enter mul ple values here which will end up transforming your column index into a
mul -index.

Example:

Output:
In this example, we reshaped the DataFrame
with Date as index, City as columns and Temperature as values.

pivot_table() based on the following syntax:

 index - keys to group by on the pivot table index


 columns - keys to group by on the pivot table column
 values - columns used for aggrega on data of the pivot table
 aggfunc - func ons or list of func ons used for aggrega on

pivot_table() with Mul ple Values

Output:
In this example, we created a pivot table for mul ple values i.e. Temperature and Humidity.

pivot_table() With Aggregate Func ons


We can use the pivot_table() method with different aggregate func ons using
the aggfunc parameter. We can set the value of aggfunc to func ons such
as 'sum', 'mean', 'count', 'max' or 'min'.
Let's see an example.

In the above example, we calculated the mean temperature of each city using
the aggfunc='mean' argument in pivot_table().

Pivot Table With Mul Index


We can create a pivot table with Mul Index using the pivot_table() func on.
Let's look at an example.
In this example, we created a pivot table with a Mul Index by passing a list of columns as
an index argument.
A Mul Index contains mul ple levels of indexes with columns linked to one another through a
parent/rela onship. Here, Country is the parent column and City is the child column.

Advanced Pivot op ons


Alterna vely, we may use more op ons with the following default values:

Useful pivot op ons are:


 fill_value - value to replace missing values with
 dropna - exclude columns whose entries are all NaN
 margins - add all row/columns (subtotal / grand totals)
 sort - sort the results
Pivot Table with Mul ple aggfunc
We can use mul ple aggrega on func ons. The func ons might be different for different columns:
 'D' - mean
 'E' - min and max

Output:

Pivot table replace NaN


To replace NaN values in the pivot table we can use the parameter fill_value. We can replace NaN
values with 0 by:

Pivot table remove NaN


To drop columns with NaN values we can use op on dropna=True:

You might also like