Pandas Notes(1)
Pandas Notes(1)
Pandas is an open-source Python package for data cleaning and data manipula on. It provides
extended, flexible data structures to hold different types of labelled and rela onal data.
Pandas is built on the NumPy package, so a lot of the structure between them is similar. Pandas is
also used in SciPy for sta s cal analysis or with Matplotlib for plo ng func ons.
Easily calculate sta s cs about data such as finding the average, distribu on, and median of
columns
Use data visualiza on tools, such as Matplotlib, to easily create plot bars, histograms, and
more
Clean your data by filtering columns by par cular criteria or easily removing values
Manipulate your data flexibly using opera ons like merging, joining, reshaping, and more
Read, write, and store your clean data as a database, txt file, or CSV file
Installing Pandas
You can install Pandas using the built-in Python tool pip and run the following command.
Series
DataFrame
Series are the columns, and the DataFrame is a table composed of a collec on of series. Series can
be best described as the single column of a 2-D array that can store data of any type.
DataFrame is like a table that stores data similar to a spreadsheet using mul ple columns and rows.
Each value in a DataFrame object is associated with a row index and a column index.
Pandas data structures below with some addi onal annota on.
We create series by invoking the pd.Series() method and then passing a list of values.
Pandas will, by default, count index from 0. We then explicitly define those values.
srs.values func on on line 9 returns the values stored in the Series object, and the
func on srs.index.values on line 13 returns the index values.
Assign names to our values
Each index corresponds to its value in the Series object. Let’s look at an example where we assign a
country name to popula on growth rates.
Example:
The a ribute srs.name sets the name of our series object. The a ribute srs.index.name then sets the
name for the indexes.
srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)
srs = pd.Series(np.arange(0, 6, 1), index = ['ind0', 'ind1', 'ind2', 'ind3', 'ind4', 'ind5'])
srs.index.name = "Index"
print("The original Series:\n", srs)
The output that the ind2 index is dropped. Also, an index can only be dropped by specifying the
index name and not the number. So, srs.drop(srs[2]) does not work.
DataFrame: the most important opera ons
Using the pandas.DataFrame() func on
To create a pandas dataframe from a numpy array, pass the numpy array as an argument to
the pandas.DataFrame() func on. You can also pass the index and column labels for the
dataframe. The following is the syntax:
We can also create a dict and pass our dic onary data to the DataFrame constructor. Say we have
some data on vegetable sales and want to organize it by type of vegetable and quan ty. Our data
would look like this:
Each item, or value, in our data will correspond with a column in the DataFrame we created, just like
a chart. The index for this DataFrame is listed as numbers, but we can specify them further
depending on our needs. Say we wanted to know quan ty per month. That would be our new index.
We do that using the following command.
Get info about your data
One of the first commands you run a er loading your data is .info(), which provides all the essen al
informa on about a dataset.
You can access more informa on with other opera ons, like .shape, which outputs a tuple of (rows,
columns).
We use the .columns operator to print a dataset’s column names.
quan ty.columns
You can then rename your columns easily. On top of that, the .rename() method allows us to rename
columns.
quan ty.rename(columns = {'carrots':'bananas'})
On line 11, the indexes are changed. The new index name is added between Row2 and Row4. One
line 14, the columns keyword should be specifically used to reindex the columns of DataFrame. The
rules are the same as for the indexes. NaN values were assigned to the whole column by default.
Once we’ve used Pandas to sort and clean data, we can then save it back as the original file with
simple commands. You only have to input the filename and extension.
Reading and impor ng data from JSON ( JSON (JavaScript Object Nota on))
Examples:
{
"glossary": {
" tle": "example glossary",
"GlossDiv": {
" tle": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
Say you have a JSON file. A JSON file is basically like a stored Python dict, so Pandas can easily access
and read it using the read_json func on. Let’s look at an example.
Reading and impor ng data from Excel file
Say you have an Excel file. You can similarly use the read_excel func on to access and read that data.
Once we call the read_excel func on, we pass the name of the Excel file as our argument,
so read_excel will open the file’s data. We can the print() to display the data. If we want to go one
step further, we can add the loc() method from earlier, allowing us to read specific rows and columns
of our file.
le _df – Dataframe1
right_df– Dataframe2.
on− Columns (names) to join on. Must be found in both the le and right DataFrame objects.
how – type of join needs to be performed – ‘le ’, ‘right’, ‘outer’, ‘inner’, Default is inner join
The data frames must have same column names on which the merging happens. Merge() Func on in
pandas is similar to database join opera on in SQL.
UNDERSTANDING THE DIFFERENT TYPES OF JOIN OR MERGE IN PANDAS:
Inner Join or Natural join: To keep only rows that match from the data frames, specify the
argument how=‘inner’.
Outer Join or Full outer join:To keep all rows from both data frames, specify how=‘outer’.
Le Join or Le outer join:To include all the rows of your data frame x and only those
from y that match, specify how=‘le ’.
Right Join or Right outer join:To include all the rows of your data frame y and only those
from x that match, specify how=‘right’.
Example:
import pandas as pd
import numpy as np
# data frame 1
d1 = {'Customer_id':pd.Series([1,2,3,4,5,6]),
'Product':pd.Series(['Oven','Oven','Oven','Television','Television','Television'])}
df1 = pd.DataFrame(d1)
# data frame 2
d2 = {'Customer_id':pd.Series([2,4,6,7,8]),
'State':pd.Series(['California','California','Texas','New York','Indiana'])}
df2 = pd.DataFrame(d2)
df1:
df2:
Example:
#inner join in python pandas
inner_join_df= pd.merge(df1, df2, on='Customer_id', how='inner')
inner_join_df
Returns all rows from both tables, join records from the le which have matching keys in the right
table.When there is no Matching from any table NaN will be returned
Example:
# outer join in python pandas
outer_join_df=pd.merge(df1, df2, on='Customer_id', how='outer')
outer_join_df
Return all rows from the le table, and any rows with matching keys from the right table.When there
is no Matching from right table NaN will be returned
Example:
# le join in python
le _join_df= pd.merge(df1, df2, on='Customer_id', how='le ')
le _join_df
the resultant data frame df will be
Return all rows from the right table, and any rows with matching keys from the le table.
Example:
# right join in python pandas
right_join_df= pd.merge(df1, df2, on='Customer_id', how='right')
right_join_df
It is frequently required to join dataframes together, such as when data is loaded from mul ple files
or even mul ple sources. pandas.concat() is used to add the rows of mul ple dataframes together
and produce a new dataframe with the the combined data.
concat
The Pandas.concat func on joins a number of dataframes on one of the axis. The default is to join
along the index.
Concatenate the two dataframes together to join along the index.
Pandas - concat by default joins two dataframes along the index
Pandas.concat Parameters:
join how to handle indexes on other axis, (op ons are ‘inner’ or ‘outer’) 'outer'
keys sequence used to create hierarchical index using the passed keys None
verify_integrityboolean value to specify whether the new concatenated axis contains duplicates False
boolean value to specify sor ng non-concatena on axis if it is not already aligned when join
sort False
is ‘outer’
Pandas - concat() with index = 1 joins two dataframes along the columns
concat with inner join
Pandas - concatenate two dataframes with inner join only keeps matching indexes
Dataframe.Append
instance method performs the same func on as concat by appending a Series or Dataframe onto the
end of the calling dataframe and returning a new dataframe.
groupby() func on in pandas
Pandas DataFrame.groupby() func on is used to collect iden cal data into groups and
perform aggregate func ons on the grouped data. Group by opera on involves spli ng the data,
applying some func ons, and finally aggrega ng the results.
The 'groupby' func on is commonly used in data analysis. It is used to gain insights into the
rela onship between variables.
Key Points –
groupby() is used to split data into groups based on one or more keys, allowing for efficient
analysis and aggrega on of grouped data.
It supports various aggrega on func ons like sum, mean, count, min, and max, which can
be applied to each group.
You can apply mul ple aggrega ons on different columns using .agg(), offering more
flexibility in analysis.
The result of groupby() o en returns a DataFrame with a Mul Index, where each level
represents a grouping key.
You can filter groups based on specific condi ons by using .filter() a er groupby().
groupby() allows itera on over groups, enabling customized opera ons on each subset of
data.
Example:
output.
Apply the groupby() func on along with the sum() func on to perform the sum opera on on grouped
data.
output.
Example:
Example:
Sort groupby() result by Group Key
To remove sor ng on grouped results in pandas, you can pass sort=False parameter to
the groupby() func on. By passing sort=False to the groupby() func on, you ensure that the grouped
results are not sorted by the group key, preserving the original order of appearance of the courses in
the DataFrame.
To sort the group keys (courses) in descending order a er performing the groupby() opera on, you
can use the sort_index() method with the ascending=False parameter.
This code first groups the DataFrame by Courses, calculates the sum of each group, and then sorts
the group keys (courses) in descending order using the sort_index() method with ascending=False.
Apply More Aggrega ons
compute mul ple aggrega ons at the same me in grouped data simply bypassing the list of
aggregate func ons to the aggregate().
Example:
To compute different aggrega ons on different columns in a grouped DataFrame, you can pass a
dic onary to the agg() func on specifying the aggrega on func on for each column. Here, calculates
the count on the Dura on grouped column and calculates min and max on the Fee grouped column.
Example:
Output:
Output:
We can simply find out rows, or columns where we have missing data and drop them by using
Pandas func ons.
To replace missing data in a DataFrame you can use the following different methods:
1. Replace missing data with fixed values in DataFrame
2. Replace missing data with Mean value
3. Replace missing data with Median value
We can impute the missing values in the dataFrame by a fixed value. The fixed value can be
an Integer or any other data depending on the nature of your Dataset. For example, if you
are dealing with gender data, you can replace all the missing values with the word
“unknown”, “Male”, or “Female”.
Pandas Replace NaN with 0
Pandas Replace NaN with empty String
Imputed all missing values by a random number, generated using the python random module.
Key Differences:
Requires
Yes Yes No No
Aggrega on
Key for
No Grouping key(s) Common key(s) Not required
Combining
The pivot_table() func on in Pandas allows us to create a spreadsheet-style pivot table making it
easier to group and analyze our data.
To create a pivot table using this method you need to define values for the following parameters:
Index
Columns (op onal)
Values
Aggfunc
The index parameter defines what is going to be the index of your pivot table. For example, it defines
how the rows of your original DataFrame are going to be grouped into categories. If you input a list
of values instead of just one value, you are going to end up with a mul -index as your row index.
The columns parameter is an op onal parameter that allows you to introduce an extra value to your
columns index, which in turn transforms your pivot table column index into a mul -index.
The values parameter defines which columns you want to aggregate. Essen ally it tells Pandas what
it needs to aggregate based on some aggrega on func on a er your data has been grouped based
on the values you entered for the index parameter.
The aggfunc parameter defines which type of aggrega on you want to perform. Based on what you
decide to use here, you can access various informa on such as the means, the sums, etc. If you want
to, you can also enter mul ple values here which will end up transforming your column index into a
mul -index.
Example:
Output:
In this example, we reshaped the DataFrame
with Date as index, City as columns and Temperature as values.
Output:
In this example, we created a pivot table for mul ple values i.e. Temperature and Humidity.
In the above example, we calculated the mean temperature of each city using
the aggfunc='mean' argument in pivot_table().
Output: