Session2-DM Using Pandas
Session2-DM Using Pandas
with
Pandas
Introduction
• Pandas: Python Data Analysis Library
• Provides rich set of functions to process various types of
data.
• Provides flexible data manipulation techniques as
spreadsheets and relational databases.
• An open source, providing high-performance, easy-to-use
data structures and data analysis tools
• Built on the top of Numpy.
• Integrates well with matplotlib library, which makes it
very handy tool for analyzing the data.
• Part of the SciPy ecosystem (Scientific Computing Tools for
Python)
Data structures
• Series : One-dimensional ndarray with axis labels (including
time series).
• DataFrame: Two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows
and columns). Arithmetic operations align on both row and
column labels.
• The primary pandas data structure
Series
• The Series is a one-dimensional array that can store various
data types, including mix data types.
• The row labels in a Series are called the index.
• Any list, tuple and dictionary can be converted in to Series
using ‘series’ method
• Like ndarrays, the length of a Series cannot be modified
after definition.
• Missing data: Represented as NaN (np.nan, a float!).
• Statistical methods from ndarray have been overridden to
automatically exclude missing data.
Creating a Series
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu', 'Akash', 'Aditya'],
name='CGPA')
print(s)
Vishnu 9.10
Akash 7.50
Aditya 8.63
Name: CGPA, dtype: float64
Some attributes
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu', 'Akash', 'Aditya'],
name='CGPA')
print(s.dtype)
print(s.name)
print(s.index)
float64
CGPA
Index(['Vishnu', 'Akash', 'Aditya'], dtype='object')
Creating a Series from List, Tuple, dictionary
0 Ram
import pandas as pd 1 15-08-2010
h = ('Ram', '15-08-2010', 48, 3.2) 2 48
s = pd.Series(h) 3 3.2
dtype: object
print(s)
name Ram
d = {'Name' : 'Ram', 'DoB' : '15-08-2010', 'Height' : 48,
DoB 15-08-2010
'Weight' : 3.2}
Height 48
ds = pd.Series(d) Weight 3.2
print(ds) dtype: object
f = ['Ram', '15-08-2020', 48, 3.2] name Ram
print(df) name
GOOG GOOG 2010-04-09 32.2
Summing over columns and rows
print(df.sum()) name AAIBMGOOG
date 2001-12-012012-02-102010-04-09
shares 220
price 54.8
owner UnknownUnknownUnknown
dtype: object
name
print(df.sum(axis=1))
AA 112.3
IBM 40.3
GOOG 122.2
dtype: float64
Reading files
import pandas as pd
casts = pd.read_csv('cast.csv', index_col=None)
print(casts.head())
titles = pd.read_csv('titles.csv', index_col =None)
print(titles.tail())
a=pd.read_csv('cast.csv', usecols= ['title','year'])
print(a.head(6))
Row and column selection
t = titles['title']
print(t.head(3))
Filter Data
• Data can be filtered by providing some boolean expression in
DataFrame.
c = casts
print(c['n'].isnull().head())
• To display the rows with null values, the condition must be passed in
the DataFrame
print(c[c['n'].isnull()].head(3))
• df.isna().any() returns a boolean value for each column.
• If there is at least one missing value in that column, the result is True.
title False
year False
print(c.isna().any())
name False
type False
character False
n True
dtype: bool
• df.isna().sum() returns the number of missing values in each column.
print(c.isna().sum()) title 0
year 0
name 0
type 0
character 0
n 28966
dtype: int64
Handling Missing Values
• Drop missing values
• Replace missing values
Drop missing values
• We can drop a row or column with missing values using dropna()
function. how parameter is used to set condition to drop.
title year
19 Maa Durga Shakti 1999
3046 Maa Aur Mamta 1970
7470 Maa Vaibhav Laxmi 1989
• Total number of occurrences can be counted using ‘value_counts()’
option.
• In following code, total number of movies are displayed base on
years.
• t['year'].value_counts().head()
Plots
• import matplotlib.pyplot as plt
• t = titles
• p = t['year'].value_counts()
• p.sort_index().plot()
• p.show()
Grouping
• Groupby with column-names
• c = casts
• cg = c.groupby(['year']).size()
• cg.plot()
• plt.show()
groupby option can take multiple parameters
for grouping
• c = casts
• cf = c[c['name'] == 'Aaron Abrams']
• ct=cf.groupby(['year', 'title']).size().head()
• print(ct) year title
2003 The In-Laws 1
2004 Resident Evil: Apocalypse 1
Siblings 1
2005 Cinderella Man 1
Sabah 1
dtype: int64
• grouping based on maximum ratings in a year;
• c.groupby(['year']).n.max().head()
• To check the mean rating each year,
• c.groupby(['year']).n.mean().head()
Unstack
• we want to compare and plot the total number of actors and
actresses in each decade.
• we need to group the data based on ‘type’
• c = casts
• c_decade = c.groupby( ['type', c['year']//10*10] ).size()
• print(c_decade)
type year
actor 1910 340
1920 590
1930 1364
1940 1253
1950 1490
1960 1879
1970 2191
1980 2874
1990 4051
2000 6787
2010 7259
actress 1910 267
1920 353
1930 511
1940 528
1950 665
1960 702
1970 1015
1980 1686
1990 2115
2000 3872
2010 4243
dtype: int64
• us=c_decade.unstack()
• print(us)
year 1910 1920 1930 1940 1950 1960 1970 1980 1990
2000 2010
type
actor 340 590 1364 1253 1490 1879 2191 2874 4051
6787 7259
actress 267 353 511 528 665 702 1015 1686 2115
3872 4243
• us.plot(kind='bar')
• plt.show()
Time series
• A series of time can be generated using ‘date_range’ command.
• ‘periods’ is the total number of samples;
• freq = ‘M’ represents that series must be generated based on
‘Month’.
• By default, pandas consider ‘M’ as end of the month.
• Use ‘MS’ for start of the month.
• rng = pd.date_range('2011-03-01 10:15', periods = 10, freq = 'M')
• print(rng)
DatetimeIndex(['2011-03-31 10:15:00', '2011-04-30 10:15:00',
'2011-05-31 10:15:00', '2011-06-30 10:15:00',
'2011-07-31 10:15:00', '2011-08-31 10:15:00',
'2011-09-30 10:15:00', '2011-10-31 10:15:00',
'2011-11-30 10:15:00', '2011-12-31 10:15:00'],
dtype='datetime64[ns]', freq='M')
dtype: category
Categories (3, object): [a, b, c]
References
• https://pandas.pydata.org/pandas-docs/stable/user_guide/
• GitHub awesome-pandas
• Pandas Guide by Meher Krishna Patel
• Manipulating and analyzing data with pandas by Céline Comte