0% found this document useful (0 votes)
5 views

Session2-DM Using Pandas

The document provides an overview of the Pandas library in Python, focusing on its data manipulation capabilities, including data structures like Series and DataFrame. It covers how to create, access, and manipulate these structures, as well as handling missing values, filtering, and grouping data. Additionally, it discusses integration with other libraries and basic plotting functionalities.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Session2-DM Using Pandas

The document provides an overview of the Pandas library in Python, focusing on its data manipulation capabilities, including data structures like Series and DataFrame. It covers how to create, access, and manipulate these structures, as well as handling missing values, filtering, and grouping data. Additionally, it discusses integration with other libraries and basic plotting functionalities.

Uploaded by

pavitradevi297
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Data Manipulation

with
Pandas
Introduction
• Pandas: Python Data Analysis Library
• Provides rich set of functions to process various types of
data.
• Provides flexible data manipulation techniques as
spreadsheets and relational databases.
• An open source, providing high-performance, easy-to-use
data structures and data analysis tools
• Built on the top of Numpy.
• Integrates well with matplotlib library, which makes it
very handy tool for analyzing the data.
• Part of the SciPy ecosystem (Scientific Computing Tools for
Python)
Data structures
• Series : One-dimensional ndarray with axis labels (including
time series).
• DataFrame: Two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows
and columns). Arithmetic operations align on both row and
column labels.
• The primary pandas data structure
Series
• The Series is a one-dimensional array that can store various
data types, including mix data types.
• The row labels in a Series are called the index.
• Any list, tuple and dictionary can be converted in to Series
using ‘series’ method
• Like ndarrays, the length of a Series cannot be modified
after definition.
• Missing data: Represented as NaN (np.nan, a float!).
• Statistical methods from ndarray have been overridden to
automatically exclude missing data.
Creating a Series
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu', 'Akash', 'Aditya'],
name='CGPA')
print(s)

Vishnu 9.10
Akash 7.50
Aditya 8.63
Name: CGPA, dtype: float64
Some attributes
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu', 'Akash', 'Aditya'],
name='CGPA')
print(s.dtype)
print(s.name)
print(s.index)

float64
CGPA
Index(['Vishnu', 'Akash', 'Aditya'], dtype='object')
Creating a Series from List, Tuple, dictionary
0 Ram
import pandas as pd 1 15-08-2010
h = ('Ram', '15-08-2010', 48, 3.2) 2 48
s = pd.Series(h) 3 3.2
dtype: object
print(s)
name Ram
d = {'Name' : 'Ram', 'DoB' : '15-08-2010', 'Height' : 48,
DoB 15-08-2010
'Weight' : 3.2}
Height 48
ds = pd.Series(d) Weight 3.2
print(ds) dtype: object
f = ['Ram', '15-08-2020', 48, 3.2] name Ram

f = pd.Series(f, index = ['Name', 'DoB', 'Height', DoB 15-08-2020


'Weight']) Height 48
Weight 3.2
print(f)
dtype: object
Accessing data
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu',
'Akash', 'Aditya'], name='CGPA')
print(s['Vishnu'])
print(s['Akash':'Aditya']) 9.1
print(s['Akash':]) Akash 7.50
Aditya 8.63
Name: CGPA, dtype: float64
Akash 7.50
Aditya 8.63
Name: CGPA, dtype: float64
Creating a View
import pandas as pd
s = pd.Series([9.1, 7.5, 8.63], index=['Vishnu', 'Akash',
'Aditya'], name='CGPA')
t=s['Akash':] Akash 7.50
print(t) Aditya 8.63
t['Aditya']=9.5 Name: CGPA, dtype: float64
print(s) Vishnu 9.1
Akash 7.5
Aditya 9.5
Name: CGPA, dtype: float64
Adding two series (with automatic data
alignment)
import pandas as pd Aditya NaN
s = pd.Series([9.1, 7.5, 8.63], Bibek 13.50
index=['Aditya', 'Bibek', 'Satya'], Satya 14.93
name='CGPA') dtype: float64
t = pd.Series([6, 6.3], index=['Bibek', Aditya 9.10
'Satya'], name='Height')
Bibek 13.50
u=s.add(t)
Satya 14.93
print(u)
dtype: float64
v=s.add(t, fill_value=0)
print(v)
DataFrame
• DataFrame can be used with two dimensional size-mutable,
potentially heterogeneous tabular data structure with
labeled axes
• DataFrame has two different index i.e. column-index and
row-index.
• Columns can have different dtypes and can be added and
removed,
• The most common way to create a DataFrame is by using the
dictionary of equal-length list.
• Further, all the spreadsheets and text files are read as
DataFrame.
Creating a DataFrame Height Weight
Aditya 5.5 NaN
import pandas as pd Bivek 6.0 230.0
import numpy as np Vishnu 6.5 275.0
df = pd.DataFrame({'Height': [5.5, 6, 6.5], 'Weight': [np.nan,
230., 275.]},
index=['Aditya', 'Bivek', 'Vishnu'])
print(df)
print(df.dtypes) Height float64
Weight float64
dtype: object
Other attributes
print(df.shape)
print(df.size)
print(df.columns)
print(df.index)
(3, 2)
6
Index(['Height', 'Weight'], dtype='object')
Index(['Aditya', 'Bivek', 'Vishnu'], dtype='object')
DataFrame is by using the dictionary
data = { 'name' : ['AA', 'IBM', 'GOOG'],
'date' : ['2001-12-01', '2012-02-10',
'2010-04-09'], 'shares' : [100, 30, 90],
'price' : [12.3, 10.3, 32.2]}
df = pd.DataFrame(data) name date shares price
print(df) 0 AA 2001-12-01 100 12.3
1 IBM 2012-02-10 30 10.3
df['owner'] = 'Unknown'
2 GOOG 2010-04-09 90 32.2
print(df)
name date shares price owner
0 AA 2001-12-01 100 12.3 Unknown
1 IBM 2012-02-10 30 10.3 Unknown
2 GOOG 2010-04-09 90 32.2 Unknown
df.index = ['one', 'two', name date shares price owner
'three'] one AA 2001-12-01 100 12.3 Unknown
two IBM 2012-02-10 30 10.3 Unknown
print(df)
three GOOG 2010-04-09 90 32.2 Unknown
df = df.set_index('name’, date shares price owner
drop=False) name
print(df) AA 2001-12-01 100 12.3 Unknown
IBM 2012-02-10 30 10.3 Unknown
GOOG 2010-04-09 90 32.2 Unknown
name
AA 100

Accessing data IBM 30


GOOG 90
Name: shares, dtype: int64
print(df['shares']) name AA
print(df.loc['AA',:]) date 2001-12-01

print(df.loc[:, 'name']) shares 100


price 12.3
print(df.loc['AA', 'shares']) owner Unknown
Name: AA, dtype: object
name
AA AA
IBM IBM
GOOG GOOG
Name: name, dtype: object
100
Deleting any Column
name date shares price
name
del df['owner'] AA AA 2001-12-01 100 12.3
IBM IBM 2012-02-10 30 10.3
print(df)
GOOG GOOG 2010-04-09 90 32.2
df.drop('shares', axis = name date price
1,inplace = True) name
print(df) AA AA 2001-12-01 12.3
df.drop(['AA', IBM IBM 2012-02-10 10.3
'IBM'],axis=0, GOOG GOOG 2010-04-09 32.2
inplace=True) name date price

print(df) name
GOOG GOOG 2010-04-09 32.2
Summing over columns and rows
print(df.sum()) name AAIBMGOOG
date 2001-12-012012-02-102010-04-09
shares 220
price 54.8
owner UnknownUnknownUnknown
dtype: object
name
print(df.sum(axis=1))
AA 112.3
IBM 40.3
GOOG 122.2
dtype: float64
Reading files
import pandas as pd
casts = pd.read_csv('cast.csv', index_col=None)
print(casts.head())
titles = pd.read_csv('titles.csv', index_col =None)
print(titles.tail())
a=pd.read_csv('cast.csv', usecols= ['title','year'])
print(a.head(6))
Row and column selection
t = titles['title']
print(t.head(3))
Filter Data
• Data can be filtered by providing some boolean expression in
DataFrame.

movies90 = titles[ (titles['year']>=1990) & (titles['year']<2000) ]


print(movies90.head(4))
Sorting
• In filtering operation, the data is sorted by index i.e. by default
‘sort_index’ operation is used

macbeth = titles[ titles['title'] == 'Macbeth'].sort_values('year')


print(macbeth.head())
Null values
• ‘isnull’ command returns the true value if any row of has null values.

c = casts
print(c['n'].isnull().head())
• To display the rows with null values, the condition must be passed in
the DataFrame

print(c[c['n'].isnull()].head(3))
• df.isna().any() returns a boolean value for each column.
• If there is at least one missing value in that column, the result is True.
title False
year False
print(c.isna().any())
name False
type False
character False
n True
dtype: bool
• df.isna().sum() returns the number of missing values in each column.

print(c.isna().sum()) title 0
year 0
name 0
type 0
character 0
n 28966
dtype: int64
Handling Missing Values
• Drop missing values
• Replace missing values
Drop missing values
• We can drop a row or column with missing values using dropna()
function. how parameter is used to set condition to drop.

• how=’any’ : drop if there is any missing value


• how=’all’ : drop if all values are missing

• Furthermore, using thresh parameter, we can set a threshold for


missing values in order for a row/column to be dropped.
• c.dropna(axis=0, inplace=True)
• print(c.head())
Replacing missing values
• fillna() function of Pandas conveniently handles missing values.
• Replace missing values with a scalar: c.fillna(2)
• fillna() can also be used on a particular column: c[‘n’].fillna(1)
• Using method parameter, missing values can be replaced with the
values before or after them.
• c.fillna(axis=0, method=‘ffill’)
String operations
• Various string operations can be performed using ‘.str.’ option
• h=t[t['title'].str.startswith("Maa ")].head(3)
• print(h)

title year
19 Maa Durga Shakti 1999
3046 Maa Aur Mamta 1970
7470 Maa Vaibhav Laxmi 1989
• Total number of occurrences can be counted using ‘value_counts()’
option.
• In following code, total number of movies are displayed base on
years.
• t['year'].value_counts().head()
Plots
• import matplotlib.pyplot as plt
• t = titles
• p = t['year'].value_counts()
• p.sort_index().plot()
• p.show()
Grouping
• Groupby with column-names

• c = casts
• cg = c.groupby(['year']).size()
• cg.plot()
• plt.show()
groupby option can take multiple parameters
for grouping
• c = casts
• cf = c[c['name'] == 'Aaron Abrams']
• ct=cf.groupby(['year', 'title']).size().head()
• print(ct) year title
2003 The In-Laws 1
2004 Resident Evil: Apocalypse 1
Siblings 1
2005 Cinderella Man 1
Sabah 1
dtype: int64
• grouping based on maximum ratings in a year;
• c.groupby(['year']).n.max().head()
• To check the mean rating each year,
• c.groupby(['year']).n.mean().head()
Unstack
• we want to compare and plot the total number of actors and
actresses in each decade.
• we need to group the data based on ‘type’
• c = casts
• c_decade = c.groupby( ['type', c['year']//10*10] ).size()
• print(c_decade)
type year
actor 1910 340
1920 590
1930 1364
1940 1253
1950 1490
1960 1879
1970 2191
1980 2874
1990 4051
2000 6787
2010 7259
actress 1910 267
1920 353
1930 511
1940 528
1950 665
1960 702
1970 1015
1980 1686
1990 2115
2000 3872
2010 4243
dtype: int64
• us=c_decade.unstack()
• print(us)

year 1910 1920 1930 1940 1950 1960 1970 1980 1990
2000 2010
type
actor 340 590 1364 1253 1490 1879 2191 2874 4051
6787 7259
actress 267 353 511 528 665 702 1015 1686 2115
3872 4243
• us.plot(kind='bar')
• plt.show()
Time series
• A series of time can be generated using ‘date_range’ command.
• ‘periods’ is the total number of samples;
• freq = ‘M’ represents that series must be generated based on
‘Month’.
• By default, pandas consider ‘M’ as end of the month.
• Use ‘MS’ for start of the month.
• rng = pd.date_range('2011-03-01 10:15', periods = 10, freq = 'M')
• print(rng)
DatetimeIndex(['2011-03-31 10:15:00', '2011-04-30 10:15:00',
'2011-05-31 10:15:00', '2011-06-30 10:15:00',
'2011-07-31 10:15:00', '2011-08-31 10:15:00',
'2011-09-30 10:15:00', '2011-10-31 10:15:00',
'2011-11-30 10:15:00', '2011-12-31 10:15:00'],
dtype='datetime64[ns]', freq='M')

Support for time zone representation, converting to another time


zone, and converting between time span representations.
Categoricals
• Similar to categorical variables used in statistics.
• Practical for saving memory and sorting data.
• “Examples are gender, social class, blood type, country affiliation”
• s = pd.Series(["a","b","c","a"], dtype="category")
• print(s)

dtype: category
Categories (3, object): [a, b, c]
References
• https://pandas.pydata.org/pandas-docs/stable/user_guide/
• GitHub awesome-pandas
• Pandas Guide by Meher Krishna Patel
• Manipulating and analyzing data with pandas by Céline Comte

You might also like