Pandas Cheat Sheet........
Pandas Cheat Sheet........
Pandas
Cheat Sheet
Shwetank Singh
GritSetGrow - GSGLearn.com
gsglearn.com
Cheat Sheet: The pandas DataFrame Object
Start by importing these Python modules Load a DataFrame from a CSV file
import numpy as np df = pd.read_csv('file.csv')# often works
df = pd.read_csv(‘file.csv’, header=0,
import matplotlib.pyplot as plt
import pandas as pd index_col=0, quotechar=’”’,sep=’:’,
from pandas import DataFrame, Series na_values = [‘na’, ‘-‘, ‘.’, ‘’])
Note: these are the recommended import aliases Note: refer to pandas docs for all arguments
data
data
data
data
data
data
workbook = pd.ExcelFile('file.xlsx')
(df.index)
dictionary = {}
of
of
of
of
of
of
of
Series
Series
Series
Series
Series
Series
dictionary[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.
Series object : an ordered, one-dimensional array of Load a DataFrame from a MySQL database
data with an index. All the data in a Series is of the import pymysql
same data type. Series arithmetic is vectorised after first from sqlalchemy import create_engine
aligning the Series index for each of the operands. engine = create_engine('mysql+pymysql://'
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 s2 = +'USER:PASSWORD@localhost/DATABASE')
Series(range(1,5)) # -> 1, 2, 3, 4 s3 = s1 + s2 # -> df = pd.read_sql_table('table', engine)
1, 3, 5, 7 s4 = Series(['a','b'])*3 # -> 'aaa','bbb'
Data in Series then combine into a DataFrame
The index object: The pandas Index provides the axis labels # Example 1 ...
for the Series and DataFrame objects. It can only contain s1 = Series(range(6))
hashable objects. A pandas Series has one Index; and a s2 = s1 * s1
DataFrame has two Indexes. s2.index = s2.index + 2# misalign indexes
# --- get Index from Series and DataFrame df = pd.concat([s1, s2], axis=1)
idx = s.index idx = df.columns idx = df.index
# Example 2 ... s3 = Series({'Tom':1, 'Dick':4,
# the column index 'Har':9}) s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
# the row index df = pd.concat({'A':s3, 'B':s4 }, axis=1)
# --- some Index attributes b = Note: 1st method has in integer column labels
idx.is_monotonic_decreasing b = Note: 2nd method does not guarantee col order
idx.is_monotonic_increasing b = Note: index alignment on DataFrame creation
idx.has_duplicates
Get a DataFrame from data in a Python dictionary
i = idx.nlevels # multi-level indexes
# default --- assume data is in columns
# --- some Index methods df = DataFrame({
a = idx.values() l = # get as numpy array 'col0' : [1.0, 2.0, 3.0, 4.0],
idx.tolist() # get as a python list 'col1' : [100, 200, 300, 400]
idx = idx.astype(dtype)# change data type })
b = idx.equals(o) # check for equality
idx = idx.union(o) # union of two indexes
i = idx.nunique() # number unique labels
label = idx.min() # minimum label #
label = idx.max() maximum label
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows
Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
'row0' : {'col0':0, 'col1':'A'},
'row1' : {'col0':1, 'col1':'B'} Peek at the DataFrame contents
df.info() n = 4 dfh # index & data types
}, orient='index')
= df.head(n) dft =
df.tail(n) # get first n rows
df = DataFrame.from_dict({ # data by row
# get last n rows
'row0' : [1, 1+1j, 'A'],
'row1' : [2, 2+2j, 'B'] dfs = df.describe() # summary stats cols
top_left_corner_df = df.iloc[:5, :5]
}, orient='index')
DataFrame non-indexing attributes
Create play/fake data (useful for testing)
# --- simple dfT = df.T # transpose rows and cols
df = DataFrame(np.random.rand(50,5)) l = df.axes # list row and col indexes
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
# --- with a time-stamp row index:
b = df.empty # True for empty DataFrame
df = DataFrame(np.random.rand(500,5))
i = df.ndim # number of axes (2)
df.index = pd.date_range('1/1/2006',
t = df.shape # (row-count, column-count)
periods=len(df), freq='M')
(r, c) = df.shape # from above
i =row-count
# df.size * column-count a = df.values #
# --- with alphabetic row and col indexes
get a numpy array for df
import string import random r = 52 # note:
min r is 1; max r is 52 c = 5 df =
DataFrame(np.random.randn(r, c), DataFrame utility methods
dfc = df.copy() # copy a DataFrame dfr =
df.rank() # rank each col (default) dfs =
df.sort() # sort each col (default) dfc =
columns = ['col'+str(i) for i in df.astype(dtype) # type conversion
range(c)],
index = list((string.uppercase + DataFrame iteration methods
string.lowercase)[0:r])) df.iteritems()# (col-index, Series) pairs
df['group'] = list( df.iterrows() # (row-index, Series) pairs
''.join(random.choice('abcd')
for _ in range(r))
) # example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
Saving a DataFrame
str(series.iat[0]) + '\n')
Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list)
df.to_csv('name.csv', encoding='utf-8') df = df.abs() # absolute values
df = df.add(o) # add df, Series or value
Saving DataFrames to an Excel Workbook s = df.count() # non NA/null values
from pandas import ExcelWriter writer = df(cols
# = df.cummax()
default axis)
ExcelWriter('filename.xlsx') df(cols
# = df.cummin()
default axis)
df1.to_excel(writer,'Sheet1') df(cols
# = df.cumsum()
default axis)
df2.to_excel(writer,'Sheet2') writer.save() df = df.cumprod() # (cols default axis)
df = df.diff() # 1st diff (col def axis)
df = df.div(o) # div by df, Series, value
df = df.dot(o) # matrix dot product
Saving a DataFrame to MySQL s = df.max() # max of axis (col def)
import pymysql s = df.mean() # mean (col default axis)
from sqlalchemy import create_engine s = df.median()# median (col default)
e = create_engine('mysql+pymysql://' + s = df.min() # min of axis (col def)
'USER:PASSWORD@localhost/DATABASE') df = df.mul(o) # mul by df Series val
df.to_sql('TABLE',e, if_exists='replace') s = df.sum() # sum axis (cols default)
Note: if_exists ! 'fail', 'replace', 'append' Note: The methods that return a series default to
working on columns.
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict() DataFrame filter/select rows or cols on label info
df = df.filter(items=['a', 'b']) # by col df =
Saving a DataFrame to a Python string df.filter(items=[5], axis=0) #by row df =
string = df.to_string() df.filter(like='x') # keep x in col df =
df.filter(regex='x') # regex in col df =
Note: sometimes may be useful for debugging df.select(crit=(lambda x:not x%5))#r
Not : select takes a Boolean function, for cols: axis=1
e : filter defaults to cols; select defaults to rows
Not
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
e 2
Columns value set based on criteria
Working with Columns df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)
A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by label/index
Working with rows [inclusive-from : inclusive–to [ : step]]
df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels Trap: doesn't work on integer labelled rows
idx = df.index # get row index label =
df.index[0] # 1st row label lst = df.index.tolist() Append a row of column totals to a DataFrame
# get as a list # Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
sums_df = DataFrame(sums,index=['Total']) df
Change the (row) index = df.append(sums_df)
df.index
# new ad= hoc
idx index
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new # Option 2: All done with pandas
# note: old index stored as a col in df df = df.append(DataFrame(df.sum(),
df = df.reindex(index=range(len(df))) columns=['Total']).T)
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'}, Iterating over DataFrame rows
inplace=True) for (index, row) in df.iterrows(): # pass
Trap: row data type may be coerced.
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows values
Hint: convert to a DataFrame and then append. Both df = df.sort(df.columns[0],
DataFrames should have same column labels. ascending=False)
Dropping rows (by name) df.sort(['col1', 'col2'], inplace=True)
df = df.drop('row_label')
df = df.drop(['row1','row2']) # multi-row Random selection of rows
import random as r
Boolean row selection by values in a column k = 20 # pick a number
df = df[df['col2'] >= 0.0] selection = r.sample(range(len(df)), k)
df = df[(df['col3']>=1.0) | df_sample = df.iloc[selection, :]
Note: this sample is not sorted
(df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])] Sort DataFrame by its row index
df = df[~df['col'].isin([1,2,5,7,11])] df.sort_index(inplace=True) # sort by row
df = df[df['col'].str.contains('hello')] df = df.sort_index(ascending=False)
Trap: bitwise "or", "and" “not” (ie. | & ~) co-opted to be
Boolean operators on a Series of Boolean Drop duplicates in the row index
Trap: need parentheses around comparisons. df['index'] = df.index # 1 create new col
df = df.drop_duplicates(cols='index',
Selecting rows using isin over multiple columns take_last=True)# 2 use new col
del df['index'] # 3 del the col
# fake up some data
df.sort_index(inplace=True)# 4 tidy up
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}
df = pd.DataFrame(data)
Test if two DataFrames have same row index
# multi-column isin len(a)==len(b) and all(a.index==b.index)
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] Get the integer position of a row or col index label
i = df.index.get_loc('row_label')
Selecting rows using an index
idx = df[df['col'] >= 2].index Trap: index.get_loc() returns an integer for a unique
match. If not a unique match, may return a slice or
print(df.ix[idx])
mask.
Select a slice of rows by integer position
Get integer position of rows that meet condition
[inclusive-from : exclusive-to [: step]] a = np.where(df['col'] >= 2) #numpy array
default start is 0; default end is len(df)
df = df[:] df = # copy DataFrame # rows Test if the row index values are unique/monotonic
df[0:2] df = 0 and 1 # the last row #
df[-1:] df = row 2 (the third row) # all if df.index.is_unique: pass # ... b =
df[2:3] df = but the last row # every df.index.is_monotonic_increasing b =
df[:-1] df = 2nd row (0 2 ..) df.index.is_monotonic_decreasing
df[::2]
Trap: a single integer without a colon is a column label
for integer numbered columns.
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Working with cells In summary: indexes and addresses
Selecting a cell by row and column labels In the main, these notes focus on the simple, single
value = df.at['row', 'col'] level Indexes. Pandas also has a hierarchical or
value = df.loc['row', 'col'] multi-level Indexes (aka the MultiIndex).
value = df['col'].at['row'] # tricky
Note: .at[] fastest label based scalar lookup
•A DataFrame hascolumn
Typically, the two Indexes
index (df.columns) is a list of
Setting a cell by row and column labels strings (observed variable names) or (less
df.at['row, 'col'] = value df.loc['row, commonly) integers (the default is numbered from 0
'col'] = value df['col'].at['row'] = to length-1)
value # tricky • Typically, the row index (df.index) might be:
o Integers - for case or row numbers (default is
Selecting and slicing on labels o numbered from 0 to length-1);
df = df.loc['row1':'row3', 'col1':'col3'] o Strings – for case names; or
Note: the "to" on this slice is inclusive. DatetimeIndex or PeriodIndex – for time series
data (more below)
Setting a cross-section by labels
df.loc['A':'C', 'col1':'col3'] = np.nan
df.loc[1:2,'col1':'col2']=np.zeros((2,2)) Indexing
df.loc[1:2,'A':'C']=othr.loc[1:2,'A':'C'] # --- selecting columns
s = df['col_label'] # scalar
Remember: inclusive "to" in the slice
df = df[['col_label']] # one item list
df = df[['L1', 'L2']] # many item list
Selecting a cell by integer position df = df[index] # pandas Index
value = df.iat[9, 3] value = df.iloc[0, 0] value = df = df[s] # pandas Series
# [row, col] #
df.iloc[len(df)-1,
[row, col]
# --- selecting rows
df = df['from':'inc_to']# label slice
len(df.columns)-1]
df = df[3:7] # integer slice
df = df[df['col'] > 0.5]# Boolean Series
Selecting a range of cells by int position
df = df.iloc[2:4, 2:4] # subset of the df # =single
df df.loc['label']
label # lab list/Series df =
df = df.iloc[:5, :5] # top left corner df = df.loc[container]
df.loc['from':'to']# inclusive slice
s = df.iloc[5, :] # returns row as Series df =
df.iloc[5:6, :] # returns row as row df = df.loc[bs] df # Boolean Series
Note: exclusive "to" – same as python list slicing. = df.iloc[0] # single integer
Setting cell by integer position df = df.iloc[container] # int list/Series
df.iloc[0, 0] = value df = df.iloc[0:5] # exclusive slice
df.iat[7, 8] = value # [row, col] df = df.ix[x] # loc then iloc
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Joining/Combining DataFrames Groupby: Split-Apply-Combine
Three ways to join two DataFrames: The pandas "groupby" mechanism allows us to split the
• merge (a database/SQL-like join operation) data into groups, apply a function to each group
• concat (stack side by side or one on top of the other) independently and then combine the results.
• combine_first (splice the two together, choosing
Grouping
values from one over the other)
gb = df.groupby('cat') # by one columns
gb = df.groupby(['c1','c2']) # by 2 cols
Merge on indexes gb = df.groupby(level=0) # multi-index gb
df_new = pd.merge(left=df1, right=df2, gb = df.groupby(level=['a','b']) # mi gb
how='outer', left_index=True, print(gb.groups)
right_index=True)
How:How
'left', 'right', 'outer', 'inner' Note: groupby() returns a pandas groupby object
: outer=union/all; inner=intersection Note: the groupby object attribute .groups contains a
Merge on columns dictionary mapping of the groups.
df_new = pd.merge(left=df1, right=df2, Trap: NaN values in the group key are automatically
dropped – there will never be a NA group.
how='left', left_on='col1', Iterating groups – usually not needed
right_on='col2') for name, group in gb:
Trap: When joining on columns, the indexes on the
passed DataFrames are ignored. print (name)
Trap: many-to-many merges on a column can result in print (group)
an explosion of associated data.
Join on indexes (another way of merging) Selecting a group
df_new = df1.join(other=df2, on='col1', dfa = df.groupby('cat').get_group('a') dfb =
df.groupby('cat').get_group('b')
how='outer') Applying an aggregating function
df_new = df1.join(other=df2,on=['a','b'], # apply to a column ... s = df.groupby('cat')
how='outer') ['col1'].sum() s = df.groupby('cat')
Note: DataFrame.join() joins on indexes by default. ['col1'].agg(np.sum) # apply to the every
DataFrame.merge() joins on common columns by default. column in DataFrame s =
Simple concatenation is often the best df.groupby('cat').agg(np.sum) df_summary =
df=pd.concat([df1,df2],axis=0)#top/bottom df df.groupby('cat').describe() df_row_1s =
= df1.append([df2, df3]) #top/bottom df.groupby('cat').head(1)
df=pd.concat([df1,df2],axis=1)#left/right
Note: aggregating functions reduce the dimension by
one – they include: mean, sum, size, count, std, var,
Trap:Note
can end up with duplicate rows or cols sem, describe, first, last, min, max
: concat has an ignore_index parameter Applying multiple aggregating functions
Combine_first gb = df.groupby('cat')
df = df1.combine_first(other=df2)
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Applying filtering functions
Working with dates, times and their indexes
Filtering functions allow you to make selections based on
whether each group meets specified criteria
# select groups with more than 10 members
eleven = lambda x: (len(x['col1']) >= 11) df11 = Dates and time – points and spans
df.groupby('cat').filter(eleven) With its focus on time-series data, pandas has a suite of
tools for managing dates and time: either as a point in
Group by a row index (non-hierarchical index) time (a Timestamp) or as a span of time (a Period).
df = df.set_index(keys='cat') s = t = pd.Timestamp('2013-01-01')
df.groupby(level=0)['col1'].sum() dfg = t = pd.Timestamp('2013-01-01 21:15:06')
df.groupby(level=0).sum() t = pd.Timestamp('2013-01-01 21:15:06.7')
p = pd.Period('2013-01-01', freq='M')
Note: Timestamps should be in range 1678 and 2261
years. (Check Timestamp.max and Timestamp.min).
Pivot Tables
A Series of Timestamps or Periods
ts = ['2015-04-01 13:17:27',
Pivot '2014-04-02 13:17:29']
Pivot tables move from long format to wide format data
df = DataFrame(np.random.rand(100,1)) # Series of Timestamps (good)
df.columns = ['data'] # rename col s = pd.to_datetime(pd.Series(ts))
df.index = pd.period_range('3/3/2014',
periods=len(df), freq='M') # Series of Periods (often not so good)
df['year'] = df.index.year s = pd.Series( [pd.Period(x, freq='M')
df['month'] = df.index.month for x in ts] )
s = pd.Series(
# pivot to wide format pd.PeriodIndex(ts,freq='S'))
df = df.pivot(index='year', Note: While Periods make a very useful index; they may
columns='month', values='data') be less useful in a Series.
dti = pd.DatetimeIndex(date_strs)
df.index = pd.period_range('2015-01',
periods=len(df), freq='M')
dti = pd.to_datetime(['04-01-2012'],
dayfirst=True) # Australian date format
pi = pd.period_range('1960-01-01',
'2015-12-31', freq='M')
Hint: unless you are working in less than seconds,
prefer PeriodIndex over DateTimeImdex.
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
Period frequency constants (not a complete list) Upsampling and downsampling
Name Description # upsample from quarterly to monthly
U Microsecond Millisecond Second pi = pd.period_range('1960Q1',
Minute Hour Calendar day periods=220, freq='Q')
L
S Business day Week ending on … df = DataFrame(np.random.rand(len(pi),5),
T Calendar start of month Calendar index=pi)
H end of month Quarter start with dfm = df.resample('M', convention='end') #
use ffill or bfill to fill with values
D year starting
B (QS – December)
# downsample from monthly to quarterly
W-{MON, TUE, …} Quarter end with year ending (Q dfq = dfm.resample('Q', how='sum')
MS – December)
M Year start (AS - December)
Time zones
QS-{JAN, FEB, …} Year end (A - December) t = ['2015-06-30 00:00:00',
Q-{JAN, FEB, …} '2015-12-31 00:00:00']
dti = pd.to_datetime(t
AS-{JAN, FEB, …} ).tz_localize('Australia/Canberra')
dti = dti.tz_convert('UTC')
A-{JAN, FEB, …}
ts = pd.Timestamp('now',
From DatetimeIndex to Python datetime objects tz='Europe/London')
dti = pd.DatetimeIndex(pd.date_range(
# get a list of all time zones
import pyzt
start='1/1/2011', periods=4, freq='M')) for tz in pytz.all_timezones:
s = Series([1,2,3,4], index=dti) print tz
na = dti.to_pydatetime()
#numpy array na = s.index.to_pydatetime() Note: by default, Timestamps are created without time
#numpy array zone information.
Row selection with a time-series index
Frome Timestamps to Python dates or times # start with the play data above idx =
df['date'] = [x.date() for x in df['TS']] df['time'] pd.period_range('2015-01',
= [x.time() for x in df['TS']]
Note: converts to datatime.date or datetime.time. But periods=len(df), freq='M')
does not convert to datetime.datetime. df.index = idx
From DatetimeIndex to PeriodIndex and back february_selector = (df.index.month == 2)
df = DataFrame(np.random.randn(20,3))
february_data = df[february_selector]
df.index = pd.date_range('2015-01-01',
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Working with strings Basic Statistics
Histogram binning
count, bins = np.histogram(df['col1'])
count, bins = np.histogram(df['col'],
bins=5)
count, bins = np.histogram(df['col1'],
bins=[-3,-2,-1,0,1,2,3,4])
Regression
import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 +
col3", data=df).fit()
print (result.params)
print (result.summary())
Cautionary note
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10