Pandas DataFrame Notes
Pandas DataFrame Notes
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
df = workbook.parse(sheet_name)
d[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
DataFrame from row data in a Python dictionary
# --- use helper method for data in rows Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row:
# rows as python dictionaries Peek at the DataFrame contents/structure
'row0' : {'col0':0, 'col1':'A'}, df.info() # index & data types
'row1' : {'col0':1, 'col1':'B'} dfh = df.head(i) # get first i rows
}, orient='index') dft = df.tail(i) # get last i rows
dfs = df.describe() # summary stats cols
df = DataFrame.from_dict({ # data by row: top_left_corner_df = df.iloc[:4, :4]
# rows as python lists
'row0' : [1, 1+1j, 'A'], DataFrame non-indexing attributes
'row1' : [2, 2+2j, 'B']
df = df.T # transpose rows and cols
}, orient='index')
l = df.axes # list row and col indexes
(r_idx, c_idx) = df.axes # from above
DataFrame of fake data – useful for testing s = df.dtypes # Series column data types
df = DataFrame(np.random.rand(500,5), b = df.empty # True for empty DataFrame
columns=list('ABCDE')) i = df.ndim # number of axes (it is 2)
t = df.shape # (row-count, column-count)
DataFrame of fake time-series data i = df.size # row-count * column-count
df = DataFrame(np.random.rand(500,5)) - 0.5 a = df.values # get a numpy array for df
df = df.cumsum()
df.index = pd.date_range('1/1/2017', DataFrame utility methods
periods=len(df), freq='D') df = df.copy() # copy a DataFrame
df = df.sort_values(by='col') # axis=1 cols
Fake data with alphabetic index and group variable df = df.sort_values(by=['col1', 'col2'])
import string df = df.sort_values(by='row', axis=1)
import random df = df.sort_index() # axis=1 to sort cols
rows = 52 df = df.astype(dtype) # type conversion
cols = 5
assert(1 <= rows <= 52) # min/max row count DataFrame iteration methods
df = DataFrame(np.random.randn(rows, cols), df.iteritems() # (col-index, Series) pairs
columns=['c'+str(i) for i in range(cols)], df.iterrows() # (row-index, Series) pairs
index=list((string.ascii_uppercase + # example ... iterating over columns ...
string.ascii_lowercase)[0:rows])) for (name, series) in df.iteritems():
df['groupable'] = [random.choice('abcde') print('\nCol name: ' + str(name))
for _ in range(rows)] print('1st value: ' + str(series.iat[0]))
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Multiply every column in DataFrame by a Series
df = df.mul(s, axis=0) # on matched rows Working with rows
Note: also add, sub, div, etc.
Get the row index and labels
Selecting columns with .loc, .iloc idx = df.index # get row index
df = df.loc[:, 'col1':'col2'] # inclusive label = df.index[0] # first row label
df = df.iloc[:, 0:2] # exclusive label = df.index[-1] # last row label
l = df.index.tolist() # get as a list
Get the integer position of a column index label a = df.index.values # get as an array
i = df.columns.get_loc('col_name')
Change the (row) index
Test if column index values are unique/monotonic df.index = idx # new ad hoc index
if df.columns.is_unique: pass # ... df = df.set_index('A') # col A new index
b = df.columns.is_monotonic_increasing df = df.set_index(['A', 'B']) # MultiIndex
b = df.columns.is_monotonic_decreasing df = df.reset_index() # replace old w new
# note: old index stored as a col in df
Mapping a DataFrame column or Series df.index = range(len(df)) # set with list
map = Series(['red', 'green', 'blue'], df = df.reindex(index=range(len(df)))
index=['r', 'g', 'b']) df = df.set_index(keys=['r1','r2','etc'])
s = Series(['r', 'g', 'r', 'b']).map(map)
# s contains: ['red', 'green', 'red', 'blue'] Adding rows
df = original_df.append(more_rows_in_df)
m = Series([True, False], index=['Y', 'N']) Hint: convert row to a DataFrame and then append.
df = DataFrame(np.random.choice(list('YN'), Both DataFrames must have same column labels.
500, replace=True), columns=['col'])
df['col'] = df['col'].map(m) Dropping rows (by name)
Note: Useful for decoding data before plotting df = df.drop('row_label')
Note: Sometimes referred to as a lookup function df = df.drop(['row1','row2']) # multi-row
Note: Indexes can also be mapped if needed.
Boolean row selection by values in a column
Find the largest and smallest values in a column df = df[df['col2'] >= 0.0]
s = df['A'].nlargest(5) df = df[(df['col3']>=1.0) | (df['col1']<0.0)]
s = df['A'].nsmallest(5) df = df[df['col'].isin([1,2,5,7,11])]
df = df[~df['col'].isin([1,2,5,7,11])]
Sorting the columns of a DataFrame df = df[df['col'].str.contains('hello')]
df = df.sort_index(axis=1, ascending=False) Trap: bitwise "or", "and" “not; (ie. | & ~) co-opted to be
Boolean operators on a Series of Boolean
Trap: need parentheses around comparisons.
# multi-column isin
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)]
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Select a slice of rows by label/index
df = df['a':'c'] # rows 'a' through 'c' Working with cells
Note: [inclusive-from : inclusive–to [ : step]]
Trap: cannot work for integer labelled rows – see Getting a cell by row and column labels
previous code snippet on integer position slicing. value = df.at['row', 'col']
value = df.loc['row', 'col']
Append a row of column totals to a DataFrame value = df['col'].at['row'] # tricky
# Option 1: use dictionary comprehension Note: .at[] fastest label based scalar lookup
sums = {col: df[col].sum() for col in df} Note: at[] does not take slices as an argument
sums_df = DataFrame(sums,index=['Total'])
df = df.append(sums_df) Setting a cell by row and column labels
df.at['row', 'col'] = value
# Option 2: All done with pandas df.loc['row', 'col'] = value
df = df.append(DataFrame(df.sum(), df['col'].at['row'] = value # avoid!
columns=['Total']).T)
Getting and slicing on labels
Iterating over DataFrame rows df = df.loc['row1':'row3', 'col1':'col3']
for (index, row) in df.iterrows(): # pass
Note: the "to" on this slice is inclusive.
Trap: row data type may be coerced.
Setting a cross-section by labels
Sorting the rows of a DataFrame by the row index df.loc['A':'C', 'col1':'col3'] = np.nan
df = df.sort_index(ascending=False) df.loc[1:2,'col1':'col2'] = np.zeros((2,2))
df.loc[1:2,'A':'C'] = othr.loc[1:2,'A':'C']
Sorting DataFrame rows based on column values Remember: inclusive "to" in the slice
df = df.sort_values(by=df.columns[0],
ascending=False) Getting a cell by integer position
df = df.sort_values(by=['col1', 'col2']) value = df.iat[9, 3] # [row, col]
value = df.iloc[0, 0] # [row, col]
Random selection of rows value = df.iloc[len(df)-1, len(df.columns)-1]
import random as r
k = 20 # pick a number Getting a range of cells by int position
selection = r.sample(range(len(df)), k) df = df.iloc[2:4, 2:4] # subset of the df
df_sample = df.iloc[selection, :] # get copy df = df.iloc[:5, :5] # top left corner
Note: this randomly selected sample is not sorted s = df.iloc[5, :] # return row as Series
df = df.iloc[5:6, :] # returns row as row
Drop duplicates in the row index Note: exclusive "to" – same as python list slicing.
df['index'] = df.index # 1 create new col
df = df.drop_duplicates(cols='index', Setting cell by integer position
take_last=True) # 2 use new col df.iloc[0, 0] = value # [row, col]
del df['index'] # 3 del the col df.iat[7, 8] = value
df = df.sort_index() # 4 tidy up
Setting cell range by integer position
Test if two DataFrames have same row index df.iloc[0:3, 0:5] = value
len(a)==len(b) and all(a.index==b.index) df.iloc[1:3, 1:4] = np.ones((2, 3))
df.iloc[1:3, 1:4] = np.zeros((2, 3))
Get the integer position of a row or col index label df.iloc[1:3, 1:4] = np.array([[1, 1, 1],
i = df.index.get_loc('row_label') [2, 2, 2]])
Trap: index.get_loc() returns an integer for a unique Remember: exclusive-to in the slice
match. If not a unique match, may return a slice/mask.
Views and copies
Get integer position of rows that meet condition From the manual: Setting a copy can cause subtle
a = np.where(df['col'] >= 2) #numpy array errors. The rules about when a view on the data is
returned are dependent on NumPy. Whenever an array
Test if the row index values are unique/monotonic of labels or a Boolean vector are involved in the indexing
if df.index.is_unique: pass # ... operation, the result will be a copy.
b = df.index.is_monotonic_increasing
b = df.index.is_monotonic_decreasing
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Summary: selecting using the DataFrame index Joining/Combining DataFrames
Using the DataFrame index to select columns Three ways to join two DataFrames:
s = df['col_label'] # returns Series merge (a database/SQL-like join operation)
df = df[['col_label']] # returns DataFrame concat (stack side by side or one on top of the other)
df = df[['L1', 'L2']] # select cols with list combine_first (splice the two together, choosing
df = df[index] # select cols with an index values from one over the other)
df = df[s] # select with col label Series
Note: scalar returns Series; list &c returns a DataFrame. Merge on (row) indexes
df_new = pd.merge(left=df1, right=df2,
Using the DataFrame index to select rows how='outer', left_index=True,
df = df['from':'inc_to'] # label slice right_index=True)
df = df[3:7] # integer slice How: 'left', 'right', 'outer', 'inner'
df = df[df['col'] > 0.5] # Boolean Series How: outer=union/all; inner=intersection
df = df.loc['label'] # single label
df = df.loc[container] # lab list/Series Merge on columns
df = df.loc['from':'to'] # inclusive slice df_new = pd.merge(left=df1, right=df2,
df = df.loc[bs] # Boolean Series how='left', left_on='col1',
df = df.iloc[0] # single integer right_on='col2')
df = df.iloc[container] # int list/Series Trap: When joining on columns, the indexes on the
df = df.iloc[0:5] # exclusive slice passed DataFrames are ignored.
Trap: Boolean Series gets rows, label Series gets cols. Trap: many-to-many merges on a column can result in
an explosion of associated data.
Using the DataFrame index to select a cross-section
# r and c can be scalar, list, slice Join on indexes (another way of merging)
df.loc[r, c] # label accessor (row, col) df_new = df1.join(other=df2, on='col1',
df.iloc[r, c] # integer accessor how='outer')
df[c].iloc[r] # chained – also for .loc df_new = df1.join(other=df2,on=['a','b'],
how='outer')
Using the DataFrame index to select a cell Note: DataFrame.join() joins on indexes by default.
# r and c must be label or integer DataFrame.merge() joins on common columns by
df.at[r, c] # fast scalar label accessor default.
df.iat[r, c] # fast scalar int accessor
df[c].iat[r] # chained – also for .at Simple concatenation is often the best
df=pd.concat([df1,df2],axis=0)#top/bottom
DataFrame indexing methods df = df1.append([df2, df3]) #top/bottom
v = df.get_value(r, c) # get by row, col df=pd.concat([df1,df2],axis=1)#left/right
df = df.set_value(r,c,v) # set by row, col Trap: can end up with duplicate rows or cols
df = df.xs(key, axis) # get cross-section Note: concat has an ignore_index parameter
df = df.filter(items, like, regex, axis)
df = df.select(crit, axis) Combine_first
Note: the indexing attributes (.loc, .iloc, .at .iat) can be df = df1.combine_first(other=df2)
used to get and set values in the DataFrame.
Note: the .loc, and iloc indexing attributes can accept # multi-combine with python reduce()
python slice objects. But .at and .iat do not. df = reduce(lambda x, y:
Note: .loc can also accept Boolean Series arguments x.combine_first(y),
Avoid: chaining in the form df[col_indexer][row_indexer] [df1, df2, df3, df4, df5])
Trap: label slices are inclusive, integer slices exclusive. Uses the non-null values from df1. The index of the
combined DataFrame will be the union of the indexes
Some index attributes and methods from df1 and df2.
b = idx.is_monotonic_decreasing
b = idx.is_monotonic_increasing
b = idx.has_duplicates
i = idx.nlevels # num of index levels
Groupby: Split-Apply-Combine
idx = idx.astype(dtype)# change data type
b = idx.equals(o) # check for equality Grouping
idx = idx.union(o) # union of two indexes gb = df.groupby('cat') # by one columns
i = idx.nunique() # number unique labels gb = df.groupby(['c1','c2']) # by 2 cols
label = idx.min() # minimum label gb = df.groupby(level=0) # multi-index gb
label = idx.max() # maximum label gb = df.groupby(level=['a','b']) # mi gb
print(gb.groups)
Note: groupby() returns a pandas groupby object
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Note: the groupby object attribute .groups contains a
dictionary mapping of the groups.
Trap: NaN values in the group key are automatically Pivot Tables: working with long and wide data
dropped – there will never be a NA group.
These features work with and often create
The pandas "groupby" mechanism allows us to split the
hierarchical or multi-level Indexes;
data into groups, apply a function to each group
(the pandas MultiIndex is powerful and complex).
independently and then combine the results.
Iterating groups – usually not needed Pivot, unstack, stack and melt
Pivot tables move from long format to wide format data
for name, group in gb:
print (name, group) # Let's start with data in long format
from StringIO import StringIO # python2.7
#from io import StringIO # python 3
Selecting a group
data = """Date,Pollster,State,Party,Est
dfa = df.groupby('cat').get_group('a') 13/03/2014, Newspoll, NSW, red, 25
dfb = df.groupby('cat').get_group('b') 13/03/2014, Newspoll, NSW, blue, 28
13/03/2014, Newspoll, Vic, red, 24
Applying an aggregating function 13/03/2014, Newspoll, Vic, blue, 23
# apply to a column ... 13/03/2014, Galaxy, NSW, red, 23
s = df.groupby('cat')['col1'].sum() 13/03/2014, Galaxy, NSW, blue, 24
s = df.groupby('cat')['col1'].agg(np.sum) 13/03/2014, Galaxy, Vic, red, 26
# apply to the every column in DataFrame 13/03/2014, Galaxy, Vic, blue, 25
s = df.groupby('cat').agg(np.sum) 13/03/2014, Galaxy, Qld, red, 21
df_summary = df.groupby('cat').describe() 13/03/2014, Galaxy, Qld, blue, 27"""
df_row_1s = df.groupby('cat').head(1) df = pd.read_csv(StringIO(data),
Note: aggregating functions reduce the dimension by header=0, skipinitialspace=True)
one – they include: mean, sum, size, count, std, var,
sem, describe, first, last, min, max # pivot to wide format on 'Party' column
# 1st: set up a MultiIndex for other cols
Applying multiple aggregating functions df1 = df.set_index(['Date', 'Pollster',
gb = df.groupby('cat') 'State'])
# apply multiple functions to one column # 2nd: do the pivot
dfx = gb['col2'].agg([np.sum, np.mean]) wide1 = df1.pivot(columns='Party')
# apply to multiple fns to multiple cols
dfy = gb.agg({ # unstack to wide format on State / Party
'cat': np.count_nonzero, # 1st: MultiIndex all but the Values col
'col1': [np.sum, np.mean, np.std], df2 = df.set_index(['Date', 'Pollster',
'col2': [np.min, np.max] 'State', 'Party'])
}) # 2nd: unstack a column to go wide on it
Note: gb['col2'] above is shorthand for wide2 = df2.unstack('State')
df.groupby('cat')['col2'], without the need for regrouping. wide3 = df2.unstack() # pop last index
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
Period frequency constants (not a complete list)
Working with dates, times and their indexes Name Description
U Microsecond
Dates and time – points and spans L Millisecond
With its focus on time-series data, pandas has a suite of S Second
tools for managing dates and time: either as a point in T Minute
time (a Timestamp) or as a span of time (a Period). H Hour
t = pd.Timestamp('2013-01-01') D Calendar day
t = pd.Timestamp('2013-01-01 21:15:06') B Business day
t = pd.Timestamp('2013-01-01 21:15:06.7') W-{MON, TUE, …} Week ending on …
p = pd.Period('2013-01-01', freq='M') MS Calendar start of month
Note: Timestamps can range from 1678 to 2261. M Calendar end of month
(Check pd.Timestamp.max and pd.Timestamp.min). QS-{JAN, FEB, …} Quarter start with year starting
(QS – December)
A Series of Timestamps or Periods
Q-{JAN, FEB, …} Quarter end with year ending (Q
ts = ['2015-04-01', '2014-04-02'] – December)
AS-{JAN, FEB, …} Year start (AS - December)
# Series of Timestamps
A-{JAN, FEB, …} Year end (A - December)
s = pd.to_datetime(pd.Series(ts))
DatetimeIndex from DataFrame columns
# Series of Periods
s = s.dt.to_period('M') # from Timestamps datecols = ['year', 'month', 'day']
df.index = pd.to_datetime(df[datecols])
Note: While Periods make a very useful index; they may
be less useful in a Series.
Trap: pd.to_datetime(list_of_timestamp_strings) returns From DatetimeIndex to Python datetime objects
a pandas DatetimeIndex object. dti = pd.DatetimeIndex(pd.date_range(
start='1/1/2011', periods=4, freq='M'))
From non-standard strings to Timestamps s = Series([1,2,3,4], index=dti)
t = ['09:08:55.7654-JAN092002', a = dti.to_pydatetime() # numpy array
'15:42:02.6589-FEB082016'] a = s.index.to_pydatetime() # numpy array
s = pd.Series(pd.to_datetime(t,
format="%H:%M:%S.%f-%b%d%Y")) From Timestamps to Python dates or times
Also: %B = full month name; %m = numeric month; df['py_date'] = [x.date() for x in df['TS']]
%y = year without century; and more … df['py_time'] = [x.time() for x in df['TS']]
Note: converts to datatime.date or datetime.time. But
Dates and time – stamps and spans as indexes does not convert to datetime.datetime.
An index of Timestamps is a DatetimeIndex.
An index of Periods is a PeriodIndex. From DatetimeIndex to PeriodIndex and back
date_strs = ['2018-01-01', '2018-04-01', df = DataFrame(np.random.randn(20,3))
'2018-07-01', '2018-10-01'] df.index = pd.date_range('2015-01-01',
periods=len(df), freq='M')
dti = pd.DatetimeIndex(date_strs) dfp = df.to_period(freq='M')
dft = dfp.to_timestamp()
pid = pd.PeriodIndex(date_strs, freq='D') Note: from period to timestamp defaults to the point in
pim = pd.PeriodIndex(date_strs, freq='M') time at the start of the period.
piq = pd.PeriodIndex(date_strs, freq='Q')
Working with a PeriodIndex
print (pid[1] - pid[0]) # 90 [days] pi = pd.period_range('1960-01','2015-12',
print (pim[1] - pim[0]) # 3 [months] freq='M')
print (piq[1] - piq[0]) # 1 [quarter] a = pi.values # numpy array of integers
p = pi.tolist() # python list of Periods
time_strs = ['2015-01-01 02:10:40.12345', sp = Series(pi) # pandas Series of Periods
'2015-01-01 02:10:50.67890'] s = Series(pi).astype(str) # Series of strs
pis = pd.PeriodIndex(time_strs, freq='U') l = Series(pi).astype(str).tolist()
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
8
Error handling with dates The Series.dt accessor attribute
# 1st example returns string not Timestamp DataFrame columns that contain datetime-like objects
t = pd.to_datetime('2014-02-30') can be manipulated with the .dt accessor attribute
# 2nd example returns NaT (not a time) t = ['2012-04-14 04:06:56.307000',
t = pd.to_datetime('2014-02-30', coerce=True) '2011-05-14 06:14:24.457000',
# NaT like NaN tests True for isnull() '2010-06-14 08:23:07.520000']
b = pd.isnull(t) # --> True
# a Series of time stamps
The tail of a time-series DataFrame s = pd.Series(pd.to_datetime(t))
df = df.last("5M") # the last five months print(s.dtype) # datetime64[ns]
print(s.dt.second) # 56, 24, 7
Upsampling and downsampling print(s.dt.month) # 4, 5, 6
# a Series of time periods
# upsample from quarterly to monthly
s = pd.Series(pd.PeriodIndex(t,freq='Q'))
pi = pd.period_range('1960Q1',
print(s.dtype) # datetime64[ns]
periods=220, freq='Q')
print(s.dt.quarter) # 2, 2, 2
df = DataFrame(np.random.rand(len(pi),5),
print(s.dt.year) # 2012, 2011, 2010
index=pi)
dfm = df.resample('M', convention='end')
# use ffill or bfill to fill with values
Time zones
t = ['2015-06-30 00:00:00',
'2015-12-31 00:00:00']
dti = pd.to_datetime(t
).tz_localize('Australia/Canberra')
dti = dti.tz_convert('UTC')
ts = pd.Timestamp('now',
tz='Europe/London')
february_selector = (df.index.month == 2)
february_data = df[february_selector]
mayornov_data = df[(df.index.month == 5) |
(df.index.month == 11)]
totals = df.groupby(df.index.year).sum()
Also: year, month, day [of month], hour, minute, second,
dayofweek [Mon=0 .. Sun=6], weekofmonth, weekofyear
[numbered from 1], week starts on Monday], dayofyear
[from 1], …
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Multiple histograms (overlapping or stacked)
ax = df.plot.hist(bins=25, alpha=0.5) # or...
Plotting from the DataFrame ax = df.plot.hist(bins=25, stacked=True)
# followed by the standard plot code as above
Import matplotlib, choose a matplotlib style
import matplotlib.pyplot as plt
print(plt.style.available)
plt.style.use('ggplot')
Line plot
df1 = df.cumsum()
ax = df1.plot()
Box plot
ax = df.plot.box(vert=False) Horizontal bars
# followed by the standard plot code as above ax = binned['A'][(binned.index >= -4) &
(binned.index <= 4)].plot.barh()
# followed by the standard plot code as above
ax = df.plot.box(column='c1', by='c2')
Density plot
Histogram
ax = df.plot.kde()
ax = df['A'].plot.hist(bins=20)
# followed by the standard plot code as above
# followed by the standard plot code as above
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10
Scatter plot A line and bar on the same chart
ax = df.plot.scatter(x='A', y='C') In matplotlib, bar charts visualise categorical or discrete
# followed by the standard plot code as above data. Line charts visualise continuous data. This makes
it hard to get bars and lines on the same chart. Typically
combined charts either have too many labels, and/or the
lines and bars are misaligned or missing. You need to
trick matplotlib a bit … pandas makes this tricking easier
# followed by the standard plot output ... # reindex with integers from 0; keep old
ax.set_title('Pie Chart') old = dfg.index
ax.set_aspect(1) # make it round dfg.index = range(len(dfg))
ax.set_ylabel('') # remove default
# plot the line from pandas
fig = ax.figure ax = dfg['Annual'].plot(color='blue',
fig.set_size_inches(8, 3) label='Year/Year Growth')
fig.savefig('filename.png', dpi=125)
# plot the bars from pandas
plt.close(fig) dfg['Quarter'].plot.bar(ax=ax,
label='Q/Q Growth', width=0.8)
plt.close()
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
11
Working with missing and non-finite data Working with Categorical Data
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
12
Regression
Working with strings import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 +
Working with strings col3", data=df).fit()
# assume that df['col'] is series of strings print (result.params)
s = df['col'].str.lower() print (result.summary())
s = df['col'].str.upper()
s = df['col'].str.len() Simple smoothing example using a rolling apply
k3x5 = np.array([1,2,3,3,3,2,1]) / 15.0
# the next set work like Python s = df['A'].rolling(window=len(k3x5),
df['col'] += 'suffix' # append min_periods=len(k3x5),
df['col'] *= 2 # duplicate center=True).apply(
s = df['col1'] + df['col2'] # concatenate func=lambda x: (x * k3x5).sum())
Most python string functions are replicated in the pandas # fix the missing end data ... unsmoothed
DataFrame and Series objects. s = df['A'].where(s.isnull(), other=s)
Regular expressions
s = df['col'].str.contains('regex')
s = df['col'].str.startswith('regex')
Cautionary note
s = df['col'].str.endswith('regex')
s = df['col'].str.replace('old', 'new') This cheat sheet was cobbled together by tireless bots
df['b'] = df.a.str.extract('(pattern)') roaming the dark recesses of the Internet seeking ursine
Note: pandas has many more regex methods. and anguine myths from a fabled land of milk and honey
where it is rumoured pandas and pythons gambol
together. There is no guarantee the narratives were
captured and transcribed accurately. You use these
Basic Statistics notes at your own risk. You have been warned. I will not
be held responsible for whatever happens to you and
Summary statistics those you love once your eyes begin to see what is
written here.
s = df['col1'].describe()
df1 = df.describe()
Errors: If you find any errors, please email me at
[email protected]; (but please do not correct
DataFrame – key stats methods my use of Australian-English spelling conventions).
df.corr() # pairwise correlation cols
df.cov() # pairwise covariance cols
df.kurt() # kurtosis over cols (def)
df.mad() # mean absolute deviation
df.sem() # standard error of mean
df.var() # variance over cols (def)
Value counts
s = df['col1'].value_counts()
Histogram binning
count, bins = np.histogram(df['col1'])
count, bins = np.histogram(df['col'],
bins=5)
count, bins = np.histogram(df['col1'],
bins=[-3,-2,-1,0,1,2,3,4])
Version 29 September 2018 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
13