Pandas - Data Analysis Paper
Pandas - Data Analysis Paper
Introduction
volume
1697900
1424800
1465600
1219800
23003100
15868400
14696800
12571000
ind
TRUE
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
industry
TECH
TECH
TECH
TECH
FIN
FIN
FIN
FIN
AUTO
AUTO
AUTO
AUTO
sequential observation numbers, while the column index contains the column names. The labels are not required to be
sorted, though a subclass of Index could be implemented to
require sortedness and provide operations optimized for sorted
data (e.g. time series data).
The Index object is used for many purposes:
Performing lookups to select subsets of slices of an object
Providing fast data alignment routines for aligning one
object with another
Enabling intuitive slicing / selection to form new Index
objects
Forming unions and intersections of Index objects
Here are some examples of how the index is used internally:
>>> index = Index([a, b, c, d, e])
>>> c in index
True
>>> index.get_loc(d)
3
>>> index.slice_locs(b, d)
(1, 4)
>>> data._data
BlockManager
Items: [item date price volume ind newcol]
Axis 1: [0 1 2 3 4 5 6 7]
FloatBlock: [price volume], 2 x 8
ObjectBlock: [item date], 2 x 8
BoolBlock: [ind], 1 x 8
FloatBlock: [newcol], 1 x 8
>>> data.consolidate()._data
BlockManager
Items: [item date price volume ind newcol]
Axis 1: [0 1 2 3 4 5 6 7]
BoolBlock: [ind], 1 x 8
FloatBlock: [price volume newcol], 3 x 8
ObjectBlock: [item date], 2 x 8
The separation between the internal BlockManager object and the external, user-facing DataFrame gives the pandas developers a significant amount of freedom to modify the
internal structure to achieve better performance and memory
usage.
Label-based data access
Data alignment
0.044
0.050
0.101
0.113
0.138
0.037
0.200
0.281
0.040
>>> s2
AAPL
BAR
C
DB
F
GOOG
IBM
0.025
0.158
0.028
0.087
0.004
0.154
0.034
2009-12-28
>>> df2
GOOG
622.9
619.4
622.7
620
2009-12-29
2009-12-30
1.587e+07
1.47e+07
>>> df / df2
2009-12-28
2009-12-29
2009-12-30
2009-12-31
AAPL
9.199e-06
1.318e-05
1.44e-05
NaN
GOOG
NaN
NaN
NaN
NaN
209.1
211.6
210.7
>>> s1.reindex(s2.index)
AAPL
0.0440877763224
BAR
0.199741007422
C
0.137747485628
DB
0.281070058049
F
NaN
GOOG
0.112861123629
IBM
0.0496445829129
>>> df
2009-12-29
2009-12-30
2009-12-31
2009-12-28
AAPL
2.3e+07
+ s2).dropna()
0.0686791008184
0.358165479807
0.16586702944
0.367679872693
0.26666583847
0.0833057542385
C
DB
F
GOOG
IBM
SAP
SCGLY
VW
0.16586702944
0.367679872693
0.0
0.26666583847
0.0833057542385
0.0
0.0
0.0
IBM
SAP
SCGLY
VW
0.03825
-1.9884
0.73255
-0.0588
-0.4767
1.98008
0.04410
>>> ts2
2000-01-03
2000-01-06
2000-01-11
2000-01-14
0.03825
-0.0588
0.04410
-0.1786
Common ndarray methods have been rewritten to automatically exclude missing data from calculations:
>>> (s1 + s2).sum()
1.3103630754662747
>>> (s1 + s2).count()
6
Similar to Rs is.na function, which detects NA (Not Available) values, pandas has special API functions isnull and
notnull for determining the validity of a data point. These
contrast with numpy.isnan in that they can be used with
dtypes other than float and also detect some other markers
for missing occurring in the wild, such as the Python None
value.
>>> isnull(s1 + s2)
AAPL
False
BAR
False
C
False
DB
False
F
True
GOOG
False
>>> ts3.fillna(method=ffill)
2000-01-03
0.07649
2000-01-04
0.07649
2000-01-05
0.07649
2000-01-06
-0.1177
2000-01-07
-0.1177
2000-01-10
-0.1177
2000-01-11
0.08821
2000-01-14
0.08821
False
True
True
True
bar
baz
qux
one
two
three
one
two
two
three
one
two
three
A
-0.9884
1.29
0.5366
-0.03457
0.03071
-0.9773
-1.283
0.4412
0.2215
1.73
B
0.09406
0.08242
-0.4897
-2.484
0.1091
1.474
0.7818
2.354
-0.7445
-0.965
C
1.263
-0.05576
0.3694
-0.2815
1.126
-0.06403
-1.071
0.5838
0.7585
-0.8457
The hierarchical index can be used with any axis. From the
pivot example earlier in the paper we obtained:
>>> pivoted = data.pivot(date, item)
>>> pivoted
price
volume
AAPL
GOOG
AAPL
GOOG
2009-12-28 211.6 622.9 2.3e+07
1.698e+06
2009-12-29 209.1 619.4 1.587e+07 1.425e+06
2009-12-30 211.6 622.7 1.47e+07
1.466e+06
2009-12-31 210.7 620
1.257e+07 1.22e+06
>>> pivoted[volume]
AAPL
2009-12-28 2.3e+07
2009-12-29 1.587e+07
2009-12-30 1.47e+07
2009-12-31 1.257e+07
GOOG
1.698e+06
1.425e+06
1.466e+06
1.22e+06
GOOG
622.9
1.698e+06
619.4
1.425e+06
622.7
1.466e+06
620
1.22e+06
>>> pivoted.stack(0)
AAPL
2009-12-28 price
211.6
volume 2.3e+07
2009-12-29 price
209.1
volume 1.587e+07
2009-12-30 price
211.6
volume 1.47e+07
2009-12-31 price
210.7
volume 1.257e+07
volume
GOOG
1.698e+06
1.425e+06
1.466e+06
1.22e+06
>>> df.stack()
2009-12-28 AAPL
GOOG
2009-12-29 AAPL
GOOG
2009-12-30 AAPL
GOOG
2009-12-31 AAPL
GOOG
211.61
622.87
209.1
619.4
211.64
622.73
210.73
619.98
>>> df.stack().unstack()
AAPL
GOOG
2009-12-28 211.6 622.9
2009-12-29 209.1 619.4
2009-12-30 211.6 622.7
2009-12-31 210.7 620
GOOG
622.9
1.698e+06
619.4
1.425e+06
622.7
1.466e+06
620
1.22e+06
>>> df
2009-12-28
2009-12-29
2009-12-30
2009-12-31
AAPL
211.6
209.1
211.6
210.7
>>> df.stack()
2009-12-28 AAPL
GOOG
2009-12-29 AAPL
GOOG
2009-12-30 AAPL
GOOG
2009-12-31 AAPL
GOOG
>>> pivoted.stack(0).max(1).unstack()
price volume
2009-12-28 622.9 2.3e+07
2009-12-29 619.4 1.587e+07
2009-12-30 622.7 1.47e+07
2009-12-31 620
1.257e+07
GOOG
622.9
619.4
622.7
620
211.61
622.87
209.1
619.4
211.64
622.73
210.73
619.98
>>> pivoted
2009-12-28
2009-12-29
2009-12-30
2009-12-31
price
AAPL
211.6
209.1
211.6
210.7
GOOG
622.9
619.4
622.7
620
>>> pivoted.stack()
price
2009-12-28 AAPL 211.6
GOOG 622.9
2009-12-29 AAPL 209.1
GOOG 619.4
2009-12-30 AAPL 211.6
GOOG 622.7
2009-12-31 AAPL 210.7
GOOG 620
volume
AAPL
2.3e+07
1.587e+07
1.47e+07
1.257e+07
GOOG
1.698e+06
1.425e+06
1.466e+06
1.22e+06
volume
2.3e+07
1.698e+06
1.587e+07
1.425e+06
1.47e+07
1.466e+06
1.257e+07
1.22e+06
A very common operation in SQL-like languages and generally in statistical data analysis is to group data by some
identifiers and perform either an aggregation or transformation
of the data. For example, suppose we had a simple data set
like this:
>>> df
A
0 foo
1 bar
2 foo
3 bar
4 foo
5 bar
6 foo
7 foo
B
one
one
two
three
two
two
one
three
C
D
-1.834
1.903
1.772
-0.7472
-0.67
-0.309
0.04931 0.3939
-0.5215
1.861
-3.202
0.9365
0.7927
1.256
0.1461 -2.655
foo -0.4173
0.4112
A
foo
foo
foo
foo
foo
B
C
D
one
-1.834
1.903
two
-0.67
-0.309
two
-0.5215 1.861
one
0.7927 1.256
three 0.1461 -2.65
>>> dat
A
B
C
D
2000-01-03 0.6371 0.672
0.9173 1.674
2000-01-04 -0.8178 -1.865 -0.23
0.5411
2000-01-05 0.314
0.2931 -0.6444 -0.9973
2000-01-06 1.913 -0.5867 0.273
0.4631
2000-01-07 1.308
0.426 -1.306
0.04358
>>> mapping
{A: Group 1, B: Group 2,
C: Group 1, D: Group 2}
>>> for name, group in dat.groupby(mapping.get,
...
axis=1):
...
print name; print group
Group 1
A
C
>>> tips.head()
sex
smoker
1 Female No
2 Male
No
3 Male
No
4 Male
No
5 Female No
time
Dinner
Dinner
Dinner
Dinner
Dinner
day
Sun
Sun
Sun
Sun
Sun
size
2
3
3
2
4
tip_pct
0.05945
0.1605
0.1666
0.1398
0.1468
Lunch
Male
Female
Male
0.1594
0.1571
0.1657
0.1489
0.1753
0.1667
2009-12-24
2009-12-28
2009-12-29
2009-12-30
2009-12-31
>>> df2
>>> df1.join(df2)
AAPL
2009-12-24 209
2009-12-28 211.6
2009-12-29 209.1
2009-12-30 211.6
2009-12-31 210.7
GOOG
618.5
622.9
619.4
622.7
620
GOOG
618.5
622.9
619.4
622.7
620
value
622.9
619.4
622.7
620
211.6
209.1
211.6
210.7
8
27
15
10
AAPL
209
211.6
209.1
211.6
210.7
item
GOOG
GOOG
GOOG
GOOG
AAPL
AAPL
AAPL
AAPL
Male
>>> df1
2009-12-24
2009-12-28
2009-12-29
2009-12-30
MSFT
31
31.17
31.39
30.96
NaN
MSFT
31
31.17
31.39
30.96
YHOO
16.72
16.88
16.92
16.98
NaN
YHOO
16.72
16.88
16.92
16.98
The savvy user will learn what operations are not very
efficient in DataFrame and Series and fall back on working
directly with the underlying ndarray objects (accessible
via the values attribute) in such cases. What DataFrame
sacrifices in performance it makes up for in flexibility and
expressiveness.
With 64-bit integers representing timestamps, pandas in
fact provides some of the fastest data alignment routines for
differently-indexed time series to be found in open source software. As working with large, irregularly time series requires
having a timestamp index, pandas is well-positioned to become
the gold standard for high performance open source time series
processing.
With regard to memory usage and large data sets, pandas
is currently only designed for use with in-memory data sets.
We would like to expand its capability to work with data
sets that do not fit into memory, perhaps transparently using
the multiprocessing module or a parallel computing
backend to orchestrate large scale computations.
Given the DataFrame name and feature overlap with the [R]
project and its 3rd party packages, pandas will draw inevitable
comparisons with R. pandas brings a robust, full-featured, and
integrated data analysis toolset to Python while maintaining a
simple and easy-to-use API. As nearly all data manipulations
involving data.frame objects in R can be easily expressed
using the pandas DataFrame, it is relatively straightforward
in most cases to port R functions to Python. It would be
useful to provide a migration guide for R users as we have
not copied Rs naming conventions or syntax in most places,
rather naming based on common-sense and making the syntax
and API as Pythonic as possible.
R does not provide indexing functionality in nearly such a
deeply integrated way as pandas does. For example, operations
between data.frame objects will proceed in R without
regard to whether the labels match as long as they are the
same length and width. Some R packages, such as zoo and
xts provides indexed data structures with data alignment,
but they are largely specialized to ordered time series data.
Hierarchical indexing with constant-time subset selection is
another significant feature missing from Rs data structures.
Outside of the scope of this paper is a rigorous performance
comparison of R and pandas. In almost all of the benchmarks
we have run comparing R and pandas, pandas significantly
outperforms R.
Other features of note
We believe that in the coming years there will be great opportunity to attract users in need of statistical data analysis tools
to Python who might have previously chosen R, MATLAB,
or another research environment. By designing robust, easyto-use data structures that cohere with the rest of the scientific
Python stack, we can make Python a compelling choice for
data analysis applications. In our opinion, pandas provides
a solid foundation upon which a very powerful data analysis
ecosystem can be established.
R EFERENCES
[pandas]