How to use dates &
times with pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Date & time series functionality
At the root: data types for date & time information
Objects for points in time and periods
A ributes & methods re ect time-related details
Sequences of dates & periods:
Series or DataFrame columns
Index: convert object into Time Series
Many Series/DataFrame methods rely on time information in
the index to provide time-series functionality
MANIPULATING TIME SERIES DATA IN PYTHON
Basic building block: [Link]
import pandas as pd # assumed imported going forward
from datetime import datetime # To manually create dates
time_stamp = [Link](datetime(2017, 1, 1))
[Link]('2017-01-01') == time_stamp
True # Understands dates as strings
time_stamp # type: [Link]
Timestamp('2017-01-01 [Link]')
MANIPULATING TIME SERIES DATA IN PYTHON
Basic building block: [Link]
Timestamp object has many a ributes to store time-speci c
information
time_stamp.year
2017
time_stamp.day_name()
'Sunday'
MANIPULATING TIME SERIES DATA IN PYTHON
More building blocks: [Link] & freq
period = [Link]('2017-01')
period # default: month-end
Period object has freq
Period('2017-01', 'M') a ribute to store frequency
info
[Link]('D') # convert to daily
Period('2017-01-31', 'D')
Convert [Link]() to
period.to_timestamp().to_period('M') [Link]() and back
Period('2017-01', 'M')
MANIPULATING TIME SERIES DATA IN PYTHON
More building blocks: [Link] & freq
period + 2 Frequency info enables
basic date arithmetic
Period('2017-03', 'M')
[Link]('2017-01-31', 'M') + 1
Timestamp('2017-02-28 [Link]', freq='M')
MANIPULATING TIME SERIES DATA IN PYTHON
Sequences of dates & times
pd.date_range : start , end , periods , freq
index = pd.date_range(start='2017-1-1', periods=12, freq='M')
index
DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', ...,
'2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31'],
dtype='datetime64[ns]', freq='M')
[Link] : sequence of Timestamp objects with
frequency info
MANIPULATING TIME SERIES DATA IN PYTHON
Sequences of dates & times
index[0]
Timestamp('2017-01-31 [Link]', freq='M')
index.to_period()
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', ...,
'2017-11', '2017-12'], dtype='period[M]', freq='M')
MANIPULATING TIME SERIES DATA IN PYTHON
Create a time series: [Link]
[Link]({'data': index}).info()
RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
data 12 non-null datetime64[ns]
dtypes: datetime64[ns](1)
MANIPULATING TIME SERIES DATA IN PYTHON
Create a time series: [Link]
[Link] :
Random numbers: [0,1]
12 rows, 2 columns
data = [Link]((size=12,2))
[Link](data=data, index=index).info()
DatetimeIndex: 12 entries, 2017-01-31 to 2017-12-31
Freq: M
Data columns (total 2 columns):
0 12 non-null float64
1 12 non-null float64
dtypes: float64(2)
MANIPULATING TIME SERIES DATA IN PYTHON
Frequency aliases & time info
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Indexing &
resampling time
series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Time series transformation
Basic time series transformations include:
Parsing string dates and convert to datetime64
Selecting & slicing for speci c subperiods
Se ing & changing DateTimeIndex frequency
Upsampling vs Downsampling
MANIPULATING TIME SERIES DATA IN PYTHON
Getting GOOG stock prices
google = pd.read_csv('[Link]') # import pandas as pd
[Link]()
<class '[Link]'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null object
price 504 non-null float64
dtypes: float64(1), object(1)
[Link]()
date price
0 2015-01-02 524.81
1 2015-01-05 513.87
2 2015-01-06 501.96
3 2015-01-07 501.10
4 2015-01-08 502.68
MANIPULATING TIME SERIES DATA IN PYTHON
Converting string dates to datetime64
pd.to_datetime() :
Parse date string
Convert to datetime64
[Link] = pd.to_datetime([Link])
[Link]()
<class '[Link]'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null datetime64[ns]
price 504 non-null float64
dtypes: datetime64[ns](1), float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
Converting string dates to datetime64
.set_index() :
Date into index
inplace :
don't create copy
google.set_index('date', inplace=True)
[Link]()
<class '[Link]'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
Plotting the Google stock time series
[Link](title='Google Stock Price')
plt.tight_layout(); [Link]()
MANIPULATING TIME SERIES DATA IN PYTHON
Partial string indexing
Selecting/indexing using strings that parse to dates
google['2015'].info() # Pass string for part of date
DatetimeIndex: 252 entries, 2015-01-02 to 2015-12-31
Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)
google['2015-3': '2016-2'].info() # Slice includes last month
DatetimeIndex: 252 entries, 2015-03-02 to 2016-02-29
Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)
memory usage: 3.9 KB
MANIPULATING TIME SERIES DATA IN PYTHON
Partial string indexing
[Link]['2016-6-1', 'price'] # Use full date with .loc[]
734.15
MANIPULATING TIME SERIES DATA IN PYTHON
.asfreq(): set frequency
.asfreq('D') :
Convert DateTimeIndex to calendar day frequency
[Link]('D').info() # set calendar day frequency
DatetimeIndex: 729 entries, 2015-01-02 to 2016-12-30
Freq: D
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
.asfreq(): set frequency
Upsampling:
Higher frequency implies new dates => missing data
[Link]('D').head()
price
date
2015-01-02 524.81
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 513.87
2015-01-06 501.96
MANIPULATING TIME SERIES DATA IN PYTHON
.asfreq(): reset frequency
.asfreq('B') :
Convert DateTimeIndex to business day frequency
google = [Link]('B') # Change to calendar day frequency
[Link]()
DatetimeIndex: 521 entries, 2015-01-02 to 2016-12-30
Freq: B
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
.asfreq(): reset frequency
google[[Link]()] # Select missing 'price' values
price
date
2015-01-19 NaN
2015-02-16 NaN
...
2016-11-24 NaN
2016-12-26 NaN
Business days that were not trading days
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Lags, changes, and
returns for stock
price series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Basic time series calculations
Typical Time Series manipulations include:
Shi or lag values back or forward back in time
Get the di erence in value for a given time period
Compute the percent change over any number of periods
pandas built-in methods rely on [Link]
MANIPULATING TIME SERIES DATA IN PYTHON
Getting GOOG stock prices
Let pd.read_csv() do the parsing for you!
google = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
[Link]()
<class '[Link]'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
Getting GOOG stock prices
[Link]()
price
date
2015-01-02 524.81
2015-01-05 513.87
2015-01-06 501.96
2015-01-07 501.10
2015-01-08 502.68
MANIPULATING TIME SERIES DATA IN PYTHON
.shift(): Moving data between past & future
.shift() :
defaults to periods=1
1 period into future
google['shifted'] = [Link]() # default: periods=1
[Link](3)
price shifted
date
2015-01-02 542.81 NaN
2015-01-05 513.87 542.81
2015-01-06 501.96 513.87
MANIPULATING TIME SERIES DATA IN PYTHON
.shift(): Moving data between past & future
.shift(periods=-1) :
lagged data
1 period back in time
google['lagged'] = [Link](periods=-1)
google[['price', 'lagged', 'shifted']].tail(3)
price lagged shifted
date
2016-12-28 785.05 782.79 791.55
2016-12-29 782.79 771.82 785.05
2016-12-30 771.82 NaN 782.79
MANIPULATING TIME SERIES DATA IN PYTHON
Calculate one-period percent change
xt / xt−1
google['change'] = [Link]([Link])
google[['price', 'shifted', 'change']].head(3)
price shifted change
Date
2017-01-03 786.14 NaN NaN
2017-01-04 786.90 786.14 1.000967
2017-01-05 794.02 786.90 1.009048
MANIPULATING TIME SERIES DATA IN PYTHON
Calculate one-period percent change
google['return'] = [Link](1).mul(100)
google[['price', 'shifted', 'change', 'return']].head(3)
price shifted change return
date
2015-01-02 524.81 NaN NaN NaN
2015-01-05 513.87 524.81 0.98 -2.08
2015-01-06 501.96 513.87 0.98 -2.32
MANIPULATING TIME SERIES DATA IN PYTHON
.diff(): built-in time-series change
Di erence in value for two adjacent periods
xt − xt−1
google['diff'] = [Link]()
google[['price', 'diff']].head(3)
price diff
date
2015-01-02 524.81 NaN
2015-01-05 513.87 -10.94
2015-01-06 501.96 -11.91
MANIPULATING TIME SERIES DATA IN PYTHON
.pct_change(): built-in time-series % change
Percent change for two adjacent periods
xt
xt−1
google['pct_change'] = [Link].pct_change().mul(100)
google[['price', 'return', 'pct_change']].head(3)
price return pct_change
date
2015-01-02 524.81 NaN NaN
2015-01-05 513.87 -2.08 -2.08
2015-01-06 501.96 -2.32 -2.32
MANIPULATING TIME SERIES DATA IN PYTHON
Looking ahead: Get multi-period returns
google['return_3d'] = [Link].pct_change(periods=3).mul(100)
google[['price', 'return_3d']].head()
price return_3d
date
2015-01-02 524.81 NaN
2015-01-05 513.87 NaN
2015-01-06 501.96 NaN
2015-01-07 501.10 -4.517825
2015-01-08 502.68 -2.177594
Percent change for two periods, 3 trading days apart
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Compare time series
growth rates
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Comparing stock performance
Stock price series: hard to compare at di erent levels
Simple solution: normalize price series to start at 100
Divide all prices by rst in series, multiply by 100
Same starting point
All prices relative to starting point
Di erence to starting point in percentage points
MANIPULATING TIME SERIES DATA IN PYTHON
Normalizing a single series (1)
google = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
[Link](3)
price
date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83
first_price = [Link][0] # int-based selection
first_price
313.06
first_price == [Link]['2010-01-04', 'price']
True
MANIPULATING TIME SERIES DATA IN PYTHON
Normalizing a single series (2)
normalized = [Link](first_price).mul(100)
[Link](title='Google Normalized Series')
MANIPULATING TIME SERIES DATA IN PYTHON
Normalizing multiple series (1)
prices = pd.read_csv('stock_prices.csv',
parse_dates=['date'],
index_col='date')
[Link]()
DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 3 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
dtypes: float64(3)
[Link](2)
AAPL GOOG YHOO
Date
2010-01-04 30.57 313.06 17.10
2010-01-05 30.63 311.68 17.23
MANIPULATING TIME SERIES DATA IN PYTHON
Normalizing multiple series (2)
[Link][0]
AAPL 30.57
GOOG 313.06
YHOO 17.10
Name: 2010-01-04 [Link], dtype: float64
normalized = [Link]([Link][0])
[Link](3)
AAPL GOOG YHOO
Date
2010-01-04 1.000000 1.000000 1.000000
2010-01-05 1.001963 0.995592 1.007602
2010-01-06 0.985934 0.970517 1.004094
.div() : automatic alignment of Series index & DataFrame
columns
MANIPULATING TIME SERIES DATA IN PYTHON
Comparing with a benchmark (1)
index = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
[Link]()
DatetimeIndex: 1826 entries, 2010-01-01 to 2016-12-30
Data columns (total 1 columns):
SP500 1762 non-null float64
dtypes: float64(1)
prices = [Link]([prices, index], axis=1).dropna()
[Link]()
DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 4 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
SP500 1761 non-null float64
dtypes: float64(4)
MANIPULATING TIME SERIES DATA IN PYTHON
Comparing with a benchmark (2)
[Link](1)
AAPL GOOG YHOO SP500
2010-01-04 30.57 313.06 17.10 1132.99
normalized = [Link]([Link][0]).mul(100)
[Link]()
MANIPULATING TIME SERIES DATA IN PYTHON
Plotting performance difference
diff = normalized[tickers].sub(normalized['SP500'], axis=0)
GOOG YHOO AAPL
2010-01-04 0.000000 0.000000 0.000000
2010-01-05 -0.752375 0.448669 -0.115294
2010-01-06 -3.314604 0.043069 -1.772895
.sub(..., axis=0) : Subtract a Series from each DataFrame
column by aligning indexes
MANIPULATING TIME SERIES DATA IN PYTHON
Plotting performance difference
[Link]()
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Changing the time
series frequency:
resampling
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Changing the frequency: resampling
DateTimeIndex : set & change freq using .asfreq()
But frequency conversion a ects the data
Upsampling: ll or interpolate missing data
Downsampling: aggregate existing data
pandas API:
.asfreq() , .reindex()
.resample() + transformation method
MANIPULATING TIME SERIES DATA IN PYTHON
Getting started: quarterly data
dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = [Link](data=data, index=dates)
quarterly
2016-03-31 1
2016-06-30 2
2016-09-30 3
2016-12-31 4
Freq: Q-DEC, dtype: int64 # Default: year-end quarters
MANIPULATING TIME SERIES DATA IN PYTHON
Upsampling: quarter => month
monthly = [Link]('M') # to month-end frequency
2016-03-31 1.0
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 2.0
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 4.0
Freq: M, dtype: float64
Upsampling creates missing values
monthly = monthly.to_frame('baseline') # to DataFrame
MANIPULATING TIME SERIES DATA IN PYTHON
Upsampling: fill methods
monthly['ffill'] = [Link]('M', method='ffill')
monthly['bfill'] = [Link]('M', method='bfill')
monthly['value'] = [Link]('M', fill_value=0)
MANIPULATING TIME SERIES DATA IN PYTHON
Upsampling: fill methods
bfill : back ll
ffill : forward ll
baseline ffill bfill value
2016-03-31 1.0 1 1 1
2016-04-30 NaN 1 2 0
2016-05-31 NaN 1 2 0
2016-06-30 2.0 2 2 2
2016-07-31 NaN 2 3 0
2016-08-31 NaN 2 3 0
2016-09-30 3.0 3 3 3
2016-10-31 NaN 3 4 0
2016-11-30 NaN 3 4 0
2016-12-31 4.0 4 4 4
MANIPULATING TIME SERIES DATA IN PYTHON
Add missing months: .reindex()
dates = pd.date_range(start='2016', [Link](dates)
periods=12,
freq='M')
2016-01-31 NaN
2016-02-29 NaN
DatetimeIndex(['2016-01-31', 2016-03-31 1.0
'2016-02-29', 2016-04-30 NaN
..., 2016-05-31 NaN
'2016-11-30', 2016-06-30 2.0
'2016-12-31'], 2016-07-31 NaN
dtype='datetime64[ns]', freq='M') 2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
.reindex() : 2016-11-30 NaN
conform DataFrame to 2016-12-31 4.0
new index
same lling logic as
.asfreq()
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Upsampling &
interpolation with
.resample()
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Frequency conversion & transformation methods
.resample() : similar to .groupby()
Groups data within resampling period and applies one or
several methods to each group
New date determined by o set - start, end, etc
Upsampling: ll from existing or interpolate values
Downsampling: apply aggregation to existing data
MANIPULATING TIME SERIES DATA IN PYTHON
Getting started: monthly unemployment rate
unrate = pd.read_csv('[Link]', parse_dates['Date'], index_col='Date')
[Link]()
DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01
Data columns (total 1 columns):
UNRATE 208 non-null float64 # no frequency information
dtypes: float64(1)
[Link]()
UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
Reporting date: 1st day of month
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling Period & Frequency Offsets
Resample creates new date for frequency o set
Several alternatives to calendar month end
Frequency Alias Sample Date
Calendar Month End M 2017-04-30
Calendar Month Start MS 2017-04-01
Business Month End BM 2017-04-28
Business Month Start BMS 2017-04-03
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling logic
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling logic
MANIPULATING TIME SERIES DATA IN PYTHON
Assign frequency with .resample()
[Link]('MS').info()
DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01
Freq: MS
Data columns (total 1 columns):
UNRATE 208 non-null float64
dtypes: float64(1)
[Link]('MS') # creates Resampler object
DatetimeIndexResampler [freq=<MonthBegin>, axis=0, closed=left,
label=left, convention=start, base=0]
MANIPULATING TIME SERIES DATA IN PYTHON
Assign frequency with .resample()
[Link]('MS').equals([Link]('MS').asfreq())
True
.resample() : returns data only when calling another method
MANIPULATING TIME SERIES DATA IN PYTHON
Quarterly real GDP growth
gdp = pd.read_csv('[Link]')
[Link]()
DatetimeIndex: 69 entries, 2000-01-01 to 2017-01-01
Data columns (total 1 columns):
gpd 69 non-null float64 # no frequency info
dtypes: float64(1)
[Link](2)
gpd
DATE
2000-01-01 1.2
2000-04-01 7.8
MANIPULATING TIME SERIES DATA IN PYTHON
Interpolate monthly real GDP growth
gdp_1 = [Link]('MS').ffill().add_suffix('_ffill')
gpd_ffill
DATE
2000-01-01 1.2
2000-02-01 1.2
2000-03-01 1.2
2000-04-01 7.8
MANIPULATING TIME SERIES DATA IN PYTHON
Interpolate monthly real GDP growth
gdp_2 = [Link]('MS').interpolate().add_suffix('_inter')
gpd_inter
DATE
2000-01-01 1.200000
2000-02-01 3.400000
2000-03-01 5.600000
2000-04-01 7.800000
.interpolate() : nds points on straight line between
existing data
MANIPULATING TIME SERIES DATA IN PYTHON
Concatenating two DataFrames
df1 = [Link]([1, 2, 3], columns=['df1'])
df2 = [Link]([4, 5, 6], columns=['df2'])
[Link]([df1, df2])
df1 df2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 4.0
1 NaN 5.0
2 NaN 6.0
MANIPULATING TIME SERIES DATA IN PYTHON
Concatenating two DataFrames
[Link]([df1, df2], axis=1)
df1 df2
0 1 4
1 2 5
2 3 6
axis=1 : concatenate horizontally
MANIPULATING TIME SERIES DATA IN PYTHON
Plot interpolated real GDP growth
[Link]([gdp_1, gdp_2], axis=1).loc['2015':].plot()
MANIPULATING TIME SERIES DATA IN PYTHON
Combine GDP growth & unemployment
[Link]([unrate, gdp_inter], axis=1).plot();
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Downsampling &
aggregation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Downsampling & aggregation methods
So far: upsampling, ll logic & interpolation
Now: downsampling
hour to day
day to month, etc
How to represent the existing values at the new date?
Mean, median, last value?
MANIPULATING TIME SERIES DATA IN PYTHON
Air quality: daily ozone levels
ozone = pd.read_csv('[Link]',
parse_dates=['date'],
index_col='date')
[Link]()
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)
ozone = [Link]('D').asfreq()
[Link]()
DatetimeIndex: 6300 entries, 1998-01-05 to 2017-03-31
Freq: D
Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
Creating monthly ozone data
[Link]('M').mean().head() [Link]('M').median().head()
Ozone Ozone
date date
2000-01-31 0.010443 2000-01-31 0.009486
2000-02-29 0.011817 2000-02-29 0.010726
2000-03-31 0.016810 2000-03-31 0.017004
2000-04-30 0.019413 2000-04-30 0.019866
2000-05-31 0.026535 2000-05-31 0.026018
.resample().mean() : Monthly
average, assigned to end of
calendar month
MANIPULATING TIME SERIES DATA IN PYTHON
Creating monthly ozone data
[Link]('M').agg(['mean', 'std']).head()
Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409
.resample().agg() : List of aggregation functions like
groupby
MANIPULATING TIME SERIES DATA IN PYTHON
Plotting resampled ozone data
ozone = [Link]['2016':]
ax = [Link]()
monthly = [Link]('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling multiple time series
data = pd.read_csv('ozone_pm25.csv',
parse_dates=['date'],
index_col='date')
data = [Link]('D').asfreq()
[Link]()
DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31
Freq: D
Data columns (total 2 columns):
Ozone 6167 non-null float64
PM25 6167 non-null float64
dtypes: float64(2)
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling multiple time series
data = [Link]('BM').mean()
[Link]()
<class '[Link]'>
DatetimeIndex: 207 entries, 2000-01-31 to 2017-03-31
Freq: BM
Data columns (total 2 columns):
ozone 207 non-null float64
pm25 207 non-null float64
dtypes: float64(2)
MANIPULATING TIME SERIES DATA IN PYTHON
Resampling multiple time series
[Link]('M').first().head(4)
Ozone PM25
date
2000-01-31 0.005545 20.800000
2000-02-29 0.016139 6.500000
2000-03-31 0.017004 8.493333
2000-04-30 0.031354 6.889474
[Link]('MS').first().head()
Ozone PM25
date
2000-01-01 0.004032 37.320000
2000-02-01 0.010583 24.800000
2000-03-01 0.007418 11.106667
2000-04-01 0.017631 11.700000
2000-05-01 0.022628 9.700000
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Rolling window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Window functions in pandas
Windows identify sub periods of your time series
Calculate metrics for sub periods inside the window
Create a new time series of metrics
Two types of windows:
Rolling: same size, sliding (this video)
Expanding: contain all prior values (next video)
MANIPULATING TIME SERIES DATA IN PYTHON
Calculating a rolling average
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30
Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)
MANIPULATING TIME SERIES DATA IN PYTHON
Calculating a rolling average
# Integer-based window size
[Link](window=30).mean() # fixed # observations
DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24
Data columns (total 1 columns):
price 1732 non-null float64
dtypes: float64(1)
window=30 : # business days
min_periods : choose value < 30 to get results for rst days
MANIPULATING TIME SERIES DATA IN PYTHON
Calculating a rolling average
# Offset-based window size
[Link](window='30D').mean() # fixed period length
DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24
Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)
30D : # calendar days
MANIPULATING TIME SERIES DATA IN PYTHON
90 day rolling mean
r90 = [Link](window='90D').mean()
[Link](r90.add_suffix('_mean_90')).plot()
MANIPULATING TIME SERIES DATA IN PYTHON
90 & 360 day rolling means
data['mean90'] = r90
r360 = data['price'].rolling(window='360D'.mean()
data['mean360'] = r360; [Link]()
MANIPULATING TIME SERIES DATA IN PYTHON
Multiple rolling metrics (1)
r = [Link]('90D').agg(['mean', 'std'])
[Link](subplots = True)
MANIPULATING TIME SERIES DATA IN PYTHON
Multiple rolling metrics (2)
rolling = [Link]('360D')
q10 = [Link](0.1).to_frame('q10')
median = [Link]().to_frame('median')
q90 = [Link](0.9).to_frame('q90')
[Link]([q10, median, q90], axis=1).plot()
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Expanding window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Expanding windows in pandas
From rolling to expanding windows
Calculate metrics for periods up to current date
New time series re ects all historical values
Useful for running rate of return, running min/max
Two options with pandas:
.expanding() - just like .rolling()
.cumsum() , .cumprod() , cummin() / max()
MANIPULATING TIME SERIES DATA IN PYTHON
The basic idea
df = [Link]({'data': range(5)})
df['expanding sum'] = [Link]().sum()
df['cumulative sum'] = [Link]()
df
data expanding sum cumulative sum
0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10
MANIPULATING TIME SERIES DATA IN PYTHON
Get data for the S&P 500
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
DatetimeIndex: 2519 entries, 2007-05-24 to 2017-05-24
Data columns (total 1 columns):
SP500 2519 non-null float64
MANIPULATING TIME SERIES DATA IN PYTHON
How to calculate a running return
Single period return rt : current price over last price minus 1:
Pt
rt = −1
Pt−1
Multi-period return: product of (1 + rt ) for all periods,
minus 1:
RT = (1 + r1 )(1 + r2 )...(1 + rT ) − 1
For the period return: .pct_change()
For basic math .add() , .sub() , .mul() , .div()
For cumulative product: .cumprod()
MANIPULATING TIME SERIES DATA IN PYTHON
Running rate of return in practice
pr = data.SP500.pct_change() # period return
pr_plus_one = [Link](1)
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot()
MANIPULATING TIME SERIES DATA IN PYTHON
Getting the running min & max
data['running_min'] = [Link]().min()
data['running_max'] = [Link]().max()
[Link]()
MANIPULATING TIME SERIES DATA IN PYTHON
Rolling annual rate of return
def multi_period_return(period_returns):
return [Link](period_returns + 1) - 1
pr = data.SP500.pct_change() # period return
r = [Link]('360D').apply(multi_period_return)
data['Rolling 1yr Return'] = [Link](100)
[Link](subplots=True)
MANIPULATING TIME SERIES DATA IN PYTHON
Rolling annual rate of return
data['Rolling 1yr Return'] = [Link](100)
[Link](subplots=True)
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Case study: S&P500
price simulation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Random walks & simulations
Daily stock returns are hard to predict
Models o en assume they are random in nature
Numpy allows you to generate random numbers
From random returns to prices: use .cumprod()
Two examples:
Generate random returns
Randomly selected actual SP500 returns
MANIPULATING TIME SERIES DATA IN PYTHON
Generate random numbers
from [Link] import normal, seed
from [Link] import norm
seed(42)
random_returns = normal(loc=0, scale=0.01, size=1000)
[Link](random_returns, fit=norm, kde=False)
MANIPULATING TIME SERIES DATA IN PYTHON
Create a random price path
return_series = [Link](random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot()
MANIPULATING TIME SERIES DATA IN PYTHON
S&P 500 prices & returns
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
data['returns'] = data.SP500.pct_change()
[Link](subplots=True)
MANIPULATING TIME SERIES DATA IN PYTHON
S&P return distribution
[Link]([Link]().mul(100), fit=norm)
MANIPULATING TIME SERIES DATA IN PYTHON
Generate random S&P 500 returns
from [Link] import choice
sample = [Link]()
n_obs = [Link]()
random_walk = choice(sample, size=n_obs)
random_walk = [Link](random_walk, index=[Link])
random_walk.head()
DATE
2007-05-29 -0.008357
2007-05-30 0.003702
2007-05-31 -0.013990
2007-06-01 0.008096
2007-06-04 0.013120
MANIPULATING TIME SERIES DATA IN PYTHON
Random S&P 500 prices (1)
start = [Link]('D')
DATE
2007-05-25 1515.73
Name: SP500, dtype: float64
sp500_random = [Link](random_walk.add(1))
sp500_random.head())
DATE
2007-05-25 1515.730000
2007-05-29 0.998290
2007-05-30 0.995190
2007-05-31 0.997787
2007-06-01 0.983853
dtype: float64
MANIPULATING TIME SERIES DATA IN PYTHON
Random S&P 500 prices (2)
data['SP500_random'] = sp500_random.cumprod()
data[['SP500', 'SP500_random']].plot()
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Relationships
between time series:
correlation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Correlation & relations between series
So far, focus on characteristics of individual variables
Now: characteristic of relations between variables
Correlation: measures linear relationships
Financial markets: important for prediction and risk
management
pandas & seaborn have tools to compute & visualize
MANIPULATING TIME SERIES DATA IN PYTHON
Correlation & linear relationships
Correlation coe cient: how similar is the pairwise movement
of two variables around their averages?
∑N (x −x̄)(yi − ȳ )
Varies between -1 and +1 r= i=1 i
sx sy
MANIPULATING TIME SERIES DATA IN PYTHON
Importing five price time series
data = pd.read_csv('[Link]', parse_dates=['date'],
index_col='date')
data = [Link]().info()
DatetimeIndex: 2469 entries, 2007-05-25 to 2017-05-22
Data columns (total 5 columns):
sp500 2469 non-null float64
nasdaq 2469 non-null float64
bonds 2469 non-null float64
gold 2469 non-null float64
oil 2469 non-null float64
MANIPULATING TIME SERIES DATA IN PYTHON
Visualize pairwise linear relationships
daily_returns = data.pct_change()
[Link](x='sp500', y='nasdaq', data=data_returns);
MANIPULATING TIME SERIES DATA IN PYTHON
Calculate all correlations
correlations = [Link]()
correlations
bonds oil gold sp500 nasdaq
bonds 1.000000 -0.183755 0.003167 -0.300877 -0.306437
oil -0.183755 1.000000 0.105930 0.335578 0.289590
gold 0.003167 0.105930 1.000000 -0.007786 -0.002544
sp500 -0.300877 0.335578 -0.007786 1.000000 0.959990
nasdaq -0.306437 0.289590 -0.002544 0.959990 1.000000
MANIPULATING TIME SERIES DATA IN PYTHON
Visualize all correlations
[Link](correlations, annot=True)
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Select index
components &
import data
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Market value-weighted index
Composite performance of various stocks
Components weighted by market capitalization
Share Price x Number of Shares => Market Value
Larger components get higher percentage weightings
Key market indexes are value-weighted:
S&P 500 , NASDAQ , Wilshire 5000 , Hang Seng
MANIPULATING TIME SERIES DATA IN PYTHON
Build a cap-weighted Index
Apply new skills to construct value-weighted index
Select components from exchange listing data
Get component number of shares and stock prices
Calculate component weights
Calculate index
Evaluate performance of components and index
MANIPULATING TIME SERIES DATA IN PYTHON
Load stock listing data
nyse = pd.read_excel('[Link]', sheet_name='nyse',
na_values='n/a')
[Link]()
RangeIndex: 3147 entries, 0 to 3146
Data columns (total 7 columns):
Stock Symbol 3147 non-null object # Stock Ticker
Company Name 3147 non-null object
Last Sale 3079 non-null float64 # Latest Stock Price
Market Capitalization 3147 non-null float64
IPO Year 1361 non-null float64 # Year of listing
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(4)
MANIPULATING TIME SERIES DATA IN PYTHON
Load & prepare listing data
nyse.set_index('Stock Symbol', inplace=True)
[Link](subset=['Sector'], inplace=True)
nyse['Market Capitalization'] /= 1e6 # in Million USD
Index: 2177 entries, DDD to ZTO
Data columns (total 6 columns):
Company Name 2177 non-null object
Last Sale 2175 non-null float64
Market Capitalization 2177 non-null float64
IPO Year 967 non-null float64
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(3)
MANIPULATING TIME SERIES DATA IN PYTHON
Select index components
components = [Link](['Sector'])['Market Capitalization'].nlargest(1)
components.sort_values(ascending=False)
Sector Stock Symbol
Health Care JNJ 338834.390080
Energy XOM 338728.713874
Finance JPM 300283.250479
Miscellaneous BABA 275525.000000
Public Utilities T 247339.517272
Basic Industries PG 230159.644117
Consumer Services WMT 221864.614129
Consumer Non-Durables KO 183655.305119
Technology ORCL 181046.096000
Capital Goods TM 155660.252483
Transportation UPS 90180.886756
Consumer Durables ABB 48398.935676
Name: Market Capitalization, dtype: float64
MANIPULATING TIME SERIES DATA IN PYTHON
Import & prepare listing data
tickers = [Link].get_level_values('Stock Symbol')
tickers
Index(['PG', 'TM', 'ABB', 'KO', 'WMT', 'XOM', 'JPM', 'JNJ', 'BABA', 'T',
'ORCL', ‘UPS'], dtype='object', name='Stock Symbol’)
[Link]()
['PG',
'TM',
'ABB',
'KO',
'WMT',
...
'T',
'ORCL',
'UPS']
MANIPULATING TIME SERIES DATA IN PYTHON
Stock index components
columns = ['Company Name', 'Market Capitalization', 'Last Sale']
component_info = [Link][tickers, columns]
[Link].float_format = '{:,.2f}'.format
Company Name Market Capitalization Last Sale
Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74
MANIPULATING TIME SERIES DATA IN PYTHON
Import & prepare listing data
data = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date').loc[:, [Link]()]
[Link]()
DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30
Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Build a market-cap
weighted index
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Build your value-weighted index
Key inputs:
number of shares
stock price series
MANIPULATING TIME SERIES DATA IN PYTHON
Build your value-weighted index
Key inputs:
number of shares
stock price series
Normalize index to start
at 100
MANIPULATING TIME SERIES DATA IN PYTHON
Stock index components
components
Company Name Market Capitalization Last Sale
Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74
MANIPULATING TIME SERIES DATA IN PYTHON
Number of shares outstanding
shares = components['Market Capitalization'].div(components['Last Sale'])
Stock Symbol
PG 2,556.48 # Outstanding shares in million
TM 1,494.15
ABB 2,138.71
KO 4,292.01
WMT 3,033.01
XOM 4,146.51
JPM 3,557.86
JNJ 2,710.89
BABA 2,500.00
T 6,140.50
ORCL 4,114.68
UPS 869.30
dtype: float64
Market Capitalization = Number of Shares x Share Price
MANIPULATING TIME SERIES DATA IN PYTHON
Historical stock prices
data = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date').loc[:, [Link]()]
market_cap_series = [Link](no_shares)
market_series.info()
DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30
Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
...
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)
MANIPULATING TIME SERIES DATA IN PYTHON
From stock prices to market value
market_cap_series.first('D').append(market_cap_series.last('D'))
ABB BABA JNJ JPM KO ORCL \\
Date
2016-01-04 37,470.14 191,725.00 272,390.43 226,350.95 181,981.42 147,099.95
2016-12-30 45,062.55 219,525.00 312,321.87 307,007.60 177,946.93 158,209.60
PG T TM UPS WMT XOM
Date
2016-01-04 200,351.12 210,926.33 181,479.12 82,444.14 186,408.74 321,188.96
2016-12-30 214,948.60 261,155.65 175,114.05 99,656.23 209,641.59 374,264.34
MANIPULATING TIME SERIES DATA IN PYTHON
Aggregate market value per period
agg_mcap = market_cap_series.sum(axis=1) # Total market cap
agg_mcap(title='Aggregate Market Cap')
MANIPULATING TIME SERIES DATA IN PYTHON
Value-based index
index = agg_mcap.div(agg_mcap.iloc[0]).mul(100) # Divide by 1st value
[Link](title='Market-Cap Weighted Index')
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Evaluate index
performance
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Evaluate your value-weighted index
Index return:
Total index return
Contribution by component
Performance vs Benchmark
Total period return
Rolling returns for sub periods
MANIPULATING TIME SERIES DATA IN PYTHON
Value-based index - recap
agg_market_cap = market_cap_series.sum(axis=1)
index = agg_market_cap.div(agg_market_cap.iloc[0]).mul(100)
[Link](title='Market-Cap Weighted Index')
MANIPULATING TIME SERIES DATA IN PYTHON
Value contribution by stock
agg_market_cap.iloc[-1] - agg_market_cap.iloc[0]
315,037.71
MANIPULATING TIME SERIES DATA IN PYTHON
Value contribution by stock
change = market_cap_series.first('D').append(market_cap_series.last('D'))
[Link]().iloc[-1].sort_values() # or: .loc['2016-12-30']
TM -6,365.07
KO -4,034.49
ABB 7,592.41
ORCL 11,109.65
PG 14,597.48
UPS 17,212.08
WMT 23,232.85
BABA 27,800.00
JNJ 39,931.44
T 50,229.33
XOM 53,075.38
JPM 80,656.65
Name: 2016-12-30 [Link], dtype: float64
MANIPULATING TIME SERIES DATA IN PYTHON
Market-cap based weights
market_cap = components['Market Capitalization']
weights = market_cap.div(market_cap.sum())
weights.sort_values().mul(100)
Stock Symbol
ABB 1.85
UPS 3.45
TM 5.96
ORCL 6.93
KO 7.03
WMT 8.50
PG 8.81
T 9.47
BABA 10.55
JPM 11.50
XOM 12.97
JNJ 12.97
Name: Market Capitalization, dtype: float64
MANIPULATING TIME SERIES DATA IN PYTHON
Value-weighted component returns
index_return = ([Link][-1] / [Link][0] - 1) * 100
14.06
weighted_returns = [Link](index_return)
weighted_returns.sort_values().plot(kind='barh')
MANIPULATING TIME SERIES DATA IN PYTHON
Performance vs benchmark
data = index.to_frame('Index') # Convert [Link] to [Link]
data['SP500'] = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date')
data.SP500 = [Link]([Link][0], axis=0).mul(100)
MANIPULATING TIME SERIES DATA IN PYTHON
Performance vs benchmark: 30D rolling return
def multi_period_return(r):
return ([Link](r + 1) - 1) * 100
data.pct_change().rolling('30D').apply(multi_period_return).plot()
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Index correlation &
exporting to Excel
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Some additional analysis of your index
Daily return correlations:
Calculate among all components
Visualize the result as heatmap
Write results to excel using .xls and .xlsx formats:
Single worksheet
Multiple worksheets
MANIPULATING TIME SERIES DATA IN PYTHON
Index components - price data
data = DataReader(tickers, 'google', start='2016', end='2017')['Close']
[Link]()
DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30
Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
MANIPULATING TIME SERIES DATA IN PYTHON
Index components: return correlations
daily_returns = data.pct_change()
correlations = daily_returns.corr()
ABB BABA JNJ JPM KO ORCL PG T TM UPS WMT XOM
ABB 1.00 0.40 0.33 0.56 0.31 0.53 0.34 0.29 0.48 0.50 0.15 0.48
BABA 0.40 1.00 0.27 0.27 0.25 0.38 0.21 0.17 0.34 0.35 0.13 0.21
JNJ 0.33 0.27 1.00 0.34 0.30 0.37 0.42 0.35 0.29 0.45 0.24 0.41
JPM 0.56 0.27 0.34 1.00 0.22 0.57 0.27 0.13 0.49 0.56 0.14 0.48
KO 0.31 0.25 0.30 0.22 1.00 0.31 0.62 0.47 0.33 0.50 0.25 0.29
ORCL 0.53 0.38 0.37 0.57 0.31 1.00 0.41 0.32 0.48 0.54 0.21 0.42
PG 0.34 0.21 0.42 0.27 0.62 0.41 1.00 0.43 0.32 0.47 0.33 0.34
T 0.29 0.17 0.35 0.13 0.47 0.32 0.43 1.00 0.28 0.41 0.31 0.33
TM 0.48 0.34 0.29 0.49 0.33 0.48 0.32 0.28 1.00 0.52 0.20 0.30
UPS 0.50 0.35 0.45 0.56 0.50 0.54 0.47 0.41 0.52 1.00 0.33 0.45
WMT 0.15 0.13 0.24 0.14 0.25 0.21 0.33 0.31 0.20 0.33 1.00 0.21
XOM 0.48 0.21 0.41 0.48 0.29 0.42 0.34 0.33 0.30 0.45 0.21 1.00
MANIPULATING TIME SERIES DATA IN PYTHON
Index components: return correlations
[Link](correlations, annot=True)
[Link](rotation=45)
[Link]('Daily Return Correlations')
MANIPULATING TIME SERIES DATA IN PYTHON
Saving to a single Excel worksheet
correlations.to_excel(excel_writer= '[Link]',
sheet_name='correlations',
startrow=1,
startcol=1)
MANIPULATING TIME SERIES DATA IN PYTHON
Saving to multiple Excel worksheets
[Link] = [Link] # Keep only date component
with [Link]('stock_data.xlsx') as writer:
corr.to_excel(excel_writer=writer, sheet_name='correlations')
data.to_excel(excel_writer=writer, sheet_name='prices')
data.pct_change().to_excel(writer, sheet_name='returns')
MANIPULATING TIME SERIES DATA IN PYTHON
Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to the
Course
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Example of Time Series: Google Trends
TIME SERIES ANALYSIS IN PYTHON
Example of Time Series: Climate Data
TIME SERIES ANALYSIS IN PYTHON
Example of Time Series: Quarterly Earnings Data
TIME SERIES ANALYSIS IN PYTHON
Example of Multiple Series: Natural Gas and Heating
Oil
TIME SERIES ANALYSIS IN PYTHON
Goals of Course
Learn about time series models
Fit data to a time series model
Use the models to make forecasts of the future
Learn how to use the relevant statistical packages in Python
Provide concrete examples of how these models are used
TIME SERIES ANALYSIS IN PYTHON
Some Useful Pandas Tools
Changing an index to datetime
[Link] = pd.to_datetime([Link])
Plo ing data
[Link]()
Slicing data
df['2012']
TIME SERIES ANALYSIS IN PYTHON
Some Useful Pandas Tools
Join two DataFrames
[Link](df2)
Resample data (e.g. from daily to weekly)
df = [Link](rule='W').last()
TIME SERIES ANALYSIS IN PYTHON
More pandas Functions
Computing percent changes and di erences of a time series
df['col'].pct_change()
df['col'].diff()
pandas correlation method of Series
df['ABC'].corr(df['XYZ'])
pandas autocorrelation
df['ABC'].autocorr()
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Correlation of Two
Time Series
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Correlation of Two Time Series
Plot of S&P500 and JPMorgan stock
TIME SERIES ANALYSIS IN PYTHON
Correlation of Two Time Series
Sca er plot of S&P500 and JP Morgan returns
TIME SERIES ANALYSIS IN PYTHON
More Scatter Plots
Correlation = 0.9 Correlation = 0.4
Correlation = -0.9 Corelation = 1.0
TIME SERIES ANALYSIS IN PYTHON
Common Mistake: Correlation of Two Trending Series
Dow Jones Industrial Average and UFO Sightings
([Link])
Correlation of levels: 0.94
Correlation of percent changes: ≈0
TIME SERIES ANALYSIS IN PYTHON
Example: Correlation of Large Cap and Small Cap
Stocks
Start with stock prices of SPX (large cap) and R2000 (small
cap)
First step: Compute percentage changes of both series
df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()
TIME SERIES ANALYSIS IN PYTHON
Example: Correlation of Large Cap and Small Cap
Stocks
Visualize correlation with sca ter plot
[Link](df['SPX_Ret'], df['R2000_Ret'])
[Link]()
TIME SERIES ANALYSIS IN PYTHON
Example: Correlation of Large Cap and Small Cap
Stocks
Use pandas correlation method for Series
correlation = df['SPX_Ret'].corr(df['R2000_Ret'])
print("Correlation is: ", correlation)
Correlation is: 0.868
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Simple Linear
Regressions
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Regression?
Simple linear regression:
yt = α + βxt + ϵt
TIME SERIES ANALYSIS IN PYTHON
What is a Regression?
Ordinary Least Squares (OLS)
TIME SERIES ANALYSIS IN PYTHON
Python Packages to Perform Regressions
In statsmodels: Warning: the order of x and
import [Link] as sm y is not consistent across
[Link](y, x).fit()
packages
In numpy:
[Link](x, y, deg=1)
In pandas:
[Link](y, x)
In scipy:
from scipy import stats
[Link](x, y)
TIME SERIES ANALYSIS IN PYTHON
Example: Regression of Small Cap Returns on Large
Cap
Import the statsmodels module
import [Link] as sm
As before, compute percentage changes in both series
df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()
Add a constant to the DataFrame for the regression intercept
df = sm.add_constant(df)
TIME SERIES ANALYSIS IN PYTHON
Regression Example (continued)
Notice that the rst row of returns is NaN
SPX_Price R2000_Price SPX_Ret R2000_Ret
Date
2012-11-01 1427.589966 827.849976 NaN NaN
2012-11-02 1414.199951 814.369995 -0.009379 -0.016283
Delete the row of NaN
df = [Link]()
Run the regression
results = [Link](df['R2000_Ret'],df[['const','SPX_Ret']]).fit()
print([Link]())
TIME SERIES ANALYSIS IN PYTHON
Regression Example (continued)
Regression output
Intercept in [Link][0]
Slope in [Link][1]
TIME SERIES ANALYSIS IN PYTHON
Regression Example (continued)
Regression output
TIME SERIES ANALYSIS IN PYTHON
Relationship Between R-Squared and Correlation
[corr(x, y)]2 = R2 (or R-squared)
sign(corr) = sign(regression slope)
In last example:
R-Squared = 0.753
Slope is positive
correlation = +√0.753 = 0.868
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Autocorrelation?
Correlation of a time series with a lagged copy of itself
Also called serial correlation
Lag-one autocorrelation
TIME SERIES ANALYSIS IN PYTHON
Interpretation of Autocorrelation
Mean Reversion - Negative autocorrelation
TIME SERIES ANALYSIS IN PYTHON
Interpretation of Autocorrelation
Momentum, or Trend Following - Positive autocorrelation
TIME SERIES ANALYSIS IN PYTHON
Traders Use Autocorrelation to Make Money
Individual stocks
Historically have negative autocorrelation
Measured over short horizons (days)
Trading strategy: Buy losers and sell winners
Commodities and currencies
Historically have positive autocorrelation
Measured over longer horizons (months)
Trading strategy: Buy winners and sell losers
TIME SERIES ANALYSIS IN PYTHON
Example of Positive Autocorrelation: Exchange Rates
Use daily ¥/$ exchange rates in DataFrame df from FRED
Convert index to datetime
# Convert index to datetime
[Link] = pd.to_datetime([Link])
# Downsample from daily to monthly data
df = [Link](rule='M').last()
# Compute returns from prices
df['Return'] = df['Price'].pct_change()
# Compute autocorrelation
autocorrelation = df['Return'].autocorr()
print("The autocorrelation is: ",autocorrelation)
The autocorrelation is: 0.0567
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
Function
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Autocorrelation Function
Autocorrelation Function (ACF): The autocorrelation as a
function of the lag
Equals one at lag-zero
Interesting information beyond lag-one
TIME SERIES ANALYSIS IN PYTHON
ACF Example 1: Simple Autocorrelation Function
Can use last two values in series for forecasting
TIME SERIES ANALYSIS IN PYTHON
ACF Example 2: Seasonal Earnings
Earnings for H&R Block ACF for H&R Block
TIME SERIES ANALYSIS IN PYTHON
ACF Example 3: Useful for Model Selection
Model selection
TIME SERIES ANALYSIS IN PYTHON
Plot ACF in Python
Import module:
from [Link] import plot_acf
Plot the ACF:
plot_acf(x, lags= 20, alpha=0.05)
TIME SERIES ANALYSIS IN PYTHON
Confidence Interval of ACF
TIME SERIES ANALYSIS IN PYTHON
Confidence Interval of ACF
Argument alpha sets the width of con dence interval
Example: alpha=0.05
5% chance that if true autocorrelation is zero, it will fall
outside blue band
Con dence bands are wider if:
Alpha lower
Fewer observations
Under some simplifying assumptions, 95% con dence bands
are ±2/√N
If you want no bands on plot, set alpha=1
TIME SERIES ANALYSIS IN PYTHON
ACF Values Instead of Plot
from [Link] import acf
print(acf(x))
[ 1. -0.6765505 0.34989905 -0.01629415 -0.02507
-0.03186545 0.01399904 -0.03518128 0.02063168 -0.02620
...
0.07191516 -0.12211912 0.14514481 -0.09644228 0.05215
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
White Noise
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is White Noise?
White Noise is a series with:
Constant mean
Constant variance
Zero autocorrelations at all lags
Special Case: if data has normal distribution, then Gaussian
White Noise
TIME SERIES ANALYSIS IN PYTHON
Simulating White Noise
It's very easy to generate white noise
import numpy as np
noise = [Link](loc=0, scale=1, size=500)
TIME SERIES ANALYSIS IN PYTHON
What Does White Noise Look Like?
[Link](noise)
TIME SERIES ANALYSIS IN PYTHON
Autocorrelation of White Noise
plot_acf(noise, lags=50)
TIME SERIES ANALYSIS IN PYTHON
Stock Market Returns: Close to White Noise
Autocorrelation Function for the S&P500
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Random Walk
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Random Walk?
Today's Price = Yesterday's Price + Noise
Pt = Pt−1 + ϵt
Plot of simulated data
TIME SERIES ANALYSIS IN PYTHON
What is a Random Walk?
Today's Price = Yesterday's Price + Noise
Pt = Pt−1 + ϵt
Change in price is white noise
Pt − Pt−1 = ϵt
Can't forecast a random walk
Best forecast for tomorrow's price is today's price
TIME SERIES ANALYSIS IN PYTHON
What is a Random Walk?
Today's Price = Yesterday's Price + Noise
Pt = Pt−1 + ϵt
Random walk with dri :
Pt = μ + Pt−1 + ϵt
Change in price is white noise with non-zero mean:
Pt − Pt−1 = μ + ϵt
TIME SERIES ANALYSIS IN PYTHON
Statistical Test for Random Walk
Random walk with dri
Pt = μ + Pt−1 + ϵt
Regression test for random walk
Pt = α + β Pt−1 + ϵt
Test:
H0 : β = 1 (random walk)
H1 : β < 1 (not random walk)
TIME SERIES ANALYSIS IN PYTHON
Statistical Test for Random Walk
Regression test for random walk
Pt = α + β Pt−1 + ϵt
Equivalent to
Pt − Pt−1 = α + β Pt−1 + ϵt
Test:
H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
TIME SERIES ANALYSIS IN PYTHON
Statistical Test for Random Walk
Regression test for random walk
Pt − Pt−1 = α + β Pt−1 + ϵt
Test:
H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
This test is called the Dickey-Fuller test
If you add more lagged changes on the right hand side, it's
the Augmented Dickey-Fuller test
TIME SERIES ANALYSIS IN PYTHON
ADF Test in Python
Import module from statsmodels
from [Link] import adfuller
Run Augmented Dickey-Test
adfuller(x)
TIME SERIES ANALYSIS IN PYTHON
Example: Is the S&P500 a Random Walk?
# Run Augmented Dickey-Fuller Test on SPX data
results = adfuller(df['SPX'])
# Print p-value
print(results[1])
0.782253808587
# Print full results
print(results)
(-0.91720490331127869,
0.78225380858668414,
0,
1257,
{'1%': -3.4355629707955395,
'10%': -2.567995644141416,
'5%': -2.8638420633876671},
10161.888789598503)
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Stationarity
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Stationarity?
Strong stationarity: entire distribution of data is time-
invariant
Weak stationarity: mean, variance and autocorrelation are
time-invariant (i.e., for autocorrelation, corr(Xt , Xt−τ ) is
only a function of τ)
TIME SERIES ANALYSIS IN PYTHON
Why Do We Care?
If parameters vary with time, too many parameters to
estimate
Can only estimate a parsimonious model with a few
parameters
TIME SERIES ANALYSIS IN PYTHON
Examples of Nonstationary Series
Random Walk
TIME SERIES ANALYSIS IN PYTHON
Examples of Nonstationary Series
Seasonality in series
TIME SERIES ANALYSIS IN PYTHON
Examples of Nonstationary Series
Change in Mean or Standard Deviation over time
TIME SERIES ANALYSIS IN PYTHON
Transforming Nonstationary Series Into Stationary
Series
Random Walk First di erence
[Link](SPY) [Link]([Link]())
TIME SERIES ANALYSIS IN PYTHON
Transforming Nonstationary Series Into Stationary
Series
Seasonality Seasonal di erence
[Link](HRB) [Link]([Link](4))
TIME SERIES ANALYSIS IN PYTHON
Transforming Nonstationary Series Into Stationary
Series
AMZN Quarterly Revenues # Log of AMZN Revenues
[Link]([Link](AMZN))
[Link](AMZN)
# Log, then seasonal difference
[Link]([Link](AMZN).diff(4))
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Introducing an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of AR(1) Model
Rt = μ + ϕ Rt−1 + ϵt
Since only one lagged value on right hand side, this is called:
AR model of order 1, or
AR(1) model
AR parameter is ϕ
For stationarity, −1 < ϕ < 1
TIME SERIES ANALYSIS IN PYTHON
Interpretation of AR(1) Parameter
Rt = μ + ϕ Rt−1 + ϵt
Negative ϕ: Mean Reversion
Positive ϕ: Momentum
TIME SERIES ANALYSIS IN PYTHON
Comparison of AR(1) Time Series
ϕ = 0.9 ϕ = −0.9
ϕ = 0.5 ϕ = −0.5
TIME SERIES ANALYSIS IN PYTHON
Comparison of AR(1) Autocorrelation Functions
ϕ = 0.9 ϕ = −0.9
ϕ = 0.5 ϕ = −0.5
TIME SERIES ANALYSIS IN PYTHON
Higher Order AR Models
AR(1)
Rt = μ + ϕ1 Rt−1 + ϵt
AR(2)
Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϵt
AR(3)
Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϕ3 Rt−3 + ϵt
...
TIME SERIES ANALYSIS IN PYTHON
Simulating an AR Process
from [Link].arima_process import ArmaProcess
ar = [Link]([1, -0.9])
ma = [Link]([1])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
[Link](simulated_data)
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimating and
Forecasting an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an AR Model
To estimate parameters from data (simulated)
from [Link].arima_model import ARMA
mod = ARMA(data, order=(1,0))
result = [Link]()
ARMA has been deprecated and replaced with ARIMA
from [Link] import ARIMA
mod = ARIMA(data, order=(1,0,0))
result = [Link]()
For ARMA, order=(p,q)
For ARIMA,order=(p,d,q)
TIME SERIES ANALYSIS IN PYTHON
Estimating an AR Model
Full output (true μ = 0 and ϕ = 0.9)
print([Link]())
TIME SERIES ANALYSIS IN PYTHON
Estimating an AR Model
Only the estimates of μ and ϕ (true μ = 0 and ϕ = 0.9)
print([Link])
array([-0.03605989, 0.90535667])
TIME SERIES ANALYSIS IN PYTHON
Forecasting With an AR Model
from [Link] import plot_predict
fig, ax = [Link]()
[Link](ax=ax)
plot_predict(result, start='2012-09-27', end='2012-10-06', alpha=0.05, ax=ax)
[Link]()
Arguments of function plot_predict()
First argument is ed model
Set alpha=None for no con dence interval
Set ax=ax to plot the data and prediction on same axes
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Choosing the Right
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Identifying the Order of an AR Model
The order of an AR(p) model will usually be unknown
Two techniques to determine order
Partial Autocorrelation Function
Information criteria
TIME SERIES ANALYSIS IN PYTHON
Partial Autocorrelation Function (PACF)
TIME SERIES ANALYSIS IN PYTHON
Plot PACF in Python
Same as ACF, but use plot_pacf instead of plt_acf
Import module
from [Link] import plot_pacf
Plot the PACF
plot_pacf(x, lags= 20, alpha=0.05)
TIME SERIES ANALYSIS IN PYTHON
Comparison of PACF for Different AR Models
AR(1) AR(2)
AR(3) White Noise
TIME SERIES ANALYSIS IN PYTHON
Information Criteria
Information criteria: adjusts goodness-of- t for number of
parameters
Two popular adjusted goodness-of- t measures
AIC (Akaike Information Criterion)
BIC (Bayesian Information Criterion)
TIME SERIES ANALYSIS IN PYTHON
Information Criteria
Estimation output
TIME SERIES ANALYSIS IN PYTHON
Getting Information Criteria From statsmodels
You learned earlier how to t an AR model
from [Link].arima_model import ARIMA
mod = ARIMA(simulated_data, order=(1,0))
result = [Link]()
And to get full output
[Link]()
Or just the parameters
[Link]
To get the AIC and BIC
[Link]
[Link]
TIME SERIES ANALYSIS IN PYTHON
Information Criteria
Fit a simulated AR(3) to di erent AR(p) models
Choose p with the lowest BIC
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Describe Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of MA(1) Model
Rt = μ + ϵt + θ ϵt−1
Since only one lagged error on right hand side, this is called:
MA model of order 1, or
MA(1) model
MA parameter is θ
Stationary for all values of θ
TIME SERIES ANALYSIS IN PYTHON
Interpretation of MA(1) Parameter
Rt = μ + ϵt + θ ϵt−1
Negative θ: One-Period Mean Reversion
Positive θ: One-Period Momentum
Note: One-period autocorrelation is θ/(1 + θ2 ), not θ
TIME SERIES ANALYSIS IN PYTHON
Comparison of MA(1) Autocorrelation Functions
θ = 0.9 θ = −0.9
θ = 0.5 θ = −0.5
TIME SERIES ANALYSIS IN PYTHON
Example of MA(1) Process: Intraday Stock Returns
TIME SERIES ANALYSIS IN PYTHON
Autocorrelation Function of Intraday Stock Returns
TIME SERIES ANALYSIS IN PYTHON
Higher Order MA Models
MA(1)
Rt = μ + ϵt − θ1 ϵt−1
MA(2)
Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2
MA(3)
Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2 − θ3 ϵt−3
...
TIME SERIES ANALYSIS IN PYTHON
Simulating an MA Process
from [Link].arima_process import ArmaProcess
ar = [Link]([1])
ma = [Link]([1, 0.5])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
[Link](simulated_data)
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimation and
Forecasting an MA
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an MA Model
Same as estimating an AR model (except order=(0,0,1) )
from [Link] import ARIMA
mod = ARIMA(simulated_data, order=(0,0,1))
result = [Link]()
TIME SERIES ANALYSIS IN PYTHON
Forecasting an MA Model
from [Link] import plot_predict
fig, ax = [Link]()
[Link](ax=ax)
plot_predict(res, start='2012-09-27', end='2012-10-06', ax=ax)
[Link]()
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
ARMA models
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
ARMA Model
ARMA(1,1) model:
Rt = μ + ϕ Rt−1 + ϵt + θ ϵt−1
TIME SERIES ANALYSIS IN PYTHON
Converting Between ARMA, AR, and MA Models
Converting AR(1) into an MA(∞)
Rt = μ + ϕRt−1 + ϵt
Rt = μ + ϕ(μ + ϕRt−2 + ϵt−1 ) + ϵt
⋮
μ
Rt = + ϵt + ϕϵt−1 − ϕ2 ϵt−2 + ϕ3 ϵt−3 + ...
1−ϕ
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Cointegration
Models
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Cointegration?
Two series, Pt and Qt can be random walks
But the linear combination Pt − c Qt may not be a random
walk!
If that's true
Pt − c Qt is forecastable
Pt and Qt are said to be cointegrated
TIME SERIES ANALYSIS IN PYTHON
Analogy: Dog on a Leash
Pt = Owner
Qt = Dog
Both series look like a random walk
Di erence, or distance between them, looks mean reverting
If dog falls too far behind, it gets pulled forward
If dog gets too far ahead, it gets pulled back
TIME SERIES ANALYSIS IN PYTHON
Example: Heating Oil and Natural Gas
Heating Oil and Natural Gas both look like random walks...
TIME SERIES ANALYSIS IN PYTHON
Example: Heating Oil and Natural Gas
But the spread (di erence) is mean reverting
TIME SERIES ANALYSIS IN PYTHON
What Types of Series are Cointegrated?
Economic substitutes
Heating Oil and Natural Gas
Platinum and Palladium
Corn and Wheat
Corn and Sugar
...
Bitcoin and Ethereum?
How about competitors?
Coke and Pepsi?
Apple and Blackberry? No! Leash broke and dog ran away
TIME SERIES ANALYSIS IN PYTHON
Two Steps to Test for Cointegration
Regress Pt on Qt and get slope c
Run Augmented Dickey-Fuller test on Pt − c Qt to test for
random walk
Alternatively, can use coint function in statsmodels that
combines both steps
from [Link] import coint
coint(P,Q)
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Case Study: Climate
Change
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Analyzing Temperature Data
Temperature data:
New York City from 1870-2016
Downloaded from National Oceanic and Atmospheric
Administration (NOAA)
Convert index to datetime object
Plot data
TIME SERIES ANALYSIS IN PYTHON
Analyzing Temperature Data
Test for Random Walk
Take rst di erences
Compute ACF and PACF
Fit a few AR, MA, and ARMA models
Use Information Criterion to choose best model
Forecast temperature over next 30 years
TIME SERIES ANALYSIS IN PYTHON
Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Congratulations
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Advanced Topics
GARCH Models
Nonlinear Models
Multivariate Time Series Models
Regime Switching Models
State Space Models and Kalman Filtering
...
TIME SERIES ANALYSIS IN PYTHON
Keep practicing!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Welcome to the
course!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Prerequisites
Intro to Python for Data Science
Intermediate Python for Data Science
VISUALIZING TIME SERIES DATA IN PYTHON
Time series in the field of Data Science
Time series are a fundamental way to store and analyze
many types of data
Financial, weather and device data are all best handled as
time series
VISUALIZING TIME SERIES DATA IN PYTHON
Time series in the field of Data Science
VISUALIZING TIME SERIES DATA IN PYTHON
Course overview
Chapter 1: Ge ing started and personalizing your rst time
series plot
Chapter 2: Summarizing and describing time series data
Chapter 3: Advanced time series analysis
Chapter 4: Working with multiple time series
Chapter 5: Case Study
VISUALIZING TIME SERIES DATA IN PYTHON
Reading data with Pandas
import pandas as pd
df = pd.read_csv('ch2_co2_levels.csv')
print(df)
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
...
...
...
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5
VISUALIZING TIME SERIES DATA IN PYTHON
Preview data with Pandas
print([Link](n=5))
datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4
print([Link](n=5))
datestamp co2
2279 2001-12-01 370.3
2280 2001-12-08 370.8
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5
VISUALIZING TIME SERIES DATA IN PYTHON
Check data types with Pandas
print([Link])
datestamp object
co2 float64
dtype: object
VISUALIZING TIME SERIES DATA IN PYTHON
Working with dates
To work with time series data in pandas , your date columns
needs to be of the datetime64 type.
pd.to_datetime(['2009/07/31', 'test'])
ValueError: Unknown string format
pd.to_datetime(['2009/07/31', 'test'], errors='coerce')
DatetimeIndex(['2009-07-31', 'NaT'],
dtype='datetime64[ns]', freq=None)
VISUALIZING TIME SERIES DATA IN PYTHON
Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot your first time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
The Matplotlib library
In Python, matplotlib is an extensive package used to plot
data
The pyplot submodule of matplotlib is traditionally imported
using the plt alias
import [Link] as plt
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting time series data
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting time series data
import [Link] as plt
import pandas as pd
df = df.set_index('date_column')
[Link]()
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Adding style to your plots
[Link]('fivethirtyeight')
[Link]()
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
FiveThirtyEight style
VISUALIZING TIME SERIES DATA IN PYTHON
Matplotlib style sheets
print([Link])
['seaborn-dark-palette', 'seaborn-darkgrid',
'seaborn-dark', 'seaborn-notebook',
'seaborn-pastel', 'seaborn-white',
'classic', 'ggplot', 'grayscale',
'dark_background', 'seaborn-poster',
'seaborn-muted', 'seaborn', 'bmh',
'seaborn-paper', 'seaborn-whitegrid',
'seaborn-bright', 'seaborn-talk',
'fivethirtyeight', 'seaborn-colorblind',
'seaborn-deep', 'seaborn-ticks']
VISUALIZING TIME SERIES DATA IN PYTHON
Describing your graphs with labels
ax = [Link](color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('The values of my Y axis')
ax.set_title('The title of my plot')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Figure size, linewidth, linestyle and fontsize
ax = [Link](figsize=(12, 5), fontsize=12,
linewidth=3, linestyle='--')
ax.set_xlabel('Date', fontsize=16)
ax.set_ylabel('The values of my Y axis', fontsize=16)
ax.set_title('The title of my plot', fontsize=16)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Customize your time
series plot
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Slicing time series data
discoveries['1960':'1970']
discoveries['1950-01':'1950-12']
discoveries['1960-01-01':'1960-01-15']
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting subset of your time series data
import [Link] as plt
[Link]('fivethirtyeight')
df_subset = discoveries['1960':'1970']
ax = df_subset.plot(color='blue', fontsize=14)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Adding markers
[Link](x='1969-01-01',
color='red',
linestyle='--')
[Link](y=100,
color='green',
linestyle='--')
VISUALIZING TIME SERIES DATA IN PYTHON
Using markers: the full code
ax = [Link](color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')
[Link]('1969-01-01', color='red', linestyle='--')
[Link](4, color='green', linestyle='--')
VISUALIZING TIME SERIES DATA IN PYTHON
Highlighting regions of interest
[Link]('1964-01-01', '1968-01-01',
color='red', alpha=0.5)
[Link](8, 6, color='green',
alpha=0.2)
VISUALIZING TIME SERIES DATA IN PYTHON
Highlighting regions of interest: the full code
ax = [Link](color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')
[Link]('1964-01-01', '1968-01-01', color='red',
alpha=0.3)
[Link](8, 6, color='green', alpha=0.3)
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Clean your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
The CO2 level time series
A snippet of the weekly measurements of CO2 levels at the
Mauna Loa Observatory, Hawaii.
datastamp co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
...
...
2001-12-15 371.2
2001-12-22 371.3
2001-12-29 371.5
VISUALIZING TIME SERIES DATA IN PYTHON
Finding missing values in a DataFrame
print([Link]())
datestamp co2
1958-03-29 False
1958-04-05 False
1958-04-12 False
print([Link]())
datestamp co2
1958-03-29 True
1958-04-05 True
1958-04-12 True
...
VISUALIZING TIME SERIES DATA IN PYTHON
Counting missing values in a DataFrame
print([Link]().sum())
datestamp 0
co2 59
dtype: int64
VISUALIZING TIME SERIES DATA IN PYTHON
Replacing missing values in a DataFrame
print(df)
...
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
...
df = [Link](method='bfill')
print(df)
...
5 1958-05-03 316.9
6 1958-05-10 317.5
7 1958-05-17 317.5
...
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot aggregates of
your data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Moving averages
In the eld of time series analysis, a moving average can be
used for many di erent purposes:
smoothing out short-term uctuations
removing outliers
highlighting long-term trends or cycles.
VISUALIZING TIME SERIES DATA IN PYTHON
The moving average model
co2_levels_mean = co2_levels.rolling(window=52).mean()
ax = co2_levels_mean.plot()
ax.set_xlabel("Date")
ax.set_ylabel("The values of my Y axis")
ax.set_title("52 weeks rolling mean of my time series")
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
A plot of the moving average for the CO2 data
VISUALIZING TIME SERIES DATA IN PYTHON
Computing aggregate values of your time series
co2_levels.index
DatetimeIndex(['1958-03-29', '1958-04-05',...],
dtype='datetime64[ns]', name='datestamp',
length=2284, freq=None)
print(co2_levels.[Link])
array([ 3, 4, 4, ..., 12, 12, 12], dtype=int32)
print(co2_levels.[Link])
array([1958, 1958, 1958, ..., 2001,
2001, 2001], dtype=int32)
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting aggregate values of your time series
index_month = co2_levels.[Link]
co2_levels_by_month = co2_levels.groupby(index_month).mean()
co2_levels_by_month.plot()
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting aggregate values of your time series
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Summarizing the
values in your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Obtaining numerical summaries of your data
What is the average value of this data?
What is the maximum value observed in this time series?
VISUALIZING TIME SERIES DATA IN PYTHON
The .describe() method automatically computes key
statistics of all numeric columns in your DataFrame
print([Link]())
co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000
VISUALIZING TIME SERIES DATA IN PYTHON
Summarizing your data with boxplots
ax1 = [Link]()
ax1.set_xlabel('Your first boxplot')
ax1.set_ylabel('Values of your data')
ax1.set_title('Boxplot values of your data')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
A boxplot of the values in the CO2 data
VISUALIZING TIME SERIES DATA IN PYTHON
Summarizing your data with histograms
ax2 = [Link](kind='hist', bins=100)
ax2.set_xlabel('Your first histogram')
ax2.set_ylabel('Frequency of values in your data')
ax2.set_title('Histogram of your data with 100 bins')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
A histogram plot of the values in the CO2 data
VISUALIZING TIME SERIES DATA IN PYTHON
Summarizing your data with density plots
ax3 = [Link](kind='density', linewidth=2)
ax3.set_xlabel('Your first density plot')
ax3.set_ylabel('Density values of your data')
ax3.set_title('Density plot of your data')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
A density plot of the values in the CO2 data
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Autocorrelation and
Partial
autocorrelation
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Autocorrelation in time series data
Autocorrelation is measured as the correlation between a
time series and a delayed copy of itself
For example, an autocorrelation of order 3 returns the
correlation between a time series at points ( t_1 , t_2 , t_3 ,
...) and its own values lagged by 3 time points, i.e. ( t_4 , t_5
, t_6 , ...)
It is used to nd repetitive pa erns or periodic signal in time
series
VISUALIZING TIME SERIES DATA IN PYTHON
Statsmodels
statsmodels is a Python module that provides classes and
functions for the estimation of many di erent statistical
models, as well as for conducting statistical tests, and
statistical data exploration.
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting autocorrelations
import [Link] as plt
from [Link] import tsaplots
fig = tsaplots.plot_acf(co2_levels['co2'], lags=40)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Interpreting autocorrelation plots
VISUALIZING TIME SERIES DATA IN PYTHON
Partial autocorrelation in time series data
Contrary to autocorrelation, partial autocorrelation removes
the e ect of previous time points
For example, a partial autocorrelation function of order 3
returns the correlation between our time series ( t1 , t2 , t3 ,
...) and lagged values of itself by 3 time points ( t4 , t5 , t6 ,
...), but only a er removing all e ects a ributable to lags 1
and 2
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting partial autocorrelations
import [Link] as plt
from [Link] import tsaplots
fig = tsaplots.plot_pacf(co2_levels['co2'], lags=40)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Interpreting partial autocorrelations plot
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Seasonality, trend
and noise in time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Properties of time series
VISUALIZING TIME SERIES DATA IN PYTHON
The properties of time series
Seasonality: does the data display a clear periodic pa ern?
Trend: does the data follow a consistent upwards or
downwards slope?
Noise: are there any outlier points or missing values that are
not consistent with the rest of the data?
VISUALIZING TIME SERIES DATA IN PYTHON
Time series decomposition
import [Link] as sm
import [Link] as plt
from pylab import rcParams
rcParams['[Link]'] = 11, 9
decomposition = [Link].seasonal_decompose(
co2_levels['co2'])
fig = [Link]()
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
A plot of time series decomposition on the CO2 data
VISUALIZING TIME SERIES DATA IN PYTHON
Extracting components from time series
decomposition
print(dir(decomposition))
['__class__', '__delattr__', '__dict__',
... 'plot', 'resid', 'seasonal', 'trend']
print([Link])
datestamp
1958-03-29 1.028042
1958-04-05 1.235242
1958-04-12 1.412344
1958-04-19 1.701186
VISUALIZING TIME SERIES DATA IN PYTHON
Seasonality component in time series
decomp_seasonal = [Link]
ax = decomp_seasonal.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Seasonality of time series')
ax.set_title('Seasonal values of the time series')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Seasonality component in time series
VISUALIZING TIME SERIES DATA IN PYTHON
Trend component in time series
decomp_trend = [Link]
ax = decomp_trend.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Trend component in time series
VISUALIZING TIME SERIES DATA IN PYTHON
Noise component in time series
decomp_resid = [Link]
ax = decomp_resid.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Residual of time series')
ax.set_title('Residual values of the time series')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Noise component in time series
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
A review on what
you have learned so
far
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
So far ...
Visualize aggregates of time series data
Extract statistical summaries
Autocorrelation and Partial autocorrelation
Time series decomposition
VISUALIZING TIME SERIES DATA IN PYTHON
The airline dataset
VISUALIZING TIME SERIES DATA IN PYTHON
Let's analyze this
data!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Working with more
than one time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Working with multiple time series
An isolated time series
date ts1
1949-01 112
1949-02 118
1949-03 132
A le with multiple time series
date ts1 ts2 ts3 ts4 ts5 ts6 ts7
2012-01-01 2113.8 10.4 1987.0 12.1 3091.8 43.2 476.7
2012-02-01 2009.0 9.8 1882.9 12.3 2954.0 38.8 466.8
2012-03-01 2159.8 10.0 1987.9 14.3 3043.7 40.1 502.1
VISUALIZING TIME SERIES DATA IN PYTHON
The Meat production dataset
import pandas as pd
meat = pd.read_csv("[Link]")
print([Link](5))
date beef veal pork lamb_and_mutton broilers
0 1944-01-01 751.0 85.0 1280.0 89.0 NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN
other_chicken turkey
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
VISUALIZING TIME SERIES DATA IN PYTHON
Summarizing and plotting multiple time series
import [Link] as plt
[Link]('fivethirtyeight')
ax = [Link](figsize=(12, 4), fontsize=14)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Area charts
import [Link] as plt
[Link]('fivethirtyeight')
ax = [Link](figsize=(12, 4), fontsize=14)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot multiple time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Clarity is key
In this plot, the default matplotlib color scheme assigns the
same color to the beef and turkey time series.
VISUALIZING TIME SERIES DATA IN PYTHON
The colormap argument
ax = [Link](colormap='Dark2', figsize=(14, 7))
ax.set_xlabel('Date')
ax.set_ylabel('Production Volume (in tons)')
[Link]()
For the full set of available colormaps, click here.
VISUALIZING TIME SERIES DATA IN PYTHON
Changing line colors with the colormap argument
VISUALIZING TIME SERIES DATA IN PYTHON
Enhancing your plot with information
ax = [Link](colormap='Dark2', figsize=(14, 7))
df_summary = [Link]()
# Specify values of cells in the table
[Link](cellText=df_summary.values,
# Specify width of the table
colWidths=[0.3]*len([Link]),
# Specify row labels
rowLabels=df_summary.index,
# Specify column labels
colLabels=df_summary.columns,
# Specify location of the table
loc='top')
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
Adding Statistical summaries to your plots
VISUALIZING TIME SERIES DATA IN PYTHON
Dealing with different scales
VISUALIZING TIME SERIES DATA IN PYTHON
Only veal
VISUALIZING TIME SERIES DATA IN PYTHON
Facet plots
[Link](subplots=True,
linewidth=0.5,
layout=(2, 4),
figsize=(16, 10),
sharex=False,
sharey=False)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
VISUALIZING TIME SERIES DATA IN PYTHON
Time for some
action!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Find relationships
between multiple
time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Correlations between two variables
In the eld of Statistics, the correlation coe cient is a
measure used to determine the strength or lack of
relationship between two variables:
Pearson's coe cient can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be linear
Kendall Tau or Spearman rank can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be non-linear
VISUALIZING TIME SERIES DATA IN PYTHON
Compute correlations
from [Link] import pearsonr
from [Link] import spearmanr
from [Link] import kendalltau
x = [1, 2, 4, 7]
y = [1, 3, 4, 8]
pearsonr(x, y)
SpearmanrResult(correlation=0.9843, pvalue=0.01569)
spearmanr(x, y)
SpearmanrResult(correlation=1.0, pvalue=0.0)
kendalltau(x, y)
KendalltauResult(correlation=1.0, pvalue=0.0415)
VISUALIZING TIME SERIES DATA IN PYTHON
What is a correlation matrix?
When computing the correlation coe cient between more
than two variables, you obtain a correlation matrix
Range: [-1, 1]
0: no relationship
1: strong positive relationship
-1: strong negative relationship
VISUALIZING TIME SERIES DATA IN PYTHON
What is a correlation matrix?
A correlation matrix is always "symmetric"
The diagonal values will always be equal to 1
x y z
x 1.00 -0.46 0.49
y -0.46 1.00 -0.61
z 0.49 -0.61 1.00
VISUALIZING TIME SERIES DATA IN PYTHON
Computing Correlation Matrices with Pandas
corr_p = meat[['beef', 'veal','turkey']].corr(method='pearson')
print(corr_p)
beef veal turkey
beef 1.000 -0.829 0.738
veal -0.829 1.000 -0.768
turkey 0.738 -0.768 1.000
corr_s = meat[['beef', 'veal','turkey']].corr(method='spearman')
print(corr_s)
beef veal turkey
beef 1.000 -0.812 0.778
veal -0.812 1.000 -0.829
turkey 0.778 -0.829 1.000
VISUALIZING TIME SERIES DATA IN PYTHON
Computing Correlation Matrices with Pandas
corr_mat = [Link](method='pearson')
VISUALIZING TIME SERIES DATA IN PYTHON
Heatmap
import seaborn as sns
[Link](corr_mat)
VISUALIZING TIME SERIES DATA IN PYTHON
Heatmap
VISUALIZING TIME SERIES DATA IN PYTHON
Clustermap
[Link](corr_mat)
VISUALIZING TIME SERIES DATA IN PYTHON
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Apply your
knowledge to a new
dataset
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
The Jobs dataset
VISUALIZING TIME SERIES DATA IN PYTHON
Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Beyond summary
statistics
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Facet plots of the jobs dataset
[Link](subplots=True,
layout=(4, 4),
figsize=(20, 16),
sharex=True,
sharey=False)
[Link]()
VISUALIZING TIME SERIES DATA IN PYTHON
VISUALIZING TIME SERIES DATA IN PYTHON
Annotating events in the jobs dataset
ax = [Link](figsize=(20, 14), colormap='Dark2')
[Link]('2008-01-01', color='black',
linestyle='--')
[Link]('2009-01-01', color='black',
linestyle='--')
VISUALIZING TIME SERIES DATA IN PYTHON
VISUALIZING TIME SERIES DATA IN PYTHON
Taking seasonal average in the jobs dataset
print([Link])
DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01',
'2000-04-01', '2009-09-01','2009-10-01',
'2009-11-01', '2009-12-01','2010-01-01', '2010-02-01'],
dtype='datetime64[ns]', name='datestamp',
length=122, freq=None)
index_month = [Link]
jobs_by_month = [Link](index_month).mean()
print(jobs_by_month)
datestamp Agriculture Business services Construction
1 13.763636 7.863636 12.909091
2 13.645455 7.645455 13.600000
3 13.830000 7.130000 11.290000
4 9.130000 6.270000 9.450000
5 7.100000 6.600000 8.120000
...
VISUALIZING TIME SERIES DATA IN PYTHON
Monthly averages in the jobs dataset
ax = jobs_by_month.plot(figsize=(12, 5),
colormap='Dark2')
[Link](bbox_to_anchor=(1.0, 0.5),
loc='center left')
VISUALIZING TIME SERIES DATA IN PYTHON
Monthly averages in the jobs dataset
VISUALIZING TIME SERIES DATA IN PYTHON
Time to practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Decompose time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Python dictionaries
# Initialize a Python dictionnary
my_dict = {}
# Add a key and value to your dictionnary
my_dict['your_key'] = 'your_value'
# Add a second key and value to your dictionnary
my_dict['your_second_key'] = 'your_second_value'
# Print out your dictionnary
print(my_dict)
{'your_key': 'your_value',
'your_second_key': 'your_second_value'}
VISUALIZING TIME SERIES DATA IN PYTHON
Decomposing multiple time series with Python
dictionaries
# Import the statsmodel library
import [Link] as sm
# Initialize a dictionary
my_dict = {}
# Extract the names of the time series
ts_names = [Link]
print(ts_names)
['ts1', 'ts2', 'ts3']
# Run time series decomposition
for ts in ts_names:
ts_decomposition = [Link].seasonal_decompose(jobs[ts])
my_dict[ts] = ts_decomposition
VISUALIZING TIME SERIES DATA IN PYTHON
Extract decomposition components of multiple time
series
# Initialize a new dictionnary
my_dict_trend = {}
# Extract the trend component
for ts in ts_names:
my_dict_trend[ts] = my_dict[ts].trend
# Convert to a DataFrame
trend_df = [Link].from_dict(my_dict_trend)
print(trend_df)
ts1 ts2 ts3
datestamp
2000-01-01 2.2 1.3 3.6
2000-02-01 3.4 2.1 4.7
...
VISUALIZING TIME SERIES DATA IN PYTHON
Python dictionaries
for the win!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Compute
correlations
between time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Trends in Jobs data
print(trend_df)
datestamp Agriculture Business services Construction
2000-01-01 NaN NaN NaN
2000-02-01 NaN NaN NaN
2000-03-01 NaN NaN NaN
2000-04-01 NaN NaN NaN
2000-05-01 NaN NaN NaN
2000-06-01 NaN NaN NaN
2000-07-01 9.170833 4.787500 6.329167
2000-08-01 9.466667 4.820833 6.304167
...
VISUALIZING TIME SERIES DATA IN PYTHON
Plotting a clustermap of the jobs correlation matrix
# Get correlation matrix of the seasonality_df DataFrame
trend_corr = trend_df.corr(method='spearman')
# Customize the clustermap of the seasonality_corr
correlation matrix
fig = [Link](trend_corr, annot=True, linewidth=0.4)
[Link](fig.ax_heatmap.yaxis.get_majorticklabels(),
rotation=0)
[Link](fig.ax_heatmap.xaxis.get_majorticklabels(),
rotation=90)
VISUALIZING TIME SERIES DATA IN PYTHON
The jobs correlation matrix
VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Thomas Vincent
Head of Data Science, Ge y Images
Going further with time series
Data from Zillow Research
Kaggle competitions
Reddit Data
VISUALIZING TIME SERIES DATA IN PYTHON
Going further with time series
The importance of time series in business:
to identify seasonal pa erns and trends
to study past behaviors
to produce robust forecasts
to evaluate and compare company achievements
VISUALIZING TIME SERIES DATA IN PYTHON
Getting to the next level
Manipulating Time Series Data in Python
Importing & Managing Financial Data in Python
Statistical Thinking in Python (Part 1)
Supervised Learning with scikit-learn
VISUALIZING TIME SERIES DATA IN PYTHON
Thank you!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to time
series and
stationarity
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Motivation
Time series are everywhere
Science
Technology
Business
Finance
Policy
ARIMA MODELS IN PYTHON
Course content
You will learn
Structure of ARIMA models
How to fit ARIMA model
How to optimize the model
How to make forecasts
How to calculate uncertainty in predictions
ARIMA MODELS IN PYTHON
Loading and plotting
import pandas as pd
import matplotlib as plt
df = pd.read_csv('time_series.csv', index_col='date', parse_dates=True)
date values
2019-03-11 5.734193
2019-03-12 6.288708
2019-03-13 5.205788
2019-03-14 3.176578
ARIMA MODELS IN PYTHON
Trend
fig, ax = [Link]()
[Link](ax=ax)
[Link]()
ARIMA MODELS IN PYTHON
Seasonality
ARIMA MODELS IN PYTHON
Cyclicality
ARIMA MODELS IN PYTHON
White noise
White noise series has uncorrelated values
Heads, heads, heads, tails, heads, tails, ...
0.1, -0.3, 0.8, 0.4, -0.5, 0.9, ...
ARIMA MODELS IN PYTHON
Stationarity
Stationary Not stationary
Trend stationary: Trend is zero
ARIMA MODELS IN PYTHON
Stationarity
Stationary Not stationary
Trend stationary: Trend is zero
Variance is constant
ARIMA MODELS IN PYTHON
Stationarity
Stationary Not stationary
Trend stationary: Trend is zero
Variance is constant
Autocorrelation is constant
ARIMA MODELS IN PYTHON
Train-test split
# Train data - all data up to the end of 2018
df_train = [Link][:'2018']
# Test data - all data from 2019 onwards
df_test = [Link]['2019':]
ARIMA MODELS IN PYTHON
Let's Practice!
ARIMA MODELS IN PYTHON
Making time series
stationary
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Overview
Statistical tests for stationarity
Making a dataset stationary
ARIMA MODELS IN PYTHON
The augmented Dicky-Fuller test
Tests for trend non-stationarity
Null hypothesis is time series is non-stationary
ARIMA MODELS IN PYTHON
Applying the adfuller test
from [Link] import adfuller
results = adfuller(df['close'])
ARIMA MODELS IN PYTHON
Interpreting the test result
print(results)
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.913, '10%': -2.568}, 10782.87)
0th element is test statistic (-1.34)
More negative means more likely to be stationary
1st element is p-value: (0.60)
If p-value is small → reject null hypothesis. Reject non-stationary.
4th element is the critical test statistics
ARIMA MODELS IN PYTHON
Interpreting the test result
print(results)
(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.863, '10%': -2.568}, 10782.87)
0th element is test statistic (-1.34)
More negative means more likely to be stationary
1st element is p-value: (0.60)
If p-value is small → reject null hypothesis. Reject non-stationary.
4th element is the critical test statistics
1 [Link]
ARIMA MODELS IN PYTHON
The value of plotting
Plotting time series can stop you making wrong assumptions
ARIMA MODELS IN PYTHON
The value of plotting
ARIMA MODELS IN PYTHON
Making a time series stationary
ARIMA MODELS IN PYTHON
Taking the difference
Difference: Δyt = yt − yt−1
ARIMA MODELS IN PYTHON
Taking the difference
df_stationary = [Link]()
city_population
date
1969-09-30 NaN
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
ARIMA MODELS IN PYTHON
Taking the difference
df_stationary = [Link]().dropna()
city_population
date
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
1972-03-31 -0.029569
ARIMA MODELS IN PYTHON
Taking the difference
ARIMA MODELS IN PYTHON
Other transforms
Examples of other transforms
Take the log
[Link](df)
Take the square root
[Link](df)
Take the proportional change
[Link](1)/df
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Intro to AR, MA and
ARMA models
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
AR models
Autoregressive (AR) model
AR(1) model :
yt = a1 yt−1 + ϵt
ARIMA MODELS IN PYTHON
AR models
Autoregressive (AR) model
AR(1) model :
yt = a1 yt−1 + ϵt
AR(2) model :
yt = a1 yt−1 + a2 yt−2 + ϵt
AR(p) model :
yt = a1 yt−1 + a2 yt−2 + ... + ap yt−p + ϵt
ARIMA MODELS IN PYTHON
MA models
Moving average (MA) model
MA(1) model :
yt = m1 ϵt−1 + ϵt
MA(2) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ϵt
MA(q) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ... + mq ϵt−q + ϵt
ARIMA MODELS IN PYTHON
ARMA models
Autoregressive moving-average (ARMA) model
ARMA = AR + MA
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARMA(p, q)
p is order of AR part
q is order of MA part
ARIMA MODELS IN PYTHON
Creating ARMA data
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARIMA MODELS IN PYTHON
Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt
from [Link].arima_process import arma_generate_sample
ar_coefs = [1, -0.5]
ma_coefs = [1, 0.2]
y = arma_generate_sample(ar_coefs, ma_coefs, nsample=100, scale=0.5)
ARIMA MODELS IN PYTHON
Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt
ARIMA MODELS IN PYTHON
Fitting and ARMA model
from [Link] import ARIMA
# Instantiate model object
model = ARIMA(y, order=(1,0,1))
# Fit model
results = [Link]()
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Fitting time series
models
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Creating a model
from [Link] import ARIMA
# This is an ARMA(p,q) model
model = ARIMA(timeseries, order=(p,0,q))
ARIMA MODELS IN PYTHON
Creating AR and MA models
ar_model = ARIMA(timeseries, order=(p,0,0))
ma_model = ARIMA(timeseries, order=(0,0,q))
ARIMA MODELS IN PYTHON
Fitting the model and fit summary
model = ARIMA(timeseries, order=(2,0,1))
results = [Link]()
print([Link]())
ARIMA MODELS IN PYTHON
Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: [Link] BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469
ARIMA MODELS IN PYTHON
Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: [Link] BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg
ARIMA MODELS IN PYTHON
Fit summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469
sigma2 1.6306 0.339 6.938 0.000 0.583 1.943
ARIMA MODELS IN PYTHON
Introduction to ARMAX models
Exogenous ARMA
Use external variables as well as time series
ARMAX = ARMA + linear regression
ARIMA MODELS IN PYTHON
ARMAX equation
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt
ARMAX(1,1) model :
yt = x1 zt + a1 yt−1 + m1 ϵt−1 + ϵt
ARIMA MODELS IN PYTHON
ARMAX example
ARIMA MODELS IN PYTHON
ARMAX example
ARIMA MODELS IN PYTHON
Fitting ARMAX
# Instantiate the model
model = ARIMA(df['productivity'], order=(2,0,1), exog=df['hours_sleep'])
# Fit the model
results = [Link]()
ARIMA MODELS IN PYTHON
ARMAX summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
const -0.1936 0.092 -2.098 0.041 -0.375 -0.013
x1 0.1131 0.013 8.602 0.000 0.087 0.139
ar.L1.y 0.1917 0.252 0.760 0.450 -0.302 0.686
ar.L2.y -0.3740 0.121 -3.079 0.003 -0.612 -0.136
ma.L1.y -0.0740 0.259 -0.286 0.776 -0.581 0.433
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Forecasting
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Predicting the next value
Take an AR(1) model
yt = a1 yt−1 + ϵt
Predict next value
yt = 0.6 x 10 + ϵt
yt = 6.0 + ϵt
Uncertainty on prediction
5.0 < yt < 7.0
ARIMA MODELS IN PYTHON
One-step-ahead predictions
ARIMA MODELS IN PYTHON
Making one-step-ahead predictions
# Make predictions for last 25 values
results = [Link]()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)
ARIMA MODELS IN PYTHON
Making one-step-ahead predictions
# Make predictions for last 25 values
results = [Link]()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)
# forecast mean
mean_forecast = forecast.predicted_mean
Predicted mean is a pandas series
2013-10-28 1.519368
2013-10-29 1.351082
2013-10-30 1.218016
ARIMA MODELS IN PYTHON
Confidence intervals
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()
Confidence interval method returns pandas DataFrame
lower y upper y
2013-09-28 -4.720471 -0.815384
2013-09-29 -5.069875 0.112505
2013-09-30 -5.232837 0.766300
2013-10-01 -5.305814 1.282935
2013-10-02 -5.326956 1.703974
ARIMA MODELS IN PYTHON
Plotting predictions
[Link]()
# Plot prediction
[Link](dates,
mean_forecast.values,
color='red',
label='forecast')
# Shade uncertainty area
plt.fill_between(dates, lower_limits, upper_limits, color='pink')
[Link]()
ARIMA MODELS IN PYTHON
Plotting predictions
ARIMA MODELS IN PYTHON
Dynamic predictions
ARIMA MODELS IN PYTHON
Making dynamic predictions
results = [Link]()
forecast = results.get_prediction(start=-25, dynamic=True)
# forecast mean
mean_forecast = forecast.predicted_mean
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()
ARIMA MODELS IN PYTHON
Forecasting out of sample
forecast = results.get_forecast(steps=20)
# forecast mean
mean_forecast = forecast.predicted_mean
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()
ARIMA MODELS IN PYTHON
Forecasting out of sample
forecast = results.get_forecast(steps=20)
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Introduction to
ARIMA models
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Non-stationary time series recap
ARIMA MODELS IN PYTHON
Non-stationary time series recap
ARIMA MODELS IN PYTHON
Forecast of differenced time series
ARIMA MODELS IN PYTHON
Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast)
ARIMA MODELS IN PYTHON
Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast) + [Link][-1,0]
ARIMA MODELS IN PYTHON
Reconstructing original time series after differencing
ARIMA MODELS IN PYTHON
The ARIMA model
Take the difference
Fit ARMA model
Integrate forecast
Can we avoid doing so much work?
Yes!
ARIMA - Autoregressive Integrated Moving Average
ARIMA MODELS IN PYTHON
Using the ARIMA model
from [Link] import ARIMA
model = ARIMA(df, order=(p,d,q))
p - number of autoregressive lags
d - order of differencing
q - number of moving average lags
ARIMA(p, 0, q) = ARMA(p, q)
ARIMA MODELS IN PYTHON
Using the ARIMA model
# Create model
model = ARIMA(df, order=(2,1,1))
# Fit model
[Link]()
# Make forecast
mean_forecast = results.get_forecast(steps=10).predicted_mean
ARIMA MODELS IN PYTHON
Using the ARIMA model
# Make forecast
mean_forecast = results.get_forecast(steps=steps).predicted_mean
ARIMA MODELS IN PYTHON
Picking the difference order
adf = adfuller([Link][:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])
ADF Statistic: -2.674
p-value: 0.0784
adf = adfuller([Link]().dropna().iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])
ADF Statistic: -4.978
p-value: 2.44e-05
ARIMA MODELS IN PYTHON
Picking the difference order
model = ARIMA(df, order=(p,1,q))
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Intro to ACF and
PACF
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Motivation
ARIMA MODELS IN PYTHON
ACF and PACF
ACF - Autocorrelation Function
PACF - Partial autocorrelation function
ARIMA MODELS IN PYTHON
What is the ACF
lag-1 autocorrelation → corr(yt , yt−1 )
lag-2 autocorrelation → corr(yt , yt−2 )
...
lag-n autocorrelation → corr(yt , yt−n )
ARIMA MODELS IN PYTHON
What is the ACF
ARIMA MODELS IN PYTHON
What is the PACF
ARIMA MODELS IN PYTHON
Using ACF and PACF to choose model order
AR(2) model →
ARIMA MODELS IN PYTHON
Using ACF and PACF to choose model order
MA(2) model →
ARIMA MODELS IN PYTHON
Using ACF and PACF to choose model order
ARIMA MODELS IN PYTHON
Using ACF and PACF to choose model order
ARIMA MODELS IN PYTHON
Implementation in Python
from [Link] import plot_acf, plot_pacf
# Create figure
fig, (ax1, ax2) = [Link](2,1, figsize=(8,8))
# Make ACF plot
plot_acf(df, lags=10, zero=False, ax=ax1)
# Make PACF plot
plot_pacf(df, lags=10, zero=False, ax=ax2)
[Link]()
ARIMA MODELS IN PYTHON
Implementation in Python
ARIMA MODELS IN PYTHON
Over/under differencing and ACF and PACF
ARIMA MODELS IN PYTHON
Over/under differencing and ACF and PACF
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
AIC and BIC
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
AIC - Akaike information criterion
Lower AIC indicates a better model
AIC likes to choose simple models with lower order
ARIMA MODELS IN PYTHON
BIC - Bayesian information criterion
Very similar to AIC
Lower BIC indicates a better model
BIC likes to choose simple models with lower order
ARIMA MODELS IN PYTHON
AIC vs BIC
BIC favors simpler models than AIC
AIC is better at choosing predictive models
BIC is better at choosing good explanatory model
ARIMA MODELS IN PYTHON
AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = [Link]()
# Print fit summary
print([Link]())
Statespace Model Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: SARIMAX(2, 0, 0) Log Likelihood -1399.704
Date: Fri, 10 May 2019 AIC 2805.407
Time: [Link] BIC 2820.131
Sample: 01-01-2013 HQIC 2811.003
- 09-27-2015
Covariance Type: opg
ARIMA MODELS IN PYTHON
AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = [Link]()
# Print AIC and BIC
print('AIC:', [Link])
print('BIC:', [Link])
AIC: 2806.36
BIC: 2821.09
ARIMA MODELS IN PYTHON
Searching over AIC and BIC
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# print the model order and the AIC/BIC values
print(p, q, [Link], [Link])
0 0 2900.13 2905.04
0 1 2828.70 2838.52
0 2 2806.69 2821.42
1 0 2810.25 2820.06
1 1 2806.37 2821.09
1 2 2807.52 2827.15
...
ARIMA MODELS IN PYTHON
Searching over AIC and BIC
order_aic_bic =[]
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# Add order and scores to list
order_aic_bic.append((p, q, [Link], [Link]))
# Make DataFrame of model order and AIC/BIC scores
order_df = [Link](order_aic_bic, columns=['p','q', 'aic', 'bic'])
ARIMA MODELS IN PYTHON
Searching over AIC and BIC
# Sort by AIC # Sort by BIC
print(order_df.sort_values('aic')) print(order_df.sort_values('bic'))
p q aic bic p q aic bic
7 2 1 2804.54 2824.17 3 1 0 2810.25 2820.06
6 2 0 2805.41 2820.13 6 2 0 2805.41 2820.13
4 1 1 2806.37 2821.09 4 1 1 2806.37 2821.09
2 0 2 2806.69 2821.42 2 0 2 2806.69 2821.42
... ...
ARIMA MODELS IN PYTHON
Non-stationary model orders
# Fit model
model = ARIMA(df, order=(2,0,1))
results = [Link]()
ValueError: Non-stationary starting autoregressive parameters
found with `enforce_stationarity` set to True.
ARIMA MODELS IN PYTHON
When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# Print the model order and the AIC/BIC values
print(p, q, [Link], [Link])
ARIMA MODELS IN PYTHON
When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
try:
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# Print the model order and the AIC/BIC values
print(p, q, [Link], [Link])
except:
# Print AIC and BIC as None when fails
print(p, q, None, None)
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Model diagnostics
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Introduction to model diagnostics
How good is the final model?
ARIMA MODELS IN PYTHON
Residuals
ARIMA MODELS IN PYTHON
Residuals
# Fit model
model = ARIMA(df, order=(p,d,q))
results = [Link]()
# Assign residuals to variable
residuals = [Link]
2013-01-23 1.013129
2013-01-24 0.114055
2013-01-25 0.430698
2013-01-26 -1.247046
2013-01-27 -0.499565
... ...
ARIMA MODELS IN PYTHON
Mean absolute error
How far our the predictions from the real values?
mae = [Link]([Link](residuals))
ARIMA MODELS IN PYTHON
Plot diagnostics
If the model fits well the residuals will be
white Gaussian noise
# Create the 4 diagostics plots
results.plot_diagnostics()
[Link]()
ARIMA MODELS IN PYTHON
Residuals plot
ARIMA MODELS IN PYTHON
Residuals plot
ARIMA MODELS IN PYTHON
Histogram plus estimated density
ARIMA MODELS IN PYTHON
Normal Q-Q
ARIMA MODELS IN PYTHON
Correlogram
ARIMA MODELS IN PYTHON
Summary statistics
print([Link]())
...
===================================================================================
Ljung-Box (Q): 32.10 Jarque-Bera (JB): 0.02
Prob(Q): 0.81 Prob(JB): 0.99
Heteroskedasticity (H): 1.28 Skew: -0.02
Prob(H) (two-sided): 0.21 Kurtosis: 2.98
===================================================================================
Prob(Q) - p-value for null hypothesis that residuals are uncorrelated
Prob(JB) - p-value for null hypothesis that residuals are normal
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Box-Jenkins method
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
The Box-Jenkins method
From raw data → production model
identification
estimation
model diagnostics
ARIMA MODELS IN PYTHON
Identification
Is the time series stationary?
What differencing will make it stationary?
What transforms will make it stationary?
What values of p and q are most
promising?
ARIMA MODELS IN PYTHON
Identification tools
Plot the time series
[Link]()
Use augmented Dicky-Fuller test
adfuller()
Use transforms and/or differencing
[Link]() , [Link]() , [Link]()
Plot ACF/PACF
plot_acf() , plot_pacf()
ARIMA MODELS IN PYTHON
Estimation
Use the data to train the model coefficients
Done for us using [Link]()
Choose between models using AIC and BIC
[Link] , [Link]
ARIMA MODELS IN PYTHON
Model diagnostics
Are the residuals uncorrelated
Are residuals normally distributed
results.plot_diagnostics()
[Link]()
ARIMA MODELS IN PYTHON
Decision
ARIMA MODELS IN PYTHON
Repeat
We go through the process again with more
information
Find a better model
ARIMA MODELS IN PYTHON
Production
Ready to make forecasts
results.get_forecast()
ARIMA MODELS IN PYTHON
Box-Jenkins
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Seasonal time series
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Seasonal data
Has predictable and repeated patterns
Repeats after any amount of time
ARIMA MODELS IN PYTHON
Seasonal decomposition
ARIMA MODELS IN PYTHON
Seasonal decomposition
time series = trend + seasonal + residual
ARIMA MODELS IN PYTHON
Seasonal decomposition using statsmodels
# Import
from [Link] import seasonal_decompose
# Decompose data
decomp_results = seasonal_decompose(df['IPG3113N'], period=12)
type(decomp_results)
[Link]
ARIMA MODELS IN PYTHON
Seasonal decomposition using statsmodels
# Plot decomposed data
decomp_results.plot()
[Link]()
ARIMA MODELS IN PYTHON
Finding seasonal period using ACF
ARIMA MODELS IN PYTHON
Identifying seasonal data using ACF
ARIMA MODELS IN PYTHON
Detrending time series
# Subtract long rolling average over N steps
df = df - [Link](N).mean()
# Drop NaN values
df = [Link]()
ARIMA MODELS IN PYTHON
Identifying seasonal data using ACF
# Create figure
fig, ax = [Link](1,1, figsize=(8,4))
# Plot ACF
plot_acf([Link](), ax=ax, lags=25, zero=False)
[Link]()
ARIMA MODELS IN PYTHON
ARIMA models and seasonal data
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
SARIMA models
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
The SARIMA model
Seasonal ARIMA = SARIMA SARIMA(p,d,q)(P,D,Q)S
Non-seasonal orders Seasonal Orders
p: autoregressive order P: seasonal autoregressive order
d: differencing order D: seasonal differencing order
q: moving average order Q: seasonal moving average order
S: number of time steps per cycle
ARIMA MODELS IN PYTHON
The SARIMA model
ARIMA(2,0,1) model :
yt = a1 yt−1 + a2 yt−2 + m1 ϵt−1 + ϵt
SARIMA(0,0,0)(2,0,1)7 model:
yt = a7 yt−7 + a14 yt−14 + m7 ϵt−7 + ϵt
ARIMA MODELS IN PYTHON
Fitting a SARIMA model
# Imports
[Link] import SARIMAX
# Instantiate model
model = SARIMAX(df, order=(p,d,q), seasonal_order=(P,D,Q,S))
# Fit model
results = [Link]()
ARIMA MODELS IN PYTHON
Seasonal differencing
Subtract the time series value of one season ago
Δyt = yt − yt−S
# Take the seasonal difference
df_diff = [Link](S)
ARIMA MODELS IN PYTHON
Differencing for SARIMA models
Time series
ARIMA MODELS IN PYTHON
Differencing for SARIMA models
First difference of time series
ARIMA MODELS IN PYTHON
Differencing for SARIMA models
First difference and first seasonal difference of time series
ARIMA MODELS IN PYTHON
Finding p and q
ARIMA MODELS IN PYTHON
Finding P and Q
ARIMA MODELS IN PYTHON
Plotting seasonal ACF and PACF
# Create figure
fig, (ax1, ax2) = [Link](2,1)
# Plot seasonal ACF
plot_acf(df_diff, lags=[12,24,36,48,60,72], ax=ax1)
# Plot seasonal PACF
plot_pacf(df_diff, lags=[12,24,36,48,60,72], ax=ax2)
[Link]()
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Automation and
saving
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Searching over model orders
import pmdarima as pm
results = pm.auto_arima(df)
Performing stepwise search to minimize aic
ARIMA(2,0,2)(1,1,1)[12] intercept : AIC=inf, Time=3.33 sec
ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=2648.467, Time=0.062 sec
ARIMA(1,0,0)(1,1,0)[12] intercept : AIC=2279.986, Time=1.171 sec
...
ARIMA(3,0,3)(1,1,1)[12] intercept : AIC=2173.508, Time=12.487 sec
ARIMA(3,0,3)(0,1,0)[12] intercept : AIC=2297.305, Time=2.087 sec
Best model: ARIMA(3,0,3)(1,1,1)[12]
Total fit time: 245.812 seconds
ARIMA MODELS IN PYTHON
pmdarima results
print([Link]()) results.plot_diagnostics()
ARIMA MODELS IN PYTHON
Non-seasonal search parameters
ARIMA MODELS IN PYTHON
Non-seasonal search parameters
results = pm.auto_arima( df, # data
d=0, # non-seasonal difference order
start_p=1, # initial guess for p
start_q=1, # initial guess for q
max_p=3, # max value of p to test
max_q=3, # max value of q to test
)
1 [Link]
ARIMA MODELS IN PYTHON
Seasonal search parameters
results = pm.auto_arima( df, # data
... , # non-seasonal arguments
seasonal=True, # is the time series seasonal
m=7, # the seasonal period
D=1, # seasonal difference order
start_P=1, # initial guess for P
start_Q=1, # initial guess for Q
max_P=2, # max value of P to test
max_Q=2, # max value of Q to test
)
ARIMA MODELS IN PYTHON
Other parameters
results = pm.auto_arima( df, # data
... , # model order parameters
information_criterion='aic', # used to select best model
trace=True, # print results whilst training
error_action='ignore', # ignore orders that don't work
stepwise=True, # apply intelligent order search
)
ARIMA MODELS IN PYTHON
Saving model objects
# Import
import joblib
# Select a filepath
filepath ='localpath/great_model.pkl'
# Save model to filepath
[Link](model_results_object, filepath)
ARIMA MODELS IN PYTHON
Saving model objects
# Select a filepath
filepath ='localpath/great_model.pkl'
# Load model object from filepath
model_results_object = [Link](filepath)
ARIMA MODELS IN PYTHON
Updating model
# Add new observations and update parameters
model_results_object.update(df_new)
ARIMA MODELS IN PYTHON
Update comparison
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
SARIMA and Box-
Jenkins
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
Box-Jenkins
ARIMA MODELS IN PYTHON
Box-Jenkins with seasonal data
Determine if time series is seasonal
Find seasonal period
Find transforms to make data stationary
Seasonal and non-seasonal differencing
Other transforms
ARIMA MODELS IN PYTHON
Mixed differencing
D should be 0 or 1
d + D should be 0-2
ARIMA MODELS IN PYTHON
Weak vs strong seasonality
Weak seasonal pattern Strong seasonal pattern
Use seasonal differencing if necessary Always use seasonal differencing
ARIMA MODELS IN PYTHON
Additive vs multiplicative seasonality
Additive series = trend + season multiplicative series = trend x season
Proceed as usual with differencing Apply log transform first - [Link]
ARIMA MODELS IN PYTHON
Multiplicative to additive seasonality
ARIMA MODELS IN PYTHON
Let's practice!
ARIMA MODELS IN PYTHON
Congratulations!
ARIMA MODELS IN PYTHON
James Fulton
Climate informatics researcher
The SARIMAX model
`
ARIMA MODELS IN PYTHON
Time series modeling framework
Test for stationarity and seasonality
Find promising model orders
Fit models and narrow selection with
AIC/BIC
Perform model diagnostics tests
Make forecasts
Save and update models
ARIMA MODELS IN PYTHON
Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels
ARIMA MODELS IN PYTHON
Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels
More time series courses here
1 [Link]
ARIMA MODELS IN PYTHON
Good luck!
ARIMA MODELS IN PYTHON
Timeseries kinds and
applications
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Time Series
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Time Series
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
What makes a time series?
Datapoint Datapoint Datapoint Datapoint Datapoint Datapoint
1 34 12 54 76 40
Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint
2:00 2:01 2:02 2:03 2:04 2:05
Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint
Jan Feb March April May Jun
Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint
1e-9 2e-9 3e-9 4e-9 5e-9 6e-9
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Reading in a time series with Pandas
import pandas as pd
import [Link] as plt
data = pd.read_csv('[Link]')
[Link]()
date symbol close volume
0 2010-01-04 AAPL 214.009998 123432400.0
46 2010-01-05 AAPL 214.379993 150476200.0
92 2010-01-06 AAPL 210.969995 138040000.0
138 2010-01-07 AAPL 210.580000 119282800.0
184 2010-01-08 AAPL 211.980005 111902700.0
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Plotting a pandas timeseries
import [Link] as plt
fig, ax = [Link](figsize=(12, 6))
[Link]('date', 'close', ax=ax)
[Link](title="AAPL daily closing price")
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A timeseries plot
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Why machine learning?
We can use really big data and really complicated data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Why machine learning?
We can...
Predict the future
Automate this process
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Why combine these two?
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A machine learning pipeline
Feature extraction
Model ing
Prediction and validation
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Machine learning
basics
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always begin by looking at your data
[Link]
(10, 5)
array[:3]
array([[ 0.735528 , 1.00122818, -0.28315978],
[-0.94478393, 0.18658748, -0.00241224],
[-0.74822942, -1.46636618, 0.69835096]])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Always begin by looking at your data
[Link]()
col1 col2 col3
0 0.735528 1.001228 -0.283160
1 -0.944784 0.186587 -0.002412
2 -0.748229 -1.466366 0.698351
3 1.038589 -0.171248 0.831457
4 -0.161904 0.003972 -0.321933
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Always visualize your data
Make sure it looks the way you'd expect.
# Using matplotlib
fig, ax = [Link]()
[Link](...)
# Using pandas
fig, ax = [Link]()
[Link](..., ax=ax)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Scikit-learn
Scikit-learn is the most popular machine learning library in Python
from [Link] import LinearSVC
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Preparing data for scikit-learn
scikit-learn expects a particular structure of data:
(samples, features)
Make sure that your data is at least two-dimensional
Make sure the rst dimension is samples
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
If your data is not shaped properly
If the axes are swapped:
[Link]
(10, 3)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
If your data is not shaped properly
If we're missing an axis, use .reshape() :
[Link]
(10,)
[Link](-1, 1).shape
(10, 1)
-1 will automatically ll that axis with remaining values
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Fitting a model with scikit-learn
# Import a support vector classifier
from [Link] import LinearSVC
# Instantiate this model
model = LinearSVC()
# Fit the model on some data
[Link](X, y)
It is common for y to be of shape (samples, 1)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Investigating the model
# There is one coefficient per input feature
model.coef_
array([[ 0.69417875, -0.5289162 ]])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Predicting with a fit model
# Generate predictions
predictions = [Link](X_test)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Combining
timeseries data with
machine learning
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Getting to know our data
The datasets that we'll use in this course are all freely-available online
There are many datasets available to download on the web, the ones we'll use come from
Kaggle
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
The Heartbeat Acoustic Data
Many recordings of heart sounds from di erent patients
Some had normally-functioning hearts, others had abnormalities
Data comes in the form of audio les + labels for each le
Can we nd the "abnormal" heart beats?
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Loading auditory data
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')
print(files)
['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
...
'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Reading in auditory data
import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = [Link]('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')
print(sfreq)
2205
In this case, the sampling frequency is 2205 , meaning there are 2205 samples per second
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Inferring time from samples
If we know the sampling rate of a timeseries, then we know the timestamp of each
datapoint relative to the rst datapoint
Note: this assumes the sampling rate is xed and no data points are lost
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating a time array (I)
Create an array of indices, one for each sample, and divide by the sampling frequency
indices = [Link](0, len(audio))
time = indices / sfreq
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating a time array (II)
Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero
to that time
final_time = (len(audio) - 1) / sfreq
time = [Link](0, final_time, sfreq)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
The New York Stock Exchange dataset
This dataset consists of company stock values for 10 years
Can we detect any pa erns in historical records that allow us to predict the value of
companies in the future?
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Looking at the data
data = pd.read_csv('path/to/[Link]')
[Link]
Index(['date', 'symbol', 'close', 'volume'], dtype='object')
[Link]()
date symbol close volume
0 2010-01-04 AAPL 214.009998 123432400.0
1 2010-01-04 ABT 54.459951 10829000.0
2 2010-01-04 AIG 29.889999 7750900.0
3 2010-01-04 AMAT 14.300000 18615100.0
4 2010-01-04 ARNC 16.650013 11512100.0
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Timeseries with Pandas DataFrames
We can investigate the object type of each column by accessing the dtypes a ribute
df['date'].dtypes
0 object
1 object
2 object
dtype: object
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Converting a column to a time series
To ensure that a column within a DataFrame is treated as time series, use the
to_datetime() function
df['date'] = pd.to_datetime(df['date'])
df['date']
0 2017-01-01
1 2017-01-02
2 2017-01-03
Name: date, dtype: datetime64[ns]
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Classification and
feature engineering
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always visualize raw data before fitting models
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualize your timeseries data!
ixs = [Link]([Link][-1])
time = ixs / sfreq
fig, ax = [Link]()
[Link](time, audio)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
What features to use?
Using raw timeseries data is too noisy for classi cation
We need to calculate features!
An easy start: summarize your audio data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating multiple features
print([Link])
# (n_files, time)
(20, 7000)
means = [Link](audio, axis=-1)
maxs = [Link](audio, axis=-1)
stds = [Link](audio, axis=-1)
print([Link])
# (n_files,)
(20,)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Fitting a classifier with scikit-learn
We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset
(samples)
We can combine each feature, and use it as an input to a model
If we have a label for each sample, we can use scikit-learn to create and t a classi er
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Preparing your features for scikit-learn
# Import a linear classifier
from [Link] import LinearSVC
# Note that means are reshaped to work with scikit-learn
X = np.column_stack([means, maxs, stds])
y = [Link](-1, 1)
model = LinearSVC()
[Link](X, y)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Scoring your scikit-learn model
from [Link] import accuracy_score
# Different input data
predictions = [Link](X_test)
# Score our model with % correct
# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using a sklearn scorer
percent_score = accuracy_score(labels_test, predictions)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Improving the
features we use for
classification
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The auditory envelope
Smooth the data to calculate the auditory envelope
Related to the total amount of audio energy present at each moment of time
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Smoothing over time
Instead of averaging over all time, we can do a local average
This is called smoothing your timeseries
It removes short-term noise, while retaining the general pa ern
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Smoothing your data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating a rolling window statistic
# Audio is a Pandas DataFrame
print([Link])
# (n_times, n_audio_files)
(5000, 20)
# Smooth our data by taking the rolling mean in a window of 50 samples
window_size = 50
windowed = [Link](window=window_size)
audio_smooth = [Link]()
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating the auditory envelope
First rectify your audio, then smooth it
audio_rectified = [Link]([Link])
audio_envelope = audio_rectified.rolling(50).mean()
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Feature engineering the envelope
# Calculate several features of the envelope, one per sound
envelope_mean = [Link](audio_envelope, axis=0)
envelope_std = [Link](audio_envelope, axis=0)
envelope_max = [Link](audio_envelope, axis=0)
# Create our training data for a classifier
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Preparing our features for scikit-learn
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = [Link](-1, 1)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Cross validation for classification
cross_val_score automates the process of:
Spli ing data into training / validation sets
Fi ing the model on training data
Scoring it on validation data
Repeating this process
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using cross_val_score
from sklearn.model_selection import cross_val_score
model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)
[0.60911642 0.59975305 0.61404035]
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Auditory features: The Tempogram
We can summarize more complex temporal information with timeseries-speci c functions
librosa is a great library for auditory and timeseries feature engineering
Here we'll calculate the tempogram, which estimates the tempo of a sound over time
We can calculate summary statistics of tempo in the same way that we can for the
envelope
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Computing the tempogram
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = [Link](audio, sr=sfreq,
hop_length=2**6, aggregate=None)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
The spectrogram -
spectral changes to
sound over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Fourier transforms
Timeseries data can be described as a combination of quickly-changing things and slowly-
changing things
At each moment in time, we can describe the relative presence of fast- and slow-moving
components
The simplest way to do this is called a Fourier Transform
This converts a single timeseries into an array that describes the timeseries as a
combination of oscillations
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A Fourier Transform (FFT)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Spectrograms: combinations of windows Fourier
transforms
A spectrogram is a collection of windowed Fourier transforms over time
Similar to how a rolling mean was calculated:
1. Choose a window size and shape
2. At a timepoint, calculate the FFT for that window
3. Slide the window over by one
4. Aggregate the results
Called a Short-Time Fourier Transform (STFT)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating the STFT
We can calculate the STFT with librosa
There are several parameters we can tweak (such as window size)
For our purposes, we'll convert into decibels which normalizes the average values of all
frequencies
We can then visualize it with the specshow() function
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating the STFT with code
# Import the functions we'll use for the STFT
from [Link] import stft, amplitude_to_db
from [Link] import specshow
import [Link] as plt
# Calculate our STFT
HOP_LENGTH = 2**4
SIZE_WINDOW = 2**7
audio_spec = stft(audio, hop_length=HOP_LENGTH, n_fft=SIZE_WINDOW)
# Convert into decibels for visualization
spec_db = amplitude_to_db(audio_spec)
# Visualize
fig, ax = [Link]()
specshow(spec_db, sr=sfreq, x_axis='time',
y axis='hz' hop length=HOP LENGTH ax=ax)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Spectral feature engineering
Each timeseries has a di erent spectral pa ern.
We can calculate these spectral pa erns by analyzing the spectrogram.
For example, spectral bandwidth and spectral centroids describe where most of the energy
is at each moment in time
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating spectral features
# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = [Link].spectral_bandwidth(S=spec)[0]
centroids = [Link].spectral_centroid(S=spec)[0]
# Display these features on top of the spectrogram
fig, ax = [Link]()
specshow(spec, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH, ax=ax)
[Link](times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2,
centroids + bandwidths / 2, alpha=0.5)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Combining spectral and temporal features in a
classifier
centroids_all = []
bandwidths_all = []
for spec in spectrograms:
bandwidths = [Link].spectral_bandwidth(S=lr.db_to_amplitude(spec))
centroids = [Link].spectral_centroid(S=lr.db_to_amplitude(spec))
# Calculate the mean spectral bandwidth
bandwidths_all.append([Link](bandwidths))
# Calculate the mean spectral centroid
centroids_all.append([Link](centroids))
# Create our X matrix
X = np.column_stack([means, stds, maxs, tempo_mean,
tempo_max, tempo_std, bandwidths_all, centroids_all])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Predicting data over
time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Classification vs. Regression
CLASSIFICATION REGRESSION
classification_model.predict(X_test) regression_model.predict(X_test)
array([0, 1, 1, 0]) array([0.2, 1.4, 3.6, 0.6])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Correlation and regression
Regression is similar to calculating correlation, with some key di erences
Regression: A process that results in a formal model of the data
Correlation: A statistic that describes the data. Less information than regression model.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Correlation between variables often changes over time
Timeseries o en have pa erns that change over time
Two timeseries that seem correlated at one moment may not remain so over time
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing relationships between timeseries
fig, axs = [Link](1, 2)
# Make a line plot for each timeseries
axs[0].plot(x, c='k', lw=3, alpha=.2)
axs[0].plot(y)
axs[0].set(xlabel='time', title='X values = time')
# Encode time as color in a scatterplot
axs[1].scatter(x_long, y_long, c=[Link](len(x_long)), cmap='viridis')
axs[1].set(xlabel='x', ylabel='y', title='Color = time')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing two timeseries
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Regression models with scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
[Link](X, y)
[Link](X)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualize predictions with scikit-learn
alphas = [.1, 1e2, 1e3]
[Link](y_test, color='k', alpha=.3, lw=3)
for ii, alpha in enumerate(alphas):
y_predicted = Ridge(alpha=alpha).fit(X_train, y_train).predict(X_test)
[Link](y_predicted, c=cmap(ii / len(alphas)))
[Link](['True values', 'Model 1', 'Model 2', 'Model 3'])
[Link](xlabel="Time")
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualize predictions with scikit-learn
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Scoring regression models
Two most common methods:
Correlation (r )
Coe cient of Determination (R2 )
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
2
Coefficient of Determination (R )
The value of R2 is bounded on the top by 1, and can be in nitely low
Values closer to 1 mean the model does a be er job of predicting outputs
error(model)
1−
variance(testdata)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
2
R in scikit-learn
from [Link] import r2_score
print(r2_score(y_predicted, y_test))
0.08
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cleaning and
improving your data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Data is messy
Real-world data is o en messy
The two most common problems are missing data and outliers
This o en happens because of human error, machine sensor malfunction, database failures,
etc
Visualizing your raw data makes it easier to spot these problems
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
What messy data looks like
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Interpolation: using time to fill in missing data
A common way to deal with missing data is to interpolate missing values
With timeseries data, you can use time to assist in interpolation.
In this case, interpolation means using using the known values on either side of a gap in the
data to make assumptions about what's missing.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Interpolation in Pandas
# Return a boolean that notes where missing values are
missing = [Link]()
# Interpolate linearly within missing windows
prices_interp = [Link]('linear')
# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
[Link](c='k', ax=ax, lw=2)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing the interpolated data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using a rolling window to transform data
Another common use of rolling windows is to transform the data
We've already done this once, in order to smooth the data
However, we can also use this to do more complex transformations
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Transforming data to standardize variance
A common transformation to apply to data is to standardize its mean and variance over
time. There are many ways to do this.
Here, we'll show how to convert your dataset so that each point represents the % change
over a previous window.
This makes timepoints more comparable to one another if the absolute values of data
change a lot
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Transforming to percent change with Pandas
def percent_change(values):
"""Calculates the % change between the last value
and the mean of previous values"""
# Separate the last value and all previous values into variables
previous_values = values[:-1]
last_value = values[-1]
# Calculate the % difference between the last value
# and the mean of earlier values
percent_change = (last_value - [Link](previous_values)) \
/ [Link](previous_values)
return percent_change
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Applying this to our data
# Plot the raw data
fig, axs = [Link](1, 2, figsize=(10, 5))
ax = [Link](ax=axs[0])
# Calculate % change and plot
ax = [Link](window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Finding outliers in your data
Outliers are datapoints that are signi cantly statistically di erent from the dataset.
They can have negative e ects on the predictive power of your model, biasing it away from
its "true" value
One solution is to remove or replace outliers with a more representative value
Be very careful about doing this - o en it is di cult to determine what is a legitimately
extreme value vs an abberation
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Plotting a threshold on our data
fig, axs = [Link](1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
# Calculate the mean / standard deviation for the data
this_mean = [Link]()
this_std = [Link]()
# Plot the data, with a window that is 3 standard deviations
# around the mean
[Link](ax=ax)
[Link](this_mean + this_std * 3, ls='--', c='r')
[Link](this_mean - this_std * 3, ls='--', c='r')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing outlier thresholds
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Replacing outliers using the threshold
# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()
# Calculate standard deviation
std = prices_outlier_perc.std()
# Use the absolute value of each datapoint
# to make it easier to find outliers
outliers = [Link](prices_outlier_centered) > (std * 3)
# Replace outliers with the median value
# We'll use [Link] since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = [Link](prices_outlier_fixed)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualize the results
fig, axs = [Link](1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Creating features
over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Extracting features with windows
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using .aggregate for feature extraction
# Visualize the raw data
print([Link](3))
symbol AIG ABT
date
2010-01-04 29.889999 54.459951
2010-01-05 29.330000 54.019953
2010-01-06 29.139999 54.319953
# Calculate a rolling window, then extract two features
feats = [Link](20).aggregate([[Link], [Link]]).dropna()
print([Link](3))
AIG ABT
std amax std amax
date
2010-02-01 2.051966 29.889999 0.868830 56.239949
2010-02-02 2.101032 29.629999 0.869197 56.239949
2010-02-03 2.157249 29.629999 0.852509 56.239949
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Check the properties of your features!
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using partial() in Python
# If we just take the mean, it returns a single value
a = [Link]([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
print([Link](a))
1.0
# We can use the partial function to initialize [Link]
# with an axis parameter
from functools import partial
mean_over_first_axis = partial([Link], axis=0)
print(mean_over_first_axis(a))
[0. 1. 2.]
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Percentiles summarize your data
Percentiles are a useful way to get more ne-grained summaries of your data (as opposed
to using [Link] )
For a given dataset, the Nth percentile is the value where N% of the data is below that
datapoint, and 100-N% of the data is above that datapoint.
print([Link]([Link](0, 200), q=20))
40.0
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Combining [Link]() with partial functions to
calculate a range of percentiles
data = [Link](0, 100)
# Create a list of functions using a list comprehension
percentile_funcs = [partial([Link], q=ii) for ii in [20, 40, 60]]
# Calculate the output of each function in the same way
percentiles = [i_func(data) for i_func in percentile_funcs]
print(percentiles)
[20.0, 40.00000000000001, 60.0]
# Calculate multiple percentiles of a rolling window
[Link](20).aggregate(percentiles)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating "date-based" features
Thus far we've focused on calculating "statistical" features - these are features that
correspond statistical properties of the data, like "mean", "standard deviation", etc
However, don't forget that timeseries data o en has more "human" features associated with
it, like days of the week, holidays, etc.
These features are o en useful when dealing with timeseries data that spans multiple years
(such as stock value over time)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
datetime features using Pandas
# Ensure our index is datetime
[Link] = pd.to_datetime([Link])
# Extract datetime features
day_of_week_num = [Link]
print(day_of_week_num[:10])
Index([0 1 2 3 4 0 1 2 3 4], dtype='object')
day_of_week = [Link].weekday_name
print(day_of_week[:10])
Index(['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Monday' 'Tuesday'
'Wednesday' 'Thursday' 'Friday'], dtype='object')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Time-delayed
features and auto-
regressive models
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The past is useful
Timeseries data almost always have information that is shared between timepoints
Information in the past can help predict what happens in the future
O en the features best-suited to predict a timeseries are previous values of the same
timeseries.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A note on smoothness and auto-correlation
A common question to ask of a timeseries: how smooth is the data.
AKA, how correlated is a timepoint with its neighboring timepoints (called autocorrelation).
The amount of auto-correlation in data will impact your models.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating time-lagged features
Let's see how we could build a model that uses values in the past as input features.
We can use this to assess how auto-correlated our signal is (and lots of other stu too)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Time-shifting data with Pandas
print(df)
df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
# Shift a DataFrame/Series by 3 index values towards the past
print([Link](3))
df
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Creating a time-shifted DataFrame
# data is a pandas Series containing time series data
data = [Link](...)
# Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]
# Create a dictionary of time-shifted data
many_shifts = {'lag_{}'.format(ii): [Link](ii) for ii in shifts}
# Convert them into a dataframe
many_shifts = [Link](many_shifts)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Fitting a model with time-shifted features
# Fit the model using these input features
model = Ridge()
[Link](many_shifts, data)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Interpreting the auto-regressive model coefficients
# Visualize the fit model coefficients
fig, ax = [Link]()
[Link](many_shifts.columns, model.coef_)
[Link](xlabel='Coefficient name', ylabel='Coefficient value')
# Set formatting so it looks nice
[Link](ax.get_xticklabels(), rotation=45, horizontalalignment='right')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing coefficients for a rough signal
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing coefficients for a smooth signal
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cross-validating
timeseries data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Cross validation with scikit-learn
# Iterating over the "split" method yields train/test indices
for tr, tt in [Link](X, y):
[Link](X[tr], y[tr])
[Link](X[tt], y[tt])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Cross validation types: KFold
KFold cross-validation splits your data into multiple "folds" of equal size
It is one of the most common cross-validation routines
from sklearn.model_selection import KFold
cv = KFold(n_splits=5)
for tr, tt in [Link](X, y):
...
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing model predictions
fig, axs = [Link](2, 1)
# Plot the indices chosen for validation on each loop
axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40)
axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)',
xlabel='Index of raw data')
# Plot the model predictions on each iteration
axs[1].plot([Link](X[tt]))
axs[1].set(title='Test set predictions on each CV loop',
xlabel='Prediction index')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing KFold CV behavior
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A note on shuffling your data
Many CV iterators let you shu e data as a part of the cross-validation process.
This only works if the data is i.i.d., which timeseries usually is not.
You should not shu e your data when making predictions with timeseries.
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=3)
for tr, tt in [Link](X, y):
...
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing shuffled CV behavior
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Using the time series CV iterator
Thus far, we've broken the linear passage of time in the cross validation
However, you generally should not use datapoints in the future to predict data in the past
One approach: Always use training data from the past to predict the future
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing time series cross validation iterators
# Import and initialize the cross-validation iterator
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=10)
fig, ax = [Link](figsize=(10, 5))
for ii, (tr, tt) in enumerate([Link](X, y)):
# Plot training and test indices
l1 = [Link](tr, [ii] * len(tr), c=[[Link](.1)],
marker='_', lw=6)
l2 = [Link](tt, [ii] * len(tt), c=[[Link](.9)],
marker='_', lw=6)
[Link](ylim=[10, -1], title='TimeSeriesSplit behavior',
xlabel='data index', ylabel='CV iteration')
[Link]([l1, l2], ['Training', 'Validation'])
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing the TimeSeriesSplit cross validation iterator
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Custom scoring functions in scikit-learn
def myfunction(estimator, X, y):
y_pred = [Link](X)
my_custom_score = my_custom_function(y_pred, y)
return my_custom_score
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
A custom correlation function for scikit-learn
def my_pearsonr(est, X, y):
# Generate predictions and convert to a vector
y_pred = [Link](X).squeeze()
# Use the numpy "corrcoef" function to calculate a correlation matrix
my_corrcoef_matrix = [Link](y_pred, [Link]())
# Return a single correlation value from the matrix
my_corrcoef = my_corrcoef[1, 0]
return my_corrcoef
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Stationarity and
stability
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Stationarity
Stationary time series do not change their statistical properties over time
E.g., mean, standard deviation, trends
Most time series are non-stationary to some extent
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model stability
Non-stationary data results in variability in our model
The statistical properties the model nds may change with the data
In addition, we will be less certain about the correct values of model parameters
How can we quantify this?
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Cross validation to quantify parameter stability
One approach: use cross-validation
Calculate model parameters on each iteration
Assess parameter stability across all CV splits
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Bootstrapping the mean
Bootstrapping is a common way to assess variability
The bootstrap:
1. Take a random sample of data with replacement
2. Calculate the mean of the sample
3. Repeat this process many times (1000s)
4. Calculate the percentiles of the result (usually 2.5, 97.5)
The result is a 95% con dence interval of the mean of each coe cient.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Bootstrapping the mean
from [Link] import resample
# cv_coefficients has shape (n_cv_folds, n_coefficients)
n_boots = 100
bootstrap_means = [Link](n_boots, n_coefficients)
for ii in range(n_boots):
# Generate random indices for our data with replacement,
# then take the sample mean
random_sample = resample(cv_coefficients)
bootstrap_means[ii] = random_sample.mean(axis=0)
# Compute the percentiles of choice for the bootstrapped means
percentiles = [Link](bootstrap_means, (2.5, 97.5), axis=0)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Plotting the bootstrapped coefficients
fig, ax = [Link]()
[Link](many_shifts.columns, percentiles[0], marker='_', s=200)
[Link](many_shifts.columns, percentiles[1], marker='_', s=200)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Assessing model performance stability
If using the TimeSeriesSplit, can plot the model's score over time.
This is useful in nding certain regions of time that hurt the score
Also useful to nd non-stationary signals
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model performance over time
def my_corrcoef(est, X, y):
"""Return the correlation coefficient
between model predictions and a validation set."""
return [Link](y, [Link](X))[1, 0]
# Grab the date of the first index of each validation set
first_indices = [[Link][tt[0]] for tr, tt in [Link](X, y)]
# Calculate the CV scores and convert to a Pandas Series
cv_scores = cross_val_score(model, X, y, cv=cv, scoring=my_corrcoef)
cv_scores = [Link](cv_scores, index=first_indices)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing model scores as a timeseries
fig, axs = [Link](2, 1, figsize=(10, 5), sharex=True)
# Calculate a rolling mean of scores over time
cv_scores_mean = cv_scores.rolling(10, min_periods=1).mean()
cv_scores.plot(ax=axs[0])
axs[0].set(title='Validation scores (correlation)', ylim=[0, 1])
# Plot the raw data
[Link](ax=axs[1])
axs[1].set(title='Validation data')
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Visualizing model scores
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Fixed windows with time series cross-validation
# Only keep the last 100 datapoints in the training data
window = 100
# Initialize the CV with this window size
cv = TimeSeriesSplit(n_splits=10, max_train_size=window)
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Non-stationary signals
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Wrapping-up
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Timeseries and machine learning
The many applications of time series + machine learning
Always visualize your data rst
The scikit-learn API standardizes this process
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Feature extraction and classification
Summary statistics for time series classi cation
Combining multiple features into a single input matrix
Feature extraction for time series data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model fitting and improving data quality
Time series features for regression
Generating predictions over time
Cleaning and improving time series data
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Validating and assessing our model performance
Cross-validation with time series data (don't shu e the data!)
Time series stationarity
Assessing model coe cient and score stability
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Advanced concepts in time series
Advanced window functions
Signal processing and ltering details
Spectral analysis
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Advanced machine learning
Advanced time series feature extraction (e.g., tsfresh )
More complex model architectures for regression and classi cation
Production-ready pipelines for time series analysis
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Ways to practice
There are a lot of opportunities to practice your skills with time series data.
Kaggle has a number of time series predictions challenges
Quantopian is also useful for learning and using predictive models others have built.
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N