100% found this document useful (1 vote)

265 views765 pages

Manipulating Time Series with Pandas

Time series data is one of the most common data types and understanding how to work with it is a critical data science skill if you want to make predictions and report on trends. In this track, you'll learn how to manipulate time series data using pandas, work with statistical libraries including NumPy and statsmodels to analyze data, and develop your visualization skills using Matplotlib, SciPy, and seaborn. You'll then apply your time series skills using real-world data, including financial stock data, UFO sightings, CO2 levels in Maui, monthly candy production in the US, and heartbeat sounds. By the end of this track, you'll know how to forecast the future using ARIMA class models and generate predictions and insights using machine learning models. https://ebooks-tech.sellfy.store/p/time-series-with-python/

Uploaded by

jcmayac

We take content rights seriously. If you suspect this is your content, claim it here.

100% found this document useful (1 vote)

265 views765 pages

Manipulating Time Series with Pandas

Uploaded by

jcmayac

We take content rights seriously. If you suspect this is your content, claim it here.

How to use dates &

times with pandas

M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Date & time series functionality
At the root: data types for date & time information
Objects for points in time and periods

A ributes & methods re ect time-related details

Sequences of dates & periods:

Series or DataFrame columns

Index: convert object into Time Series

Many Series/DataFrame methods rely on time information in

the index to provide time-series functionality

MANIPULATING TIME SERIES DATA IN PYTHON

Basic building block: [Link]
import pandas as pd # assumed imported going forward
from datetime import datetime # To manually create dates
time_stamp = [Link](datetime(2017, 1, 1))
[Link]('2017-01-01') == time_stamp

True # Understands dates as strings

time_stamp # type: [Link]

Timestamp('2017-01-01 [Link]')

MANIPULATING TIME SERIES DATA IN PYTHON

Basic building block: [Link]
Timestamp object has many a ributes to store time-speci c
information

time_stamp.year

2017

time_stamp.day_name()

'Sunday'

MANIPULATING TIME SERIES DATA IN PYTHON

More building blocks: [Link] & freq
period = [Link]('2017-01')
period # default: month-end

Period object has freq

Period('2017-01', 'M') a ribute to store frequency
info
[Link]('D') # convert to daily

Period('2017-01-31', 'D')
Convert [Link]() to
period.to_timestamp().to_period('M') [Link]() and back

Period('2017-01', 'M')

MANIPULATING TIME SERIES DATA IN PYTHON

More building blocks: [Link] & freq
period + 2 Frequency info enables
basic date arithmetic
Period('2017-03', 'M')

[Link]('2017-01-31', 'M') + 1

Timestamp('2017-02-28 [Link]', freq='M')

MANIPULATING TIME SERIES DATA IN PYTHON

Sequences of dates & times
pd.date_range : start , end , periods , freq

index = pd.date_range(start='2017-1-1', periods=12, freq='M')

index

DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31', ...,

'2017-09-30', '2017-10-31', '2017-11-30', '2017-12-31'],
dtype='datetime64[ns]', freq='M')

[Link] : sequence of Timestamp objects with

frequency info

MANIPULATING TIME SERIES DATA IN PYTHON

Sequences of dates & times
index[0]

Timestamp('2017-01-31 [Link]', freq='M')

index.to_period()

PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', ...,

'2017-11', '2017-12'], dtype='period[M]', freq='M')

MANIPULATING TIME SERIES DATA IN PYTHON

Create a time series: [Link]
[Link]({'data': index}).info()

RangeIndex: 12 entries, 0 to 11
Data columns (total 1 columns):
data 12 non-null datetime64[ns]
dtypes: datetime64[ns](1)

MANIPULATING TIME SERIES DATA IN PYTHON

Create a time series: [Link]
[Link] :
Random numbers: [0,1]

12 rows, 2 columns

data = [Link]((size=12,2))
[Link](data=data, index=index).info()

DatetimeIndex: 12 entries, 2017-01-31 to 2017-12-31

Freq: M
Data columns (total 2 columns):
0 12 non-null float64
1 12 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON

Frequency aliases & time info

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Indexing &
resampling time
series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Time series transformation
Basic time series transformations include:

Parsing string dates and convert to datetime64

Selecting & slicing for speci c subperiods

Se ing & changing DateTimeIndex frequency

Upsampling vs Downsampling

MANIPULATING TIME SERIES DATA IN PYTHON

Getting GOOG stock prices
google = pd.read_csv('[Link]') # import pandas as pd
[Link]()

<class '[Link]'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null object
price 504 non-null float64
dtypes: float64(1), object(1)

[Link]()

date price
0 2015-01-02 524.81
1 2015-01-05 513.87
2 2015-01-06 501.96
3 2015-01-07 501.10
4 2015-01-08 502.68

MANIPULATING TIME SERIES DATA IN PYTHON

Converting string dates to datetime64
pd.to_datetime() :
Parse date string

Convert to datetime64

[Link] = pd.to_datetime([Link])
[Link]()

<class '[Link]'>
RangeIndex: 504 entries, 0 to 503
Data columns (total 2 columns):
date 504 non-null datetime64[ns]
price 504 non-null float64
dtypes: datetime64[ns](1), float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

Converting string dates to datetime64
.set_index() :
Date into index

inplace :
don't create copy

google.set_index('date', inplace=True)
[Link]()

<class '[Link]'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

Plotting the Google stock time series
[Link](title='Google Stock Price')
plt.tight_layout(); [Link]()

MANIPULATING TIME SERIES DATA IN PYTHON

Partial string indexing
Selecting/indexing using strings that parse to dates

google['2015'].info() # Pass string for part of date

DatetimeIndex: 252 entries, 2015-01-02 to 2015-12-31

Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)

google['2015-3': '2016-2'].info() # Slice includes last month

DatetimeIndex: 252 entries, 2015-03-02 to 2016-02-29

Data columns (total 1 columns):
price 252 non-null float64
dtypes: float64(1)
memory usage: 3.9 KB

MANIPULATING TIME SERIES DATA IN PYTHON

Partial string indexing
[Link]['2016-6-1', 'price'] # Use full date with .loc[]

734.15

MANIPULATING TIME SERIES DATA IN PYTHON

.asfreq(): set frequency
.asfreq('D') :
Convert DateTimeIndex to calendar day frequency

[Link]('D').info() # set calendar day frequency

DatetimeIndex: 729 entries, 2015-01-02 to 2016-12-30

Freq: D
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

.asfreq(): set frequency
Upsampling:
Higher frequency implies new dates => missing data

[Link]('D').head()

price
date
2015-01-02 524.81
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 513.87
2015-01-06 501.96

MANIPULATING TIME SERIES DATA IN PYTHON

.asfreq(): reset frequency
.asfreq('B') :
Convert DateTimeIndex to business day frequency

google = [Link]('B') # Change to calendar day frequency

[Link]()

DatetimeIndex: 521 entries, 2015-01-02 to 2016-12-30

Freq: B
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

.asfreq(): reset frequency
google[[Link]()] # Select missing 'price' values

price
date
2015-01-19 NaN
2015-02-16 NaN
...
2016-11-24 NaN
2016-12-26 NaN

Business days that were not trading days

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Lags, changes, and
returns for stock
price series
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Basic time series calculations
Typical Time Series manipulations include:
Shi or lag values back or forward back in time

Get the di erence in value for a given time period

Compute the percent change over any number of periods

pandas built-in methods rely on [Link]

MANIPULATING TIME SERIES DATA IN PYTHON

Getting GOOG stock prices
Let pd.read_csv() do the parsing for you!

google = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')

[Link]()

<class '[Link]'>
DatetimeIndex: 504 entries, 2015-01-02 to 2016-12-30
Data columns (total 1 columns):
price 504 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

Getting GOOG stock prices
[Link]()

price
date
2015-01-02 524.81
2015-01-05 513.87
2015-01-06 501.96
2015-01-07 501.10
2015-01-08 502.68

MANIPULATING TIME SERIES DATA IN PYTHON

.shift(): Moving data between past & future
.shift() :
defaults to periods=1

1 period into future

google['shifted'] = [Link]() # default: periods=1

[Link](3)

price shifted
date
2015-01-02 542.81 NaN
2015-01-05 513.87 542.81
2015-01-06 501.96 513.87

MANIPULATING TIME SERIES DATA IN PYTHON

.shift(): Moving data between past & future
.shift(periods=-1) :
lagged data

1 period back in time

google['lagged'] = [Link](periods=-1)
google[['price', 'lagged', 'shifted']].tail(3)

price lagged shifted

date
2016-12-28 785.05 782.79 791.55
2016-12-29 782.79 771.82 785.05
2016-12-30 771.82 NaN 782.79

MANIPULATING TIME SERIES DATA IN PYTHON

Calculate one-period percent change
xt / xt−1
google['change'] = [Link]([Link])
google[['price', 'shifted', 'change']].head(3)

price shifted change

Date
2017-01-03 786.14 NaN NaN
2017-01-04 786.90 786.14 1.000967
2017-01-05 794.02 786.90 1.009048

MANIPULATING TIME SERIES DATA IN PYTHON

Calculate one-period percent change
google['return'] = [Link](1).mul(100)
google[['price', 'shifted', 'change', 'return']].head(3)

price shifted change return

date
2015-01-02 524.81 NaN NaN NaN
2015-01-05 513.87 524.81 0.98 -2.08
2015-01-06 501.96 513.87 0.98 -2.32

MANIPULATING TIME SERIES DATA IN PYTHON

.diff(): built-in time-series change
Di erence in value for two adjacent periods

xt − xt−1
google['diff'] = [Link]()
google[['price', 'diff']].head(3)

price diff
date
2015-01-02 524.81 NaN
2015-01-05 513.87 -10.94
2015-01-06 501.96 -11.91

MANIPULATING TIME SERIES DATA IN PYTHON

.pct_change(): built-in time-series % change
Percent change for two adjacent periods
xt
xt−1

google['pct_change'] = [Link].pct_change().mul(100)
google[['price', 'return', 'pct_change']].head(3)

price return pct_change

date
2015-01-02 524.81 NaN NaN
2015-01-05 513.87 -2.08 -2.08
2015-01-06 501.96 -2.32 -2.32

MANIPULATING TIME SERIES DATA IN PYTHON

Looking ahead: Get multi-period returns
google['return_3d'] = [Link].pct_change(periods=3).mul(100)
google[['price', 'return_3d']].head()

price return_3d
date
2015-01-02 524.81 NaN
2015-01-05 513.87 NaN
2015-01-06 501.96 NaN
2015-01-07 501.10 -4.517825
2015-01-08 502.68 -2.177594

Percent change for two periods, 3 trading days apart

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Compare time series
growth rates
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Comparing stock performance
Stock price series: hard to compare at di erent levels

Simple solution: normalize price series to start at 100

Divide all prices by rst in series, multiply by 100

Same starting point

All prices relative to starting point

Di erence to starting point in percentage points

MANIPULATING TIME SERIES DATA IN PYTHON

Normalizing a single series (1)
google = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
[Link](3)

price
date
2010-01-04 313.06
2010-01-05 311.68
2010-01-06 303.83

first_price = [Link][0] # int-based selection

first_price

313.06

first_price == [Link]['2010-01-04', 'price']

True

MANIPULATING TIME SERIES DATA IN PYTHON

Normalizing a single series (2)
normalized = [Link](first_price).mul(100)
[Link](title='Google Normalized Series')

MANIPULATING TIME SERIES DATA IN PYTHON

Normalizing multiple series (1)
prices = pd.read_csv('stock_prices.csv',
parse_dates=['date'],
index_col='date')
[Link]()

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30

Data columns (total 3 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
dtypes: float64(3)

[Link](2)

AAPL GOOG YHOO

Date
2010-01-04 30.57 313.06 17.10
2010-01-05 30.63 311.68 17.23

MANIPULATING TIME SERIES DATA IN PYTHON

Normalizing multiple series (2)
[Link][0]

AAPL 30.57
GOOG 313.06
YHOO 17.10
Name: 2010-01-04 [Link], dtype: float64

normalized = [Link]([Link][0])
[Link](3)

AAPL GOOG YHOO

Date
2010-01-04 1.000000 1.000000 1.000000
2010-01-05 1.001963 0.995592 1.007602
2010-01-06 0.985934 0.970517 1.004094

.div() : automatic alignment of Series index & DataFrame

columns

MANIPULATING TIME SERIES DATA IN PYTHON

Comparing with a benchmark (1)
index = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
[Link]()

DatetimeIndex: 1826 entries, 2010-01-01 to 2016-12-30

Data columns (total 1 columns):
SP500 1762 non-null float64
dtypes: float64(1)

prices = [Link]([prices, index], axis=1).dropna()

[Link]()

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30

Data columns (total 4 columns):
AAPL 1761 non-null float64
GOOG 1761 non-null float64
YHOO 1761 non-null float64
SP500 1761 non-null float64
dtypes: float64(4)

MANIPULATING TIME SERIES DATA IN PYTHON

Comparing with a benchmark (2)
[Link](1)

AAPL GOOG YHOO SP500

2010-01-04 30.57 313.06 17.10 1132.99

normalized = [Link]([Link][0]).mul(100)
[Link]()

MANIPULATING TIME SERIES DATA IN PYTHON

Plotting performance difference
diff = normalized[tickers].sub(normalized['SP500'], axis=0)

GOOG YHOO AAPL

2010-01-04 0.000000 0.000000 0.000000
2010-01-05 -0.752375 0.448669 -0.115294
2010-01-06 -3.314604 0.043069 -1.772895

.sub(..., axis=0) : Subtract a Series from each DataFrame

column by aligning indexes

MANIPULATING TIME SERIES DATA IN PYTHON

Plotting performance difference
[Link]()

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Changing the time
series frequency:
resampling
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Changing the frequency: resampling
DateTimeIndex : set & change freq using .asfreq()

But frequency conversion a ects the data

Upsampling: ll or interpolate missing data

Downsampling: aggregate existing data

pandas API:
.asfreq() , .reindex()

.resample() + transformation method

MANIPULATING TIME SERIES DATA IN PYTHON

Getting started: quarterly data
dates = pd.date_range(start='2016', periods=4, freq='Q')
data = range(1, 5)
quarterly = [Link](data=data, index=dates)
quarterly

2016-03-31 1
2016-06-30 2
2016-09-30 3
2016-12-31 4
Freq: Q-DEC, dtype: int64 # Default: year-end quarters

MANIPULATING TIME SERIES DATA IN PYTHON

Upsampling: quarter => month
monthly = [Link]('M') # to month-end frequency

2016-03-31 1.0
2016-04-30 NaN
2016-05-31 NaN
2016-06-30 2.0
2016-07-31 NaN
2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
2016-11-30 NaN
2016-12-31 4.0
Freq: M, dtype: float64

Upsampling creates missing values

monthly = monthly.to_frame('baseline') # to DataFrame

MANIPULATING TIME SERIES DATA IN PYTHON

Upsampling: fill methods
monthly['ffill'] = [Link]('M', method='ffill')
monthly['bfill'] = [Link]('M', method='bfill')
monthly['value'] = [Link]('M', fill_value=0)

MANIPULATING TIME SERIES DATA IN PYTHON

Upsampling: fill methods
bfill : back ll

ffill : forward ll

baseline ffill bfill value

2016-03-31 1.0 1 1 1
2016-04-30 NaN 1 2 0
2016-05-31 NaN 1 2 0
2016-06-30 2.0 2 2 2
2016-07-31 NaN 2 3 0
2016-08-31 NaN 2 3 0
2016-09-30 3.0 3 3 3
2016-10-31 NaN 3 4 0
2016-11-30 NaN 3 4 0
2016-12-31 4.0 4 4 4

MANIPULATING TIME SERIES DATA IN PYTHON

Add missing months: .reindex()
dates = pd.date_range(start='2016', [Link](dates)
periods=12,
freq='M')
2016-01-31 NaN
2016-02-29 NaN
DatetimeIndex(['2016-01-31', 2016-03-31 1.0
'2016-02-29', 2016-04-30 NaN
..., 2016-05-31 NaN
'2016-11-30', 2016-06-30 2.0
'2016-12-31'], 2016-07-31 NaN
dtype='datetime64[ns]', freq='M') 2016-08-31 NaN
2016-09-30 3.0
2016-10-31 NaN
.reindex() : 2016-11-30 NaN

conform DataFrame to 2016-12-31 4.0

new index

same lling logic as

.asfreq()

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Upsampling &
interpolation with
.resample()
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Frequency conversion & transformation methods
.resample() : similar to .groupby()

Groups data within resampling period and applies one or

several methods to each group

New date determined by o set - start, end, etc

Upsampling: ll from existing or interpolate values

Downsampling: apply aggregation to existing data

MANIPULATING TIME SERIES DATA IN PYTHON

Getting started: monthly unemployment rate
unrate = pd.read_csv('[Link]', parse_dates['Date'], index_col='Date')
[Link]()

DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01

Data columns (total 1 columns):
UNRATE 208 non-null float64 # no frequency information
dtypes: float64(1)

[Link]()

UNRATE
DATE
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0

Reporting date: 1st day of month

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling Period & Frequency Offsets
Resample creates new date for frequency o set

Several alternatives to calendar month end

Frequency Alias Sample Date

Calendar Month End M 2017-04-30
Calendar Month Start MS 2017-04-01
Business Month End BM 2017-04-28
Business Month Start BMS 2017-04-03

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling logic

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling logic

MANIPULATING TIME SERIES DATA IN PYTHON

Assign frequency with .resample()
[Link]('MS').info()

DatetimeIndex: 208 entries, 2000-01-01 to 2017-04-01

Freq: MS
Data columns (total 1 columns):
UNRATE 208 non-null float64
dtypes: float64(1)

[Link]('MS') # creates Resampler object

DatetimeIndexResampler [freq=<MonthBegin>, axis=0, closed=left,

label=left, convention=start, base=0]

MANIPULATING TIME SERIES DATA IN PYTHON

Assign frequency with .resample()
[Link]('MS').equals([Link]('MS').asfreq())

True

.resample() : returns data only when calling another method

MANIPULATING TIME SERIES DATA IN PYTHON

Quarterly real GDP growth
gdp = pd.read_csv('[Link]')
[Link]()

DatetimeIndex: 69 entries, 2000-01-01 to 2017-01-01

Data columns (total 1 columns):
gpd 69 non-null float64 # no frequency info
dtypes: float64(1)

[Link](2)

gpd
DATE
2000-01-01 1.2
2000-04-01 7.8

MANIPULATING TIME SERIES DATA IN PYTHON

Interpolate monthly real GDP growth
gdp_1 = [Link]('MS').ffill().add_suffix('_ffill')

gpd_ffill
DATE
2000-01-01 1.2
2000-02-01 1.2
2000-03-01 1.2
2000-04-01 7.8

MANIPULATING TIME SERIES DATA IN PYTHON

Interpolate monthly real GDP growth
gdp_2 = [Link]('MS').interpolate().add_suffix('_inter')

gpd_inter
DATE
2000-01-01 1.200000
2000-02-01 3.400000
2000-03-01 5.600000
2000-04-01 7.800000

.interpolate() : nds points on straight line between

existing data

MANIPULATING TIME SERIES DATA IN PYTHON

Concatenating two DataFrames
df1 = [Link]([1, 2, 3], columns=['df1'])
df2 = [Link]([4, 5, 6], columns=['df2'])
[Link]([df1, df2])

df1 df2
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 4.0
1 NaN 5.0
2 NaN 6.0

MANIPULATING TIME SERIES DATA IN PYTHON

Concatenating two DataFrames
[Link]([df1, df2], axis=1)

df1 df2
0 1 4
1 2 5
2 3 6

axis=1 : concatenate horizontally

MANIPULATING TIME SERIES DATA IN PYTHON

Plot interpolated real GDP growth
[Link]([gdp_1, gdp_2], axis=1).loc['2015':].plot()

MANIPULATING TIME SERIES DATA IN PYTHON

Combine GDP growth & unemployment
[Link]([unrate, gdp_inter], axis=1).plot();

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Downsampling &
aggregation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Downsampling & aggregation methods
So far: upsampling, ll logic & interpolation

Now: downsampling
hour to day

day to month, etc

How to represent the existing values at the new date?

Mean, median, last value?

MANIPULATING TIME SERIES DATA IN PYTHON

Air quality: daily ozone levels
ozone = pd.read_csv('[Link]',
parse_dates=['date'],
index_col='date')
[Link]()

DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31

Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)

ozone = [Link]('D').asfreq()
[Link]()

DatetimeIndex: 6300 entries, 1998-01-05 to 2017-03-31

Freq: D
Data columns (total 1 columns):
Ozone 6167 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

Creating monthly ozone data
[Link]('M').mean().head() [Link]('M').median().head()

Ozone Ozone
date date
2000-01-31 0.010443 2000-01-31 0.009486
2000-02-29 0.011817 2000-02-29 0.010726
2000-03-31 0.016810 2000-03-31 0.017004
2000-04-30 0.019413 2000-04-30 0.019866
2000-05-31 0.026535 2000-05-31 0.026018

.resample().mean() : Monthly
average, assigned to end of
calendar month

MANIPULATING TIME SERIES DATA IN PYTHON

Creating monthly ozone data
[Link]('M').agg(['mean', 'std']).head()

Ozone
mean std
date
2000-01-31 0.010443 0.004755
2000-02-29 0.011817 0.004072
2000-03-31 0.016810 0.004977
2000-04-30 0.019413 0.006574
2000-05-31 0.026535 0.008409

.resample().agg() : List of aggregation functions like

groupby

MANIPULATING TIME SERIES DATA IN PYTHON

Plotting resampled ozone data
ozone = [Link]['2016':]
ax = [Link]()
monthly = [Link]('M').mean()
monthly.add_suffix('_monthly').plot(ax=ax)

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling multiple time series
data = pd.read_csv('ozone_pm25.csv',
parse_dates=['date'],
index_col='date')
data = [Link]('D').asfreq()
[Link]()

DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31

Freq: D
Data columns (total 2 columns):
Ozone 6167 non-null float64
PM25 6167 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling multiple time series
data = [Link]('BM').mean()
[Link]()

<class '[Link]'>
DatetimeIndex: 207 entries, 2000-01-31 to 2017-03-31
Freq: BM
Data columns (total 2 columns):
ozone 207 non-null float64
pm25 207 non-null float64
dtypes: float64(2)

MANIPULATING TIME SERIES DATA IN PYTHON

Resampling multiple time series
[Link]('M').first().head(4)

Ozone PM25
date
2000-01-31 0.005545 20.800000
2000-02-29 0.016139 6.500000
2000-03-31 0.017004 8.493333
2000-04-30 0.031354 6.889474

[Link]('MS').first().head()

Ozone PM25
date
2000-01-01 0.004032 37.320000
2000-02-01 0.010583 24.800000
2000-03-01 0.007418 11.106667
2000-04-01 0.017631 11.700000
2000-05-01 0.022628 9.700000

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Rolling window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Window functions in pandas
Windows identify sub periods of your time series

Calculate metrics for sub periods inside the window

Create a new time series of metrics

Two types of windows:

Rolling: same size, sliding (this video)

Expanding: contain all prior values (next video)

MANIPULATING TIME SERIES DATA IN PYTHON

Calculating a rolling average
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')

DatetimeIndex: 1761 entries, 2010-01-04 to 2016-12-30

Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)

MANIPULATING TIME SERIES DATA IN PYTHON

Calculating a rolling average
# Integer-based window size
[Link](window=30).mean() # fixed # observations

DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24

Data columns (total 1 columns):
price 1732 non-null float64
dtypes: float64(1)

window=30 : # business days

min_periods : choose value < 30 to get results for rst days

MANIPULATING TIME SERIES DATA IN PYTHON

Calculating a rolling average
# Offset-based window size
[Link](window='30D').mean() # fixed period length

DatetimeIndex: 1761 entries, 2010-01-04 to 2017-05-24

Data columns (total 1 columns):
price 1761 non-null float64
dtypes: float64(1)

30D : # calendar days

MANIPULATING TIME SERIES DATA IN PYTHON

90 day rolling mean
r90 = [Link](window='90D').mean()
[Link](r90.add_suffix('_mean_90')).plot()

MANIPULATING TIME SERIES DATA IN PYTHON

90 & 360 day rolling means
data['mean90'] = r90
r360 = data['price'].rolling(window='360D'.mean()
data['mean360'] = r360; [Link]()

MANIPULATING TIME SERIES DATA IN PYTHON

Multiple rolling metrics (1)
r = [Link]('90D').agg(['mean', 'std'])
[Link](subplots = True)

MANIPULATING TIME SERIES DATA IN PYTHON

Multiple rolling metrics (2)
rolling = [Link]('360D')
q10 = [Link](0.1).to_frame('q10')
median = [Link]().to_frame('median')
q90 = [Link](0.9).to_frame('q90')
[Link]([q10, median, q90], axis=1).plot()

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Expanding window
functions with
pandas
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Expanding windows in pandas
From rolling to expanding windows

Calculate metrics for periods up to current date

New time series re ects all historical values

Useful for running rate of return, running min/max

Two options with pandas:

.expanding() - just like .rolling()

.cumsum() , .cumprod() , cummin() / max()

MANIPULATING TIME SERIES DATA IN PYTHON

The basic idea
df = [Link]({'data': range(5)})
df['expanding sum'] = [Link]().sum()
df['cumulative sum'] = [Link]()
df

data expanding sum cumulative sum

0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10

MANIPULATING TIME SERIES DATA IN PYTHON

Get data for the S&P 500
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')

DatetimeIndex: 2519 entries, 2007-05-24 to 2017-05-24

Data columns (total 1 columns):
SP500 2519 non-null float64

MANIPULATING TIME SERIES DATA IN PYTHON

How to calculate a running return
Single period return rt : current price over last price minus 1:
Pt
rt = −1
Pt−1
Multi-period return: product of (1 + rt ) for all periods,
minus 1:

RT = (1 + r1 )(1 + r2 )...(1 + rT ) − 1

For the period return: .pct_change()

For basic math .add() , .sub() , .mul() , .div()

For cumulative product: .cumprod()

MANIPULATING TIME SERIES DATA IN PYTHON

Running rate of return in practice
pr = data.SP500.pct_change() # period return
pr_plus_one = [Link](1)
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot()

MANIPULATING TIME SERIES DATA IN PYTHON

Getting the running min & max
data['running_min'] = [Link]().min()
data['running_max'] = [Link]().max()
[Link]()

MANIPULATING TIME SERIES DATA IN PYTHON

Rolling annual rate of return
def multi_period_return(period_returns):
return [Link](period_returns + 1) - 1
pr = data.SP500.pct_change() # period return
r = [Link]('360D').apply(multi_period_return)
data['Rolling 1yr Return'] = [Link](100)
[Link](subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON

Rolling annual rate of return
data['Rolling 1yr Return'] = [Link](100)
[Link](subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Case study: S&P500
price simulation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Random walks & simulations
Daily stock returns are hard to predict

Models o en assume they are random in nature

Numpy allows you to generate random numbers

From random returns to prices: use .cumprod()

Two examples:
Generate random returns

Randomly selected actual SP500 returns

MANIPULATING TIME SERIES DATA IN PYTHON

Generate random numbers
from [Link] import normal, seed
from [Link] import norm
seed(42)
random_returns = normal(loc=0, scale=0.01, size=1000)
[Link](random_returns, fit=norm, kde=False)

MANIPULATING TIME SERIES DATA IN PYTHON

Create a random price path
return_series = [Link](random_returns)
random_prices = return_series.add(1).cumprod().sub(1)
random_prices.mul(100).plot()

MANIPULATING TIME SERIES DATA IN PYTHON

S&P 500 prices & returns
data = pd.read_csv('[Link]', parse_dates=['date'], index_col='date')
data['returns'] = data.SP500.pct_change()
[Link](subplots=True)

MANIPULATING TIME SERIES DATA IN PYTHON

S&P return distribution
[Link]([Link]().mul(100), fit=norm)

MANIPULATING TIME SERIES DATA IN PYTHON

Generate random S&P 500 returns
from [Link] import choice
sample = [Link]()
n_obs = [Link]()
random_walk = choice(sample, size=n_obs)
random_walk = [Link](random_walk, index=[Link])
random_walk.head()

DATE
2007-05-29 -0.008357
2007-05-30 0.003702
2007-05-31 -0.013990
2007-06-01 0.008096
2007-06-04 0.013120

MANIPULATING TIME SERIES DATA IN PYTHON

Random S&P 500 prices (1)
start = [Link]('D')

DATE
2007-05-25 1515.73
Name: SP500, dtype: float64

sp500_random = [Link](random_walk.add(1))
sp500_random.head())

DATE
2007-05-25 1515.730000
2007-05-29 0.998290
2007-05-30 0.995190
2007-05-31 0.997787
2007-06-01 0.983853
dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON

Random S&P 500 prices (2)
data['SP500_random'] = sp500_random.cumprod()
data[['SP500', 'SP500_random']].plot()

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Relationships
between time series:
correlation
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Correlation & relations between series
So far, focus on characteristics of individual variables

Now: characteristic of relations between variables

Correlation: measures linear relationships

Financial markets: important for prediction and risk

management

pandas & seaborn have tools to compute & visualize

MANIPULATING TIME SERIES DATA IN PYTHON

Correlation & linear relationships
Correlation coe cient: how similar is the pairwise movement
of two variables around their averages?
∑N (x −x̄)(yi − ȳ )
Varies between -1 and +1 r= i=1 i
sx sy

MANIPULATING TIME SERIES DATA IN PYTHON

Importing five price time series
data = pd.read_csv('[Link]', parse_dates=['date'],
index_col='date')
data = [Link]().info()

DatetimeIndex: 2469 entries, 2007-05-25 to 2017-05-22

Data columns (total 5 columns):
sp500 2469 non-null float64
nasdaq 2469 non-null float64
bonds 2469 non-null float64
gold 2469 non-null float64
oil 2469 non-null float64

MANIPULATING TIME SERIES DATA IN PYTHON

Visualize pairwise linear relationships
daily_returns = data.pct_change()
[Link](x='sp500', y='nasdaq', data=data_returns);

MANIPULATING TIME SERIES DATA IN PYTHON

Calculate all correlations
correlations = [Link]()
correlations

bonds oil gold sp500 nasdaq

bonds 1.000000 -0.183755 0.003167 -0.300877 -0.306437
oil -0.183755 1.000000 0.105930 0.335578 0.289590
gold 0.003167 0.105930 1.000000 -0.007786 -0.002544
sp500 -0.300877 0.335578 -0.007786 1.000000 0.959990
nasdaq -0.306437 0.289590 -0.002544 0.959990 1.000000

MANIPULATING TIME SERIES DATA IN PYTHON

Visualize all correlations
[Link](correlations, annot=True)

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Select index
components &
import data
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Market value-weighted index
Composite performance of various stocks

Components weighted by market capitalization

Share Price x Number of Shares => Market Value

Larger components get higher percentage weightings

Key market indexes are value-weighted:

S&P 500 , NASDAQ , Wilshire 5000 , Hang Seng

MANIPULATING TIME SERIES DATA IN PYTHON

Build a cap-weighted Index
Apply new skills to construct value-weighted index
Select components from exchange listing data

Get component number of shares and stock prices

Calculate component weights

Calculate index

Evaluate performance of components and index

MANIPULATING TIME SERIES DATA IN PYTHON

Load stock listing data
nyse = pd.read_excel('[Link]', sheet_name='nyse',
na_values='n/a')
[Link]()

RangeIndex: 3147 entries, 0 to 3146

Data columns (total 7 columns):
Stock Symbol 3147 non-null object # Stock Ticker
Company Name 3147 non-null object
Last Sale 3079 non-null float64 # Latest Stock Price
Market Capitalization 3147 non-null float64
IPO Year 1361 non-null float64 # Year of listing
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(4)

MANIPULATING TIME SERIES DATA IN PYTHON

Load & prepare listing data
nyse.set_index('Stock Symbol', inplace=True)
[Link](subset=['Sector'], inplace=True)
nyse['Market Capitalization'] /= 1e6 # in Million USD

Index: 2177 entries, DDD to ZTO

Data columns (total 6 columns):
Company Name 2177 non-null object
Last Sale 2175 non-null float64
Market Capitalization 2177 non-null float64
IPO Year 967 non-null float64
Sector 2177 non-null object
Industry 2177 non-null object
dtypes: float64(3), object(3)

MANIPULATING TIME SERIES DATA IN PYTHON

Select index components
components = [Link](['Sector'])['Market Capitalization'].nlargest(1)
components.sort_values(ascending=False)

Sector Stock Symbol

Health Care JNJ 338834.390080
Energy XOM 338728.713874
Finance JPM 300283.250479
Miscellaneous BABA 275525.000000
Public Utilities T 247339.517272
Basic Industries PG 230159.644117
Consumer Services WMT 221864.614129
Consumer Non-Durables KO 183655.305119
Technology ORCL 181046.096000
Capital Goods TM 155660.252483
Transportation UPS 90180.886756
Consumer Durables ABB 48398.935676
Name: Market Capitalization, dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON

Import & prepare listing data
tickers = [Link].get_level_values('Stock Symbol')
tickers

Index(['PG', 'TM', 'ABB', 'KO', 'WMT', 'XOM', 'JPM', 'JNJ', 'BABA', 'T',
'ORCL', ‘UPS'], dtype='object', name='Stock Symbol’)

[Link]()

['PG',
'TM',
'ABB',
'KO',
'WMT',
...
'T',
'ORCL',
'UPS']

MANIPULATING TIME SERIES DATA IN PYTHON

Stock index components
columns = ['Company Name', 'Market Capitalization', 'Last Sale']
component_info = [Link][tickers, columns]
[Link].float_format = '{:,.2f}'.format

Company Name Market Capitalization Last Sale

Stock Symbol
PG Procter & Gamble Company (The) 230,159.64 90.03
TM Toyota Motor Corp Ltd Ord 155,660.25 104.18
ABB ABB Ltd 48,398.94 22.63
KO Coca-Cola Company (The) 183,655.31 42.79
WMT Wal-Mart Stores, Inc. 221,864.61 73.15
XOM Exxon Mobil Corporation 338,728.71 81.69
JPM J P Morgan Chase & Co 300,283.25 84.40
JNJ Johnson & Johnson 338,834.39 124.99
BABA Alibaba Group Holding Limited 275,525.00 110.21
T AT&T Inc. 247,339.52 40.28
ORCL Oracle Corporation 181,046.10 44.00
UPS United Parcel Service, Inc. 90,180.89 103.74

MANIPULATING TIME SERIES DATA IN PYTHON

Import & prepare listing data
data = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date').loc[:, [Link]()]
[Link]()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30

Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
KO 252 non-null float64
ORCL 252 non-null float64
PG 252 non-null float64
T 252 non-null float64
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Build a market-cap
weighted index
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Build your value-weighted index
Key inputs:
number of shares

stock price series

MANIPULATING TIME SERIES DATA IN PYTHON

Build your value-weighted index
Key inputs:
number of shares

stock price series

Normalize index to start

at 100

MANIPULATING TIME SERIES DATA IN PYTHON

Stock index components
components

Company Name Market Capitalization Last Sale

MANIPULATING TIME SERIES DATA IN PYTHON

Number of shares outstanding
shares = components['Market Capitalization'].div(components['Last Sale'])

Stock Symbol
PG 2,556.48 # Outstanding shares in million
TM 1,494.15
ABB 2,138.71
KO 4,292.01
WMT 3,033.01
XOM 4,146.51
JPM 3,557.86
JNJ 2,710.89
BABA 2,500.00
T 6,140.50
ORCL 4,114.68
UPS 869.30
dtype: float64

Market Capitalization = Number of Shares x Share Price

MANIPULATING TIME SERIES DATA IN PYTHON

Historical stock prices
data = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date').loc[:, [Link]()]
market_cap_series = [Link](no_shares)
market_series.info()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30

Data columns (total 12 columns):
ABB 252 non-null float64
BABA 252 non-null float64
JNJ 252 non-null float64
JPM 252 non-null float64
...
TM 252 non-null float64
UPS 252 non-null float64
WMT 252 non-null float64
XOM 252 non-null float64
dtypes: float64(12)

MANIPULATING TIME SERIES DATA IN PYTHON

From stock prices to market value
market_cap_series.first('D').append(market_cap_series.last('D'))

ABB BABA JNJ JPM KO ORCL \\

Date
2016-01-04 37,470.14 191,725.00 272,390.43 226,350.95 181,981.42 147,099.95
2016-12-30 45,062.55 219,525.00 312,321.87 307,007.60 177,946.93 158,209.60
PG T TM UPS WMT XOM
Date
2016-01-04 200,351.12 210,926.33 181,479.12 82,444.14 186,408.74 321,188.96
2016-12-30 214,948.60 261,155.65 175,114.05 99,656.23 209,641.59 374,264.34

MANIPULATING TIME SERIES DATA IN PYTHON

Aggregate market value per period
agg_mcap = market_cap_series.sum(axis=1) # Total market cap
agg_mcap(title='Aggregate Market Cap')

MANIPULATING TIME SERIES DATA IN PYTHON

Value-based index
index = agg_mcap.div(agg_mcap.iloc[0]).mul(100) # Divide by 1st value
[Link](title='Market-Cap Weighted Index')

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Evaluate index
performance
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Evaluate your value-weighted index
Index return:
Total index return

Contribution by component

Performance vs Benchmark
Total period return

Rolling returns for sub periods

MANIPULATING TIME SERIES DATA IN PYTHON

Value-based index - recap
agg_market_cap = market_cap_series.sum(axis=1)
index = agg_market_cap.div(agg_market_cap.iloc[0]).mul(100)
[Link](title='Market-Cap Weighted Index')

MANIPULATING TIME SERIES DATA IN PYTHON

Value contribution by stock
agg_market_cap.iloc[-1] - agg_market_cap.iloc[0]

315,037.71

MANIPULATING TIME SERIES DATA IN PYTHON

Value contribution by stock
change = market_cap_series.first('D').append(market_cap_series.last('D'))
[Link]().iloc[-1].sort_values() # or: .loc['2016-12-30']

TM -6,365.07
KO -4,034.49
ABB 7,592.41
ORCL 11,109.65
PG 14,597.48
UPS 17,212.08
WMT 23,232.85
BABA 27,800.00
JNJ 39,931.44
T 50,229.33
XOM 53,075.38
JPM 80,656.65
Name: 2016-12-30 [Link], dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON

Market-cap based weights
market_cap = components['Market Capitalization']
weights = market_cap.div(market_cap.sum())
weights.sort_values().mul(100)

Stock Symbol
ABB 1.85
UPS 3.45
TM 5.96
ORCL 6.93
KO 7.03
WMT 8.50
PG 8.81
T 9.47
BABA 10.55
JPM 11.50
XOM 12.97
JNJ 12.97
Name: Market Capitalization, dtype: float64

MANIPULATING TIME SERIES DATA IN PYTHON

Value-weighted component returns
index_return = ([Link][-1] / [Link][0] - 1) * 100

14.06

weighted_returns = [Link](index_return)
weighted_returns.sort_values().plot(kind='barh')

MANIPULATING TIME SERIES DATA IN PYTHON

Performance vs benchmark
data = index.to_frame('Index') # Convert [Link] to [Link]
data['SP500'] = pd.read_csv('[Link]', parse_dates=['Date'],
index_col='Date')
data.SP500 = [Link]([Link][0], axis=0).mul(100)

MANIPULATING TIME SERIES DATA IN PYTHON

Performance vs benchmark: 30D rolling return
def multi_period_return(r):
return ([Link](r + 1) - 1) * 100
data.pct_change().rolling('30D').apply(multi_period_return).plot()

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Index correlation &
exporting to Excel
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Some additional analysis of your index
Daily return correlations:

Calculate among all components

Visualize the result as heatmap

Write results to excel using .xls and .xlsx formats:

Single worksheet

Multiple worksheets

MANIPULATING TIME SERIES DATA IN PYTHON

Index components - price data
data = DataReader(tickers, 'google', start='2016', end='2017')['Close']
[Link]()

DatetimeIndex: 252 entries, 2016-01-04 to 2016-12-30

MANIPULATING TIME SERIES DATA IN PYTHON

Index components: return correlations
daily_returns = data.pct_change()
correlations = daily_returns.corr()

ABB BABA JNJ JPM KO ORCL PG T TM UPS WMT XOM

ABB 1.00 0.40 0.33 0.56 0.31 0.53 0.34 0.29 0.48 0.50 0.15 0.48
BABA 0.40 1.00 0.27 0.27 0.25 0.38 0.21 0.17 0.34 0.35 0.13 0.21
JNJ 0.33 0.27 1.00 0.34 0.30 0.37 0.42 0.35 0.29 0.45 0.24 0.41
JPM 0.56 0.27 0.34 1.00 0.22 0.57 0.27 0.13 0.49 0.56 0.14 0.48
KO 0.31 0.25 0.30 0.22 1.00 0.31 0.62 0.47 0.33 0.50 0.25 0.29
ORCL 0.53 0.38 0.37 0.57 0.31 1.00 0.41 0.32 0.48 0.54 0.21 0.42
PG 0.34 0.21 0.42 0.27 0.62 0.41 1.00 0.43 0.32 0.47 0.33 0.34
T 0.29 0.17 0.35 0.13 0.47 0.32 0.43 1.00 0.28 0.41 0.31 0.33
TM 0.48 0.34 0.29 0.49 0.33 0.48 0.32 0.28 1.00 0.52 0.20 0.30
UPS 0.50 0.35 0.45 0.56 0.50 0.54 0.47 0.41 0.52 1.00 0.33 0.45
WMT 0.15 0.13 0.24 0.14 0.25 0.21 0.33 0.31 0.20 0.33 1.00 0.21
XOM 0.48 0.21 0.41 0.48 0.29 0.42 0.34 0.33 0.30 0.45 0.21 1.00

MANIPULATING TIME SERIES DATA IN PYTHON

Index components: return correlations
[Link](correlations, annot=True)
[Link](rotation=45)
[Link]('Daily Return Correlations')

MANIPULATING TIME SERIES DATA IN PYTHON

Saving to a single Excel worksheet
correlations.to_excel(excel_writer= '[Link]',
sheet_name='correlations',
startrow=1,
startcol=1)

MANIPULATING TIME SERIES DATA IN PYTHON

Saving to multiple Excel worksheets
[Link] = [Link] # Keep only date component
with [Link]('stock_data.xlsx') as writer:
corr.to_excel(excel_writer=writer, sheet_name='correlations')
data.to_excel(excel_writer=writer, sheet_name='prices')
data.pct_change().to_excel(writer, sheet_name='returns')

MANIPULATING TIME SERIES DATA IN PYTHON

Let's practice!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N

Stefan Jansen
Founder & Lead Data Scientist at
Applied Arti cial Intelligence
Congratulations!
M A N I P U L AT I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to the
Course
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Example of Time Series: Google Trends

TIME SERIES ANALYSIS IN PYTHON

Example of Time Series: Climate Data

TIME SERIES ANALYSIS IN PYTHON

Example of Time Series: Quarterly Earnings Data

TIME SERIES ANALYSIS IN PYTHON

Example of Multiple Series: Natural Gas and Heating
Oil

TIME SERIES ANALYSIS IN PYTHON

Goals of Course
Learn about time series models

Fit data to a time series model

Use the models to make forecasts of the future

Learn how to use the relevant statistical packages in Python

Provide concrete examples of how these models are used

TIME SERIES ANALYSIS IN PYTHON

Some Useful Pandas Tools
Changing an index to datetime

[Link] = pd.to_datetime([Link])

Plo ing data

[Link]()

Slicing data

df['2012']

TIME SERIES ANALYSIS IN PYTHON

Some Useful Pandas Tools
Join two DataFrames

[Link](df2)

Resample data (e.g. from daily to weekly)

df = [Link](rule='W').last()

TIME SERIES ANALYSIS IN PYTHON

More pandas Functions
Computing percent changes and di erences of a time series

df['col'].pct_change()
df['col'].diff()

pandas correlation method of Series

df['ABC'].corr(df['XYZ'])

pandas autocorrelation

df['ABC'].autocorr()

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Correlation of Two
Time Series
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Correlation of Two Time Series
Plot of S&P500 and JPMorgan stock

TIME SERIES ANALYSIS IN PYTHON

Correlation of Two Time Series
Sca er plot of S&P500 and JP Morgan returns

TIME SERIES ANALYSIS IN PYTHON

More Scatter Plots
Correlation = 0.9 Correlation = 0.4

Correlation = -0.9 Corelation = 1.0

TIME SERIES ANALYSIS IN PYTHON

Common Mistake: Correlation of Two Trending Series
Dow Jones Industrial Average and UFO Sightings
([Link])

Correlation of levels: 0.94

Correlation of percent changes: ≈0

TIME SERIES ANALYSIS IN PYTHON

Example: Correlation of Large Cap and Small Cap
Stocks
Start with stock prices of SPX (large cap) and R2000 (small
cap)

First step: Compute percentage changes of both series

df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()

TIME SERIES ANALYSIS IN PYTHON

Example: Correlation of Large Cap and Small Cap
Stocks
Visualize correlation with sca ter plot
[Link](df['SPX_Ret'], df['R2000_Ret'])
[Link]()

TIME SERIES ANALYSIS IN PYTHON

Example: Correlation of Large Cap and Small Cap
Stocks
Use pandas correlation method for Series

correlation = df['SPX_Ret'].corr(df['R2000_Ret'])
print("Correlation is: ", correlation)

Correlation is: 0.868

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Simple Linear
Regressions
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Regression?
Simple linear regression:

yt = α + βxt + ϵt

TIME SERIES ANALYSIS IN PYTHON

What is a Regression?
Ordinary Least Squares (OLS)

TIME SERIES ANALYSIS IN PYTHON

Python Packages to Perform Regressions
In statsmodels: Warning: the order of x and
import [Link] as sm y is not consistent across
[Link](y, x).fit()
packages

In numpy:
[Link](x, y, deg=1)

In pandas:
[Link](y, x)

In scipy:
from scipy import stats
[Link](x, y)

TIME SERIES ANALYSIS IN PYTHON

Example: Regression of Small Cap Returns on Large
Cap
Import the statsmodels module
import [Link] as sm

As before, compute percentage changes in both series

df['SPX_Ret'] = df['SPX_Prices'].pct_change()
df['R2000_Ret'] = df['R2000_Prices'].pct_change()

Add a constant to the DataFrame for the regression intercept

df = sm.add_constant(df)

TIME SERIES ANALYSIS IN PYTHON

Regression Example (continued)
Notice that the rst row of returns is NaN
SPX_Price R2000_Price SPX_Ret R2000_Ret
Date
2012-11-01 1427.589966 827.849976 NaN NaN
2012-11-02 1414.199951 814.369995 -0.009379 -0.016283

Delete the row of NaN

df = [Link]()

Run the regression

results = [Link](df['R2000_Ret'],df[['const','SPX_Ret']]).fit()
print([Link]())

TIME SERIES ANALYSIS IN PYTHON

Regression Example (continued)
Regression output

Intercept in [Link][0]

Slope in [Link][1]

TIME SERIES ANALYSIS IN PYTHON

Regression Example (continued)
Regression output

TIME SERIES ANALYSIS IN PYTHON

Relationship Between R-Squared and Correlation
[corr(x, y)]2 = R2 (or R-squared)
sign(corr) = sign(regression slope)
In last example:
R-Squared = 0.753

Slope is positive

correlation = +√0.753 = 0.868

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Autocorrelation?
Correlation of a time series with a lagged copy of itself

Also called serial correlation

Lag-one autocorrelation

TIME SERIES ANALYSIS IN PYTHON

Interpretation of Autocorrelation
Mean Reversion - Negative autocorrelation

TIME SERIES ANALYSIS IN PYTHON

Interpretation of Autocorrelation
Momentum, or Trend Following - Positive autocorrelation

TIME SERIES ANALYSIS IN PYTHON

Traders Use Autocorrelation to Make Money
Individual stocks
Historically have negative autocorrelation

Measured over short horizons (days)

Trading strategy: Buy losers and sell winners

Commodities and currencies

Historically have positive autocorrelation

Measured over longer horizons (months)

Trading strategy: Buy winners and sell losers

TIME SERIES ANALYSIS IN PYTHON

Example of Positive Autocorrelation: Exchange Rates
Use daily ¥/$ exchange rates in DataFrame df from FRED

Convert index to datetime

# Convert index to datetime
[Link] = pd.to_datetime([Link])
# Downsample from daily to monthly data
df = [Link](rule='M').last()
# Compute returns from prices
df['Return'] = df['Price'].pct_change()
# Compute autocorrelation
autocorrelation = df['Return'].autocorr()
print("The autocorrelation is: ",autocorrelation)

The autocorrelation is: 0.0567

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Autocorrelation
Function
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Autocorrelation Function
Autocorrelation Function (ACF): The autocorrelation as a
function of the lag

Equals one at lag-zero

Interesting information beyond lag-one

TIME SERIES ANALYSIS IN PYTHON

ACF Example 1: Simple Autocorrelation Function
Can use last two values in series for forecasting

TIME SERIES ANALYSIS IN PYTHON

ACF Example 2: Seasonal Earnings
Earnings for H&R Block ACF for H&R Block

TIME SERIES ANALYSIS IN PYTHON

ACF Example 3: Useful for Model Selection
Model selection

TIME SERIES ANALYSIS IN PYTHON

Plot ACF in Python
Import module:
from [Link] import plot_acf

Plot the ACF:

plot_acf(x, lags= 20, alpha=0.05)

TIME SERIES ANALYSIS IN PYTHON

Confidence Interval of ACF

TIME SERIES ANALYSIS IN PYTHON

Confidence Interval of ACF
Argument alpha sets the width of con dence interval

Example: alpha=0.05
5% chance that if true autocorrelation is zero, it will fall
outside blue band

Con dence bands are wider if:

Alpha lower

Fewer observations

Under some simplifying assumptions, 95% con dence bands

are ±2/√N
If you want no bands on plot, set alpha=1

TIME SERIES ANALYSIS IN PYTHON

ACF Values Instead of Plot
from [Link] import acf
print(acf(x))

[ 1. -0.6765505 0.34989905 -0.01629415 -0.02507

-0.03186545 0.01399904 -0.03518128 0.02063168 -0.02620
...
0.07191516 -0.12211912 0.14514481 -0.09644228 0.05215

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
White Noise
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is White Noise?
White Noise is a series with:
Constant mean

Constant variance

Zero autocorrelations at all lags

Special Case: if data has normal distribution, then Gaussian

White Noise

TIME SERIES ANALYSIS IN PYTHON

Simulating White Noise
It's very easy to generate white noise
import numpy as np
noise = [Link](loc=0, scale=1, size=500)

TIME SERIES ANALYSIS IN PYTHON

What Does White Noise Look Like?
[Link](noise)

TIME SERIES ANALYSIS IN PYTHON

Autocorrelation of White Noise
plot_acf(noise, lags=50)

TIME SERIES ANALYSIS IN PYTHON

Stock Market Returns: Close to White Noise
Autocorrelation Function for the S&P500

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Random Walk
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Plot of simulated data

TIME SERIES ANALYSIS IN PYTHON

What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Change in price is white noise

Pt − Pt−1 = ϵt
Can't forecast a random walk

Best forecast for tomorrow's price is today's price

TIME SERIES ANALYSIS IN PYTHON

What is a Random Walk?
Today's Price = Yesterday's Price + Noise

Pt = Pt−1 + ϵt
Random walk with dri :

Pt = μ + Pt−1 + ϵt
Change in price is white noise with non-zero mean:

Pt − Pt−1 = μ + ϵt

TIME SERIES ANALYSIS IN PYTHON

Statistical Test for Random Walk
Random walk with dri

Pt = μ + Pt−1 + ϵt
Regression test for random walk

Pt = α + β Pt−1 + ϵt
Test:

H0 : β = 1 (random walk)
H1 : β < 1 (not random walk)

TIME SERIES ANALYSIS IN PYTHON

Statistical Test for Random Walk
Regression test for random walk

Pt = α + β Pt−1 + ϵt
Equivalent to

Pt − Pt−1 = α + β Pt−1 + ϵt
Test:

H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)

TIME SERIES ANALYSIS IN PYTHON

Statistical Test for Random Walk
Regression test for random walk

Pt − Pt−1 = α + β Pt−1 + ϵt
Test:

H0 : β = 0 (random walk)
H1 : β < 0 (not random walk)
This test is called the Dickey-Fuller test

If you add more lagged changes on the right hand side, it's
the Augmented Dickey-Fuller test

TIME SERIES ANALYSIS IN PYTHON

ADF Test in Python
Import module from statsmodels

from [Link] import adfuller

Run Augmented Dickey-Test

adfuller(x)

TIME SERIES ANALYSIS IN PYTHON

Example: Is the S&P500 a Random Walk?
# Run Augmented Dickey-Fuller Test on SPX data
results = adfuller(df['SPX'])

# Print p-value
print(results[1])

0.782253808587

# Print full results

print(results)

(-0.91720490331127869,
0.78225380858668414,
0,
1257,
{'1%': -3.4355629707955395,
'10%': -2.567995644141416,
'5%': -2.8638420633876671},
10161.888789598503)

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Stationarity
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Stationarity?
Strong stationarity: entire distribution of data is time-
invariant

Weak stationarity: mean, variance and autocorrelation are

time-invariant (i.e., for autocorrelation, corr(Xt , Xt−τ ) is
only a function of τ)

TIME SERIES ANALYSIS IN PYTHON

Why Do We Care?
If parameters vary with time, too many parameters to
estimate

Can only estimate a parsimonious model with a few

parameters

TIME SERIES ANALYSIS IN PYTHON

Examples of Nonstationary Series
Random Walk

TIME SERIES ANALYSIS IN PYTHON

Examples of Nonstationary Series
Seasonality in series

TIME SERIES ANALYSIS IN PYTHON

Examples of Nonstationary Series
Change in Mean or Standard Deviation over time

TIME SERIES ANALYSIS IN PYTHON

Transforming Nonstationary Series Into Stationary
Series
Random Walk First di erence
[Link](SPY) [Link]([Link]())

TIME SERIES ANALYSIS IN PYTHON

Transforming Nonstationary Series Into Stationary
Series
Seasonality Seasonal di erence
[Link](HRB) [Link]([Link](4))

TIME SERIES ANALYSIS IN PYTHON

Transforming Nonstationary Series Into Stationary
Series
AMZN Quarterly Revenues # Log of AMZN Revenues
[Link]([Link](AMZN))
[Link](AMZN)

# Log, then seasonal difference

[Link]([Link](AMZN).diff(4))

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Introducing an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of AR(1) Model
Rt = μ + ϕ Rt−1 + ϵt
Since only one lagged value on right hand side, this is called:
AR model of order 1, or

AR(1) model

AR parameter is ϕ
For stationarity, −1 < ϕ < 1

TIME SERIES ANALYSIS IN PYTHON

Interpretation of AR(1) Parameter
Rt = μ + ϕ Rt−1 + ϵt
Negative ϕ: Mean Reversion
Positive ϕ: Momentum

TIME SERIES ANALYSIS IN PYTHON

Comparison of AR(1) Time Series
ϕ = 0.9 ϕ = −0.9

ϕ = 0.5 ϕ = −0.5

TIME SERIES ANALYSIS IN PYTHON

Comparison of AR(1) Autocorrelation Functions
ϕ = 0.9 ϕ = −0.9

ϕ = 0.5 ϕ = −0.5

TIME SERIES ANALYSIS IN PYTHON

Higher Order AR Models
AR(1)

Rt = μ + ϕ1 Rt−1 + ϵt
AR(2)

Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϵt
AR(3)

Rt = μ + ϕ1 Rt−1 + ϕ2 Rt−2 + ϕ3 Rt−3 + ϵt

...

TIME SERIES ANALYSIS IN PYTHON

Simulating an AR Process
from [Link].arima_process import ArmaProcess
ar = [Link]([1, -0.9])
ma = [Link]([1])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
[Link](simulated_data)

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimating and
Forecasting an AR
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an AR Model
To estimate parameters from data (simulated)

from [Link].arima_model import ARMA

mod = ARMA(data, order=(1,0))
result = [Link]()

ARMA has been deprecated and replaced with ARIMA

from [Link] import ARIMA

mod = ARIMA(data, order=(1,0,0))
result = [Link]()

For ARMA, order=(p,q)

For ARIMA,order=(p,d,q)

TIME SERIES ANALYSIS IN PYTHON

Estimating an AR Model
Full output (true μ = 0 and ϕ = 0.9)
print([Link]())

TIME SERIES ANALYSIS IN PYTHON

Estimating an AR Model
Only the estimates of μ and ϕ (true μ = 0 and ϕ = 0.9)
print([Link])

array([-0.03605989, 0.90535667])

TIME SERIES ANALYSIS IN PYTHON

Forecasting With an AR Model
from [Link] import plot_predict
fig, ax = [Link]()
[Link](ax=ax)
plot_predict(result, start='2012-09-27', end='2012-10-06', alpha=0.05, ax=ax)
[Link]()

Arguments of function plot_predict()

First argument is ed model

Set alpha=None for no con dence interval

Set ax=ax to plot the data and prediction on same axes

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Choosing the Right
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Identifying the Order of an AR Model
The order of an AR(p) model will usually be unknown

Two techniques to determine order

Partial Autocorrelation Function

Information criteria

TIME SERIES ANALYSIS IN PYTHON

Partial Autocorrelation Function (PACF)

TIME SERIES ANALYSIS IN PYTHON

Plot PACF in Python
Same as ACF, but use plot_pacf instead of plt_acf

Import module
from [Link] import plot_pacf

Plot the PACF

plot_pacf(x, lags= 20, alpha=0.05)

TIME SERIES ANALYSIS IN PYTHON

Comparison of PACF for Different AR Models
AR(1) AR(2)

AR(3) White Noise

TIME SERIES ANALYSIS IN PYTHON

Information Criteria
Information criteria: adjusts goodness-of- t for number of
parameters

Two popular adjusted goodness-of- t measures

AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

TIME SERIES ANALYSIS IN PYTHON

Information Criteria
Estimation output

TIME SERIES ANALYSIS IN PYTHON

Getting Information Criteria From statsmodels
You learned earlier how to t an AR model
from [Link].arima_model import ARIMA
mod = ARIMA(simulated_data, order=(1,0))
result = [Link]()

And to get full output

[Link]()

Or just the parameters

[Link]

To get the AIC and BIC

[Link]
[Link]

TIME SERIES ANALYSIS IN PYTHON

Information Criteria
Fit a simulated AR(3) to di erent AR(p) models

Choose p with the lowest BIC

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Describe Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Mathematical Description of MA(1) Model
Rt = μ + ϵt + θ ϵt−1
Since only one lagged error on right hand side, this is called:
MA model of order 1, or

MA(1) model

MA parameter is θ
Stationary for all values of θ

TIME SERIES ANALYSIS IN PYTHON

Interpretation of MA(1) Parameter
Rt = μ + ϵt + θ ϵt−1
Negative θ: One-Period Mean Reversion
Positive θ: One-Period Momentum
Note: One-period autocorrelation is θ/(1 + θ2 ), not θ

TIME SERIES ANALYSIS IN PYTHON

Comparison of MA(1) Autocorrelation Functions
θ = 0.9 θ = −0.9

θ = 0.5 θ = −0.5

TIME SERIES ANALYSIS IN PYTHON

Example of MA(1) Process: Intraday Stock Returns

TIME SERIES ANALYSIS IN PYTHON

Autocorrelation Function of Intraday Stock Returns

TIME SERIES ANALYSIS IN PYTHON

Higher Order MA Models
MA(1)

Rt = μ + ϵt − θ1 ϵt−1
MA(2)

Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2
MA(3)

Rt = μ + ϵt − θ1 ϵt−1 − θ2 ϵt−2 − θ3 ϵt−3

...

TIME SERIES ANALYSIS IN PYTHON

Simulating an MA Process
from [Link].arima_process import ArmaProcess
ar = [Link]([1])
ma = [Link]([1, 0.5])
AR_object = ArmaProcess(ar, ma)
simulated_data = AR_object.generate_sample(nsample=1000)
[Link](simulated_data)

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Estimation and
Forecasting an MA
Model
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Estimating an MA Model
Same as estimating an AR model (except order=(0,0,1) )

from [Link] import ARIMA

mod = ARIMA(simulated_data, order=(0,0,1))
result = [Link]()

TIME SERIES ANALYSIS IN PYTHON

Forecasting an MA Model
from [Link] import plot_predict
fig, ax = [Link]()
[Link](ax=ax)
plot_predict(res, start='2012-09-27', end='2012-10-06', ax=ax)
[Link]()

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
ARMA models
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
ARMA Model
ARMA(1,1) model:

Rt = μ + ϕ Rt−1 + ϵt + θ ϵt−1

TIME SERIES ANALYSIS IN PYTHON

Converting Between ARMA, AR, and MA Models
Converting AR(1) into an MA(∞)

Rt = μ + ϕRt−1 + ϵt

Rt = μ + ϕ(μ + ϕRt−2 + ϵt−1 ) + ϵt

⋮
μ
Rt = + ϵt + ϕϵt−1 − ϕ2 ϵt−2 + ϕ3 ϵt−3 + ...
1−ϕ

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Cointegration
Models
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
What is Cointegration?
Two series, Pt and Qt can be random walks
But the linear combination Pt − c Qt may not be a random
walk!

If that's true
Pt − c Qt is forecastable
Pt and Qt are said to be cointegrated

TIME SERIES ANALYSIS IN PYTHON

Analogy: Dog on a Leash
Pt = Owner
Qt = Dog
Both series look like a random walk

Di erence, or distance between them, looks mean reverting

If dog falls too far behind, it gets pulled forward

If dog gets too far ahead, it gets pulled back

TIME SERIES ANALYSIS IN PYTHON

Example: Heating Oil and Natural Gas
Heating Oil and Natural Gas both look like random walks...

TIME SERIES ANALYSIS IN PYTHON

Example: Heating Oil and Natural Gas
But the spread (di erence) is mean reverting

TIME SERIES ANALYSIS IN PYTHON

What Types of Series are Cointegrated?
Economic substitutes
Heating Oil and Natural Gas

Platinum and Palladium

Corn and Wheat

Corn and Sugar

...

Bitcoin and Ethereum?

How about competitors?

Coke and Pepsi?

Apple and Blackberry? No! Leash broke and dog ran away

TIME SERIES ANALYSIS IN PYTHON

Two Steps to Test for Cointegration
Regress Pt on Qt and get slope c
Run Augmented Dickey-Fuller test on Pt − c Qt to test for
random walk

Alternatively, can use coint function in statsmodels that

combines both steps

from [Link] import coint

coint(P,Q)

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Case Study: Climate
Change
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Analyzing Temperature Data
Temperature data:
New York City from 1870-2016

Downloaded from National Oceanic and Atmospheric

Administration (NOAA)

Convert index to datetime object

Plot data

TIME SERIES ANALYSIS IN PYTHON

Analyzing Temperature Data
Test for Random Walk

Take rst di erences

Compute ACF and PACF

Fit a few AR, MA, and ARMA models

Use Information Criterion to choose best model

Forecast temperature over next 30 years

TIME SERIES ANALYSIS IN PYTHON

Let's practice!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Congratulations
T I M E S E R I E S A N A LY S I S I N P Y T H O N

Rob Reider
Adjunct Professor, NYU-Courant
Consultant, Quantopian
Advanced Topics
GARCH Models

Nonlinear Models

Multivariate Time Series Models

Regime Switching Models

State Space Models and Kalman Filtering

...

TIME SERIES ANALYSIS IN PYTHON

Keep practicing!
T I M E S E R I E S A N A LY S I S I N P Y T H O N
Welcome to the
course!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Prerequisites
Intro to Python for Data Science

Intermediate Python for Data Science

VISUALIZING TIME SERIES DATA IN PYTHON

Time series in the field of Data Science
Time series are a fundamental way to store and analyze
many types of data

Financial, weather and device data are all best handled as

time series

VISUALIZING TIME SERIES DATA IN PYTHON

Time series in the field of Data Science

VISUALIZING TIME SERIES DATA IN PYTHON

Course overview
Chapter 1: Ge ing started and personalizing your rst time
series plot

Chapter 2: Summarizing and describing time series data

Chapter 3: Advanced time series analysis

Chapter 4: Working with multiple time series

Chapter 5: Case Study

VISUALIZING TIME SERIES DATA IN PYTHON

Reading data with Pandas
import pandas as pd
df = pd.read_csv('ch2_co2_levels.csv')
print(df)

datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
...
...
...
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON

Preview data with Pandas
print([Link](n=5))

datestamp co2
0 1958-03-29 316.1
1 1958-04-05 317.3
2 1958-04-12 317.6
3 1958-04-19 317.5
4 1958-04-26 316.4

print([Link](n=5))

datestamp co2
2279 2001-12-01 370.3
2280 2001-12-08 370.8
2281 2001-12-15 371.2
2282 2001-12-22 371.3
2283 2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON

Check data types with Pandas
print([Link])

datestamp object
co2 float64
dtype: object

VISUALIZING TIME SERIES DATA IN PYTHON

Working with dates
To work with time series data in pandas , your date columns
needs to be of the datetime64 type.

pd.to_datetime(['2009/07/31', 'test'])

ValueError: Unknown string format

pd.to_datetime(['2009/07/31', 'test'], errors='coerce')

DatetimeIndex(['2009-07-31', 'NaT'],
dtype='datetime64[ns]', freq=None)

VISUALIZING TIME SERIES DATA IN PYTHON

Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot your first time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The Matplotlib library
In Python, matplotlib is an extensive package used to plot
data

The pyplot submodule of matplotlib is traditionally imported

using the plt alias

import [Link] as plt

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting time series data

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting time series data
import [Link] as plt
import pandas as pd

df = df.set_index('date_column')
[Link]()
[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Adding style to your plots
[Link]('fivethirtyeight')
[Link]()
[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

FiveThirtyEight style

VISUALIZING TIME SERIES DATA IN PYTHON

Matplotlib style sheets
print([Link])

['seaborn-dark-palette', 'seaborn-darkgrid',
'seaborn-dark', 'seaborn-notebook',
'seaborn-pastel', 'seaborn-white',
'classic', 'ggplot', 'grayscale',
'dark_background', 'seaborn-poster',
'seaborn-muted', 'seaborn', 'bmh',
'seaborn-paper', 'seaborn-whitegrid',
'seaborn-bright', 'seaborn-talk',
'fivethirtyeight', 'seaborn-colorblind',
'seaborn-deep', 'seaborn-ticks']

VISUALIZING TIME SERIES DATA IN PYTHON

Describing your graphs with labels
ax = [Link](color='blue')

ax.set_xlabel('Date')
ax.set_ylabel('The values of my Y axis')
ax.set_title('The title of my plot')
[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Figure size, linewidth, linestyle and fontsize
ax = [Link](figsize=(12, 5), fontsize=12,
linewidth=3, linestyle='--')
ax.set_xlabel('Date', fontsize=16)
ax.set_ylabel('The values of my Y axis', fontsize=16)
ax.set_title('The title of my plot', fontsize=16)
[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Customize your time
series plot
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Slicing time series data
discoveries['1960':'1970']

discoveries['1950-01':'1950-12']

discoveries['1960-01-01':'1960-01-15']

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting subset of your time series data
import [Link] as plt
[Link]('fivethirtyeight')
df_subset = discoveries['1960':'1970']

ax = df_subset.plot(color='blue', fontsize=14)
[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Adding markers
[Link](x='1969-01-01',
color='red',
linestyle='--')

[Link](y=100,
color='green',
linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON

Using markers: the full code
ax = [Link](color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')
[Link]('1969-01-01', color='red', linestyle='--')
[Link](4, color='green', linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON

Highlighting regions of interest
[Link]('1964-01-01', '1968-01-01',
color='red', alpha=0.5)

[Link](8, 6, color='green',
alpha=0.2)

VISUALIZING TIME SERIES DATA IN PYTHON

Highlighting regions of interest: the full code
ax = [Link](color='blue')
ax.set_xlabel('Date')
ax.set_ylabel('Number of great discoveries')

[Link]('1964-01-01', '1968-01-01', color='red',

alpha=0.3)
[Link](8, 6, color='green', alpha=0.3)

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Clean your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The CO2 level time series
A snippet of the weekly measurements of CO2 levels at the
Mauna Loa Observatory, Hawaii.

datastamp co2
1958-03-29 316.1
1958-04-05 317.3
1958-04-12 317.6
...
...
2001-12-15 371.2
2001-12-22 371.3
2001-12-29 371.5

VISUALIZING TIME SERIES DATA IN PYTHON

Finding missing values in a DataFrame
print([Link]())

datestamp co2
1958-03-29 False
1958-04-05 False
1958-04-12 False

print([Link]())

datestamp co2
1958-03-29 True
1958-04-05 True
1958-04-12 True
...

VISUALIZING TIME SERIES DATA IN PYTHON

Counting missing values in a DataFrame
print([Link]().sum())

datestamp 0
co2 59
dtype: int64

VISUALIZING TIME SERIES DATA IN PYTHON

Replacing missing values in a DataFrame
print(df)

...
5 1958-05-03 316.9
6 1958-05-10 NaN
7 1958-05-17 317.5
...

df = [Link](method='bfill')
print(df)

...
5 1958-05-03 316.9
6 1958-05-10 317.5
7 1958-05-17 317.5
...

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot aggregates of
your data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Moving averages
In the eld of time series analysis, a moving average can be
used for many di erent purposes:
smoothing out short-term uctuations

removing outliers

highlighting long-term trends or cycles.

VISUALIZING TIME SERIES DATA IN PYTHON

The moving average model
co2_levels_mean = co2_levels.rolling(window=52).mean()

ax = co2_levels_mean.plot()
ax.set_xlabel("Date")
ax.set_ylabel("The values of my Y axis")
ax.set_title("52 weeks rolling mean of my time series")

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

A plot of the moving average for the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON

Computing aggregate values of your time series
co2_levels.index

DatetimeIndex(['1958-03-29', '1958-04-05',...],
dtype='datetime64[ns]', name='datestamp',
length=2284, freq=None)

print(co2_levels.[Link])

array([ 3, 4, 4, ..., 12, 12, 12], dtype=int32)

print(co2_levels.[Link])

array([1958, 1958, 1958, ..., 2001,

2001, 2001], dtype=int32)

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting aggregate values of your time series
index_month = co2_levels.[Link]
co2_levels_by_month = co2_levels.groupby(index_month).mean()
co2_levels_by_month.plot()

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting aggregate values of your time series

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Summarizing the
values in your time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Obtaining numerical summaries of your data
What is the average value of this data?

What is the maximum value observed in this time series?

VISUALIZING TIME SERIES DATA IN PYTHON

The .describe() method automatically computes key
statistics of all numeric columns in your DataFrame

print([Link]())

co2
count 2284.000000
mean 339.657750
std 17.100899
min 313.000000
25% 323.975000
50% 337.700000
75% 354.500000
max 373.900000

VISUALIZING TIME SERIES DATA IN PYTHON

Summarizing your data with boxplots
ax1 = [Link]()
ax1.set_xlabel('Your first boxplot')
ax1.set_ylabel('Values of your data')
ax1.set_title('Boxplot values of your data')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

A boxplot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON

Summarizing your data with histograms
ax2 = [Link](kind='hist', bins=100)
ax2.set_xlabel('Your first histogram')
ax2.set_ylabel('Frequency of values in your data')
ax2.set_title('Histogram of your data with 100 bins')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

A histogram plot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON

Summarizing your data with density plots
ax3 = [Link](kind='density', linewidth=2)
ax3.set_xlabel('Your first density plot')
ax3.set_ylabel('Density values of your data')
ax3.set_title('Density plot of your data')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

A density plot of the values in the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Autocorrelation and
Partial
autocorrelation
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Autocorrelation in time series data
Autocorrelation is measured as the correlation between a
time series and a delayed copy of itself

For example, an autocorrelation of order 3 returns the

correlation between a time series at points ( t_1 , t_2 , t_3 ,
...) and its own values lagged by 3 time points, i.e. ( t_4 , t_5
, t_6 , ...)

It is used to nd repetitive pa erns or periodic signal in time

series

VISUALIZING TIME SERIES DATA IN PYTHON

Statsmodels
statsmodels is a Python module that provides classes and
functions for the estimation of many di erent statistical
models, as well as for conducting statistical tests, and
statistical data exploration.

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting autocorrelations
import [Link] as plt
from [Link] import tsaplots
fig = tsaplots.plot_acf(co2_levels['co2'], lags=40)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Interpreting autocorrelation plots

VISUALIZING TIME SERIES DATA IN PYTHON

Partial autocorrelation in time series data
Contrary to autocorrelation, partial autocorrelation removes
the e ect of previous time points

For example, a partial autocorrelation function of order 3

returns the correlation between our time series ( t1 , t2 , t3 ,
...) and lagged values of itself by 3 time points ( t4 , t5 , t6 ,
...), but only a er removing all e ects a ributable to lags 1
and 2

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting partial autocorrelations
import [Link] as plt

from [Link] import tsaplots

fig = tsaplots.plot_pacf(co2_levels['co2'], lags=40)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Interpreting partial autocorrelations plot

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Seasonality, trend
and noise in time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Properties of time series

VISUALIZING TIME SERIES DATA IN PYTHON

The properties of time series
Seasonality: does the data display a clear periodic pa ern?

Trend: does the data follow a consistent upwards or

downwards slope?

Noise: are there any outlier points or missing values that are
not consistent with the rest of the data?

VISUALIZING TIME SERIES DATA IN PYTHON

Time series decomposition
import [Link] as sm
import [Link] as plt
from pylab import rcParams

rcParams['[Link]'] = 11, 9
decomposition = [Link].seasonal_decompose(
co2_levels['co2'])
fig = [Link]()

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

A plot of time series decomposition on the CO2 data

VISUALIZING TIME SERIES DATA IN PYTHON

Extracting components from time series
decomposition
print(dir(decomposition))

['class', 'delattr', 'dict',

... 'plot', 'resid', 'seasonal', 'trend']

print([Link])

datestamp
1958-03-29 1.028042
1958-04-05 1.235242
1958-04-12 1.412344
1958-04-19 1.701186

VISUALIZING TIME SERIES DATA IN PYTHON

Seasonality component in time series
decomp_seasonal = [Link]

ax = decomp_seasonal.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Seasonality of time series')
ax.set_title('Seasonal values of the time series')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Seasonality component in time series

VISUALIZING TIME SERIES DATA IN PYTHON

Trend component in time series
decomp_trend = [Link]

ax = decomp_trend.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Trend of time series')
ax.set_title('Trend values of the time series')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Trend component in time series

VISUALIZING TIME SERIES DATA IN PYTHON

Noise component in time series
decomp_resid = [Link]

ax = decomp_resid.plot(figsize=(14, 2))
ax.set_xlabel('Date')
ax.set_ylabel('Residual of time series')
ax.set_title('Residual values of the time series')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Noise component in time series

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
A review on what
you have learned so
far
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
So far ...
Visualize aggregates of time series data

Extract statistical summaries

Autocorrelation and Partial autocorrelation

Time series decomposition

VISUALIZING TIME SERIES DATA IN PYTHON

The airline dataset

VISUALIZING TIME SERIES DATA IN PYTHON

Let's analyze this
data!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Working with more
than one time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Working with multiple time series
An isolated time series

date ts1
1949-01 112
1949-02 118
1949-03 132

A le with multiple time series

date ts1 ts2 ts3 ts4 ts5 ts6 ts7

2012-01-01 2113.8 10.4 1987.0 12.1 3091.8 43.2 476.7
2012-02-01 2009.0 9.8 1882.9 12.3 2954.0 38.8 466.8
2012-03-01 2159.8 10.0 1987.9 14.3 3043.7 40.1 502.1

VISUALIZING TIME SERIES DATA IN PYTHON

The Meat production dataset
import pandas as pd
meat = pd.read_csv("[Link]")
print([Link](5))

date beef veal pork lamb_and_mutton broilers

0 1944-01-01 751.0 85.0 1280.0 89.0 NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN

other_chicken turkey
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

VISUALIZING TIME SERIES DATA IN PYTHON

Summarizing and plotting multiple time series
import [Link] as plt
[Link]('fivethirtyeight')
ax = [Link](figsize=(12, 4), fontsize=14)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Area charts
import [Link] as plt
[Link]('fivethirtyeight')
ax = [Link](figsize=(12, 4), fontsize=14)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Plot multiple time
series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Clarity is key
In this plot, the default matplotlib color scheme assigns the
same color to the beef and turkey time series.

VISUALIZING TIME SERIES DATA IN PYTHON

The colormap argument
ax = [Link](colormap='Dark2', figsize=(14, 7))
ax.set_xlabel('Date')
ax.set_ylabel('Production Volume (in tons)')

[Link]()

For the full set of available colormaps, click here.

VISUALIZING TIME SERIES DATA IN PYTHON

Changing line colors with the colormap argument

VISUALIZING TIME SERIES DATA IN PYTHON

Enhancing your plot with information
ax = [Link](colormap='Dark2', figsize=(14, 7))
df_summary = [Link]()

# Specify values of cells in the table

[Link](cellText=df_summary.values,
# Specify width of the table
colWidths=[0.3]*len([Link]),
# Specify row labels
rowLabels=df_summary.index,
# Specify column labels
colLabels=df_summary.columns,
# Specify location of the table
loc='top')

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

Adding Statistical summaries to your plots

VISUALIZING TIME SERIES DATA IN PYTHON

Dealing with different scales

VISUALIZING TIME SERIES DATA IN PYTHON

Only veal

VISUALIZING TIME SERIES DATA IN PYTHON

Facet plots
[Link](subplots=True,
linewidth=0.5,
layout=(2, 4),
figsize=(16, 10),
sharex=False,
sharey=False)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

VISUALIZING TIME SERIES DATA IN PYTHON
Time for some
action!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Find relationships
between multiple
time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Correlations between two variables
In the eld of Statistics, the correlation coe cient is a
measure used to determine the strength or lack of
relationship between two variables:
Pearson's coe cient can be used to compute the
correlation coe cient between variables for which the
relationship is thought to be linear

Kendall Tau or Spearman rank can be used to compute the

correlation coe cient between variables for which the
relationship is thought to be non-linear

VISUALIZING TIME SERIES DATA IN PYTHON

Compute correlations
from [Link] import pearsonr
from [Link] import spearmanr
from [Link] import kendalltau
x = [1, 2, 4, 7]
y = [1, 3, 4, 8]
pearsonr(x, y)

SpearmanrResult(correlation=0.9843, pvalue=0.01569)

spearmanr(x, y)

SpearmanrResult(correlation=1.0, pvalue=0.0)

kendalltau(x, y)

KendalltauResult(correlation=1.0, pvalue=0.0415)

VISUALIZING TIME SERIES DATA IN PYTHON

What is a correlation matrix?
When computing the correlation coe cient between more
than two variables, you obtain a correlation matrix
Range: [-1, 1]

0: no relationship

1: strong positive relationship

-1: strong negative relationship

VISUALIZING TIME SERIES DATA IN PYTHON

What is a correlation matrix?
A correlation matrix is always "symmetric"

The diagonal values will always be equal to 1

x y z
x 1.00 -0.46 0.49
y -0.46 1.00 -0.61
z 0.49 -0.61 1.00

VISUALIZING TIME SERIES DATA IN PYTHON

Computing Correlation Matrices with Pandas
corr_p = meat[['beef', 'veal','turkey']].corr(method='pearson')
print(corr_p)

beef veal turkey

beef 1.000 -0.829 0.738
veal -0.829 1.000 -0.768
turkey 0.738 -0.768 1.000

corr_s = meat[['beef', 'veal','turkey']].corr(method='spearman')

print(corr_s)

beef veal turkey

beef 1.000 -0.812 0.778
veal -0.812 1.000 -0.829
turkey 0.778 -0.829 1.000

VISUALIZING TIME SERIES DATA IN PYTHON

Computing Correlation Matrices with Pandas
corr_mat = [Link](method='pearson')

VISUALIZING TIME SERIES DATA IN PYTHON

Heatmap
import seaborn as sns
[Link](corr_mat)

VISUALIZING TIME SERIES DATA IN PYTHON

Heatmap

VISUALIZING TIME SERIES DATA IN PYTHON

Clustermap
[Link](corr_mat)

VISUALIZING TIME SERIES DATA IN PYTHON

VISUALIZING TIME SERIES DATA IN PYTHON
Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Apply your
knowledge to a new
dataset
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
The Jobs dataset

VISUALIZING TIME SERIES DATA IN PYTHON

Let's get started!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Beyond summary
statistics
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Facet plots of the jobs dataset
[Link](subplots=True,
layout=(4, 4),
figsize=(20, 16),
sharex=True,
sharey=False)

[Link]()

VISUALIZING TIME SERIES DATA IN PYTHON

VISUALIZING TIME SERIES DATA IN PYTHON
Annotating events in the jobs dataset
ax = [Link](figsize=(20, 14), colormap='Dark2')
[Link]('2008-01-01', color='black',
linestyle='--')
[Link]('2009-01-01', color='black',
linestyle='--')

VISUALIZING TIME SERIES DATA IN PYTHON

VISUALIZING TIME SERIES DATA IN PYTHON
Taking seasonal average in the jobs dataset
print([Link])

DatetimeIndex(['2000-01-01', '2000-02-01', '2000-03-01',

'2000-04-01', '2009-09-01','2009-10-01',
'2009-11-01', '2009-12-01','2010-01-01', '2010-02-01'],
dtype='datetime64[ns]', name='datestamp',
length=122, freq=None)

index_month = [Link]
jobs_by_month = [Link](index_month).mean()
print(jobs_by_month)

datestamp Agriculture Business services Construction

1 13.763636 7.863636 12.909091
2 13.645455 7.645455 13.600000
3 13.830000 7.130000 11.290000
4 9.130000 6.270000 9.450000
5 7.100000 6.600000 8.120000
...

VISUALIZING TIME SERIES DATA IN PYTHON

Monthly averages in the jobs dataset
ax = jobs_by_month.plot(figsize=(12, 5),
colormap='Dark2')

[Link](bbox_to_anchor=(1.0, 0.5),
loc='center left')

VISUALIZING TIME SERIES DATA IN PYTHON

Monthly averages in the jobs dataset

VISUALIZING TIME SERIES DATA IN PYTHON

Time to practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Decompose time
series data
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Python dictionaries
# Initialize a Python dictionnary
my_dict = {}

# Add a key and value to your dictionnary

my_dict['your_key'] = 'your_value'

# Add a second key and value to your dictionnary

my_dict['your_second_key'] = 'your_second_value'

# Print out your dictionnary

print(my_dict)

{'your_key': 'your_value',
'your_second_key': 'your_second_value'}

VISUALIZING TIME SERIES DATA IN PYTHON

Decomposing multiple time series with Python
dictionaries
# Import the statsmodel library
import [Link] as sm
# Initialize a dictionary
my_dict = {}
# Extract the names of the time series
ts_names = [Link]
print(ts_names)

['ts1', 'ts2', 'ts3']

# Run time series decomposition

for ts in ts_names:
ts_decomposition = [Link].seasonal_decompose(jobs[ts])
my_dict[ts] = ts_decomposition

VISUALIZING TIME SERIES DATA IN PYTHON

Extract decomposition components of multiple time
series
# Initialize a new dictionnary
my_dict_trend = {}
# Extract the trend component
for ts in ts_names:
my_dict_trend[ts] = my_dict[ts].trend
# Convert to a DataFrame
trend_df = [Link].from_dict(my_dict_trend)
print(trend_df)

ts1 ts2 ts3

datestamp
2000-01-01 2.2 1.3 3.6
2000-02-01 3.4 2.1 4.7
...

VISUALIZING TIME SERIES DATA IN PYTHON

Python dictionaries
for the win!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Compute
correlations
between time series
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Trends in Jobs data
print(trend_df)

datestamp Agriculture Business services Construction

2000-01-01 NaN NaN NaN

2000-02-01 NaN NaN NaN
2000-03-01 NaN NaN NaN
2000-04-01 NaN NaN NaN
2000-05-01 NaN NaN NaN
2000-06-01 NaN NaN NaN
2000-07-01 9.170833 4.787500 6.329167
2000-08-01 9.466667 4.820833 6.304167
...

VISUALIZING TIME SERIES DATA IN PYTHON

Plotting a clustermap of the jobs correlation matrix
# Get correlation matrix of the seasonality_df DataFrame
trend_corr = trend_df.corr(method='spearman')

# Customize the clustermap of the seasonality_corr

correlation matrix
fig = [Link](trend_corr, annot=True, linewidth=0.4)

[Link](fig.ax_heatmap.yaxis.get_majorticklabels(),
rotation=0)

[Link](fig.ax_heatmap.xaxis.get_majorticklabels(),
rotation=90)

VISUALIZING TIME SERIES DATA IN PYTHON

The jobs correlation matrix

VISUALIZING TIME SERIES DATA IN PYTHON

Let's practice!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Congratulations!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N

Thomas Vincent
Head of Data Science, Ge y Images
Going further with time series
Data from Zillow Research

Kaggle competitions

Reddit Data

VISUALIZING TIME SERIES DATA IN PYTHON

Going further with time series
The importance of time series in business:
to identify seasonal pa erns and trends

to study past behaviors

to produce robust forecasts

to evaluate and compare company achievements

VISUALIZING TIME SERIES DATA IN PYTHON

Getting to the next level
Manipulating Time Series Data in Python

Importing & Managing Financial Data in Python

Statistical Thinking in Python (Part 1)

Supervised Learning with scikit-learn

VISUALIZING TIME SERIES DATA IN PYTHON

Thank you!
V I S U A L I Z I N G T I M E S E R I E S D ATA I N P Y T H O N
Introduction to time
series and
stationarity
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Motivation
Time series are everywhere

Science

Technology
Business

Finance

Policy

ARIMA MODELS IN PYTHON

Course content
You will learn

Structure of ARIMA models

How to fit ARIMA model

How to optimize the model

How to make forecasts

How to calculate uncertainty in predictions

ARIMA MODELS IN PYTHON

Loading and plotting
import pandas as pd
import matplotlib as plt

df = pd.read_csv('time_series.csv', index_col='date', parse_dates=True)

date values
2019-03-11 5.734193
2019-03-12 6.288708
2019-03-13 5.205788
2019-03-14 3.176578

ARIMA MODELS IN PYTHON

Trend
fig, ax = [Link]()
[Link](ax=ax)
[Link]()

ARIMA MODELS IN PYTHON

Seasonality

ARIMA MODELS IN PYTHON

Cyclicality

ARIMA MODELS IN PYTHON

White noise
White noise series has uncorrelated values

Heads, heads, heads, tails, heads, tails, ...

0.1, -0.3, 0.8, 0.4, -0.5, 0.9, ...

ARIMA MODELS IN PYTHON

Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

ARIMA MODELS IN PYTHON

Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

Variance is constant

ARIMA MODELS IN PYTHON

Stationarity
Stationary Not stationary

Trend stationary: Trend is zero

Variance is constant

Autocorrelation is constant

ARIMA MODELS IN PYTHON

Train-test split
# Train data - all data up to the end of 2018
df_train = [Link][:'2018']

# Test data - all data from 2019 onwards

df_test = [Link]['2019':]

ARIMA MODELS IN PYTHON

Let's Practice!
ARIMA MODELS IN PYTHON
Making time series
stationary
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Overview
Statistical tests for stationarity
Making a dataset stationary

ARIMA MODELS IN PYTHON

The augmented Dicky-Fuller test
Tests for trend non-stationarity
Null hypothesis is time series is non-stationary

ARIMA MODELS IN PYTHON

Applying the adfuller test

from [Link] import adfuller

results = adfuller(df['close'])

ARIMA MODELS IN PYTHON

Interpreting the test result
print(results)

(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.913, '10%': -2.568}, 10782.87)

0th element is test statistic (-1.34)

More negative means more likely to be stationary

1st element is p-value: (0.60)

If p-value is small → reject null hypothesis. Reject non-stationary.

4th element is the critical test statistics

ARIMA MODELS IN PYTHON

Interpreting the test result
print(results)

(-1.34, 0.60, 23, 1235, {'1%': -3.435, '5%': -2.863, '10%': -2.568}, 10782.87)

0th element is test statistic (-1.34)

More negative means more likely to be stationary

1st element is p-value: (0.60)

If p-value is small → reject null hypothesis. Reject non-stationary.

4th element is the critical test statistics

1 [Link]

ARIMA MODELS IN PYTHON

The value of plotting
Plotting time series can stop you making wrong assumptions

ARIMA MODELS IN PYTHON

The value of plotting

ARIMA MODELS IN PYTHON

Making a time series stationary

ARIMA MODELS IN PYTHON

Taking the difference

Difference: Δyt = yt − yt−1

ARIMA MODELS IN PYTHON

Taking the difference
df_stationary = [Link]()

city_population
date
1969-09-30 NaN
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389

ARIMA MODELS IN PYTHON

Taking the difference
df_stationary = [Link]().dropna()

city_population
date
1970-03-31 -0.116156
1970-09-30 0.050850
1971-03-31 -0.153261
1971-09-30 0.108389
1972-03-31 -0.029569

ARIMA MODELS IN PYTHON

Taking the difference

ARIMA MODELS IN PYTHON

Other transforms
Examples of other transforms

Take the log

[Link](df)

Take the square root

[Link](df)

Take the proportional change

[Link](1)/df

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Intro to AR, MA and
ARMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
AR models
Autoregressive (AR) model

AR(1) model :
yt = a1 yt−1 + ϵt

ARIMA MODELS IN PYTHON

AR models
Autoregressive (AR) model

AR(1) model :
yt = a1 yt−1 + ϵt

AR(2) model :
yt = a1 yt−1 + a2 yt−2 + ϵt

AR(p) model :
yt = a1 yt−1 + a2 yt−2 + ... + ap yt−p + ϵt

ARIMA MODELS IN PYTHON

MA models
Moving average (MA) model

MA(1) model :
yt = m1 ϵt−1 + ϵt

MA(2) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ϵt

MA(q) model :
yt = m1 ϵt−1 + m2 ϵt−2 + ... + mq ϵt−q + ϵt

ARIMA MODELS IN PYTHON

ARMA models
Autoregressive moving-average (ARMA) model

ARMA = AR + MA

ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARMA(p, q)

p is order of AR part

q is order of MA part

ARIMA MODELS IN PYTHON

Creating ARMA data
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARIMA MODELS IN PYTHON

Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt

from [Link].arima_process import arma_generate_sample

ar_coefs = [1, -0.5]
ma_coefs = [1, 0.2]
y = arma_generate_sample(ar_coefs, ma_coefs, nsample=100, scale=0.5)

ARIMA MODELS IN PYTHON

Creating ARMA data
yt = 0.5yt−1 + 0.2ϵt−1 + ϵt

ARIMA MODELS IN PYTHON

Fitting and ARMA model
from [Link] import ARIMA
# Instantiate model object
model = ARIMA(y, order=(1,0,1))
# Fit model
results = [Link]()

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Fitting time series
models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Creating a model
from [Link] import ARIMA

# This is an ARMA(p,q) model

model = ARIMA(timeseries, order=(p,0,q))

ARIMA MODELS IN PYTHON

Creating AR and MA models
ar_model = ARIMA(timeseries, order=(p,0,0))

ma_model = ARIMA(timeseries, order=(0,0,q))

ARIMA MODELS IN PYTHON

Fitting the model and fit summary
model = ARIMA(timeseries, order=(2,0,1))
results = [Link]()

print([Link]())

ARIMA MODELS IN PYTHON

Fit summary
SARIMAX Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: ARMA(2, 1) Log Likelihood 148.580
Date: Thu, 25 Apr 2022 AIC -287.159
Time: [Link] BIC -262.621
Sample: 0 HQIC -277.833
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469

ARIMA MODELS IN PYTHON

Fit summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0017 0.012 -0.147 0.883 -0.025 0.021
ar.L1.y 0.5253 0.054 9.807 0.000 0.420 0.630
ar.L2.y -0.2909 0.042 -6.850 0.000 -0.374 -0.208
ma.L1.y 0.3679 0.052 7.100 0.000 0.266 0.469
sigma2 1.6306 0.339 6.938 0.000 0.583 1.943

ARIMA MODELS IN PYTHON

Introduction to ARMAX models
Exogenous ARMA
Use external variables as well as time series

ARMAX = ARMA + linear regression

ARIMA MODELS IN PYTHON

ARMAX equation
ARMA(1,1) model :
yt = a1 yt−1 + m1 ϵt−1 + ϵt

ARMAX(1,1) model :
yt = x1 zt + a1 yt−1 + m1 ϵt−1 + ϵt

ARIMA MODELS IN PYTHON

ARMAX example

ARIMA MODELS IN PYTHON

ARMAX example

ARIMA MODELS IN PYTHON

Fitting ARMAX
# Instantiate the model
model = ARIMA(df['productivity'], order=(2,0,1), exog=df['hours_sleep'])

# Fit the model

results = [Link]()

ARIMA MODELS IN PYTHON

ARMAX summary
==============================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------
const -0.1936 0.092 -2.098 0.041 -0.375 -0.013
x1 0.1131 0.013 8.602 0.000 0.087 0.139
ar.L1.y 0.1917 0.252 0.760 0.450 -0.302 0.686
ar.L2.y -0.3740 0.121 -3.079 0.003 -0.612 -0.136
ma.L1.y -0.0740 0.259 -0.286 0.776 -0.581 0.433

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Forecasting
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Predicting the next value
Take an AR(1) model

yt = a1 yt−1 + ϵt

Predict next value

yt = 0.6 x 10 + ϵt

yt = 6.0 + ϵt

Uncertainty on prediction

5.0 < yt < 7.0

ARIMA MODELS IN PYTHON

One-step-ahead predictions

ARIMA MODELS IN PYTHON

Making one-step-ahead predictions
# Make predictions for last 25 values
results = [Link]()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)

ARIMA MODELS IN PYTHON

Making one-step-ahead predictions
# Make predictions for last 25 values
results = [Link]()
# Make in-sample prediction
forecast = results.get_prediction(start=-25)
# forecast mean
mean_forecast = forecast.predicted_mean

Predicted mean is a pandas series

2013-10-28 1.519368
2013-10-29 1.351082
2013-10-30 1.218016

ARIMA MODELS IN PYTHON

Confidence intervals
# Get confidence intervals of forecasts
confidence_intervals = forecast.conf_int()

Confidence interval method returns pandas DataFrame

lower y upper y
2013-09-28 -4.720471 -0.815384
2013-09-29 -5.069875 0.112505
2013-09-30 -5.232837 0.766300
2013-10-01 -5.305814 1.282935
2013-10-02 -5.326956 1.703974

ARIMA MODELS IN PYTHON

Plotting predictions
[Link]()

# Plot prediction
[Link](dates,
mean_forecast.values,
color='red',
label='forecast')
# Shade uncertainty area
plt.fill_between(dates, lower_limits, upper_limits, color='pink')

[Link]()

ARIMA MODELS IN PYTHON

Plotting predictions

ARIMA MODELS IN PYTHON

Dynamic predictions

ARIMA MODELS IN PYTHON

Making dynamic predictions
results = [Link]()
forecast = results.get_prediction(start=-25, dynamic=True)

# forecast mean
mean_forecast = forecast.predicted_mean

# Get confidence intervals of forecasts

confidence_intervals = forecast.conf_int()

ARIMA MODELS IN PYTHON

Forecasting out of sample
forecast = results.get_forecast(steps=20)

# forecast mean
mean_forecast = forecast.predicted_mean

# Get confidence intervals of forecasts

confidence_intervals = forecast.conf_int()

ARIMA MODELS IN PYTHON

Forecasting out of sample
forecast = results.get_forecast(steps=20)

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Introduction to
ARIMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Non-stationary time series recap

ARIMA MODELS IN PYTHON

Non-stationary time series recap

ARIMA MODELS IN PYTHON

Forecast of differenced time series

ARIMA MODELS IN PYTHON

Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast)

ARIMA MODELS IN PYTHON

Reconstructing original time series after differencing
diff_forecast = results.get_forecast(steps=10).predicted_mean
from numpy import cumsum
mean_forecast = cumsum(diff_forecast) + [Link][-1,0]

ARIMA MODELS IN PYTHON

Reconstructing original time series after differencing

ARIMA MODELS IN PYTHON

The ARIMA model

Take the difference

Fit ARMA model

Integrate forecast

Can we avoid doing so much work?

Yes!

ARIMA - Autoregressive Integrated Moving Average

ARIMA MODELS IN PYTHON

Using the ARIMA model
from [Link] import ARIMA
model = ARIMA(df, order=(p,d,q))

p - number of autoregressive lags

d - order of differencing

q - number of moving average lags

ARIMA(p, 0, q) = ARMA(p, q)

ARIMA MODELS IN PYTHON

Using the ARIMA model
# Create model
model = ARIMA(df, order=(2,1,1))
# Fit model
[Link]()
# Make forecast
mean_forecast = results.get_forecast(steps=10).predicted_mean

ARIMA MODELS IN PYTHON

Using the ARIMA model
# Make forecast
mean_forecast = results.get_forecast(steps=steps).predicted_mean

ARIMA MODELS IN PYTHON

Picking the difference order
adf = adfuller([Link][:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])

ADF Statistic: -2.674

p-value: 0.0784

adf = adfuller([Link]().dropna().iloc[:,0])
print('ADF Statistic:', adf[0])
print('p-value:', adf[1])

ADF Statistic: -4.978

p-value: 2.44e-05

ARIMA MODELS IN PYTHON

Picking the difference order
model = ARIMA(df, order=(p,1,q))

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Intro to ACF and
PACF
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Motivation

ARIMA MODELS IN PYTHON

ACF and PACF
ACF - Autocorrelation Function

PACF - Partial autocorrelation function

ARIMA MODELS IN PYTHON

What is the ACF
lag-1 autocorrelation → corr(yt , yt−1 )
lag-2 autocorrelation → corr(yt , yt−2 )

...

lag-n autocorrelation → corr(yt , yt−n )

ARIMA MODELS IN PYTHON

What is the ACF

ARIMA MODELS IN PYTHON

What is the PACF

ARIMA MODELS IN PYTHON

Using ACF and PACF to choose model order

AR(2) model →

ARIMA MODELS IN PYTHON

Using ACF and PACF to choose model order

MA(2) model →

ARIMA MODELS IN PYTHON

Using ACF and PACF to choose model order

ARIMA MODELS IN PYTHON

Using ACF and PACF to choose model order

ARIMA MODELS IN PYTHON

Implementation in Python
from [Link] import plot_acf, plot_pacf

# Create figure
fig, (ax1, ax2) = [Link](2,1, figsize=(8,8))
# Make ACF plot
plot_acf(df, lags=10, zero=False, ax=ax1)
# Make PACF plot
plot_pacf(df, lags=10, zero=False, ax=ax2)

[Link]()

ARIMA MODELS IN PYTHON

Implementation in Python

ARIMA MODELS IN PYTHON

Over/under differencing and ACF and PACF

ARIMA MODELS IN PYTHON

Over/under differencing and ACF and PACF

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
AIC and BIC
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
AIC - Akaike information criterion
Lower AIC indicates a better model
AIC likes to choose simple models with lower order

ARIMA MODELS IN PYTHON

BIC - Bayesian information criterion
Very similar to AIC
Lower BIC indicates a better model

BIC likes to choose simple models with lower order

ARIMA MODELS IN PYTHON

AIC vs BIC
BIC favors simpler models than AIC
AIC is better at choosing predictive models

BIC is better at choosing good explanatory model

ARIMA MODELS IN PYTHON

AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = [Link]()
# Print fit summary
print([Link]())

Statespace Model Results

==============================================================================
Dep. Variable: y No. Observations: 1000
Model: SARIMAX(2, 0, 0) Log Likelihood -1399.704
Date: Fri, 10 May 2019 AIC 2805.407
Time: [Link] BIC 2820.131
Sample: 01-01-2013 HQIC 2811.003
- 09-27-2015
Covariance Type: opg

ARIMA MODELS IN PYTHON

AIC and BIC in statsmodels
# Create model
model = ARIMA(df, order=(1,0,1))
# Fit model
results = [Link]()
# Print AIC and BIC
print('AIC:', [Link])
print('BIC:', [Link])

AIC: 2806.36
BIC: 2821.09

ARIMA MODELS IN PYTHON

Searching over AIC and BIC
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# print the model order and the AIC/BIC values
print(p, q, [Link], [Link])

0 0 2900.13 2905.04
0 1 2828.70 2838.52
0 2 2806.69 2821.42
1 0 2810.25 2820.06
1 1 2806.37 2821.09
1 2 2807.52 2827.15
...

ARIMA MODELS IN PYTHON

Searching over AIC and BIC
order_aic_bic =[]
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()
# Add order and scores to list
order_aic_bic.append((p, q, [Link], [Link]))

# Make DataFrame of model order and AIC/BIC scores

order_df = [Link](order_aic_bic, columns=['p','q', 'aic', 'bic'])

ARIMA MODELS IN PYTHON

Searching over AIC and BIC
# Sort by AIC # Sort by BIC
print(order_df.sort_values('aic')) print(order_df.sort_values('bic'))

p q aic bic p q aic bic

7 2 1 2804.54 2824.17 3 1 0 2810.25 2820.06
6 2 0 2805.41 2820.13 6 2 0 2805.41 2820.13
4 1 1 2806.37 2821.09 4 1 1 2806.37 2821.09
2 0 2 2806.69 2821.42 2 0 2 2806.69 2821.42
... ...

ARIMA MODELS IN PYTHON

Non-stationary model orders
# Fit model
model = ARIMA(df, order=(2,0,1))
results = [Link]()

ValueError: Non-stationary starting autoregressive parameters

found with `enforce_stationarity` set to True.

ARIMA MODELS IN PYTHON

When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):

# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()

# Print the model order and the AIC/BIC values

print(p, q, [Link], [Link])

ARIMA MODELS IN PYTHON

When certain orders don't work
# Loop over AR order
for p in range(3):
# Loop over MA order
for q in range(3):
try:
# Fit model
model = ARIMA(df, order=(p,0,q))
results = [Link]()

# Print the model order and the AIC/BIC values

print(p, q, [Link], [Link])
except:
# Print AIC and BIC as None when fails
print(p, q, None, None)

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Model diagnostics
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Introduction to model diagnostics
How good is the final model?

ARIMA MODELS IN PYTHON

Residuals

ARIMA MODELS IN PYTHON

Residuals
# Fit model
model = ARIMA(df, order=(p,d,q))
results = [Link]()
# Assign residuals to variable
residuals = [Link]

2013-01-23 1.013129
2013-01-24 0.114055
2013-01-25 0.430698
2013-01-26 -1.247046
2013-01-27 -0.499565
... ...

ARIMA MODELS IN PYTHON

Mean absolute error
How far our the predictions from the real values?

mae = [Link]([Link](residuals))

ARIMA MODELS IN PYTHON

Plot diagnostics
If the model fits well the residuals will be
white Gaussian noise

# Create the 4 diagostics plots

results.plot_diagnostics()
[Link]()

ARIMA MODELS IN PYTHON

Residuals plot

ARIMA MODELS IN PYTHON

Residuals plot

ARIMA MODELS IN PYTHON

Histogram plus estimated density

ARIMA MODELS IN PYTHON

Normal Q-Q

ARIMA MODELS IN PYTHON

Correlogram

ARIMA MODELS IN PYTHON

Summary statistics
print([Link]())

...
===================================================================================
Ljung-Box (Q): 32.10 Jarque-Bera (JB): 0.02
Prob(Q): 0.81 Prob(JB): 0.99
Heteroskedasticity (H): 1.28 Skew: -0.02
Prob(H) (two-sided): 0.21 Kurtosis: 2.98
===================================================================================

Prob(Q) - p-value for null hypothesis that residuals are uncorrelated

Prob(JB) - p-value for null hypothesis that residuals are normal

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Box-Jenkins method
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The Box-Jenkins method
From raw data → production model

identification

estimation
model diagnostics

ARIMA MODELS IN PYTHON

Identification
Is the time series stationary?
What differencing will make it stationary?

What transforms will make it stationary?

What values of p and q are most

promising?

ARIMA MODELS IN PYTHON

Identification tools
Plot the time series
[Link]()

Use augmented Dicky-Fuller test

adfuller()

Use transforms and/or differencing

[Link]() , [Link]() , [Link]()

Plot ACF/PACF
plot_acf() , plot_pacf()

ARIMA MODELS IN PYTHON

Estimation
Use the data to train the model coefficients
Done for us using [Link]()

Choose between models using AIC and BIC

[Link] , [Link]

ARIMA MODELS IN PYTHON

Model diagnostics
Are the residuals uncorrelated

Are residuals normally distributed

results.plot_diagnostics()

[Link]()

ARIMA MODELS IN PYTHON

Decision

ARIMA MODELS IN PYTHON

Repeat
We go through the process again with more
information

Find a better model

ARIMA MODELS IN PYTHON

Production
Ready to make forecasts
results.get_forecast()

ARIMA MODELS IN PYTHON

Box-Jenkins

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Seasonal time series
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Seasonal data
Has predictable and repeated patterns
Repeats after any amount of time

ARIMA MODELS IN PYTHON

Seasonal decomposition

ARIMA MODELS IN PYTHON

Seasonal decomposition

time series = trend + seasonal + residual

ARIMA MODELS IN PYTHON

Seasonal decomposition using statsmodels
# Import
from [Link] import seasonal_decompose

# Decompose data
decomp_results = seasonal_decompose(df['IPG3113N'], period=12)

type(decomp_results)

[Link]

ARIMA MODELS IN PYTHON

Seasonal decomposition using statsmodels
# Plot decomposed data
decomp_results.plot()
[Link]()

ARIMA MODELS IN PYTHON

Finding seasonal period using ACF

ARIMA MODELS IN PYTHON

Identifying seasonal data using ACF

ARIMA MODELS IN PYTHON

Detrending time series
# Subtract long rolling average over N steps
df = df - [Link](N).mean()
# Drop NaN values
df = [Link]()

ARIMA MODELS IN PYTHON

Identifying seasonal data using ACF
# Create figure
fig, ax = [Link](1,1, figsize=(8,4))

# Plot ACF
plot_acf([Link](), ax=ax, lags=25, zero=False)
[Link]()

ARIMA MODELS IN PYTHON

ARIMA models and seasonal data

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
SARIMA models
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The SARIMA model
Seasonal ARIMA = SARIMA SARIMA(p,d,q)(P,D,Q)S

Non-seasonal orders Seasonal Orders

p: autoregressive order P: seasonal autoregressive order

d: differencing order D: seasonal differencing order

q: moving average order Q: seasonal moving average order

S: number of time steps per cycle

ARIMA MODELS IN PYTHON

The SARIMA model
ARIMA(2,0,1) model :
yt = a1 yt−1 + a2 yt−2 + m1 ϵt−1 + ϵt

SARIMA(0,0,0)(2,0,1)7 model:
yt = a7 yt−7 + a14 yt−14 + m7 ϵt−7 + ϵt

ARIMA MODELS IN PYTHON

Fitting a SARIMA model
# Imports
[Link] import SARIMAX
# Instantiate model
model = SARIMAX(df, order=(p,d,q), seasonal_order=(P,D,Q,S))
# Fit model
results = [Link]()

ARIMA MODELS IN PYTHON

Seasonal differencing
Subtract the time series value of one season ago

Δyt = yt − yt−S

# Take the seasonal difference

df_diff = [Link](S)

ARIMA MODELS IN PYTHON

Differencing for SARIMA models

Time series

ARIMA MODELS IN PYTHON

Differencing for SARIMA models

First difference of time series

ARIMA MODELS IN PYTHON

Differencing for SARIMA models

First difference and first seasonal difference of time series

ARIMA MODELS IN PYTHON

Finding p and q

ARIMA MODELS IN PYTHON

Finding P and Q

ARIMA MODELS IN PYTHON

Plotting seasonal ACF and PACF
# Create figure
fig, (ax1, ax2) = [Link](2,1)

# Plot seasonal ACF

plot_acf(df_diff, lags=[12,24,36,48,60,72], ax=ax1)

# Plot seasonal PACF

plot_pacf(df_diff, lags=[12,24,36,48,60,72], ax=ax2)

[Link]()

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Automation and
saving
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Searching over model orders
import pmdarima as pm

results = pm.auto_arima(df)

Performing stepwise search to minimize aic

ARIMA(2,0,2)(1,1,1)[12] intercept : AIC=inf, Time=3.33 sec
ARIMA(0,0,0)(0,1,0)[12] intercept : AIC=2648.467, Time=0.062 sec
ARIMA(1,0,0)(1,1,0)[12] intercept : AIC=2279.986, Time=1.171 sec

...

ARIMA(3,0,3)(1,1,1)[12] intercept : AIC=2173.508, Time=12.487 sec

ARIMA(3,0,3)(0,1,0)[12] intercept : AIC=2297.305, Time=2.087 sec

Best model: ARIMA(3,0,3)(1,1,1)[12]

Total fit time: 245.812 seconds

ARIMA MODELS IN PYTHON

pmdarima results
print([Link]()) results.plot_diagnostics()

ARIMA MODELS IN PYTHON

Non-seasonal search parameters

ARIMA MODELS IN PYTHON

Non-seasonal search parameters
results = pm.auto_arima( df, # data
d=0, # non-seasonal difference order
start_p=1, # initial guess for p
start_q=1, # initial guess for q
max_p=3, # max value of p to test
max_q=3, # max value of q to test
)

1 [Link]

ARIMA MODELS IN PYTHON

Seasonal search parameters
results = pm.auto_arima( df, # data
... , # non-seasonal arguments
seasonal=True, # is the time series seasonal
m=7, # the seasonal period
D=1, # seasonal difference order
start_P=1, # initial guess for P
start_Q=1, # initial guess for Q
max_P=2, # max value of P to test
max_Q=2, # max value of Q to test
)

ARIMA MODELS IN PYTHON

Other parameters
results = pm.auto_arima( df, # data
... , # model order parameters
information_criterion='aic', # used to select best model
trace=True, # print results whilst training
error_action='ignore', # ignore orders that don't work
stepwise=True, # apply intelligent order search
)

ARIMA MODELS IN PYTHON

Saving model objects
# Import
import joblib

# Select a filepath
filepath ='localpath/great_model.pkl'

# Save model to filepath

[Link](model_results_object, filepath)

ARIMA MODELS IN PYTHON

Saving model objects
# Select a filepath
filepath ='localpath/great_model.pkl'

# Load model object from filepath

model_results_object = [Link](filepath)

ARIMA MODELS IN PYTHON

Updating model
# Add new observations and update parameters
model_results_object.update(df_new)

ARIMA MODELS IN PYTHON

Update comparison

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
SARIMA and Box-
Jenkins
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
Box-Jenkins

ARIMA MODELS IN PYTHON

Box-Jenkins with seasonal data
Determine if time series is seasonal
Find seasonal period

Find transforms to make data stationary

Seasonal and non-seasonal differencing

Other transforms

ARIMA MODELS IN PYTHON

Mixed differencing
D should be 0 or 1

d + D should be 0-2

ARIMA MODELS IN PYTHON

Weak vs strong seasonality

Weak seasonal pattern Strong seasonal pattern

Use seasonal differencing if necessary Always use seasonal differencing

ARIMA MODELS IN PYTHON

Additive vs multiplicative seasonality

Additive series = trend + season multiplicative series = trend x season

Proceed as usual with differencing Apply log transform first - [Link]

ARIMA MODELS IN PYTHON

Multiplicative to additive seasonality

ARIMA MODELS IN PYTHON

Let's practice!
ARIMA MODELS IN PYTHON
Congratulations!
ARIMA MODELS IN PYTHON

James Fulton
Climate informatics researcher
The SARIMAX model
`

ARIMA MODELS IN PYTHON

Time series modeling framework
Test for stationarity and seasonality

Find promising model orders

Fit models and narrow selection with

AIC/BIC

Perform model diagnostics tests

Make forecasts

Save and update models

ARIMA MODELS IN PYTHON

Further steps
Fit data created using arma_generate_sample()

Tackle real world data! Either your own or examples from statsmodels

ARIMA MODELS IN PYTHON

Further steps
Fit data created using arma_generate_sample()
Tackle real world data! Either your own or examples from statsmodels

More time series courses here

1 [Link]

ARIMA MODELS IN PYTHON

Good luck!
ARIMA MODELS IN PYTHON
Timeseries kinds and
applications
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Time Series

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Time Series

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

What makes a time series?
Datapoint Datapoint Datapoint Datapoint Datapoint Datapoint
1 34 12 54 76 40

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint

2:00 2:01 2:02 2:03 2:04 2:05

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint

Jan Feb March April May Jun

Timepoint Timepoint Timepoint Timepoint Timepoint Timepoint

1e-9 2e-9 3e-9 4e-9 5e-9 6e-9

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Reading in a time series with Pandas
import pandas as pd
import [Link] as plt
data = pd.read_csv('[Link]')
[Link]()

date symbol close volume

0 2010-01-04 AAPL 214.009998 123432400.0
46 2010-01-05 AAPL 214.379993 150476200.0
92 2010-01-06 AAPL 210.969995 138040000.0
138 2010-01-07 AAPL 210.580000 119282800.0
184 2010-01-08 AAPL 211.980005 111902700.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Plotting a pandas timeseries
import [Link] as plt
fig, ax = [Link](figsize=(12, 6))
[Link]('date', 'close', ax=ax)
[Link](title="AAPL daily closing price")

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A timeseries plot

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Why machine learning?
We can use really big data and really complicated data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Why machine learning?
We can...

Predict the future

Automate this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Why combine these two?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A machine learning pipeline
Feature extraction

Model ing

Prediction and validation

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Machine learning
basics
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always begin by looking at your data
[Link]

(10, 5)

array[:3]

array([[ 0.735528 , 1.00122818, -0.28315978],

[-0.94478393, 0.18658748, -0.00241224],
[-0.74822942, -1.46636618, 0.69835096]])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Always begin by looking at your data
[Link]()

col1 col2 col3

0 0.735528 1.001228 -0.283160
1 -0.944784 0.186587 -0.002412
2 -0.748229 -1.466366 0.698351
3 1.038589 -0.171248 0.831457
4 -0.161904 0.003972 -0.321933

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Always visualize your data
Make sure it looks the way you'd expect.

# Using matplotlib
fig, ax = [Link]()
[Link](...)

# Using pandas
fig, ax = [Link]()
[Link](..., ax=ax)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Scikit-learn
Scikit-learn is the most popular machine learning library in Python

from [Link] import LinearSVC

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Preparing data for scikit-learn
scikit-learn expects a particular structure of data:

(samples, features)

Make sure that your data is at least two-dimensional

Make sure the rst dimension is samples

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

If your data is not shaped properly
If the axes are swapped:

[Link]

(10, 3)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

If your data is not shaped properly
If we're missing an axis, use .reshape() :

[Link]

(10,)

[Link](-1, 1).shape

(10, 1)

-1 will automatically ll that axis with remaining values

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Fitting a model with scikit-learn
# Import a support vector classifier
from [Link] import LinearSVC

# Instantiate this model

model = LinearSVC()

# Fit the model on some data

[Link](X, y)

It is common for y to be of shape (samples, 1)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Investigating the model
# There is one coefficient per input feature
model.coef_

array([[ 0.69417875, -0.5289162 ]])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Predicting with a fit model
# Generate predictions
predictions = [Link](X_test)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Combining
timeseries data with
machine learning
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Getting to know our data
The datasets that we'll use in this course are all freely-available online

There are many datasets available to download on the web, the ones we'll use come from
Kaggle

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

The Heartbeat Acoustic Data
Many recordings of heart sounds from di erent patients

Some had normally-functioning hearts, others had abnormalities

Data comes in the form of audio les + labels for each le

Can we nd the "abnormal" heart beats?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Loading auditory data
from glob import glob
files = glob('data/heartbeat-sounds/files/*.wav')

print(files)

['data/heartbeat-sounds/proc/files/murmur__201101051104.wav',
...
'data/heartbeat-sounds/proc/files/murmur__201101051114.wav']

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Reading in auditory data
import librosa as lr
# `load` accepts a path to an audio file
audio, sfreq = [Link]('data/heartbeat-sounds/proc/files/murmur__201101051104.wav')

print(sfreq)

2205

In this case, the sampling frequency is 2205 , meaning there are 2205 samples per second

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Inferring time from samples
If we know the sampling rate of a timeseries, then we know the timestamp of each
datapoint relative to the rst datapoint

Note: this assumes the sampling rate is xed and no data points are lost

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating a time array (I)
Create an array of indices, one for each sample, and divide by the sampling frequency

indices = [Link](0, len(audio))

time = indices / sfreq

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating a time array (II)
Find the time stamp for the N-1th data point. Then use linspace() to interpolate from zero
to that time

final_time = (len(audio) - 1) / sfreq

time = [Link](0, final_time, sfreq)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

The New York Stock Exchange dataset
This dataset consists of company stock values for 10 years

Can we detect any pa erns in historical records that allow us to predict the value of
companies in the future?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Looking at the data
data = pd.read_csv('path/to/[Link]')

[Link]

Index(['date', 'symbol', 'close', 'volume'], dtype='object')

[Link]()

date symbol close volume

0 2010-01-04 AAPL 214.009998 123432400.0
1 2010-01-04 ABT 54.459951 10829000.0
2 2010-01-04 AIG 29.889999 7750900.0
3 2010-01-04 AMAT 14.300000 18615100.0
4 2010-01-04 ARNC 16.650013 11512100.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Timeseries with Pandas DataFrames
We can investigate the object type of each column by accessing the dtypes a ribute

df['date'].dtypes

0 object
1 object
2 object
dtype: object

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Converting a column to a time series
To ensure that a column within a DataFrame is treated as time series, use the
to_datetime() function

df['date'] = pd.to_datetime(df['date'])

df['date']

0 2017-01-01
1 2017-01-02
2 2017-01-03
Name: date, dtype: datetime64[ns]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Classification and
feature engineering
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Always visualize raw data before fitting models

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualize your timeseries data!
ixs = [Link]([Link][-1])
time = ixs / sfreq
fig, ax = [Link]()
[Link](time, audio)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

What features to use?
Using raw timeseries data is too noisy for classi cation

We need to calculate features!

An easy start: summarize your audio data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating multiple features
print([Link])
# (n_files, time)

(20, 7000)

means = [Link](audio, axis=-1)

maxs = [Link](audio, axis=-1)
stds = [Link](audio, axis=-1)

print([Link])
# (n_files,)

(20,)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Fitting a classifier with scikit-learn
We've just collapsed a 2-D dataset (samples x time) into several features of a 1-D dataset
(samples)

We can combine each feature, and use it as an input to a model

If we have a label for each sample, we can use scikit-learn to create and t a classi er

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Preparing your features for scikit-learn
# Import a linear classifier
from [Link] import LinearSVC

# Note that means are reshaped to work with scikit-learn

X = np.column_stack([means, maxs, stds])
y = [Link](-1, 1)
model = LinearSVC()
[Link](X, y)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Scoring your scikit-learn model
from [Link] import accuracy_score

# Different input data

predictions = [Link](X_test)

# Score our model with % correct

# Manually
percent_score = sum(predictions == labels_test) / len(labels_test)
# Using a sklearn scorer
percent_score = accuracy_score(labels_test, predictions)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Improving the
features we use for
classification
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The auditory envelope
Smooth the data to calculate the auditory envelope

Related to the total amount of audio energy present at each moment of time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Smoothing over time
Instead of averaging over all time, we can do a local average

This is called smoothing your timeseries

It removes short-term noise, while retaining the general pa ern

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Smoothing your data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Calculating a rolling window statistic
# Audio is a Pandas DataFrame
print([Link])
# (n_times, n_audio_files)

(5000, 20)

# Smooth our data by taking the rolling mean in a window of 50 samples

window_size = 50
windowed = [Link](window=window_size)
audio_smooth = [Link]()

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Calculating the auditory envelope
First rectify your audio, then smooth it

audio_rectified = [Link]([Link])
audio_envelope = audio_rectified.rolling(50).mean()

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Feature engineering the envelope
# Calculate several features of the envelope, one per sound
envelope_mean = [Link](audio_envelope, axis=0)
envelope_std = [Link](audio_envelope, axis=0)
envelope_max = [Link](audio_envelope, axis=0)

# Create our training data for a classifier

X = np.column_stack([envelope_mean, envelope_std, envelope_max])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Preparing our features for scikit-learn
X = np.column_stack([envelope_mean, envelope_std, envelope_max])
y = [Link](-1, 1)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Cross validation for classification
cross_val_score automates the process of:
Spli ing data into training / validation sets

Fi ing the model on training data

Scoring it on validation data

Repeating this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using cross_val_score
from sklearn.model_selection import cross_val_score

model = LinearSVC()
scores = cross_val_score(model, X, y, cv=3)
print(scores)

[0.60911642 0.59975305 0.61404035]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Auditory features: The Tempogram
We can summarize more complex temporal information with timeseries-speci c functions

librosa is a great library for auditory and timeseries feature engineering

Here we'll calculate the tempogram, which estimates the tempo of a sound over time

We can calculate summary statistics of tempo in the same way that we can for the
envelope

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Computing the tempogram
# Import librosa and calculate the tempo of a 1-D sound array
import librosa as lr
audio_tempo = [Link](audio, sr=sfreq,
hop_length=2**6, aggregate=None)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
The spectrogram -
spectral changes to
sound over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Fourier transforms
Timeseries data can be described as a combination of quickly-changing things and slowly-
changing things

At each moment in time, we can describe the relative presence of fast- and slow-moving
components

The simplest way to do this is called a Fourier Transform

This converts a single timeseries into an array that describes the timeseries as a
combination of oscillations

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A Fourier Transform (FFT)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Spectrograms: combinations of windows Fourier
transforms
A spectrogram is a collection of windowed Fourier transforms over time

Similar to how a rolling mean was calculated:

1. Choose a window size and shape

2. At a timepoint, calculate the FFT for that window

3. Slide the window over by one

4. Aggregate the results

Called a Short-Time Fourier Transform (STFT)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Calculating the STFT
We can calculate the STFT with librosa

There are several parameters we can tweak (such as window size)

For our purposes, we'll convert into decibels which normalizes the average values of all
frequencies

We can then visualize it with the specshow() function

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Calculating the STFT with code
# Import the functions we'll use for the STFT
from [Link] import stft, amplitude_to_db
from [Link] import specshow
import [Link] as plt

# Calculate our STFT

HOP_LENGTH = 2**4
SIZE_WINDOW = 2**7
audio_spec = stft(audio, hop_length=HOP_LENGTH, n_fft=SIZE_WINDOW)

# Convert into decibels for visualization

spec_db = amplitude_to_db(audio_spec)

# Visualize
fig, ax = [Link]()
specshow(spec_db, sr=sfreq, x_axis='time',
y axis='hz' hop length=HOP LENGTH ax=ax)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Spectral feature engineering
Each timeseries has a di erent spectral pa ern.

We can calculate these spectral pa erns by analyzing the spectrogram.

For example, spectral bandwidth and spectral centroids describe where most of the energy
is at each moment in time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Calculating spectral features
# Calculate the spectral centroid and bandwidth for the spectrogram
bandwidths = [Link].spectral_bandwidth(S=spec)[0]
centroids = [Link].spectral_centroid(S=spec)[0]

# Display these features on top of the spectrogram

fig, ax = [Link]()
specshow(spec, x_axis='time', y_axis='hz', hop_length=HOP_LENGTH, ax=ax)
[Link](times_spec, centroids)
ax.fill_between(times_spec, centroids - bandwidths / 2,
centroids + bandwidths / 2, alpha=0.5)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Combining spectral and temporal features in a
classifier
centroids_all = []
bandwidths_all = []
for spec in spectrograms:
bandwidths = [Link].spectral_bandwidth(S=lr.db_to_amplitude(spec))
centroids = [Link].spectral_centroid(S=lr.db_to_amplitude(spec))
# Calculate the mean spectral bandwidth
bandwidths_all.append([Link](bandwidths))
# Calculate the mean spectral centroid
centroids_all.append([Link](centroids))

# Create our X matrix

X = np.column_stack([means, stds, maxs, tempo_mean,
tempo_max, tempo_std, bandwidths_all, centroids_all])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Predicting data over
time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Classification vs. Regression
CLASSIFICATION REGRESSION

classification_model.predict(X_test) regression_model.predict(X_test)

array([0, 1, 1, 0]) array([0.2, 1.4, 3.6, 0.6])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Correlation and regression
Regression is similar to calculating correlation, with some key di erences
Regression: A process that results in a formal model of the data

Correlation: A statistic that describes the data. Less information than regression model.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Correlation between variables often changes over time
Timeseries o en have pa erns that change over time

Two timeseries that seem correlated at one moment may not remain so over time

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing relationships between timeseries
fig, axs = [Link](1, 2)

# Make a line plot for each timeseries

axs[0].plot(x, c='k', lw=3, alpha=.2)
axs[0].plot(y)
axs[0].set(xlabel='time', title='X values = time')

# Encode time as color in a scatterplot

axs[1].scatter(x_long, y_long, c=[Link](len(x_long)), cmap='viridis')
axs[1].set(xlabel='x', ylabel='y', title='Color = time')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing two timeseries

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Regression models with scikit-learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()
[Link](X, y)
[Link](X)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualize predictions with scikit-learn
alphas = [.1, 1e2, 1e3]
[Link](y_test, color='k', alpha=.3, lw=3)
for ii, alpha in enumerate(alphas):
y_predicted = Ridge(alpha=alpha).fit(X_train, y_train).predict(X_test)
[Link](y_predicted, c=cmap(ii / len(alphas)))
[Link](['True values', 'Model 1', 'Model 2', 'Model 3'])
[Link](xlabel="Time")

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualize predictions with scikit-learn

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Scoring regression models
Two most common methods:
Correlation (r )

Coe cient of Determination (R2 )

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

2
Coefficient of Determination (R )
The value of R2 is bounded on the top by 1, and can be in nitely low

Values closer to 1 mean the model does a be er job of predicting outputs

error(model)
1−
variance(testdata)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

2
R in scikit-learn
from [Link] import r2_score
print(r2_score(y_predicted, y_test))

0.08

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cleaning and
improving your data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Data is messy
Real-world data is o en messy

The two most common problems are missing data and outliers

This o en happens because of human error, machine sensor malfunction, database failures,
etc

Visualizing your raw data makes it easier to spot these problems

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

What messy data looks like

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpolation: using time to fill in missing data
A common way to deal with missing data is to interpolate missing values

With timeseries data, you can use time to assist in interpolation.

In this case, interpolation means using using the known values on either side of a gap in the
data to make assumptions about what's missing.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpolation in Pandas
# Return a boolean that notes where missing values are
missing = [Link]()

# Interpolate linearly within missing windows

prices_interp = [Link]('linear')

# Plot the interpolated data in red and the data w/ missing values in black
ax = prices_interp.plot(c='r')
[Link](c='k', ax=ax, lw=2)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing the interpolated data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using a rolling window to transform data
Another common use of rolling windows is to transform the data

We've already done this once, in order to smooth the data

However, we can also use this to do more complex transformations

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Transforming data to standardize variance
A common transformation to apply to data is to standardize its mean and variance over
time. There are many ways to do this.

Here, we'll show how to convert your dataset so that each point represents the % change
over a previous window.

This makes timepoints more comparable to one another if the absolute values of data
change a lot

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Transforming to percent change with Pandas
def percent_change(values):
"""Calculates the % change between the last value
and the mean of previous values"""
# Separate the last value and all previous values into variables
previous_values = values[:-1]
last_value = values[-1]

# Calculate the % difference between the last value

# and the mean of earlier values
percent_change = (last_value - [Link](previous_values)) \
/ [Link](previous_values)
return percent_change

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Applying this to our data
# Plot the raw data
fig, axs = [Link](1, 2, figsize=(10, 5))
ax = [Link](ax=axs[0])

# Calculate % change and plot

ax = [Link](window=20).aggregate(percent_change).plot(ax=axs[1])
ax.legend_.set_visible(False)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Finding outliers in your data
Outliers are datapoints that are signi cantly statistically di erent from the dataset.

They can have negative e ects on the predictive power of your model, biasing it away from
its "true" value

One solution is to remove or replace outliers with a more representative value

Be very careful about doing this - o en it is di cult to determine what is a legitimately

extreme value vs an abberation

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Plotting a threshold on our data
fig, axs = [Link](1, 2, figsize=(10, 5))
for data, ax in zip([prices, prices_perc_change], axs):
# Calculate the mean / standard deviation for the data
this_mean = [Link]()
this_std = [Link]()

# Plot the data, with a window that is 3 standard deviations

# around the mean
[Link](ax=ax)
[Link](this_mean + this_std * 3, ls='--', c='r')
[Link](this_mean - this_std * 3, ls='--', c='r')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing outlier thresholds

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Replacing outliers using the threshold
# Center the data so the mean is 0
prices_outlier_centered = prices_outlier_perc - prices_outlier_perc.mean()

# Calculate standard deviation

std = prices_outlier_perc.std()

# Use the absolute value of each datapoint

# to make it easier to find outliers
outliers = [Link](prices_outlier_centered) > (std * 3)

# Replace outliers with the median value

# We'll use [Link] since there may be nans around the outliers
prices_outlier_fixed = prices_outlier_centered.copy()
prices_outlier_fixed[outliers] = [Link](prices_outlier_fixed)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualize the results
fig, axs = [Link](1, 2, figsize=(10, 5))
prices_outlier_centered.plot(ax=axs[0])
prices_outlier_fixed.plot(ax=axs[1])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Creating features
over time
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Extracting features with windows

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using .aggregate for feature extraction
# Visualize the raw data
print([Link](3))

symbol AIG ABT

date
2010-01-04 29.889999 54.459951
2010-01-05 29.330000 54.019953
2010-01-06 29.139999 54.319953

# Calculate a rolling window, then extract two features

feats = [Link](20).aggregate([[Link], [Link]]).dropna()
print([Link](3))

AIG ABT
std amax std amax
date
2010-02-01 2.051966 29.889999 0.868830 56.239949
2010-02-02 2.101032 29.629999 0.869197 56.239949
2010-02-03 2.157249 29.629999 0.852509 56.239949

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Check the properties of your features!

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using partial() in Python
# If we just take the mean, it returns a single value
a = [Link]([[0, 1, 2], [0, 1, 2], [0, 1, 2]])
print([Link](a))

1.0

# We can use the partial function to initialize [Link]

# with an axis parameter
from functools import partial
mean_over_first_axis = partial([Link], axis=0)

print(mean_over_first_axis(a))

[0. 1. 2.]

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Percentiles summarize your data
Percentiles are a useful way to get more ne-grained summaries of your data (as opposed
to using [Link] )

For a given dataset, the Nth percentile is the value where N% of the data is below that
datapoint, and 100-N% of the data is above that datapoint.

print([Link]([Link](0, 200), q=20))

40.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Combining [Link]() with partial functions to
calculate a range of percentiles
data = [Link](0, 100)

# Create a list of functions using a list comprehension

percentile_funcs = [partial([Link], q=ii) for ii in [20, 40, 60]]

# Calculate the output of each function in the same way

percentiles = [i_func(data) for i_func in percentile_funcs]
print(percentiles)

[20.0, 40.00000000000001, 60.0]

# Calculate multiple percentiles of a rolling window

[Link](20).aggregate(percentiles)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Calculating "date-based" features
Thus far we've focused on calculating "statistical" features - these are features that
correspond statistical properties of the data, like "mean", "standard deviation", etc

However, don't forget that timeseries data o en has more "human" features associated with
it, like days of the week, holidays, etc.

These features are o en useful when dealing with timeseries data that spans multiple years
(such as stock value over time)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

datetime features using Pandas
# Ensure our index is datetime
[Link] = pd.to_datetime([Link])

# Extract datetime features

day_of_week_num = [Link]
print(day_of_week_num[:10])

Index([0 1 2 3 4 0 1 2 3 4], dtype='object')

day_of_week = [Link].weekday_name
print(day_of_week[:10])

Index(['Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Monday' 'Tuesday'

'Wednesday' 'Thursday' 'Friday'], dtype='object')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Time-delayed
features and auto-
regressive models
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
The past is useful
Timeseries data almost always have information that is shared between timepoints

Information in the past can help predict what happens in the future

O en the features best-suited to predict a timeseries are previous values of the same
timeseries.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A note on smoothness and auto-correlation
A common question to ask of a timeseries: how smooth is the data.

AKA, how correlated is a timepoint with its neighboring timepoints (called autocorrelation).

The amount of auto-correlation in data will impact your models.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating time-lagged features
Let's see how we could build a model that uses values in the past as input features.

We can use this to assess how auto-correlated our signal is (and lots of other stu too)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Time-shifting data with Pandas
print(df)

df
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0

# Shift a DataFrame/Series by 3 index values towards the past

print([Link](3))

df
0 NaN
1 NaN
2 NaN
3 0.0
4 1.0

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Creating a time-shifted DataFrame
# data is a pandas Series containing time series data
data = [Link](...)

# Shifts
shifts = [0, 1, 2, 3, 4, 5, 6, 7]

# Create a dictionary of time-shifted data

many_shifts = {'lag_{}'.format(ii): [Link](ii) for ii in shifts}

# Convert them into a dataframe

many_shifts = [Link](many_shifts)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Fitting a model with time-shifted features
# Fit the model using these input features
model = Ridge()
[Link](many_shifts, data)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Interpreting the auto-regressive model coefficients
# Visualize the fit model coefficients
fig, ax = [Link]()
[Link](many_shifts.columns, model.coef_)
[Link](xlabel='Coefficient name', ylabel='Coefficient value')

# Set formatting so it looks nice

[Link](ax.get_xticklabels(), rotation=45, horizontalalignment='right')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing coefficients for a rough signal

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing coefficients for a smooth signal

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Cross-validating
timeseries data
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Cross validation with scikit-learn
# Iterating over the "split" method yields train/test indices
for tr, tt in [Link](X, y):
[Link](X[tr], y[tr])
[Link](X[tt], y[tt])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Cross validation types: KFold
KFold cross-validation splits your data into multiple "folds" of equal size

It is one of the most common cross-validation routines

from sklearn.model_selection import KFold

cv = KFold(n_splits=5)
for tr, tt in [Link](X, y):
...

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing model predictions
fig, axs = [Link](2, 1)

# Plot the indices chosen for validation on each loop

axs[0].scatter(tt, [0] * len(tt), marker='_', s=2, lw=40)
axs[0].set(ylim=[-.1, .1], title='Test set indices (color=CV loop)',
xlabel='Index of raw data')

# Plot the model predictions on each iteration

axs[1].plot([Link](X[tt]))
axs[1].set(title='Test set predictions on each CV loop',
xlabel='Prediction index')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing KFold CV behavior

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A note on shuffling your data
Many CV iterators let you shu e data as a part of the cross-validation process.

This only works if the data is i.i.d., which timeseries usually is not.

You should not shu e your data when making predictions with timeseries.

from sklearn.model_selection import ShuffleSplit

cv = ShuffleSplit(n_splits=3)
for tr, tt in [Link](X, y):
...

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing shuffled CV behavior

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Using the time series CV iterator
Thus far, we've broken the linear passage of time in the cross validation

However, you generally should not use datapoints in the future to predict data in the past

One approach: Always use training data from the past to predict the future

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing time series cross validation iterators
# Import and initialize the cross-validation iterator
from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=10)

fig, ax = [Link](figsize=(10, 5))

for ii, (tr, tt) in enumerate([Link](X, y)):
# Plot training and test indices
l1 = [Link](tr, [ii] * len(tr), c=[[Link](.1)],
marker='_', lw=6)
l2 = [Link](tt, [ii] * len(tt), c=[[Link](.9)],
marker='_', lw=6)
[Link](ylim=[10, -1], title='TimeSeriesSplit behavior',
xlabel='data index', ylabel='CV iteration')
[Link]([l1, l2], ['Training', 'Validation'])

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing the TimeSeriesSplit cross validation iterator

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Custom scoring functions in scikit-learn
def myfunction(estimator, X, y):
y_pred = [Link](X)
my_custom_score = my_custom_function(y_pred, y)
return my_custom_score

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

A custom correlation function for scikit-learn
def my_pearsonr(est, X, y):
# Generate predictions and convert to a vector
y_pred = [Link](X).squeeze()

# Use the numpy "corrcoef" function to calculate a correlation matrix

my_corrcoef_matrix = [Link](y_pred, [Link]())

# Return a single correlation value from the matrix

my_corrcoef = my_corrcoef[1, 0]
return my_corrcoef

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Stationarity and
stability
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Stationarity
Stationary time series do not change their statistical properties over time

E.g., mean, standard deviation, trends

Most time series are non-stationary to some extent

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON
Model stability
Non-stationary data results in variability in our model

The statistical properties the model nds may change with the data

In addition, we will be less certain about the correct values of model parameters

How can we quantify this?

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Cross validation to quantify parameter stability
One approach: use cross-validation

Calculate model parameters on each iteration

Assess parameter stability across all CV splits

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Bootstrapping the mean
Bootstrapping is a common way to assess variability

The bootstrap:
1. Take a random sample of data with replacement

2. Calculate the mean of the sample

3. Repeat this process many times (1000s)

4. Calculate the percentiles of the result (usually 2.5, 97.5)

The result is a 95% con dence interval of the mean of each coe cient.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Bootstrapping the mean
from [Link] import resample

# cv_coefficients has shape (n_cv_folds, n_coefficients)

n_boots = 100
bootstrap_means = [Link](n_boots, n_coefficients)
for ii in range(n_boots):
# Generate random indices for our data with replacement,
# then take the sample mean
random_sample = resample(cv_coefficients)
bootstrap_means[ii] = random_sample.mean(axis=0)

# Compute the percentiles of choice for the bootstrapped means

percentiles = [Link](bootstrap_means, (2.5, 97.5), axis=0)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Plotting the bootstrapped coefficients
fig, ax = [Link]()
[Link](many_shifts.columns, percentiles[0], marker='_', s=200)
[Link](many_shifts.columns, percentiles[1], marker='_', s=200)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Assessing model performance stability
If using the TimeSeriesSplit, can plot the model's score over time.

This is useful in nding certain regions of time that hurt the score

Also useful to nd non-stationary signals

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Model performance over time
def my_corrcoef(est, X, y):
"""Return the correlation coefficient
between model predictions and a validation set."""
return [Link](y, [Link](X))[1, 0]

# Grab the date of the first index of each validation set

first_indices = [[Link][tt[0]] for tr, tt in [Link](X, y)]

# Calculate the CV scores and convert to a Pandas Series

cv_scores = cross_val_score(model, X, y, cv=cv, scoring=my_corrcoef)
cv_scores = [Link](cv_scores, index=first_indices)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing model scores as a timeseries
fig, axs = [Link](2, 1, figsize=(10, 5), sharex=True)

# Calculate a rolling mean of scores over time

cv_scores_mean = cv_scores.rolling(10, min_periods=1).mean()
cv_scores.plot(ax=axs[0])
axs[0].set(title='Validation scores (correlation)', ylim=[0, 1])

# Plot the raw data

[Link](ax=axs[1])
axs[1].set(title='Validation data')

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Visualizing model scores

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Fixed windows with time series cross-validation
# Only keep the last 100 datapoints in the training data
window = 100

# Initialize the CV with this window size

cv = TimeSeriesSplit(n_splits=10, max_train_size=window)

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Non-stationary signals

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N
Wrapping-up
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Chris Holdgraf
Fellow, Berkeley Institute for Data
Science
Timeseries and machine learning
The many applications of time series + machine learning

Always visualize your data rst

The scikit-learn API standardizes this process

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Feature extraction and classification
Summary statistics for time series classi cation

Combining multiple features into a single input matrix

Feature extraction for time series data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Model fitting and improving data quality
Time series features for regression

Generating predictions over time

Cleaning and improving time series data

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Validating and assessing our model performance
Cross-validation with time series data (don't shu e the data!)

Time series stationarity

Assessing model coe cient and score stability

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Advanced concepts in time series
Advanced window functions

Signal processing and ltering details

Spectral analysis

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Advanced machine learning
Advanced time series feature extraction (e.g., tsfresh )

More complex model architectures for regression and classi cation

Production-ready pipelines for time series analysis

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Ways to practice
There are a lot of opportunities to practice your skills with time series data.

Kaggle has a number of time series predictions challenges

Quantopian is also useful for learning and using predictive models others have built.

MACHINE LEARNING FOR TIME SERIES DATA IN PYTHON

Let's practice!
M A C H I N E L E A R N I N G F O R T I M E S E R I E S D ATA I N P Y T H O N

Essential Python Libraries and Frameworks
No ratings yet
Essential Python Libraries and Frameworks
170 pages
DuckDB Documentation Overview
No ratings yet
DuckDB Documentation Overview
721 pages
Flight Delay Analysis with Dask Python
No ratings yet
Flight Delay Analysis with Dask Python
32 pages
Building Dask Bags for Parallel Computing
No ratings yet
Building Dask Bags for Parallel Computing
33 pages
Data Wrangling with Pandas Techniques
No ratings yet
Data Wrangling with Pandas Techniques
13 pages
Stock Price Prediction with ML Tools
No ratings yet
Stock Price Prediction with ML Tools
19 pages
100 Pandas Exercises Collection
No ratings yet
100 Pandas Exercises Collection
6 pages
Visual Guide to Pandas Essentials
No ratings yet
Visual Guide to Pandas Essentials
99 pages
Dask Arrays and Float64 Operations
No ratings yet
Dask Arrays and Float64 Operations
41 pages
Learn Python For Finance & Accounting A Comprehensive and Step by Step Guide To Unleashe Your Career Potential
100% (1)
Learn Python For Finance & Accounting A Comprehensive and Step by Step Guide To Unleashe Your Career Potential
364 pages
Looker
100% (1)
Looker
57 pages
Overview of Python Pandas Library
No ratings yet
Overview of Python Pandas Library
19 pages
Time Series Forecasting with Python
No ratings yet
Time Series Forecasting with Python
30 pages
Python for Finance: Data-Driven Insights
No ratings yet
Python for Finance: Data-Driven Insights
502 pages
OOP Concepts in Python Programming
No ratings yet
OOP Concepts in Python Programming
94 pages
Comprehensive Data Science Cheat Sheets
No ratings yet
Comprehensive Data Science Cheat Sheets
18 pages
Python for Stock Market Analysis Guide
No ratings yet
Python for Stock Market Analysis Guide
8 pages
Machine Learning for Finance Syllabus
100% (1)
Machine Learning for Finance Syllabus
131 pages
Essential Python Libraries for Data Science
No ratings yet
Essential Python Libraries for Data Science
12 pages
Azure Machine Learning Engineering Guide
No ratings yet
Azure Machine Learning Engineering Guide
17 pages
Webinar: Python for Data Science Basics
100% (1)
Webinar: Python for Data Science Basics
96 pages
Programming AI with Python Guide
No ratings yet
Programming AI with Python Guide
56 pages
Advanced Regex and Socket Cheat Sheet
No ratings yet
Advanced Regex and Socket Cheat Sheet
237 pages
Time Series Analysis with Python Guide
100% (1)
Time Series Analysis with Python Guide
835 pages
Python MySQL Connection Tutorial
No ratings yet
Python MySQL Connection Tutorial
5 pages
2023 Machine Learning Roadmap in Python
No ratings yet
2023 Machine Learning Roadmap in Python
11 pages
Py Spark
No ratings yet
Py Spark
427 pages
Aiohttp Performance: 1 Million Requests
No ratings yet
Aiohttp Performance: 1 Million Requests
9 pages
Excel-Python Integration Guide
No ratings yet
Excel-Python Integration Guide
6 pages
Introduction to Snowflake Data Engineering
No ratings yet
Introduction to Snowflake Data Engineering
93 pages
Pandas
100% (2)
Pandas
2,017 pages
Python for Finance: Amazon Stock Analysis
100% (1)
Python for Finance: Amazon Stock Analysis
12 pages
Interview Bit Pandas
No ratings yet
Interview Bit Pandas
62 pages
Getting Started with TensorFlow.js
No ratings yet
Getting Started with TensorFlow.js
6 pages
Manipulating and Analyzing Data With Pandas
No ratings yet
Manipulating and Analyzing Data With Pandas
50 pages
Top 15 Attractions on Phuket Island
No ratings yet
Top 15 Attractions on Phuket Island
3 pages
XGBoost Parameter Tuning Guide
No ratings yet
XGBoost Parameter Tuning Guide
20 pages
SQL Basics: From Zero to Hero Guide
No ratings yet
SQL Basics: From Zero to Hero Guide
14 pages
Understanding Big Data Characteristics
No ratings yet
Understanding Big Data Characteristics
202 pages
Python Programming Lecture Notes
No ratings yet
Python Programming Lecture Notes
142 pages
Key Facts About Python Programming
No ratings yet
Key Facts About Python Programming
8 pages
Understanding pandas DataFrame Basics
No ratings yet
Understanding pandas DataFrame Basics
33 pages
ARIMA Models for Seasonal Data in Python
100% (1)
ARIMA Models for Seasonal Data in Python
50 pages
Comprehensive Guide to Airflow DAGs
No ratings yet
Comprehensive Guide to Airflow DAGs
89 pages
Python Data Science Course Notes PDF
No ratings yet
Python Data Science Course Notes PDF
10 pages
Python Pandas Basics and Usage Guide
No ratings yet
Python Pandas Basics and Usage Guide
44 pages
Data Science Tools by Amidi
No ratings yet
Data Science Tools by Amidi
23 pages
Understanding Vector Databases
No ratings yet
Understanding Vector Databases
24 pages
Pandas Data Wrangling Cheat Sheet
No ratings yet
Pandas Data Wrangling Cheat Sheet
1 page
Date Manipulation with Pandas
No ratings yet
Date Manipulation with Pandas
37 pages
Resampling and Interpolating Time Series
No ratings yet
Resampling and Interpolating Time Series
39 pages
Creating Date Ranges in Pandas
No ratings yet
Creating Date Ranges in Pandas
17 pages
Time Series Data Analysis in Python
No ratings yet
Time Series Data Analysis in Python
16 pages
Understanding Time Series Data Analysis
No ratings yet
Understanding Time Series Data Analysis
6 pages
Time Series Analysis in Python
No ratings yet
Time Series Analysis in Python
22 pages
Pandas Rolling Window Functions Guide
No ratings yet
Pandas Rolling Window Functions Guide
37 pages
Working with Pandas Time Series Data
No ratings yet
Working with Pandas Time Series Data
5 pages
Data Exploration and Time Series Analysis
No ratings yet
Data Exploration and Time Series Analysis
11 pages
Tutorial - Time Series Analysis With Pandas - Dataquest
No ratings yet
Tutorial - Time Series Analysis With Pandas - Dataquest
32 pages
Datetime Manipulation in Python
No ratings yet
Datetime Manipulation in Python
6 pages
Supervised Learning with Scikit-Learn
100% (2)
Supervised Learning with Scikit-Learn
178 pages
TensorFlow Basics: Tensors & Operations
100% (3)
TensorFlow Basics: Tensors & Operations
146 pages
Understanding Artificial Intelligence Basics
88% (17)
Understanding Artificial Intelligence Basics
881 pages
Introduction to Statistics in Python
100% (3)
Introduction to Statistics in Python
211 pages
Python Basics for Financial Analysis
100% (4)
Python Basics for Financial Analysis
877 pages
Docker Essentials for Developers
100% (1)
Docker Essentials for Developers
255 pages
Portfolio Risk Management in Python
100% (4)
Portfolio Risk Management in Python
545 pages
Aztar Corporation Proxy Statement 2006
No ratings yet
Aztar Corporation Proxy Statement 2006
154 pages
Bikano Competitive Analysis Report
100% (2)
Bikano Competitive Analysis Report
74 pages
KRSET 10HP Compressor Parts Manual
No ratings yet
KRSET 10HP Compressor Parts Manual
32 pages
Amul: Strategic Analysis and Recommendations
No ratings yet
Amul: Strategic Analysis and Recommendations
20 pages
PMO Value and Implementation Insights
No ratings yet
PMO Value and Implementation Insights
13 pages
Bangladesh Telecom Industry Analysis
No ratings yet
Bangladesh Telecom Industry Analysis
19 pages
Storekeeper Job Description and Duties
No ratings yet
Storekeeper Job Description and Duties
2 pages
EquipHotel 2024 Exhibitor List
No ratings yet
EquipHotel 2024 Exhibitor List
13 pages
Udyam Registration Certificate for Vishal Pharmacy
No ratings yet
Udyam Registration Certificate for Vishal Pharmacy
5 pages
Weekly Stock Analysis for November 2024
No ratings yet
Weekly Stock Analysis for November 2024
8 pages
Essential Accounting Processes for SMBs
No ratings yet
Essential Accounting Processes for SMBs
2 pages
Excavation Cost Analysis and Rates
100% (1)
Excavation Cost Analysis and Rates
977 pages
Customer Service & Technical Support Resume
No ratings yet
Customer Service & Technical Support Resume
3 pages
Nautilus Box Boom Crane Specs List
100% (2)
Nautilus Box Boom Crane Specs List
6 pages
Journal Entries for Bill of Exchange
No ratings yet
Journal Entries for Bill of Exchange
1 page
Tostao's Low-Cost Business Model Analysis
No ratings yet
Tostao's Low-Cost Business Model Analysis
21 pages
Tax Calculations for Various Scenarios
No ratings yet
Tax Calculations for Various Scenarios
7 pages
Configuring SAP Gateway Reginfo File
No ratings yet
Configuring SAP Gateway Reginfo File
3 pages
Merchant Banking Setup Requirements in India
No ratings yet
Merchant Banking Setup Requirements in India
5 pages
Tax Invoice for MAA DHUMABATI Supplier
No ratings yet
Tax Invoice for MAA DHUMABATI Supplier
1 page
LTE Project Management Workflow Guide
No ratings yet
LTE Project Management Workflow Guide
8 pages
Daniel Swarovski: A Legacy in Crystals
No ratings yet
Daniel Swarovski: A Legacy in Crystals
14 pages
Airline Capacity Planning Factors
No ratings yet
Airline Capacity Planning Factors
6 pages
State Bank of India vs Him Cableways Case Update
No ratings yet
State Bank of India vs Him Cableways Case Update
43 pages
Entrepreneurs: Born vs. Made Explained
No ratings yet
Entrepreneurs: Born vs. Made Explained
7 pages
Clothes Shopping Dialogue for Beginners
100% (1)
Clothes Shopping Dialogue for Beginners
2 pages
Financial Ratios Quiz - Accounting Coach
100% (1)
Financial Ratios Quiz - Accounting Coach
3 pages
Research Questions on International Trade
No ratings yet
Research Questions on International Trade
3 pages
Supply Chain Risk Management Insights
No ratings yet
Supply Chain Risk Management Insights
22 pages
Overview of Software Testing Life Cycle
No ratings yet
Overview of Software Testing Life Cycle
5 pages