Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015
Matt P. Dziubinski
CppCon 2015
[email protected] // @matt_dz
Department of Mathematical Sciences, Aalborg University
CREATES (Center for Research in Econometric Analysis of Time Series)
Outline
Why?
Numerical Optimization
Calculating Derivatives
Resources
References
2
Slides
3
Why?
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Data & Models
5
Motivation: Parametric Models
6
Motivation: Parametric Models
θ̂ = argmax Q(θ)
θ∈Θ
6
Motivation: Parametric Models
θ̂ = argmax Q(θ)
θ∈Θ
6
Motivation: Parametric Models
θ̂ = argmax Q(θ)
θ∈Θ
6
Numerical Optimization
Numerical Optimization: Algorithms
8
Numerical Optimization: Algorithms
8
Numerical Optimization: Algorithms
8
Numerical Optimization: Algorithms
8
Numerical Optimization: Algorithms
9
Numerical Optimization: Gradient
9
Numerical Optimization: Gradient
9
Numerical Optimization: Gradient
9
Numerical Optimization: Gradient
9
Numerical Optimization: Algorithmic Differentiation
10
Numerical Optimization: Algorithmic Differentiation
10
Numerical Optimization: Algorithmic Differentiation
10
Numerical Optimization: Algorithmic Differentiation
10
Numerical Optimization: Algorithmic Differentiation
10
Numerical Optimization: Algorithmic Differentiation
• Essentially:
10
Numerical Optimization: Algorithmic Differentiation
• Essentially:
• an automatic computational differentiation method
10
Numerical Optimization: Algorithmic Differentiation
• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
10
Numerical Optimization: Algorithmic Differentiation
• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision
10
Numerical Optimization: Algorithmic Differentiation
• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision
10
Calculating Derivatives
Calculating Derivatives
• Main approaches:
12
Calculating Derivatives
• Main approaches:
• Finite Differencing
12
Calculating Derivatives
• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation
12
Calculating Derivatives
• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation
• Symbolic Differentiation
12
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
13
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:
∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h
13
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:
∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h
• central-difference approximation:
∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h
13
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:
∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h
• central-difference approximation:
∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h
• Computational cost:
13
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:
∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h
• central-difference approximation:
∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h
• Computational cost:
• forward-difference formula: p + 1 function evaluations
13
Finite Differencing
f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:
∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h
• central-difference approximation:
∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h
• Computational cost:
• forward-difference formula: p + 1 function evaluations
• central-difference formula: 2 ∗ p function evaluations
13
Finite Differencing — Truncation Error
14
Finite Differencing — Truncation Error
14
Finite Differencing — Truncation Error
14
Finite Differencing — Truncation Error
14
Finite Differencing — Truncation Error
14
Finite Differencing — Truncation Error
15
Floating Point Arithmetic
15
Floating Point Arithmetic
15
Floating Point Arithmetic
15
Floating Point Arithmetic
15
Floating Point Arithmetic
15
Floating Point Arithmetic
15
Floating Point Arithmetic
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation
16
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
17
Floating Point Representation Properties
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
• round-off error — increases with smaller h
18
Finite Differencing — Round-off Error I
• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
• round-off error — increases with smaller h
• error contribution inversely proportial to h, proportionality
constant O(ϵM )
18
Finite Differencing — Round-off Error II
• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
19
Finite Differencing — Round-off Error II
• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
19
Finite Differencing — Round-off Error II
• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
19
Finite Differencing — Round-off Error II
• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0
ĥ
19
Finite Differencing — Round-off Error II
• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0
ĥ
• =⇒ no correct digits in the result at (or below) step-size ĥ
19
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Trucation—Round-off Trade-off
20
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Differencing — Central-Difference Formula
21
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))
23
Symbolic Differentiation
23
Symbolic Differentiation
23
Symbolic Differentiation
23
Symbolic Differentiation
23
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation I
24
Numerical Objective Function & Algorithmic Differentiation II
25
Numerical Objective Function & Algorithmic Differentiation II
25
Numerical Objective Function & Algorithmic Differentiation II
25
Numerical Objective Function & Algorithmic Differentiation II
25
Algorithmic Differentiation — Idea I
• Idea:
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have
Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have
Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have
Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have
Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′
26
Algorithmic Differentiation — Idea I
• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have
Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′
26
Algorithmic Differentiation — Idea II
27
Algorithmic Differentiation — Idea II
27
Algorithmic Differentiation — Idea II
27
Algorithmic Differentiation — Idea II
27
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Idea III
28
Algorithmic Differentiation — Implementations
29
Algorithmic Differentiation — Implementations
29
Algorithmic Differentiation — Implementations
29
Algorithmic Differentiation — Implementations
• operator overloading
29
Algorithmic Differentiation — Implementations
• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs
29
Algorithmic Differentiation — Implementations
• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs
29
Algorithmic Differentiation — Implementations
• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs
29
Algorithmic Differentiation — Example
MLE
30
Algorithmic Differentiation — Example
MLE
30
Algorithmic Differentiation — Example
MLE
30
Algorithmic Differentiation — Example
MLE
30
Algorithmic Differentiation — Example
MLE
30
Algorithmic Differentiation — Evaluation Trace
31
Algorithmic Differentiation — Evaluation Trace
31
Algorithmic Differentiation — Evaluation Trace
31
Algorithmic Differentiation — Evaluation Trace
31
Algorithmic Differentiation — Evaluation Trace
ℓt = v7 = −1.612086
32
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Evaluation Trace
33
Algorithmic Differentiation — Forward Mode
34
Algorithmic Differentiation — Forward Mode
34
Algorithmic Differentiation — Forward Mode
34
Algorithmic Differentiation — Forward Mode
34
Algorithmic Differentiation — Forward Mode
34
Algorithmic Differentiation — Forward Mode
• in our case
35
Algorithmic Differentiation — Forward Mode
• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
35
Algorithmic Differentiation — Forward Mode
• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
35
Algorithmic Differentiation — Forward Mode
• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0
35
Algorithmic Differentiation — Forward Mode
• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0
• analogously, v2 = ϕ2 (v1 ) = − 21 log(2π) − 21 v1 , hence
v̇2 = (∂ϕ2 (v1 )/∂v1 )v̇1 = − 21 v̇1 = 0.0
35
Algorithmic Differentiation — Forward Mode
• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0
• analogously, v2 = ϕ2 (v1 ) = − 21 log(2π) − 21 v1 , hence
v̇2 = (∂ϕ2 (v1 )/∂v1 )v̇1 = − 21 v̇1 = 0.0
35
Algorithmic Differentiation — Augmented Evaluation Trace
v−1 = µ = 6
v̇−1 = ∂v−1 /∂v−1 = 1
v0 = σ2 = 4
v̇0 = ∂v0 /∂v−1 = 0
ℓt = v7 = −1.612086
ℓ̇t = v̇7 = 0
36
Algorithmic Differentiation — Augmented Evaluation Trace
37
Algorithmic Differentiation — Augmented Evaluation Trace
37
Algorithmic Differentiation — Augmented Evaluation Trace
37
AD — Reverse Mode
38
AD — Reverse Mode
38
AD — Reverse Mode
38
AD — Reverse Mode
38
AD — Forward Mode vs. Reverse Mode
• In general, for f : Rn → Rm
39
AD — Forward Mode vs. Reverse Mode
• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
39
AD — Forward Mode vs. Reverse Mode
• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))
39
AD — Forward Mode vs. Reverse Mode
• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))
39
AD — Forward Mode vs. Reverse Mode
• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))
39
Illustration: Simulation Study — Setup
St+1
log ≡ rt+1 = µ + ϵt+1 (1)
St
√
ϵt+1 = vt+1 wt+1 (2)
vt+1 = ω + aϵ2t + bvt (3)
P
w ∼ GWN(0, 1) (4)
40
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks
• pv = 1 — integrated GARCH (IGARCH) process
41
Illustration: Simulation Study — Setup — Notation
• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks
• pv = 1 — integrated GARCH (IGARCH) process
• common finding: pv close to 1 in financial data, ”near-integrated”
process, Bollerslev and Engle (1993). 41
Illustration: Monte Carlo Simulation
42
Illustration: Monte Carlo Simulation
42
Illustration: Monte Carlo Simulation
42
Illustration: Monte Carlo Simulation
42
Illustration: Monte Carlo Simulation
42
Illustration: Reliability — AD, L-BFGS
43
Illustration: Reliability — FD, TNR
44
Illustration: Findings — Reliability
• Reliability of estimation:
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio
45
Illustration: Findings — Reliability
• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio
45
Illustration: Findings
46
Illustration: Findings
46
Illustration: Findings
46
Example Source Code
Example using Rcpp
Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')
48
Example using Rcpp
Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')
48
Example using Rcpp
Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')
48
ExampleGaussianRcpp.cpp I
// [[Rcpp::plugins("cpp11")]]
#include <cstddef>
// [[Rcpp::depends(BH)]]
#include <boost/math/constants/constants.hpp>
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <cmath>
49
ExampleGaussianRcpp.cpp II
namespace model
{
// 2 parameters: mu, sigma^2
enum parameter : std::size_t { mu, s2 };
constexpr std::size_t parameters_count = 2;
}
50
ExampleGaussianRcpp.cpp III
// [[Rcpp::export]]
double l_t_cpp(double xt,
const Eigen::Map<Eigen::VectorXd> parameters)
{
const auto mu = parameters[model::parameter::mu];
const auto s2 = parameters[model::parameter::s2];
constexpr auto two_pi =
boost::math::constants::two_pi<double>();
using std::log;
using std::pow;
return -.5 * log(two_pi) - .5 * log(s2) -
.5 * pow(xt - mu, 2) / s2;
}
51
AD Example using RcppEigen
Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp')
52
ExampleGaussianRcppEigen.cpp I
// [[Rcpp::plugins("cpp11")]]
#include <cstddef>
// [[Rcpp::depends(BH)]]
#include <boost/math/constants/constants.hpp>
// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <unsupported/Eigen/AutoDiff>
#include <cmath>
53
ExampleGaussianRcppEigen.cpp II
namespace model
{
// 2 parameters: mu, sigma^2
enum parameter : std::size_t { mu, s2 };
constexpr std::size_t parameters_count = 2;
54
ExampleGaussianRcppEigen.cpp III
// note: data `xt` -- double-precision number(s), just as before
// only the parameters adjusted to `ScalarType`
template <typename VectorType>
typename VectorType::Scalar
l_t_cpp_AD(double xt,
const VectorType & parameters)
{
using Scalar = typename VectorType::Scalar;
const Scalar & mu = parameters[model::parameter::mu];
const Scalar & s2 = parameters[model::parameter::s2];
// note: `two_pi` is, as always, a double-precision constant
constexpr auto two_pi =
boost::math::constants::two_pi<double>();
using std::log; using std::pow;
return -.5 * log(two_pi) - .5 * log(s2) -
.5 * pow(xt - mu, 2) / s2;
}
55
ExampleGaussianRcppEigen.cpp IV
// [[Rcpp::export]]
double l_t_value_cpp(double xt,
const Eigen::Map<Eigen::VectorXd> parameters)
{
return l_t_cpp_AD(xt, parameters);
}
56
ExampleGaussianRcppEigen.cpp V
// objective function together with its gradient
// `xt`: input (data)
// `parameters_input`: input (model parameters)
// `gradient_output`: output (computed gradient)
// returns: computed objective function value
// [[Rcpp::export]]
double l_t_value_gradient_cpp
(
double xt,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::VectorXd> gradient_output
)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;
57
ExampleGaussianRcppEigen.cpp VI
58
Data Parallel Objective Function I
59
Data Parallel Objective Function II
// Second: Parallelize using OpenMP
// [[Rcpp::export]]
Eigen::VectorXd l_t_value_cppDP_OMP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
#pragma omp parallel for default(none) shared(result)
for (std::size_t t = 0; t < sample_size; ++t)
{
result[t] = l_t_cpp_AD(xs[t], parameters);
}
return result;
}
60
Data Parallel Objective Function Performance I
> require("microbenchmark")
> microbenchmark(
+ l_t(xs, fixed_parameters),
+ l_t_value_cppDP(xs, fixed_parameters),
+ l_t_value_cppDP_OMP(xs, fixed_parameters)
+ )
Unit: microseconds
expr median neval
l_t_value_cppDP(xs, fixed_parameters) 458.618 100
l_t_value_cppDP_OMP(xs, fixed_parameters) 213.526 100
61
Data Parallel Gradient I
// [[Rcpp::export]]
Eigen::VectorXd l_t_value_gradient_cppDP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::MatrixXd> gradient_output
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
for (std::size_t t = 0; t != sample_size; ++t)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;
62
Data Parallel Gradient II
63
Data Parallel Gradient III
// [[Rcpp::export]]
Eigen::VectorXd l_t_value_gradient_cppDP_OMP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::MatrixXd> gradient_output
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
#pragma omp parallel for default(none) \
shared(result, gradient_output)
for (std::size_t t = 0; t < sample_size; ++t)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;
64
Data Parallel Gradient IV
65
Data Parallel Gradient Performance I
microbenchmark(
l_t_value_gradient_cppDP(xs, fixed_parameters, gradient),
l_t_value_gradient_cppDP_OMP(xs, fixed_parameters, gradient)
)
Unit: microseconds
expr median neval
l_t_value_gradient_cppDP 631.3375 100
l_t_value_gradient_cppDP_OMP 258.6945 100
66
Data Parallel Gradient Performance — Conclusions
Worth noting:
67
Data Parallel Gradient Performance — Conclusions
Worth noting:
67
Data Parallel Gradient Performance — Conclusions
Worth noting:
67
Caveats
• Source code
68
Caveats
• Source code
• Recall: AD requires the access to the source code
68
Caveats
• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
68
Caveats
• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages
68
Caveats
• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages
68
Caveats
• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages
68
Caveats
• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages
68
Resources
General
http://www.autodiff.org/
http://en.wikipedia.org/wiki/Automatic_differentiation
70
Resources I
71
Resources II
A Step by Step Backpropagation Example
http://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/
Adjoint Methods in Computational Finance
http://www.hpcfinance.eu/sites/www.hpcfinance.eu/files/Uwe%20Nau
http://www.nag.com/Market/seminars/Uwe_AD_Slides_July13.pdf
Adjoints and Automatic (Algorithmic) Differentiation in
Computational Finance
http://arxiv.org/abs/1107.1831v1
Algorithmic Differentiation in More Depth
http://www.nag.com/pss/ad-in-more-depth
72
Resources III
Algorithmic Differentiation of a GPU Accelerated Application
http://www.nag.co.uk/Market/events/jdt-hpc-new-thinking-in-
finance-presentation.pdf
Automatic Differentiation and QuantLib
https://quantlib.wordpress.com/tag/automatic-differentiation/
Automatic differentiation in deep learning by Shawn Tan
https://cdn.rawgit.com/shawntan/presentations/master/Deep
Learning-Copy1.slides.html
Calculus on Computational Graphs: Backpropagation
https://colah.github.io/posts/2015-08-Backprop/
73
Resources IV
Computing derivatives for nonlinear optimization: Forward mode
automatic differentiation
http://nbviewer.ipython.org/github/joehuchette/OR-software-
tools-2015/blob/master/6-nonlinear-opt/Nonlinear-
DualNumbers.ipynb
Introduction to Automatic Differentiation
http://alexey.radul.name/ideas/2013/introduction-to-automatic-
differentiation/
Jarrett Revels: Automatic differentiation
https://www.youtube.com/watch?v=PrXUl0sanro
74
Resources V
75
Floating Point Numbers
• http://www.johndcook.com/blog/2009/04/06/anatomy-of-a-
floating-point-number/
• https://randomascii.wordpress.com/category/floating-point/
In particular:
• https://randomascii.wordpress.com/2012/02/25/
comparing-floating-point-numbers-2012-edition/
• https://randomascii.wordpress.com/2012/04/05/
floating-point-complexities/
• https://randomascii.wordpress.com/2013/01/03/
top-eight-entertaining-blog-facts-for-2012/
76
Software I
https://en.wikipedia.org/wiki/Automatic_differentiation#Software
ADNumber, Adept, ADOL-C, CppAD, Eigen (Auto Diff module)
CasADi
https://github.com/casadi/casadi/wiki
CasADi is a symbolic framework for algorithmic (a.k.a. automatic)
differentiation and numeric optimization.
CppAD
https://github.com/coin-or/CppAD/
77
Software II
Dali
https://github.com/JonathanRaiman/Dali
An automatic differentiation library that uses reverse-mode
differentation (backpropagation) to differentiate recurrent neural
networks, or most mathematical expressions through control flow,
while loops, recursion.
Open Porous Media Automatic Differentiation Library
https://github.com/OPM/opm-autodiff
Utilities for automatic differentiation and simulators based on AD.
The Stan Math Library (stan::math: includes a C++
reverse-mode automatic differentiation library)
https://github.com/stan-dev/math
78
References
References I
81
Thank You!
Questions?
82