0% found this document useful (0 votes)
36 views283 pages

Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views283 pages

Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 283

Algorithmic Differentiation

C++ & Extremum Estimation

Matt P. Dziubinski
CppCon 2015
[email protected] // @matt_dz
Department of Mathematical Sciences, Aalborg University
CREATES (Center for Research in Econometric Analysis of Time Series)
Outline

Why?

Numerical Optimization

Calculating Derivatives

Example Source Code

Resources

References

2
Slides

• Latest version: https://speakerdeck.com/mattpd

3
Why?
Motivation: Data & Models

• Have: data, y (numbers)

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: fitting model to data

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: fitting model to data
• How: cost function — minimize / maximize

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: fitting model to data
• How: cost function — minimize / maximize
• Nowadays one of the most widely applied approaches —
estimation (econometrics, statistics), calibration (finance),
training ((supervised) machine learning)

5
Motivation: Data & Models

• Have: data, y (numbers)


• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: fitting model to data
• How: cost function — minimize / maximize
• Nowadays one of the most widely applied approaches —
estimation (econometrics, statistics), calibration (finance),
training ((supervised) machine learning)
• In practice: Generally no are closed-form expressions —
estimation relies on numerical optimization

5
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,


Θ — parameter space

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,


Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,


Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

• can be obtained as an extremum (maximum or minimum) of a


scalar objective function

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,


Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

• can be obtained as an extremum (maximum or minimum) of a


scalar objective function
• class includes OLS (Ordinary Least Squares), NLS (Nonlinear Least
Squares), GMM (Generalized Method of Moments), and QMLE (Quasi
Maximum Likelihood Estimation)

6
Numerical Optimization
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly


categorized into two categories:

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly


categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly


categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process
• gradient-based — need the gradient of the objective function

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly


categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process
• gradient-based — need the gradient of the objective function
• iteration form:
initial iterate: θ0
new iterate: θk = θk−1 + sk , iteration k ≥ 1
step: sk = αk pk
length: scalar αk > 0
direction: vector pk s.t. Hk pk = ∇Q(θk )

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly


categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process
• gradient-based — need the gradient of the objective function
• iteration form:
initial iterate: θ0
new iterate: θk = θk−1 + sk , iteration k ≥ 1
step: sk = αk pk
length: scalar αk > 0
direction: vector pk s.t. Hk pk = ∇Q(θk )
• Hk :
Hk = I (Steepest Ascent) — many variants, including SGD (Stochastic
Gradient Descent)
Hk = ∇2 Q(θk ) (Newton)
Hk = Hessian approximation satisfying the secant equation:
Hk+1 sk = ∇Qk+1 − ∇Qk (Quasi-Newton) 8
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence


rates (superlinear or even quadratic)

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence


rates (superlinear or even quadratic)
• However: this property depends on the gradient either being
exact — or at minimum sufficiently accurate in the limit sense,
Nocedal and Wright (2006 Chapter 3)

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence


rates (superlinear or even quadratic)
• However: this property depends on the gradient either being
exact — or at minimum sufficiently accurate in the limit sense,
Nocedal and Wright (2006 Chapter 3)
• For many modern models analytical gradient formulas not
available in closed-form — or non-trivial to derive

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence


rates (superlinear or even quadratic)
• However: this property depends on the gradient either being
exact — or at minimum sufficiently accurate in the limit sense,
Nocedal and Wright (2006 Chapter 3)
• For many modern models analytical gradient formulas not
available in closed-form — or non-trivial to derive
• This often leads to the use of numerical approximations — in
particular, finite difference methods (FDMs), Nocedal and
Wright (2006 Chapter 8)

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence


rates (superlinear or even quadratic)
• However: this property depends on the gradient either being
exact — or at minimum sufficiently accurate in the limit sense,
Nocedal and Wright (2006 Chapter 3)
• For many modern models analytical gradient formulas not
available in closed-form — or non-trivial to derive
• This often leads to the use of numerical approximations — in
particular, finite difference methods (FDMs), Nocedal and
Wright (2006 Chapter 8)
• Inference: Covariance matrix estimators also involve the use of
derivative information.

9
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate


• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision

• How does it compare to other computational differentiation


methods?

10
Calculating Derivatives
Calculating Derivatives

• Main approaches:

12
Calculating Derivatives

• Main approaches:
• Finite Differencing

12
Calculating Derivatives

• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation

12
Calculating Derivatives

• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation
• Symbolic Differentiation

12
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:
• forward-difference formula: p + 1 function evaluations

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:
• forward-difference formula: p + 1 function evaluations
• central-difference formula: 2 ∗ p function evaluations

13
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem


• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem


• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem


• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6
• Hence
f (x + h) − f (x) 1 ′′ 1
f ′ (x) = − hf (x) − h2 f ′′′ (x) − O(h3 )
h 2 6

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem


• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6
• Hence
f (x + h) − f (x) 1 ′′ 1
f ′ (x) = − hf (x) − h2 f ′′′ (x) − O(h3 )
h 2 6
• Truncation error c1 h + c2 h2 + ...

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem


• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6
• Hence
f (x + h) − f (x) 1 ′′ 1
f ′ (x) = − hf (x) − h2 f ′′′ (x) − O(h3 )
h 2 6
• Truncation error c1 h + c2 h2 + ...
• Decreases with smaller h
14
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454
• -2.273737e-13

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of


addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454
• -2.273737e-13
• ”Many a serious mathematician has attempted to give rigorous
analyses of a sequence of floating point operations, but has found
the task to be so formidable that he has tried to content himself with
plausibility arguments instead.” — Donald E. Knuth, TAOCP2 15
Floating Point Representation

• Floating point representation: inspired by scientific notation

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}
• the difference between 1.0 and the next value representable by the
floating-point number 1.0 + ϵ

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}
• the difference between 1.0 and the next value representable by the
floating-point number 1.0 + ϵ
• IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign
bit, 8 exponent bits, leaving 23 bits for significand)

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}
• the difference between 1.0 and the next value representable by the
floating-point number 1.0 + ϵ
• IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign
bit, 8 exponent bits, leaving 23 bits for significand)
• IEEE 754 binary64 (double precision): 2−52 ≈ 2.22045e-16 (1 sign
bit, 11 exponent bits, leaving 52 bits for significand)

16
Floating Point Representation

• Floating point representation: inspired by scientific notation


• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}
• the difference between 1.0 and the next value representable by the
floating-point number 1.0 + ϵ
• IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign
bit, 8 exponent bits, leaving 23 bits for significand)
• IEEE 754 binary64 (double precision): 2−52 ≈ 2.22045e-16 (1 sign
bit, 11 exponent bits, leaving 52 bits for significand)

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM

16
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ϵM a.k.a. unit round-off, u = ULP(1.0)

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ϵM a.k.a. unit round-off, u = ULP(1.0)
• ULP — Unit in the last place

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ϵM a.k.a. unit round-off, u = ULP(1.0)
• ULP — Unit in the last place
• the distance between the two closest straddling floating-point
numbers a and b (i.e., those with a ≤ x ≤ b and a ̸= b),

17
Floating Point Representation Properties

• floating point number representation: fl(x) = x(1 + δ), where


|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• floating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ϵM a.k.a. unit round-off, u = ULP(1.0)
• ULP — Unit in the last place
• the distance between the two closest straddling floating-point
numbers a and b (i.e., those with a ≤ x ≤ b and a ̸= b),
• a measure of precision in numeric calculations. In base b, if x has
17
exponent E, then ULP(x) = ϵM · bE .
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
• round-off error — increases with smaller h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
• round-off error — increases with smaller h
• error contribution inversely proportial to h, proportionality
constant O(ϵM )
18
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0

• =⇒ no correct digits in the result at (or below) step-size ĥ

19
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h≈ ϵM
h

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)


1/3
• h ≈ ϵM

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)


1/3
• h ≈ ϵM
• binary64: 6.055454e-06

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)


1/3
• h ≈ ϵM
• binary64: 6.055454e-06
2/3
• total error’s minimum value ∈ O(ϵM )

20
Finite Differencing — Trucation—Round-off Trade-off

• Total error: O(h) + O( ϵhM ) (forward-difference formula)


• Best accuracy when contributions approximately equal
• h ≈ ϵhM
• h2 ≈ ϵM

• h ≈ ϵM
• binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM )
• binary64: 1.490116e-08

• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)


1/3
• h ≈ ϵM
• binary64: 6.055454e-06
2/3
• total error’s minimum value ∈ O(ϵM )
• binary64: 3.666853e-11

20
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

• Truncation error ∈ O(h2 ) (second-order approximation)

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

• Truncation error ∈ O(h2 ) (second-order approximation)


• Round-off error ∈ O(ϵM /h)

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

• Truncation error ∈ O(h2 ) (second-order approximation)


• Round-off error ∈ O(ϵM /h)
• Balanced trade-off:

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

• Truncation error ∈ O(h2 ) (second-order approximation)


• Round-off error ∈ O(ϵM /h)
• Balanced trade-off:
• h ≈ ϵM 1/3

21
Finite Differencing — Central-Difference Formula

• Subtracting the backward-difference approximation from the


forward-difference approximation:
• (central-difference)
f (x + h) − f (x − h)
f ′ (x) = − O(h2 )
2h

• Truncation error ∈ O(h2 ) (second-order approximation)


• Round-off error ∈ O(ϵM /h)
• Balanced trade-off:
• h ≈ ϵM 1/3
• total error’s minimum value ∈ O(ϵM 2/3 )

21
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)

22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)



• h≈ ϵM — binary64: 1.490116e-08

22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)



• h ≈ ϵM — binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM ) — binary64: 1.490116e-08

22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)



• h ≈ ϵM — binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM ) — binary64: 1.490116e-08
• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)

22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)



• h ≈ ϵM — binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM ) — binary64: 1.490116e-08
• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)
1/3
• h ≈ ϵM — binary64: 6.055454e-06

22
Finite Difference Approximation Error
Example: f (x) = cos(sin(x) ∗ cos(x)) f ′ (x) = −cos(2 ∗ x) ∗ sin(sin(x) ∗ cos(x))

• Total error: O(h) + O( ϵhM ) (forward-difference formula)



• h ≈ ϵM — binary64: 1.490116e-08

• total error’s minimum value ∈ O( ϵM ) — binary64: 1.490116e-08
• Total error: O(h2 ) + O( ϵhM ) (central-difference formula)
1/3
• h ≈ ϵM — binary64: 6.055454e-06
2/3
• total error’s minimum value ∈ O(ϵM ) — binary64: 3.666853e-11
22
Symbolic Differentiation

• Algebraic specification for f manipulated by symbolic


manipulation tools to produce new algebraic expressions for
each component of the gradient

23
Symbolic Differentiation

• Algebraic specification for f manipulated by symbolic


manipulation tools to produce new algebraic expressions for
each component of the gradient
• Commonly used symbolic manipulation tools can be found in
the packages Mathematica (also Wolfram Alpha), Macsyma,
Maple, or Mathics.

23
Symbolic Differentiation

• Algebraic specification for f manipulated by symbolic


manipulation tools to produce new algebraic expressions for
each component of the gradient
• Commonly used symbolic manipulation tools can be found in
the packages Mathematica (also Wolfram Alpha), Macsyma,
Maple, or Mathics.
• Disadvantages:

23
Symbolic Differentiation

• Algebraic specification for f manipulated by symbolic


manipulation tools to produce new algebraic expressions for
each component of the gradient
• Commonly used symbolic manipulation tools can be found in
the packages Mathematica (also Wolfram Alpha), Macsyma,
Maple, or Mathics.
• Disadvantages:
• narrow applicability

23
Symbolic Differentiation

• Algebraic specification for f manipulated by symbolic


manipulation tools to produce new algebraic expressions for
each component of the gradient
• Commonly used symbolic manipulation tools can be found in
the packages Mathematica (also Wolfram Alpha), Macsyma,
Maple, or Mathics.
• Disadvantages:
• narrow applicability
• expression swell

23
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?
• named intermediates

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?
• named intermediates
• program branches

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?
• named intermediates
• program branches
• joint allocations / mutation / overwriting

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?
• named intermediates
• program branches
• joint allocations / mutation / overwriting
• iterative loops

24
Numerical Objective Function & Algorithmic Differentiation I

• Symbolic differentiation (SD) applicable to ”paper-and-pencil”


algebraic expressions
• Input: function specified by an algebraic formula or closed-form
expression

• Objective function in numerical optimization — a computer


program
• How are computer programs different?
• named intermediates
• program branches
• joint allocations / mutation / overwriting
• iterative loops

• Do we have to resort to finite differences?

24
Numerical Objective Function & Algorithmic Differentiation II

• Algorithmic Differentiation (AD) offers another solution.

25
Numerical Objective Function & Algorithmic Differentiation II

• Algorithmic Differentiation (AD) offers another solution.


• AD — a computational differentiation method with the roots in
the late 1960s, Griewank (2012)

25
Numerical Objective Function & Algorithmic Differentiation II

• Algorithmic Differentiation (AD) offers another solution.


• AD — a computational differentiation method with the roots in
the late 1960s, Griewank (2012)
• Machine Learning — Neural Networks — Backpropagation: Bryson,
Denham, Dreyfus (1963), Werbos (1974)

25
Numerical Objective Function & Algorithmic Differentiation II

• Algorithmic Differentiation (AD) offers another solution.


• AD — a computational differentiation method with the roots in
the late 1960s, Griewank (2012)
• Machine Learning — Neural Networks — Backpropagation: Bryson,
Denham, Dreyfus (1963), Werbos (1974)
• AD widely known and applied in natural sciences — Naumann
(2012), and computational finance — Giles and Glasserman (2006),
Homescu (2011)

25
Algorithmic Differentiation — Idea I

• Idea:

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have

Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have

Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′

• Relies on the systematic application of the chain rule by a


computer program

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have

Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′

• Relies on the systematic application of the chain rule by a


computer program
• operating at the source code level

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have

Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′

• Relies on the systematic application of the chain rule by a


computer program
• operating at the source code level
• performed automatically

26
Algorithmic Differentiation — Idea I

• Idea:
• computer code for evaluating the function can be broken down
• into a composition of elementary arithmetic operations
• to which the chain rule (one of the basic rules of calculus) can be
applied
• e.g., given an objective function Q(θ) = f (g(θ)), we have

Q′ = (f ◦ g)′ = (f ′ ◦ g) · g′

• Relies on the systematic application of the chain rule by a


computer program
• operating at the source code level
• performed automatically
• with minimal or no user intervention

26
Algorithmic Differentiation — Idea II

• Mathematical operations performed exactly

27
Algorithmic Differentiation — Idea II

• Mathematical operations performed exactly


• to the maximum extent allowed by the machine precision — i.e.,
accurate to within round-off error

27
Algorithmic Differentiation — Idea II

• Mathematical operations performed exactly


• to the maximum extent allowed by the machine precision — i.e.,
accurate to within round-off error

• truncation error: none

27
Algorithmic Differentiation — Idea II

• Mathematical operations performed exactly


• to the maximum extent allowed by the machine precision — i.e.,
accurate to within round-off error

• truncation error: none


• round-off error: equivalent to using a source code analogously
implementing the analytical derivatives.

27
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:
• SD restricted to symbolic manipulation of algebraic expressions

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:
• SD restricted to symbolic manipulation of algebraic expressions
• AD operates on (and integrates with) the existing source code of a
computer program, Nocedal and Wright (2006 Section 8.2).

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:
• SD restricted to symbolic manipulation of algebraic expressions
• AD operates on (and integrates with) the existing source code of a
computer program, Nocedal and Wright (2006 Section 8.2).

• Note: less than required by SD, more than required by finite


differences (which can operate on black box implementations)

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:
• SD restricted to symbolic manipulation of algebraic expressions
• AD operates on (and integrates with) the existing source code of a
computer program, Nocedal and Wright (2006 Section 8.2).

• Note: less than required by SD, more than required by finite


differences (which can operate on black box implementations)
• AD Input: (source code of ) a computer program for evaluating a vector
function

28
Algorithmic Differentiation — Idea III

• AD similar to symbolic differentiation


• major difference:
• SD restricted to symbolic manipulation of algebraic expressions
• AD operates on (and integrates with) the existing source code of a
computer program, Nocedal and Wright (2006 Section 8.2).

• Note: less than required by SD, more than required by finite


differences (which can operate on black box implementations)
• AD Input: (source code of ) a computer program for evaluating a vector
function
• AD Output: (source code of) a computer program implementing
derivative(s) thereof

28
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)
• e.g., ADIFOR — produce new code that calculates both function and
derivative values

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)
• e.g., ADIFOR — produce new code that calculates both function and
derivative values

• operator overloading

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)
• e.g., ADIFOR — produce new code that calculates both function and
derivative values

• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)
• e.g., ADIFOR — produce new code that calculates both function and
derivative values

• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs

• Two modes: forward mode and reverse mode

29
Algorithmic Differentiation — Implementations

• Two implementation styles, Naumann (2012):


• source code transformation (SCT)
• e.g., ADIFOR — produce new code that calculates both function and
derivative values

• operator overloading
• e.g., ADOL-C — keep track of the elementary computations during the
function evaluation, produce the derivatives at given inputs

• Two modes: forward mode and reverse mode


• Example: Eigen’s Auto Diff module

29
Algorithmic Differentiation — Example

MLE

• a more concrete illustration: example in the context of


statistical estimation

30
Algorithmic Differentiation — Example

MLE

• a more concrete illustration: example in the context of


statistical estimation
• evaluation and differentiation of the loglikelihood function of a
normally distributed sample

30
Algorithmic Differentiation — Example

MLE

• a more concrete illustration: example in the context of


statistical estimation
• evaluation and differentiation of the loglikelihood function of a
normally distributed sample
i.i.d.
• we have T > 1 observations x1 , x2 , ..., xT ∼ N (µ, σ 2 ), with µ ∈ R,
and σ ∈ R+

30
Algorithmic Differentiation — Example

MLE

• a more concrete illustration: example in the context of


statistical estimation
• evaluation and differentiation of the loglikelihood function of a
normally distributed sample
i.i.d.
• we have T > 1 observations x1 , x2 , ..., xT ∼ N (µ, σ 2 ), with µ ∈ R,
and σ ∈ R+
• our parameter vector is θ = (µ, σ 2 )

30
Algorithmic Differentiation — Example

MLE

• a more concrete illustration: example in the context of


statistical estimation
• evaluation and differentiation of the loglikelihood function of a
normally distributed sample
i.i.d.
• we have T > 1 observations x1 , x2 , ..., xT ∼ N (µ, σ 2 ), with µ ∈ R,
and σ ∈ R+
• our parameter vector is θ = (µ, σ 2 )
• the likelihood function is
√ ∑
L(θ; x) = f (x; θ) = (σ 2π)−T exp[− 21 Tt=1 (xt − µ)2 /σ 2 ]

30
Algorithmic Differentiation — Evaluation Trace

• objective function: the loglikelihood function,



ℓ(θ) = logL(θ; x) = Tt=1 ℓt (θ)

31
Algorithmic Differentiation — Evaluation Trace

• objective function: the loglikelihood function,



ℓ(θ) = logL(θ; x) = Tt=1 ℓt (θ)
• a contribution given by
ℓt (θ) = − T2 log(2π) − T2 log(σ 2 ) − 21 (xt − µ)2 /σ 2 .

31
Algorithmic Differentiation — Evaluation Trace

• objective function: the loglikelihood function,



ℓ(θ) = logL(θ; x) = Tt=1 ℓt (θ)
• a contribution given by
ℓt (θ) = − T2 log(2π) − T2 log(σ 2 ) − 21 (xt − µ)2 /σ 2 .

• goal: compute the value of ℓt corresponding to the arguments


xt = 6, µ = 6, and σ 2 = 4.

31
Algorithmic Differentiation — Evaluation Trace

• objective function: the loglikelihood function,



ℓ(θ) = logL(θ; x) = Tt=1 ℓt (θ)
• a contribution given by
ℓt (θ) = − T2 log(2π) − T2 log(σ 2 ) − 21 (xt − µ)2 /σ 2 .

• goal: compute the value of ℓt corresponding to the arguments


xt = 6, µ = 6, and σ 2 = 4.
• A computer program may execute the sequence of operations
represented by an evaluation trace, Griewank and Walther
(2008)

31
Algorithmic Differentiation — Evaluation Trace

Variable Operation Expression Value


v−1 = µ = 6
v0 = σ2 = 4

v1 = ϕ1 (v0 ) = log(v0 ) = 1.386294


v2 = ϕ2 (v1 ) = − 21 log(2π) − 21 v1 = −1.612086
v3 = ϕ3 (v−1 ) = xt − v−1 = 0
v4 = ϕ4 (v3 ) = v23 = 0
v5 = ϕ5 (v0 , v4 ) = v4 /v0 = 0
v6 = ϕ6 (v5 ) = − 21 v5 = 0
v7 = ϕ7 (v2 , v6 ) = v2 + v6 = −1.612086

ℓt = v7 = −1.612086

32
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}
• intermediate subexpressions: elementary operations and
functions ϕi with known, predefined derivatives

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}
• intermediate subexpressions: elementary operations and
functions ϕi with known, predefined derivatives
• note: similar to the static single assignment form (SSA) used in
compiler design and implementation

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}
• intermediate subexpressions: elementary operations and
functions ϕi with known, predefined derivatives
• note: similar to the static single assignment form (SSA) used in
compiler design and implementation
• We can now apply the chain rule to the intermediate
operations vi and collect the intermediate derivative results in
order to obtain the derivative of the final expression ℓt = v7

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}
• intermediate subexpressions: elementary operations and
functions ϕi with known, predefined derivatives
• note: similar to the static single assignment form (SSA) used in
compiler design and implementation
• We can now apply the chain rule to the intermediate
operations vi and collect the intermediate derivative results in
order to obtain the derivative of the final expression ℓt = v7
• Recall: Two modes of propagating the intermediate information:
the forward mode and the reverse mode

33
Algorithmic Differentiation — Evaluation Trace

• evaluation of ℓt broken down


• into a composition of intermediate subexpressions vi , with i ∈
{1, . . . , 7}
• intermediate subexpressions: elementary operations and
functions ϕi with known, predefined derivatives
• note: similar to the static single assignment form (SSA) used in
compiler design and implementation
• We can now apply the chain rule to the intermediate
operations vi and collect the intermediate derivative results in
order to obtain the derivative of the final expression ℓt = v7
• Recall: Two modes of propagating the intermediate information:
the forward mode and the reverse mode
• forward mode: Eigen’s Auto Diff module used here

33
Algorithmic Differentiation — Forward Mode

• goal: the derivative of the output variable ℓt (dependent


variable) with respect to the input variable µ (independent
variable)

34
Algorithmic Differentiation — Forward Mode

• goal: the derivative of the output variable ℓt (dependent


variable) with respect to the input variable µ (independent
variable)
• idea: compute the numerical value of the derivative of each of
the intermediate subexpressions, keeping the track of the
resulting derivatives values along the way.

34
Algorithmic Differentiation — Forward Mode

• goal: the derivative of the output variable ℓt (dependent


variable) with respect to the input variable µ (independent
variable)
• idea: compute the numerical value of the derivative of each of
the intermediate subexpressions, keeping the track of the
resulting derivatives values along the way.
• for each variable vi : introduce another variable, v̇i = ∂vi /∂µ

34
Algorithmic Differentiation — Forward Mode

• goal: the derivative of the output variable ℓt (dependent


variable) with respect to the input variable µ (independent
variable)
• idea: compute the numerical value of the derivative of each of
the intermediate subexpressions, keeping the track of the
resulting derivatives values along the way.
• for each variable vi : introduce another variable, v̇i = ∂vi /∂µ
• then: apply the chain rule mechanically to each intermediate
subexpression

34
Algorithmic Differentiation — Forward Mode

• goal: the derivative of the output variable ℓt (dependent


variable) with respect to the input variable µ (independent
variable)
• idea: compute the numerical value of the derivative of each of
the intermediate subexpressions, keeping the track of the
resulting derivatives values along the way.
• for each variable vi : introduce another variable, v̇i = ∂vi /∂µ
• then: apply the chain rule mechanically to each intermediate
subexpression
• assigning the resulting numerical value to each v̇i .

34
Algorithmic Differentiation — Forward Mode

• in our case

35
Algorithmic Differentiation — Forward Mode

• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),

35
Algorithmic Differentiation — Forward Mode

• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)

35
Algorithmic Differentiation — Forward Mode

• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0

35
Algorithmic Differentiation — Forward Mode

• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0
• analogously, v2 = ϕ2 (v1 ) = − 21 log(2π) − 21 v1 , hence
v̇2 = (∂ϕ2 (v1 )/∂v1 )v̇1 = − 21 v̇1 = 0.0

35
Algorithmic Differentiation — Forward Mode

• in our case
• start with v̇−1 = 1.0 (active differentation variable µ),
• v̇0 = 0.0 (passive: we are not differentiating with respect to σ 2 and
thus treat it as a constant)
• then, given that v1 = ϕ1 (v0 ) = log(v0 ), we obtain
v̇1 = (∂ϕ1 (v0 )/∂v0 )v̇0 = v̇0 /v0 = 0.0
• analogously, v2 = ϕ2 (v1 ) = − 21 log(2π) − 21 v1 , hence
v̇2 = (∂ϕ2 (v1 )/∂v1 )v̇1 = − 21 v̇1 = 0.0

• by continuing this procedure we ultimately obtain an


augmented, forward-derived, evaluation trace

35
Algorithmic Differentiation — Augmented Evaluation Trace

Variable Operation Expression Value

v−1 = µ = 6
v̇−1 = ∂v−1 /∂v−1 = 1
v0 = σ2 = 4
v̇0 = ∂v0 /∂v−1 = 0

v1 = ϕ1 (v0 ) = log(v0 ) = 1.386294


v̇1 = (∂ϕ1 /∂v0 )v̇0 = v̇0 /v0 = 0
... ... ... ...
v7 = ϕ7 (v2 , v6 ) = v2 + v6 = −1.612086
v̇7 = (∂ϕ7 /∂v2 )v̇2 + (∂ϕ7 /∂v6 )v̇6 = v̇2 + v̇6 = 0

ℓt = v7 = −1.612086
ℓ̇t = v̇7 = 0

36
Algorithmic Differentiation — Augmented Evaluation Trace

• The basic forward mode of AD — called “forward” because the


derivatives values v̇i carried along simultaneously with the
values vi themselves

37
Algorithmic Differentiation — Augmented Evaluation Trace

• The basic forward mode of AD — called “forward” because the


derivatives values v̇i carried along simultaneously with the
values vi themselves
• The problem addressed by the implementations:

37
Algorithmic Differentiation — Augmented Evaluation Trace

• The basic forward mode of AD — called “forward” because the


derivatives values v̇i carried along simultaneously with the
values vi themselves
• The problem addressed by the implementations:
• ”transform a program with a particular evaluation trace into an
augmented program whose evaluation trace also contains exactly
the same extra variables and additional lines as in the derived
evaluation trace,” Griewank and Walther (2008).

37
AD — Reverse Mode

• An alternative to the forward approach — the reverse or adjoint


mode.

38
AD — Reverse Mode

• An alternative to the forward approach — the reverse or adjoint


mode.
• Griewank and Walther (2008): ”Rather than choosing an input
variable and calculating the sensitivity of every intermediate
variable with respect to that input, we choose instead an
output variable and calculate the sensitivity of that output with
respect to each of the intermediate variables. As it turns out
adjoint sensitivities must naturally be computed backward, i.e.,
starting from the output variables.”

38
AD — Reverse Mode

• An alternative to the forward approach — the reverse or adjoint


mode.
• Griewank and Walther (2008): ”Rather than choosing an input
variable and calculating the sensitivity of every intermediate
variable with respect to that input, we choose instead an
output variable and calculate the sensitivity of that output with
respect to each of the intermediate variables. As it turns out
adjoint sensitivities must naturally be computed backward, i.e.,
starting from the output variables.”
• This talk: we shall not treat it any further detail and refer
anyone interested to Griewank and Walther (2008).

38
AD — Reverse Mode

• An alternative to the forward approach — the reverse or adjoint


mode.
• Griewank and Walther (2008): ”Rather than choosing an input
variable and calculating the sensitivity of every intermediate
variable with respect to that input, we choose instead an
output variable and calculate the sensitivity of that output with
respect to each of the intermediate variables. As it turns out
adjoint sensitivities must naturally be computed backward, i.e.,
starting from the output variables.”
• This talk: we shall not treat it any further detail and refer
anyone interested to Griewank and Walther (2008).
• Worth noting: the choice between the modes depends on the
computational cost.

38
AD — Forward Mode vs. Reverse Mode

• In general, for f : Rn → Rm

39
AD — Forward Mode vs. Reverse Mode

• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))

39
AD — Forward Mode vs. Reverse Mode

• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))

39
AD — Forward Mode vs. Reverse Mode

• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))

• Which is better? Depends: e.g., gradient vs. Jacobian

39
AD — Forward Mode vs. Reverse Mode

• In general, for f : Rn → Rm
• rough CostAD,Forward (f) ∈ n O(Cost(f))
• rough CostAD,Reverse (f) ∈ m O(Cost(f))

• Which is better? Depends: e.g., gradient vs. Jacobian


• Foward mode performance: can establish a worst-case
performance bound for AD when used for gradient computation

39
Illustration: Simulation Study — Setup

The Model: GARCH(1, 1)


The spot asset price S follows the following process under the phys-
ical probability measure P:

St+1
log ≡ rt+1 = µ + ϵt+1 (1)
St

ϵt+1 = vt+1 wt+1 (2)
vt+1 = ω + aϵ2t + bvt (3)
P
w ∼ GWN(0, 1) (4)

40
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks
• pv = 1 — integrated GARCH (IGARCH) process

41
Illustration: Simulation Study — Setup — Notation

• θ = (µ, ω, a, b)⊤
• r — logarithmic return of the spot price
• µ — conditional mean of r
• v — conditional variance of r
• non-negativity: ω, a, b > 0
• w — randomness source (innovation): standard Gaussian white noise
• mv = 1−a−b
ω
— the unconditional mean of v (unconditional variance of
r)
• pv = (a + b) — the mean-reversion rate or the persistence rate
• when pv < 1, weak stationarity, the conditional variance will revert
to the unconditional variance at a geometric rate of pv
• smaller pv — less persistent sensitivity of the volatility expectation
to past market shocks
• pv = 1 — integrated GARCH (IGARCH) process
• common finding: pv close to 1 in financial data, ”near-integrated”
process, Bollerslev and Engle (1993). 41
Illustration: Monte Carlo Simulation

• DGP (data-generating process): near-integrated scenario, θDGP :


µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85

42
Illustration: Monte Carlo Simulation

• DGP (data-generating process): near-integrated scenario, θDGP :


µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85
• M = 10,000 (MC estimation experiments / replications count)

42
Illustration: Monte Carlo Simulation

• DGP (data-generating process): near-integrated scenario, θDGP :


µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85
• M = 10,000 (MC estimation experiments / replications count)
• N = 2,000 ... 16,000 (sample size)

42
Illustration: Monte Carlo Simulation

• DGP (data-generating process): near-integrated scenario, θDGP :


µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85
• M = 10,000 (MC estimation experiments / replications count)
• N = 2,000 ... 16,000 (sample size)
• 2,000 observations / 252 business days — roughly 8 years of data

42
Illustration: Monte Carlo Simulation

• DGP (data-generating process): near-integrated scenario, θDGP :


µ = 0.01 (risk-free rate), ω = 2e − 6, a = 0.10, b = 0.85
• M = 10,000 (MC estimation experiments / replications count)
• N = 2,000 ... 16,000 (sample size)
• 2,000 observations / 252 business days — roughly 8 years of data
• S&P 500 from January 3rd, 1950 — over 16,000 observations

42
Illustration: Reliability — AD, L-BFGS

43
Illustration: Reliability — FD, TNR

44
Illustration: Findings — Reliability

• Reliability of estimation:

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio

• Successful-convergence ratios (fraction of MC experiments with


successful convergence within a 0.01 neighborhood true DGP
values) for the best algorithm-gradient pairs are as follows:

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio

• Successful-convergence ratios (fraction of MC experiments with


successful convergence within a 0.01 neighborhood true DGP
values) for the best algorithm-gradient pairs are as follows:
• L-BFGS(AD): average = 0.811, median: 0.99999903,

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio

• Successful-convergence ratios (fraction of MC experiments with


successful convergence within a 0.01 neighborhood true DGP
values) for the best algorithm-gradient pairs are as follows:
• L-BFGS(AD): average = 0.811, median: 0.99999903,
• L-BFGS(FD): average = 0.446, median: 0.02893254 (sic),

45
Illustration: Findings — Reliability

• Reliability of estimation:
• using AD: successful convergence in the vast majority of Monte
Carlo experiments
• using FD: very high convergence failure ratio

• Successful-convergence ratios (fraction of MC experiments with


successful convergence within a 0.01 neighborhood true DGP
values) for the best algorithm-gradient pairs are as follows:
• L-BFGS(AD): average = 0.811, median: 0.99999903,
• L-BFGS(FD): average = 0.446, median: 0.02893254 (sic),
• Truncated Newton with restarting (TNR) (FD): average = 0.708,
median: 0.99856261.

45
Illustration: Findings

• Overall: When using the fastest successful algorithm-gradient


pair for both AD and FD:

46
Illustration: Findings

• Overall: When using the fastest successful algorithm-gradient


pair for both AD and FD:
• AD achieves 12.5 times higher accuracy (in terms of the error norm)
with a 3 slowdown compared to L-BFGS(FD)

46
Illustration: Findings

• Overall: When using the fastest successful algorithm-gradient


pair for both AD and FD:
• AD achieves 12.5 times higher accuracy (in terms of the error norm)
with a 3 slowdown compared to L-BFGS(FD)
• while achieving 4 times higher accuracy and 1.1 speedup compared
to TNR(FD).

46
Example Source Code
Example using Rcpp

Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')

• Note: Rcpp not necessary for any of this:

48
Example using Rcpp

Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')

• Note: Rcpp not necessary for any of this:


• Feel free to skip over any line with Rcpp::

48
Example using Rcpp

Rcpp::sourceCpp('ExampleGaussianRcpp.cpp')

• Note: Rcpp not necessary for any of this:


• Feel free to skip over any line with Rcpp::
• Add #include <Eigen/Dense> and
#include <unsupported/Eigen/AutoDiff> for plain vanilla
Eigen

48
ExampleGaussianRcpp.cpp I

// [[Rcpp::plugins("cpp11")]]
#include <cstddef>

// [[Rcpp::depends(BH)]]
#include <boost/math/constants/constants.hpp>

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>

#include <cmath>

49
ExampleGaussianRcpp.cpp II

namespace model
{
// 2 parameters: mu, sigma^2
enum parameter : std::size_t { mu, s2 };
constexpr std::size_t parameters_count = 2;
}

50
ExampleGaussianRcpp.cpp III

// [[Rcpp::export]]
double l_t_cpp(double xt,
const Eigen::Map<Eigen::VectorXd> parameters)
{
const auto mu = parameters[model::parameter::mu];
const auto s2 = parameters[model::parameter::s2];
constexpr auto two_pi =
boost::math::constants::two_pi<double>();
using std::log;
using std::pow;
return -.5 * log(two_pi) - .5 * log(s2) -
.5 * pow(xt - mu, 2) / s2;
}

51
AD Example using RcppEigen

Rcpp::sourceCpp('ExampleGaussianRcppEigen.cpp')

52
ExampleGaussianRcppEigen.cpp I

// [[Rcpp::plugins("cpp11")]]
#include <cstddef>

// [[Rcpp::depends(BH)]]
#include <boost/math/constants/constants.hpp>

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
#include <unsupported/Eigen/AutoDiff>

#include <cmath>

53
ExampleGaussianRcppEigen.cpp II

namespace model
{
// 2 parameters: mu, sigma^2
enum parameter : std::size_t { mu, s2 };
constexpr std::size_t parameters_count = 2;

template <typename ScalarType>


using parameter_vector =
Eigen::Matrix<ScalarType, parameters_count, 1>;
}

54
ExampleGaussianRcppEigen.cpp III
// note: data `xt` -- double-precision number(s), just as before
// only the parameters adjusted to `ScalarType`
template <typename VectorType>
typename VectorType::Scalar
l_t_cpp_AD(double xt,
const VectorType & parameters)
{
using Scalar = typename VectorType::Scalar;
const Scalar & mu = parameters[model::parameter::mu];
const Scalar & s2 = parameters[model::parameter::s2];
// note: `two_pi` is, as always, a double-precision constant
constexpr auto two_pi =
boost::math::constants::two_pi<double>();
using std::log; using std::pow;
return -.5 * log(two_pi) - .5 * log(s2) -
.5 * pow(xt - mu, 2) / s2;
}

55
ExampleGaussianRcppEigen.cpp IV

// wrapper over `l_t_cpp_AD`


// (simply dispatching to the existing implementation)
//
// (Rcpp-exportable explicit template instantiations
// would render this unnecessary)

// [[Rcpp::export]]
double l_t_value_cpp(double xt,
const Eigen::Map<Eigen::VectorXd> parameters)
{
return l_t_cpp_AD(xt, parameters);
}

56
ExampleGaussianRcppEigen.cpp V
// objective function together with its gradient
// `xt`: input (data)
// `parameters_input`: input (model parameters)
// `gradient_output`: output (computed gradient)
// returns: computed objective function value
// [[Rcpp::export]]
double l_t_value_gradient_cpp
(
double xt,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::VectorXd> gradient_output
)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;

57
ExampleGaussianRcppEigen.cpp VI

parameter_vector parameters = parameters_input;


VectorAD parameters_AD = parameters.cast<AD>();

constexpr auto P = model::parameters_count;


for (std::size_t p = 0; p != P; ++p)
{
// forward mode:
// active differentiation variable `p`
// (one parameter at a time)
parameters_AD(p).derivatives() = parameter_vector::Unit(p);
}

const AD & loglikelihood_result = l_t_cpp_AD(xt, parameters_AD);


gradient_output = loglikelihood_result.derivatives();
return loglikelihood_result.value();
}

58
Data Parallel Objective Function I

// First: make the data parallelism (DP) explicit


// [[Rcpp::export]]
Eigen::VectorXd l_t_value_cppDP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
for (std::size_t t = 0; t != sample_size; ++t)
{
result[t] = l_t_cpp_AD(xs[t], parameters);
}
return result;
}

59
Data Parallel Objective Function II
// Second: Parallelize using OpenMP
// [[Rcpp::export]]
Eigen::VectorXd l_t_value_cppDP_OMP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
#pragma omp parallel for default(none) shared(result)
for (std::size_t t = 0; t < sample_size; ++t)
{
result[t] = l_t_cpp_AD(xs[t], parameters);
}
return result;
}

60
Data Parallel Objective Function Performance I

> require("microbenchmark")

> microbenchmark(
+ l_t(xs, fixed_parameters),
+ l_t_value_cppDP(xs, fixed_parameters),
+ l_t_value_cppDP_OMP(xs, fixed_parameters)
+ )
Unit: microseconds
expr median neval
l_t_value_cppDP(xs, fixed_parameters) 458.618 100
l_t_value_cppDP_OMP(xs, fixed_parameters) 213.526 100

Note: changing l_t_cpp_AD to l_t_cpp_AD_DP (analogously)


can also help, but we’re not done, yet...

61
Data Parallel Gradient I

// [[Rcpp::export]]
Eigen::VectorXd l_t_value_gradient_cppDP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::MatrixXd> gradient_output
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
for (std::size_t t = 0; t != sample_size; ++t)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;

62
Data Parallel Gradient II

parameter_vector parameters = parameters_input;


VectorAD parameters_AD = parameters.cast<AD>();

constexpr auto P = model::parameters_count;


for (std::size_t p = 0; p != P; ++p)
{
parameters_AD(p).derivatives() = parameter_vector::Unit(p);
}
const AD & loglikelihood_result = l_t_cpp_AD(xs[t],
parameters_AD);
gradient_output.row(t) = loglikelihood_result.derivatives();
result[t] = loglikelihood_result.value();
}
return result;
}

63
Data Parallel Gradient III
// [[Rcpp::export]]
Eigen::VectorXd l_t_value_gradient_cppDP_OMP
(
const Eigen::Map<Eigen::VectorXd> xs,
const Eigen::Map<Eigen::VectorXd> parameters_input,
Eigen::Map<Eigen::MatrixXd> gradient_output
)
{
const std::size_t sample_size = xs.size();
Eigen::VectorXd result(sample_size);
#pragma omp parallel for default(none) \
shared(result, gradient_output)
for (std::size_t t = 0; t < sample_size; ++t)
{
using parameter_vector = model::parameter_vector<double>;
using AD = Eigen::AutoDiffScalar<parameter_vector>;
using VectorAD = model::parameter_vector<AD>;

64
Data Parallel Gradient IV

parameter_vector parameters = parameters_input;


VectorAD parameters_AD = parameters.cast<AD>();

constexpr auto P = model::parameters_count;


for (std::size_t p = 0; p != P; ++p)
{
parameters_AD(p).derivatives() = parameter_vector::Unit(p);
}
const AD & loglikelihood_result = l_t_cpp_AD(xs[t],
parameters_AD);
gradient_output.row(t) = loglikelihood_result.derivatives();
result[t] = loglikelihood_result.value();
}
return result;
}

65
Data Parallel Gradient Performance I

microbenchmark(
l_t_value_gradient_cppDP(xs, fixed_parameters, gradient),
l_t_value_gradient_cppDP_OMP(xs, fixed_parameters, gradient)
)
Unit: microseconds
expr median neval
l_t_value_gradient_cppDP 631.3375 100
l_t_value_gradient_cppDP_OMP 258.6945 100

66
Data Parallel Gradient Performance — Conclusions

Worth noting:

• speed-up over naive serial version thanks to OpenMP

67
Data Parallel Gradient Performance — Conclusions

Worth noting:

• speed-up over naive serial version thanks to OpenMP


• we can do better: truly data-parallel (non-naive) objective
function implementation l_t_cpp_AD_DP, reverse mode AD

67
Data Parallel Gradient Performance — Conclusions

Worth noting:

• speed-up over naive serial version thanks to OpenMP


• we can do better: truly data-parallel (non-naive) objective
function implementation l_t_cpp_AD_DP, reverse mode AD
• left as an exercise for the audience! :-)

67
Caveats

• Source code

68
Caveats

• Source code
• Recall: AD requires the access to the source code

68
Caveats

• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)

68
Caveats

• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages

68
Caveats

• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages

• Reverse mode — memory requirements

68
Caveats

• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages

• Reverse mode — memory requirements


• Perverse (but valid) code:
if (x == 1.0) then y = 0.0 else y = x − 1.0

68
Caveats

• Source code
• Recall: AD requires the access to the source code
• We don’t always have it: Consider closed source software
(proprietary, third party)
• Another challenge: Source code spanning multiple programming
languages

• Reverse mode — memory requirements


• Perverse (but valid) code:
if (x == 1.0) then y = 0.0 else y = x − 1.0
• derivative at x = 1?

68
Resources
General

http://www.autodiff.org/
http://en.wikipedia.org/wiki/Automatic_differentiation

70
Resources I

A Gentle Introduction to Backpropagation


http://numericinsight.blogspot.com/2014/07/a-gentle-
introduction-to-backpropagation.html
http://www.numericinsight.com/uploads/A_Gentle_Introduction_to_Ba
A Multi-Core Benchmark Used to Improve Algorithmic
Differentiation
http://www.seanet.com/~bradbell/cppad_thread.htm
A Multi-Core Algorithmic Differentiation Benchmarking System —
Brad Bell
https://vimeo.com/39008544

71
Resources II
A Step by Step Backpropagation Example
http://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/
Adjoint Methods in Computational Finance
http://www.hpcfinance.eu/sites/www.hpcfinance.eu/files/Uwe%20Nau
http://www.nag.com/Market/seminars/Uwe_AD_Slides_July13.pdf
Adjoints and Automatic (Algorithmic) Differentiation in
Computational Finance
http://arxiv.org/abs/1107.1831v1
Algorithmic Differentiation in More Depth
http://www.nag.com/pss/ad-in-more-depth

72
Resources III
Algorithmic Differentiation of a GPU Accelerated Application
http://www.nag.co.uk/Market/events/jdt-hpc-new-thinking-in-
finance-presentation.pdf
Automatic Differentiation and QuantLib
https://quantlib.wordpress.com/tag/automatic-differentiation/
Automatic differentiation in deep learning by Shawn Tan
https://cdn.rawgit.com/shawntan/presentations/master/Deep
Learning-Copy1.slides.html
Calculus on Computational Graphs: Backpropagation
https://colah.github.io/posts/2015-08-Backprop/

73
Resources IV
Computing derivatives for nonlinear optimization: Forward mode
automatic differentiation
http://nbviewer.ipython.org/github/joehuchette/OR-software-
tools-2015/blob/master/6-nonlinear-opt/Nonlinear-
DualNumbers.ipynb
Introduction to Automatic Differentiation
http://alexey.radul.name/ideas/2013/introduction-to-automatic-
differentiation/
Jarrett Revels: Automatic differentiation
https://www.youtube.com/watch?v=PrXUl0sanro

74
Resources V

Neural Networks (Part I) – Understanding the Mathematics behind


backpropagation
https://biasvariance.wordpress.com/2015/08/18/neural-
networks-understanding-the-math-behind-backpropagation-
part-i/

75
Floating Point Numbers

• http://www.johndcook.com/blog/2009/04/06/anatomy-of-a-
floating-point-number/

Everything by Bruce Dawson:

• https://randomascii.wordpress.com/category/floating-point/

In particular:

• https://randomascii.wordpress.com/2012/02/25/
comparing-floating-point-numbers-2012-edition/
• https://randomascii.wordpress.com/2012/04/05/
floating-point-complexities/
• https://randomascii.wordpress.com/2013/01/03/
top-eight-entertaining-blog-facts-for-2012/
76
Software I

https://en.wikipedia.org/wiki/Automatic_differentiation#Software
ADNumber, Adept, ADOL-C, CppAD, Eigen (Auto Diff module)
CasADi
https://github.com/casadi/casadi/wiki
CasADi is a symbolic framework for algorithmic (a.k.a. automatic)
differentiation and numeric optimization.
CppAD
https://github.com/coin-or/CppAD/

77
Software II

Dali
https://github.com/JonathanRaiman/Dali
An automatic differentiation library that uses reverse-mode
differentation (backpropagation) to differentiate recurrent neural
networks, or most mathematical expressions through control flow,
while loops, recursion.
Open Porous Media Automatic Differentiation Library
https://github.com/OPM/opm-autodiff
Utilities for automatic differentiation and simulators based on AD.
The Stan Math Library (stan::math: includes a C++
reverse-mode automatic differentiation library)
https://github.com/stan-dev/math

78
References
References I

Bollerslev, Tim, and Robert Engle. 1993. “Common Persistence in


Conditional Variances.” Econometrica 61 (1). The Econometric
Society: pp. 167–86.
http://www.jstor.org/stable/2951782.

Giles, Michael, and Paul Glasserman. 2006. “Smoking Adjoints:


Fast Monte Carlo Greeks.” Risk 19: 88–92.

Griewank, Andreas. 2012. “Who Invented the Reverse Mode of


Differentiation?” Optimization Stories Documenta Matematica,
Extra Volume ISMP (2012): 389–400.

Griewank, Andreas, and Andrea Walther. 2008. Evaluating


Derivatives: Principles and Techniques of Algorithmic
Differentiation. 2nd ed. Other Titles in Applied Mathematics 105.
80
References II

Philadelphia, PA: SIAM.


http://www.ec-securehost.com/SIAM/OT105.html.

Homescu, C. 2011. “Adjoints and Automatic (Algorithmic)


Differentiation in Computational Finance.” ArXiv E-Prints, July.
http://arxiv.org/abs/1107.1831.

Naumann, Uwe. 2012. The Art of Differentiating Computer


Programs: An Introduction to Algorithmic Differentiation.
Computer software, Environments, and Tools 24. Philadelphia, PA:
SIAM. http://www.ec-securehost.com/SIAM/SE24.html.

Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical


Optimization. 2. ed. Springer: New York, NY.

81
Thank You!
Questions?

82

You might also like