0% found this document useful (0 votes)

36 views283 pages

Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views283 pages

Algorithmic Differentiation - C++ and Extremum Estimation - Matt P. Dziubinski - CppCon 2015

Uploaded by

alan88w

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 283

Algorithmic Differentiation

C++ & Extremum Estimation

Matt P. Dziubinski
CppCon 2015
[email protected] // @matt_dz
Department of Mathematical Sciences, Aalborg University
CREATES (Center for Research in Econometric Analysis of Time Series)
Outline

Why?

Numerical Optimization

Calculating Derivatives

Example Source Code

Resources

References

2
Slides

• Latest version: https://speakerdeck.com/mattpd

3
Why?
Motivation: Data & Models

• Have: data, y (numbers)

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast
• Tool: model, f (y; θ)

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: ﬁtting model to data

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: ﬁtting model to data
• How: cost function — minimize / maximize

5
Motivation: Data & Models

• Have: data, y (numbers)

• Want: understand, explain, forecast
• Tool: model, f (y; θ)
• Need: connection, θ
• Idea: ﬁtting model to data
• How: cost function — minimize / maximize
• Nowadays one of the most widely applied approaches —
estimation (econometrics, statistics), calibration (ﬁnance),
training ((supervised) machine learning)

5
Motivation: Data & Models

• Have: data, y (numbers)

5
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,

Θ — parameter space

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,

Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,

Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

• can be obtained as an extremum (maximum or minimum) of a

scalar objective function

6
Motivation: Parametric Models

• Parametric model: parameter vector θ ∈ Θ ⊆ Rp ,

Θ — parameter space
• An estimator θ̂ is an extremum estimator if ∃Q : Θ → R s.t.

θ̂ = argmax Q(θ)
θ∈Θ

• can be obtained as an extremum (maximum or minimum) of a

scalar objective function
• class includes OLS (Ordinary Least Squares), NLS (Nonlinear Least
Squares), GMM (Generalized Method of Moments), and QMLE (Quasi
Maximum Likelihood Estimation)

6
Numerical Optimization
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly

categorized into two categories:

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly

categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly

8
Numerical Optimization: Algorithms

• Numerical optimization algorithms — can be broadly

categorized into two categories:
• derivative-free — do not rely on the knowledge of the objective
function’s gradient in the optimization process
• gradient-based — need the gradient of the objective function
• iteration form:
initial iterate: θ0
new iterate: θk = θk−1 + sk , iteration k ≥ 1
step: sk = αk pk
length: scalar αk > 0
direction: vector pk s.t. Hk pk = ∇Q(θk )
• Hk :
Hk = I (Steepest Ascent) — many variants, including SGD (Stochastic
Gradient Descent)
Hk = ∇2 Q(θk ) (Newton)
Hk = Hessian approximation satisfying the secant equation:
Hk+1 sk = ∇Qk+1 − ∇Qk (Quasi-Newton) 8
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic)

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence

rates (superlinear or even quadratic)
• However: this property depends on the gradient either being
exact — or at minimum sufﬁciently accurate in the limit sense,
Nocedal and Wright (2006 Chapter 3)
• For many modern models analytical gradient formulas not
available in closed-form — or non-trivial to derive
• This often leads to the use of numerical approximations — in
particular, ﬁnite difference methods (FDMs), Nocedal and
Wright (2006 Chapter 8)

9
Numerical Optimization: Gradient

• Gradient-based algorithms often exhibit superior convergence

9
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision

10
Numerical Optimization: Algorithmic Differentiation

• Need a way obtain the gradient — preferably: fast and accurate

• This is where Algorithmic Differentiation (AD) comes in:
• fast
• and accurate
• and simple

• Essentially:
• an automatic computational differentiation method
• based on the systematic application of the chain rule
• exact to the maximum extent allowed by the machine precision

• How does it compare to other computational differentiation

methods?

10
Calculating Derivatives
Calculating Derivatives

• Main approaches:

12
Calculating Derivatives

• Main approaches:
• Finite Differencing

12
Calculating Derivatives

• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation

12
Calculating Derivatives

• Main approaches:
• Finite Differencing
• Algorithmic Differentiation (AD), a.k.a. Automatic Differentiation
• Symbolic Differentiation

12
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:
• forward-difference formula: p + 1 function evaluations

13
Finite Differencing

f (x+h)−f (x)
• Recall f ′ (x) = limh→0 h
• forward-difference approximation:

∂f f (x + hei ) − f (x)
(x) ≈ ,
∂xi h

• central-difference approximation:

∂f f (x + hei ) − f (x − hei )
(x) ≈ ,
∂xi 2h

• Computational cost:
• forward-difference formula: p + 1 function evaluations
• central-difference formula: 2 ∗ p function evaluations

13
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

• Consider univariate function f and the forward-difference
formula:
f (x + h) − f (x)
f ′ (x) ≈
h
• Recall Taylor’s formula
1 1
f (x + h) = f (x) + hf ′ (x) + h2 f ′′ (x) + h3 f ′′′ (x) + O(h4 )
2 6
• Hence
f (x + h) − f (x) 1 ′′ 1
f ′ (x) = − hf (x) − h2 f ′′′ (x) − O(h3 )
h 2 6

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

14
Finite Differencing — Truncation Error

• Truncation Error Analysis — Taylor’s theorem

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454
• -2.273737e-13

15
Floating Point Arithmetic

• Floating Point Arithmetic: NO associativity (or distributivity) of

addition (and multiplication)
• Example:
• Input:
a = 1234.567; b = 45.678; c = 0.0004
a + b + c
c + b + a
(a + b + c) - (c + b + a)
• Output:
• 1280.2454
• 1280.2454
• -2.273737e-13
• ”Many a serious mathematician has attempted to give rigorous
analyses of a sequence of ﬂoating point operations, but has found
the task to be so formidable that he has tried to content himself with
plausibility arguments instead.” — Donald E. Knuth, TAOCP2 15
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: signiﬁcant digits × baseexponent

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: signiﬁcant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

• The IEEE Standard for Floating-Point Arithmetic (IEEE 754)
• Form: significant digits × baseexponent
• Machine epsilon, ϵM := inf{ϵ > 0 : 1.0 + ϵ ̸= 1.0}
• the difference between 1.0 and the next value representable by the
floating-point number 1.0 + ϵ
• IEEE 754 binary32 (single precision): 2−23 ≈ 1.19209e-07 (1 sign
bit, 8 exponent bits, leaving 23 bits for significand)

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

16
Floating Point Representation

• Floating point representation: inspired by scientiﬁc notation

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM

16
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM
• representation (round-off) error: x − fl(x)

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ﬂoating point number representation: fl(x) = x(1 + δ), where
|δ| < ϵM
• representation (round-off) error: x − fl(x)
• absolute round-off error: |x − fl(x)| < ϵM |x|
• relative round-off error: |x − fl(x)|/|x| < ϵM
• ϵM a.k.a. unit round-off, u = ULP(1.0)
• ULP — Unit in the last place
• the distance between the two closest straddling ﬂoating-point
numbers a and b (i.e., those with a ≤ x ≤ b and a ̸= b),

17
Floating Point Representation Properties

• ﬂoating point number representation: fl(x) = x(1 + δ), where

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h

18
Finite Differencing — Round-off Error I

• Consider
f (x + h) − f (x)
f ′ (x) ≈
h
• Suppose that x and x + h exact
• Round-off errors when evaluating f:
f (x + h)(1 + δ1 ) − f (x)(1 + δ2 ) f (x + h) − f (x) δ1 f (x + h) − δ2 f (x)
= +
h h h
• We have |δi | ≤ ϵM for i ∈ 1, 2.
• =⇒ round-off error bound:
|f (x + h)| + |f (x)|
ϵM
h
• =⇒ round-off error bound order: O( ϵhM )
• round-off error — increases with smaller h

18
Finite Differencing — Round-off Error I

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0
ĥ

19
Finite Differencing — Round-off Error II

• ϵM ̸= 0 =⇒ ∃ĥ : x + ĥ = x
• i.e., x and x + h not guaranteed to be exactly representable, either
• =⇒ f (x) = f (x + ĥ)
• =⇒ f (x+h)−f (x) = 0
ĥ
• =⇒ no correct digits in the result at (or below) step-size ĥ

19
Finite Differencing — Trucation—Round-off Trade-off