Mdobook
Mdobook
Design Optimization
Joaquim R. R. A. Martins
Andrew Ning
Engineering
Design
Optimization
joaquim r. r. a. martins
University of Michigan
andrew ning
Brigham Young University
Publication
First electronic edition: January 2020.
Contents
Contents i
Preface vi
Acknowledgements viii
1 Introduction 1
1.1 Design Optimization Process 2
1.2 Optimization Problem Formulation 6
1.3 Optimization Problem Classification 17
1.4 Optimization Algorithms 21
1.5 Selecting an Optimization Approach 26
1.6 Notation 28
1.7 Summary 29
Problems 30
i
Contents ii
3.10 Summary 72
Problems 74
Bibliography 585
Index 608
Preface
vi
Preface vii
viii
Introduction
1
Optimization is a human instinct. People constantly seek to improve
their lives and the systems that surround them. Optimization is intrinsic
in biology, as exemplified by the evolution of species. Birds optimize
their wings’ shape in real time, and dogs have been shown to find
optimal trajectories. Even more broadly, many laws of physics relate to
optimization, such as the principle of minimum energy. As Leonhard
Euler once wrote, “nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear.”
The term optimization is often used to mean “improvement”, but
mathematically, it is a much more precise concept: finding the best
possible solution by changing variables that can be controlled, often
subject to constraints. Optimization has a broad appeal because it is
applicable in all domains and because of the human desire to make
things better. Any problem where a decision needs to be made can be
cast as an optimization problem.
Although some simple optimization problems can be solved an-
alytically, most practical problems of interest are too complex to be
solved this way. The advent of numerical computing, together with
the development of optimization algorithms, has enabled us to solve
problems of increasing complexity.
1
1 Introduction 2
Manual iteration
Change
design
manually No
Optimization
Update
design
No
variables
Initial
design
Evaluate
Specifications Optimality
objective and
achieved?
Formulate constraints
optimization
problem
Yes
used, the decision to finalize the design is made only when the current Fig. 1.2 Conventional (top) versus de-
design satisfies the optimality conditions that ensure that no other sign optimization process (bottom).
design “close by” is better. The design changes are made automatically
by the optimization algorithm and do not require intervention from
the designer.
This automated process does not usually provide a “push-button”
solution; it requires human intervention and expertise (often more
expertise than in the traditional process). Human decisions are still
needed in the design optimization process. Before running an op-
timization, in addition to determining the specifications and initial
design, engineers need to formulate the design problem. This requires
expertise in both the subject area and numerical optimization. The
designer must decide what the objective is, which parameters can be
changed, and which constraints must be enforced. These decisions
have profound effects on the outcome, so it is crucial that the designer
1 Introduction 5
performance
optimizations. We illustrate several advantages of design optimization System
in Fig. 1.3, which shows the notional variations of system performance, Conventional
design process
cost, and uncertainty as a function of time in design. When using
optimization, the system performance increases more rapidly compared
with the conventional process, achieving a better end result in a shorter
Cumulative
Reduced cost
total time. As a result, the cost of the design process is lower. Finally,
cost
𝑥 = [𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 𝑥 ] . (1.1)
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 , 𝑖 = 1, . . . , 𝑛 𝑥 , (1.2)
where 𝑥 and 𝑥 are lower and upper bounds on the design variables,
respectively. These are also known as bound constraints or side constraints.
Some design variables may be unbounded or bounded on only one
side.
When all the design variables are continuous, the optimization prob-
† This is not to be confused with the conti-
nuity of the objective and constraint func-
lem is said to be continuous.† Most of this book focuses on algorithms tions, which we discuss in Section 1.3.
1 Introduction 8
Consider a wing design problem where the wing planform shape is rect-
angular. The planform could be parametrized by the span (𝑏) and the chord 𝑐
(𝑐), as shown in Fig. 1.5, so that 𝑥 = [𝑏, 𝑐]. However, this choice is not
𝑏
unique. Two other variables are often used in aircraft design: wing area (𝑆)
and wing aspect ratio (𝐴𝑅), as shown in Fig. 1.6. Because these variables are Fig. 1.5 Wingspan (𝑏) and chord (𝑐).
1 Introduction 9
5 10
8
4 𝑆
𝑆= =
𝑆
𝑆=5
35
1.0
=
𝑏=
25
15
6 12
𝑐=
𝑐 3 𝐴𝑅
4 1.5
𝑐=
2 .0 𝑏=8
2 𝑐=
𝑐=2
.5
2
=1
=
𝐴𝑅 7
1
𝐴𝑅 =
0
𝑏=4
different sets of design variables, 𝑥 =
2 4 6 8 10 5 10 15 20 25 [𝑏, 𝑐] and 𝑥 = [𝑆, 𝐴𝑅].
𝑏 𝑆
not independent (𝑆 = 𝑏𝑐 and 𝐴𝑅 = 𝑏 2 /𝑆), we cannot just add them to the set
of design variables. Instead, we must pick any two variables out of the four
to parametrize the design because we have four possible variables and two
dependency relationships.
For this wing, the variables must be positive to be physically meaningful,
so we must remember to explicitly bound these variables to be greater than
zero in an optimization. The variables should be bound from below by small
positive values because numerical models are probably not prepared to take
zero values. No upper bound is needed unless the optimization algorithm
requires it.
these can be represented more compactly with splines. This is a commonly used 𝑐1 𝑐3
technique in optimization because reducing the number of design variables 2
often speeds up an optimization with little if any loss in the model parameteri-
zation fidelity. Figure 1.7 shows an example spline describing the shape of a 𝑐4
turbine blade. In this example, only four design variables are used to represent 0
0 0.2 0.4 0.6 0.8 1
the curved shape. Blade fraction
designer, it does not matter how precisely the function and its optimum
point are computed—the mathematical optimum will be non-optimal
from the engineering point of view. A bad choice for the objective
function is a common mistake in design optimization. 0
𝑥∗
The choice of objective function is not always obvious. For example,
minimizing the weight of a vehicle might sound like a good idea, but
this might result in a vehicle that is too expensive to manufacture. In
this case, manufacturing cost would probably be a better objective. min − 𝑓 (𝑥)
However, there is a trade-off between manufacturing cost and the
performance of the vehicle. It might not be obvious which of these Fig. 1.8 A maximization problem can
be transformed into an equivalent
objectives is the most appropriate one because this trade-off depends on minimization problem.
customer preferences. This issue motivates multiobjective optimization,
which is the subject of Chapter 9. Multiobjective optimization does
not yield a single design but rather a range of designs that settle for
different trade-offs between the objectives.
Experimenting with different objectives should be part of the design
exploration process (this is represented by the outer loop in the design
optimization process in Fig. 1.2). Results from optimizing the “wrong”
objective can still yield insights into the design trade-offs and trends
for the system at hand.
In Ex. 1.1, we have the luxury of being able to visualize the design
space because we have only two variables. For more than three variables,
it becomes impossible to visualize the design space. We can also
visualize the objective function for two variables, as shown in Fig. 1.9.
In this figure, we plot the function values using the vertical axis, which
results in a three-dimensional surface. Although plotting the surface
might provide intuition about the function, it is not possible to locate
Fig. 1.9 A function of two variables
the points accurately when drawing on a two-dimensional surface. ( 𝑓 = 𝑥12 + 𝑥22 in this case) can be visual-
Another possibility is to plot the contours of the function, which ized by plotting a three-dimensional
are lines of constant value, as shown in Fig. 1.10. We prefer this type surface or contour plot.
1 Introduction 11
called an isosurface.
−1
choices of design variable sets discussed in Ex. 1.1. We can locate the minimum
graphically (denoted by the dot). Although the two optimum solutions are
the same, the shapes of the objective function contours are different. In this
case, using the aspect ratio and wing area simplifies the relationship between
the design variables and the objective by aligning the two main curvature
trends with each design variable. Thus, the parameterization can change the
effectiveness of the optimization.
1.5 70
1.2
50
𝑐 0.9 𝐴𝑅
Fig. 1.11 Required power contours
30
for two different choices of design
0.6 variable sets. The optimal wing is
the same for both cases, but the func-
10
0.3
tional form of the objective is simpli-
5 15 25 35 5 10 15 20 25 fied in the one on the right.
𝑏 𝑆
The optimal wing for this problem has an aspect ratio that is much higher
than that typically seen in airplanes or birds. Although the high aspect ratio
increases aerodynamic efficiency, it adversely affects the structural strength,
which we did not consider here. Thus, as in most engineering problems, we
need to add constraints and consider multiple disciplines.
1 Introduction 12
1.2.3 Constraints
The vast majority of practical design optimization problems require the
enforcement of constraints. These are functions of the design variables
that we want to restrict in some way. Like the objective function,
constraints are computed through a model whose complexity can vary
widely. The feasible region is the set of points that satisfy all constraints.
We seek to minimize the objective function within this feasible design
space.
When we restrict a function to being equal to a fixed value, we call
this an equality constraint, denoted by ℎ(𝑥) = 0. When the function is
required to be less than or equal to a certain value, we have an inequality
constraint, denoted by 𝑔(𝑥) ≤ 0.¶ Although we use the “less or equal” ¶A strict inequality, 𝑔(𝑥) < 0, is never
used because then 𝑥 could be arbitrarily
convention, some texts and software programs use “greater or equal” close to the equality. Because the optimum
instead. There is no loss of generality with either convention because is at 𝑔 = 0 for an active constraint, the
exact solution would then be ill-defined
we can always multiply the constraint by −1 to convert between the from a mathematical perspective. Also, the
two. difference is not meaningful when using
finite-precision arithmetic (which is always
the case when using a computer).
Tip 1.2 Check the inequality convention
Some texts and papers omit the equality constraints without loss
of generality because an equality constraint can be replaced by two
inequality constraints. More specifically, an equality constraint, ℎ(𝑥) =
0, is equivalent to enforcing two inequality constraints, ℎ(𝑥) ≥ 0 and
ℎ(𝑥) ≤ 0.
1 Introduction 13
𝑔1 (𝑥) ≤ 0 ℎ1 (𝑥) = 0
(active) 𝑓 (𝑥) (active) 𝑓 (𝑥)
the design space as shown in Ex. 1.2 and obtain the solution graphically. Fig. 1.13 Minimum-power wing with
In addition to the possibility of a large number of design variables a constraint on bending stress com-
and computationally expensive objective function evaluations, we now pared with the unconstrained case.
add the possibility of a large number of constraints, which might also
be expensive to evaluate. Again, this is further motivation for the
optimization techniques covered in this book.
minimize 𝑓 (𝑥)
by varying 𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
(1.4)
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ .
initial design 𝑥 0 and then queries the analysis for a sequence of designs
until it finds the optimum design, 𝑥 ∗ . Fig. 1.14 The analysis computes the
objective ( 𝑓 ) and constraint values (𝑔,
ℎ) for a given set of design variables
Tip 1.3 Using an optimization software package (𝑥).
The setup of an optimization problem varies depending on the particular
software package, so read the documentation carefully. Most optimization
software requires you to define the objective and constraints as callback functions.
These are passed to the optimizer, which calls them back as needed during the
optimization process. The functions take the design variable values as inputs
∗∗ Optimization software resources in-
and output the function values, as shown in Fig. 1.14. Study the software clude the optimization toolboxes in
documentation for the details on how to use it.∗∗ To make sure you understand MATLAB, scipy.optimize.minimize in
how to use a given optimization package, test it on simple problems for which Python, Optim.jl or Ipopt.jl in Julia,
NLopt for multiple languages, and the
you know the solution first (see Prob. 1.4). Solver add-in in Microsoft Excel. The py-
OptSparse framework provides a common
Python wrapper for many existing opti-
mization codes and facilitates the testing
When the optimizer queries the analysis for a given 𝑥, for most of different methods.1 SNOW.jl wraps a
methods, the constraints do not have to be feasible. The optimizer is few optimizers and multiple derivative
computation methods in Julia.
responsible for changing 𝑥 so that the constraints are satisfied.
1. Wu et al., pyOptSparse: A Python frame-
The objective and constraint functions must depend on the design work for large-scale constrained nonlinear
variables; if a function does not depend on any variable in the whole optimization of sparse systems, 2020.
1 Introduction 16
Continuous
Design variables Discrete
Mixed
Single
Problem Objective
formulation Multiobjective
Constrained
Constraints
Unconstrained
Optimization
Continuous
problem Smoothness
classification Discontinuous
Linear
Linearity
Nonlinear
𝑓 (𝑥)
1.3.1 Smoothness
The degree of function smoothness with respect to variations in the 𝑥
design variables depends on the continuity of the function values and
their derivatives. When the value of the function varies continuously, 𝑓 (𝑥)
0
the function is said to be 𝐶 continuous. If the first derivatives also vary
continuously, then the function is 𝐶 1 continuous, and so on. A function 𝑥
0 1
function, a 𝐶 function, and a 𝐶 function. Fig. 1.17 Discontinuous function
As we will see later, discontinuities in the function value or deriva- (top), 𝐶 0 continuous function (mid-
tives limit the type of optimization algorithm that can be used because dle), and 𝐶 1 continuous function (bot-
tom).
1 Introduction 19
1.3.2 Linearity
The functions of interest could be linear or nonlinear. When both the
objective and constraint functions are linear, the optimization problem
is known as a linear optimization problem. These problems are easier
to solve than general nonlinear ones, and there are entire books and
courses dedicated to the subject. The first numerical optimization
algorithms were developed to solve linear optimization problems, and 𝑥2
that such a function is unimodal would require evaluating the function Fig. 1.19 Types of minima.
at every point in the domain, which is computationally prohibitive.
However, it much easier to prove multimodality—all we need to do is
find two distinct local minima.
1 Introduction 20
Zeroth
Order First
Second
Local
Search
Global
Mathematical
Algorithm
Heuristic
Optimization
algorithm
classification
Function Direct
evaluation Surrogate model
1.4.5 Stochasticity
This attribute is independent of the stochasticity of the model that
we mentioned previously, and it is strictly related to whether the
optimization algorithm itself contains steps that are determined at
random or not.
A deterministic optimization algorithm always evaluates the same
points and converges to the same result, given the same initial conditions.
In contrast, a stochastic optimization algorithm evaluates a different set
of points if run multiple times from the same initial conditions, even
if the models for the objective and constraints are deterministic. For
example, most evolutionary algorithms include steps determined by
generating random numbers. Gradient-based algorithms are usually
deterministic, but some exceptions exist, such as stochastic gradient
descent (see Section 10.5).
Convex? Yes
Linear optimization, quadratic optimization, etc.
Ch. 11
Yes
No Branch and bound
Yes
Dynamic programming
Discrete? Yes
Linear? Markov chain?
Ch. 8 No SA or GA (bit-encoded)
No
No
Yes BFGS
Yes Ch. 4 Yes
Differentiable?
Unconstrained? Multimodal? Multistart
Ch. 6 SQP or IP
No Ch. 5
No
Yes
DIRECT, GPS, GA, PS, etc.
Gradient free
Multimodal?
Ch. 7 Nelder–Mead
No
The first node asks about convexity. Although it is often not Fig. 1.24 Decision tree for selecting
immediately apparent if the problem is convex, with some experience, an optimization algorithm.
we can usually discern whether we should attempt to reformulate it as a
convex problem. In most instances, convexity occurs for problems with
simple objectives and constraints (e.g., linear or quadratic), such as in
control applications where the optimization is performed repeatedly. A
convex problem can be solved with general gradient-based or gradient-
free algorithms, but it would be inefficient not to take advantage of the
convex formulation structure if we can do so.
The next node asks about discrete variables. Problems with discrete
design variables are generally much harder to solve, so we might
consider alternatives that avoid using discrete variables when possible.
For example, a wind turbine’s position in a field could be posed as
a discrete variable within a discrete set of options. Alternatively, we
could represent the wind turbine’s position as a continuous variable
with two continuous coordinate variables. That level of flexibility may
or may not be desirable but generally leads to better solutions. Many
1 Introduction 28
a desire not to deviate from the standard conventions used in each Fig. 1.25 An 𝑛-vector, 𝑥.
field. We explicitly note these exceptions as needed. For example, the
𝐴
objective function 𝑓 is a scalar function and the Lagrange multipliers
(𝜆 and 𝜎) are vectors. 𝐴11 𝐴1𝑚
Fig. 1.25. For more compact notation, we may write a column vector
horizontally, with its components separated by commas, for example, 𝐴𝑛1 𝐴𝑛𝑚
𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. We refer to a vector with 𝑛 components as an 𝑗
𝑛-vector, which is equivalent to writing 𝑥 ∈ R𝑛 . (𝑛 × 𝑚)
An (𝑛 × 𝑚) matrix has 𝑛 rows and 𝑚 columns, which is equivalent Fig. 1.26 An (𝑛 × 𝑚) matrix, 𝐴.
1 Introduction 29
Tip 1.5 Work out the dimensions of the vectors and matrices
As you read this book, we encourage you to work out the dimensions of
the vectors and matrices in the operations within each equation and verify the
dimensions of the result for consistency. This will enhance your understanding
of the equations.
1.7 Summary
Problems
𝑥12 + 𝑥22 ≤ 1
1
𝑥1 − 3𝑥 2 + ≥ 0 ,
2
and bound constraints:
𝑥1 ≥ 0, 𝑥2 ≥ 0 .
Plot the constraints and identify the feasible region. Find the
constrained minimum graphically. Use optimization software
to solve the constrained minimization problem. Which of the
inequality constraints and bounds are active at the solution?
1 Introduction 32
33
2 A Short History of Optimization 34
𝜃 𝜃
2.2 Optimization Revolution: Derivatives and Calculus
Mirror
𝐵0
The scientific revolution generated significant optimization develop-
ments in the seventeenth and eighteenth centuries that intertwined Fig. 2.2 The law of reflection can be
with other developments in mathematics and physics. derived by minimizing the length of
In the early seventeenth century, Johannes Kepler published a book the light beam.
in which he derived the optimal dimensions of a wine barrel.7 He 7. Kepler, Nova stereometria doliorum
vinariorum (New Solid Geometry of Wine
became interested in this problem when he bought a barrel of wine, Barrels), 1615.
and the merchant charged him based on a diagonal length (see Fig. 2.3).
This outraged Kepler because he realized that the amount of wine could
vary for the same diagonal length, depending on the barrel proportions.
Incidentally, Kepler also formulated an optimization problem when
looking for his second wife, seeking to maximize the likelihood of satis-
faction. This “marriage problem” later became known as the “secretary
problem”, which is an optimal-stopping problem that has since been
solved using dynamic optimization (mentioned in Section 1.4.6 and
discussed in Section 8.5).8
Willebrord Snell discovered the law of refraction in 1621, a formula
that describes the relationship between the angles of incidence and
Fig. 2.3 Wine barrels were measured
refraction when light passes through a boundary between two different by inserting a ruler in the tap hole
media, such as air, glass, or water. Whereas Hero minimized the length until it hit the corner.
to derive the law of reflection, Snell minimized time. These laws were 8. Ferguson, Who solved the secretary
generalized by Fermat in the principle of least time (or Fermat’s principle), problem? 1989.
which states that a ray of light going from one point to another follows
the path that takes the least time.
2 A Short History of Optimization 35
point each time does not, in general, result in the shortest overall path.
This is a combinatorial optimization problem that later became known
as the traveling salesperson problem, one of the most intensively studied
problems in optimization (Chapter 8).
In 1939, William Karush derived the necessary conditions for in-
equality constrained problems in his master’s thesis. His approach
generalized the method of Lagrange multipliers, which only allowed
for equality constraints. Harold Kuhn and Albert Tucker independently
rediscovered these conditions and published their seminal paper in
1951.15 These became known as the Karush–Kuhn–Tucker (KKT) condi- 15. Karush, Minima of functions of several
variables with inequalities as side constraints,
tions, which constitute the foundation of gradient-based constrained 1939.
optimization algorithms (see Section 5.3).
Leonid Kantorovich developed a technique to solve linear program-
ming problems in 1939 after having been given the task of optimizing
production in the Soviet government’s plywood industry. However,
his contribution was neglected for ideological reasons. In the United
States, Tjalling Koopmans rediscovered linear programming in the
early 1940s when working on ship-transportation problems. In 1947,
George Dantzig published the first complete algorithm for solving linear
programming problems—the simplex algorithm.16 In the same year, 16. Dantzig, Linear programming and
extensions, 1998.
von Neumann developed the theory of duality for linear programming
problems. Kantorovich and Koopmans later shared the 1975 Nobel
Memorial Prize in Economic Sciences “for their contributions to the
theory of optimum allocation of resources”. Dantzig was not included,
presumably because his work was more theoretical. The development
of the simplex algorithm and the widespread practical applications of
linear programming sparked a revolution in optimization. The first
international conference on optimization, the International Symposium
on Mathematical Programming, was held in Chicago in 1949.
In 1951, George Box and Kenneth Wilson developed the response-
surface methodology (surrogate modeling), which enables optimization
of systems based on experimental data (as opposed to a physics-based
model). They developed a method to build a quadratic model where
the number of data points scales linearly with the number of inputs
instead of exponentially, striking a balance between accuracy and ease of
application. In the same year, Danie Krige developed a surrogate model
based on a stochastic process, which is now known as the kriging model.
He developed this model in his master’s thesis to estimate the most likely
distribution of gold based on a limited number of borehole samples.17 17. Krige, A statistical approach to some
mine valuation and allied problems on the
These approaches are foundational in surrogate-based optimization Witwatersrand, 1951.
(Chapter 10).
In 1952, Harry Markowitz published a paper on portfolio theory
2 A Short History of Optimization 38
what became known as the Bellman equation (Section 8.5), which was
first applied to engineering control theory and subsequently became a
core principle in economic theory.
In 1959, William Davidon developed the first quasi-Newton method
for solving nonlinear optimization problems that rely on approxi-
mations of the curvature based on gradient information. He was
motivated by his work at Argonne National Laboratory, where he
used a coordinate-descent method to perform an optimization that
kept crashing the computer before converging. Although Davidon’s
approach was a breakthrough in nonlinear optimization, his original
paper was rejected. It was eventually published more than 30 years
later in the first issue of the SIAM Journal on Optimization.20 Fortunately, 20. Davidon, Variable metric method for
minimization, 1991.
his valuable insight had been recognized well before that by Roger
Fletcher and Michael Powell, who further developed the method.21 The 21. Fletcher and Powell, A rapidly con-
vergent descent method for minimization,
method became known as the Davidon–Fletcher–Powell (DFP) method 1963.
(Section 4.4.4).
Another quasi-Newton approximation method was independently
proposed in 1970 by Charles Broyden, Roger Fletcher, Donald Goldfarb,
and David Shanno, now called the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) method. Larry Armijo, A. Goldstein, and Philip Wolfe developed
the conditions for the line search that ensure convergence in gradient-
based methods (see Section 4.3.2).22 22. Wolfe, Convergence conditions for
ascent methods, 1969.
Leveraging the developments in unconstrained optimization, re-
searchers sought methods for solving constrained problems. Penalty
and barrier methods were developed but fell out of favor because
of numerical issues (see Section 5.4). In another effort to solve non-
linear constrained problems, Robert Wilson proposed the sequential 23. Wilson, A simplicial algorithm for
concave programming, 1963.
quadratic programming (SQP) method in his PhD thesis.23 SQP consists
24. Han, Superlinearly convergent variable
of applying the Newton method to solve the KKT conditions (see Sec- metric algorithms for general nonlinear
programming problems, 1976.
tion 5.5). Shih-Ping Han reinvented SQP in 1976,24 and Michael Powell
25. Powell, Algorithms for nonlinear con-
popularized this method in a series of papers starting from 1977.25 straints that use Lagrangian functions, 1978.
2 A Short History of Optimization 39
women from having the same opportunities as men. The first known
female mathematician, Hypatia, lived in Alexandria (Egypt) in the
fourth century and was brutally murdered for political motives. In
the eighteenth century, Sophie Germain corresponded with famous
mathematicians under a male pseudonym to avoid gender bias. She
could not get a university degree because she was female but was
nevertheless a pioneer in elasticity theory. Ada Lovelace famously
wrote the first computer program in the nineteenth century.64 In the late 64. Hollings et al., Ada Lovelace: The
Making of a Computer Scientist, 2014.
nineteenth century, Sofia Kovalevskaya became the first woman to obtain
a doctorate in mathematics but had to be tutored privately because
she was not allowed to attend lectures. Similarly, Emmy Noether, who
made many fundamental contributions to abstract algebra in the early
twentieth century, had to overcome rules that prevented women from
enrolling in universities and being employed as faculty.65 65. Osen, Women in Mathematics, 1974.
Banneker, a free African American who was a self-taught mathematician 68. Rothstein, The Color of Law: A For-
gotten History of How Our Government
and astronomer, corresponded directly with Thomas Jefferson and Segregated America, 2017.
successfully challenged the morality of the U.S. government’s views on
race and humanity.69 Historically black colleges and universities were 69. King, More than slaves: Black founders,
Benjamin Banneker, and critical intellectual
established in the United States after the American Civil War because agency, 2014.
2 A Short History of Optimization 45
2.6 Summary
46
3 Numerical Models and Solvers 47
structure’s weight does not contribute to the loading. Finally, the displacements
are assumed to be small relative to the dimensions of the truss members.
The structure is discretized by pinned bar elements. The discrete governing
equations for any truss structure can be derived using the finite-element method.
This leads to the linear system
𝐾𝑢 = 𝑞 ,
where 𝐾 is the stiffness matrix, 𝑞 is the vector of applied loads, and 𝑢 represents
the displacements that we want to compute. At each joint, there are two degrees
of freedom (horizontal and vertical) that describe the displacement and applied
force. Because there are 9 joints, each with 2 degrees of freedom, the size of
this linear system is 18.
𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (3.1)
where 𝑟 is a vector of residuals that has the same size as the vector of
state variables 𝑢. The equations defining the residuals could be any
expression that can be coded in a computer program. No matter how
complex the mathematical model, it can always be written as a set of
equations in this form, which we write more compactly as 𝑟(𝑢) = 0.
Finding the state variables that satisfy this set of equations requires
a solver, as illustrated in Fig. 3.3. We review the various types of solvers Solver 𝑢
𝑢
in Section 3.6. Solving a set of implicit equations is more costly than
𝑟(𝑢)
computing explicit functions, and it is typically the most expensive step 𝑟
The linear system from Ex. 3.1 is an example of a system of implicit equations,
which we can write as a set of residuals by moving the right-hand-side vector
to the left to obtain
𝑟(𝑢) = 𝐾𝑢 − 𝑞 = 0 ,
where 𝑢 represents the state variables. Although the solution for 𝑢 could be
written as an explicit function, 𝑢 = 𝐾 −1 𝑓 , this is usually not done because it
is computationally inefficient and intractable for large-scale systems. Instead,
we use a linear solver that does not explicitly form the inverse of the stiffness
matrix (see Appendix B).
In addition to computing the displacements, we might also want to compute
the axial stress (𝜎) in each of the 15 truss members.This is an explicit function
of the displacements, which is given by the linear relationship
𝜎 = 𝑆𝑢 ,
We can still use the residual notation to represent explicit functions Solver 𝑢𝑟
𝑢𝑟
to write all the functions in a model (implicit and explicit) as 𝑟(𝑢) = 0 𝑓 (𝑢𝑟 ) 𝑢𝑓
𝑟𝑟 (𝑢𝑟 )
without loss of generality. Suppose we have an implicit system of 𝑟𝑟
equations, 𝑟𝑟 (𝑢𝑟 ) = 0, followed by a set of explicit functions whose
Fig. 3.4 A model with implicit and
output is a vector 𝑢 𝑓 = 𝑓 (𝑢𝑟 ), as illustrated in Fig. 3.4. We can rewrite
explicit functions.
the explicit function as a residual by moving all the terms to one side to
get 𝑟 𝑓 (𝑢𝑟 , 𝑢 𝑓 ) = 𝑓 (𝑢𝑟 ) − 𝑢 𝑓 = 0. Then, we can concatenate the residuals
and variables for the implicit and explicit equations as
Solver
𝑢
𝑢𝑟
𝑟𝑟 (𝑢𝑟 ) 𝑢𝑟 𝑢𝑓
𝑟(𝑢) ≡ = 0, where 𝑢≡ . (3.2) 𝑟𝑟 (𝑢𝑟 )
𝑓 (𝑢𝑟 ) − 𝑢 𝑓 𝑢𝑓 𝑟 𝑓 (𝑢𝑟 ) − 𝑢 𝑓
The solver arrangement would then be as shown in Fig. 3.5. Fig. 3.5 Explicit functions can be writ-
Even though it is more natural to just evaluate explicit functions ten in residual form and added to the
instead of adding them to a solver, in some cases, it is helpful to use solver.
the residual to represent the entire model with the compact notation,
𝑟(𝑢) = 0. This will be helpful in later chapters when we compute
derivatives (Chapter 6) and solve systems that mix multiple implicit
and explicit sets of equations (Chapter 13).
𝑢12 + 2𝑢2 − 1 = 0
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
3 Numerical Models and Solvers 51
The first two equations are written in implicit form, and the third equation is
given as an explicit function. The first equation could be manipulated to obtain
an explicit function of either 𝑢1 or 𝑢2 . The second equation does not have a
closed-form solution and cannot be written as an explicit function for 𝑢1 . The
third equation is an explicit function of 𝑢1 and 𝑢2 . In this case, we could solve
the first two equations for 𝑢1 and 𝑢2 using a nonlinear solver and then evaluate
𝑓 (𝑢1 , 𝑢2 ). Alternatively, we can write the whole system as implicit residual
equations by defining the value of 𝑓 (𝑢1 , 𝑢2 ) as 𝑢3 ,
Then we can use the same nonlinear solver to solve for all three equations
simultaneously.
Mesh point 𝑧
Cell 𝑧
Element 𝑧
With any of these discretization methods, the final result is a Fig. 3.6 Discretization methods in one
set of algebraic equations that we can write as 𝑟(𝑢) = 0 and solve spatial dimension.
for the state variables 𝑢. This is a potentially large set of equations
depending on the domain and discretization (e.g., it is common to
have millions of equations in three-dimensional computational fluid
dynamics problems). The number of state variables of the discretized PDE
model is equal to the number of equations for a complete and well-
defined model. In the most general case, the set of equations could be
implicit and nonlinear. 𝑡
𝑢(𝑧, 𝑡)
When a problem involves both space and time, the prevailing ap-
proach is to decouple the discretization in space from the discretization
in time—called the method of lines (see Fig. 3.7). The discretization in 𝑧
required.
𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
3.5 Numerical Errors 𝑧
error sources.
Fig. 3.7 PDEs in space and time are
An absolute error is the magnitude of the difference between the exact often discretized in space first to yield
value (𝑥 ∗ ) and the computed value (𝑥), which we can write as |𝑥 − 𝑥 ∗ |. an ODE in time.
3 Numerical Models and Solvers 53
This is the more useful error measure in most cases. When the exact
value 𝑥 ∗ is close to zero, however, this definition breaks down. To
address this, we avoid the division by zero by using
|𝑥 − 𝑥 ∗ |
𝜀= . (3.4)
1 + |𝑥 ∗ |
This error metric combines the properties of absolute and relative errors.
When |𝑥 ∗ | 1, this metric is similar to the relative error, but when
|𝑥 ∗ | 1, it becomes similar to the absolute error.
Suppose that three decimal digits are available to represent a number (and
that we use base 10 for simplicity). Then, 𝜀ℳ = 0.005 because any number
smaller than this results in 1 + 𝜀 = 1 when rounded to three digits. For
example, 1.00 + 0.00499 = 1.00499, which rounds to 1.00. On the other hand,
1.00 + 0.005 = 1.005, which rounds to 1.01 and satisfies Eq. 3.6.
If we try to store 24.11 using three digits, we get 24.1. The relative error is
24.11 − 24.1
≈ 0.0004 ,
24.11
which is lower than the maximum possible representation error of 𝜀ℳ = 0.005
established in Ex. 3.4.
For addition and subtraction, an error can occur even when the
two operands are represented exactly. Before addition and subtraction,
the computer must convert the two numbers to the same exponent.
When adding numbers with different exponents, several digits from
the small number vanish (see Fig. 3.8). If the difference in the two
exponents is greater than the magnitude of the exponent of 𝜀ℳ , the
small number vanishes completely—a consequence of Eq. 3.6. The
relative error incurred in addition is still 𝜀ℳ .
𝑎 0.
± 𝑏 0. 0 0 0 0 0 0 0 0 0 0
𝑐 0.
On the other hand, subtraction can incur much greater relative Fig. 3.8 Adding or subtracting num-
bers of differing exponents results in
errors when subtracting two numbers that have the same exponent and a loss in the number of digits cor-
are close to each other. In this case, the digits that match between the responding to the difference in the
two numbers cancel each other and reduce the number of significant exponents. The gray boxes indicate
digits that are identical between the
digits. When the relative difference between two numbers is less than two numbers.
machine precision, all digits match, and the subtraction result is zero
(see Fig. 3.9). This is called subtractive cancellation and is a serious issue
when approximating derivatives via finite differences (see Section 6.4).
Common digits
𝑎 0.
− 𝑏 0.
If we use double precision and plot many points in a small interval, we can see
that the function exhibits the step pattern shown in Fig. 3.10. The numerical 2·
10−15
minimum of this function is anywhere in the interval around 𝑥 = 2 where the
numerical value is zero. This interval is much larger than the machine precision 1·
(𝜀ℳ = 2.2 × 10−16 ). An additional error is incurred in the function computation 10−15
arise even if we could do the arithmetic with infinite precision. section, is sometimes also referred to as
truncation error because digits are truncated.
When discretizing a mathematical model with partial derivatives as However, we avoid this confusing naming
described in Section 3.4, these are approximated by truncated Taylor and only use truncation error to refer to a
truncation in the number of operations.
series expansions that ignore higher-order terms. When the model
includes integrals, they are approximated as finite sums. In either case,
a mesh of points where the relevant states and functions are evaluated
is required. Discretization errors generally decrease as the spacing
between the points decreases.
Fig. 3.11. The norm of the residuals decreases gradually until a limit 105
is reached (near 10−10 in this case). This limit represents the lowest 102
error achieved with the iterative solver and is determined by other 10−1
sources of error, such as roundoff and truncation errors. If we terminate
10−4
before reaching the limit (either by setting a convergence tolerance to a
10−7
value higher than 10−10 or setting an iteration limit to lower than 400
iterations), we incur an additional error. However, it might be desirable 10−10
0 200 400 600
to trade off a less precise solution for a lower computational effort. 𝑘
0.5369
0.58 +2 · 10−8
𝑓 𝑓 ∼ 10−8
0.56
Fig. 3.12 To find the level of numerical
0.5369
0.54 noise of a function of interest with re-
spect to an input parameter (left), we
0.52 magnify both axes by several orders
0.5369 of magnitude and evaluate the func-
−2 · 10−8 tion at points that are closely spaced
0.5
0 1 2 3 4 2 − 1 · 10−6 2.0 2 + 1 · 10−6 (right).
𝑥 𝑥
The overall attitude toward programming should be that all code has bugs
until it is verified through testing. Programmers who are skilled at debugging
are not necessarily any better at spotting errors by reading code or by stepping
through a debugger than average programmers. Instead, effective programmers
use a systematic approach to narrow down where the problem is occurring.
Beginners often try to debug by running the entire program. Even experi-
enced programmers have a hard time debugging at that level. One primary
strategy discussed in this section is to write modular code. It is much easier
to test and debug small functions. Reliable complex programs are built up
through a series of well-tested modular functions. Sometimes we need to
simplify or break up functions even further to narrow down the problem. We
might need to streamline and remove pieces, make sure a simple case works,
then slowly rebuild the complexity.
You should also become comfortable reading and understanding the error
messages and stack traces produced by the program. These messages seem
obscure at first, but through practice and researching what the error messages
mean, they become valuable information sources.
Of course, you should carefully reread the code, looking for errors, but
reading through it again and again is unlikely to yield new insights. Instead,
it can be helpful to step away from the code and hypothesize the most likely
ways the function could fail. You can then test and eliminate hypotheses to
narrow down the problem.
3 Numerical Models and Solvers 59
There are several methods available for solving the discretized gov-
erning equations (Eq. 3.1). We want to solve the governing equations
for a fixed set of design variables, so 𝑥 will not appear in the solution
algorithms. Our objective is to find the state variables 𝑢 such that
𝑟(𝑢) = 0.
This is not a book about solvers, but it is essential to understand the
characteristics of these solvers because they affect the cost and precision
of the function evaluations in the overall optimization process. Thus,
we provide an overview and some of the most relevant details in this
section.∗ In addition, the solution of coupled systems builds on these ∗ Ascher and Greif74 provide a more de-
tailed introduction to the numerical meth-
solvers, as we will see in Section 13.2. Finally, some of the optimization ods mentioned in this chapter.
algorithms detailed in later chapters use these solvers. 74. Ascher and Greif, A First Course in
There are two main types of solvers, depending on whether the Numerical Methods, 2011.
Solver SOR
Newton
+ linear solver CG
Krylov subspace
Nonlinear
Nonlinear GMRES
variants of
fixed point
Linear systems can be solved directly or iteratively. Direct meth- Fig. 3.13 Overview of solution meth-
ods are based on the concept of Gaussian elimination, which can be ods for linear and nonlinear systems.
expressed in matrix form as a factorization into lower and upper tri-
angular matrices that are easier to solve (LU factorization). Cholesky
factorization is a more efficient variant of LU factorization that applies
only to symmetric positive-definite matrices.
Whereas direct solvers obtain the solution 𝑢 at the end of a process,
iterative solvers start with a guess for 𝑢 and successively improve it
3 Numerical Models and Solvers 61
more detail.
Direct methods are the right choice for many problems because
they are generally robust. Also, the solution is guaranteed for a fixed
Residual
number of operations, 𝒪(𝑛 3 ) in this case. However, for large systems
Iterative Direct
where 𝐴 is sparse, the cost of direct methods can become prohibitive,
whereas iterative methods remain viable. Iterative methods have other 𝜀ℳ
advantages, such as being able to trade between computational cost
𝒪 𝑛3
and precision. They can also be restarted from a good guess (see
Effort
Appendix B.4).
Fig. 3.14 Whereas direct methods
only yield the solution at the end
Tip 3.5 Do not compute the inverse of 𝐴
of the process, iterative methods pro-
Because some numerical libraries have functions to compute 𝐴−1 , you duce approximate intermediate re-
sults.
might be tempted to do this and then multiply by a vector to compute 𝑢 = 𝐴−1 𝑏.
This is a bad idea because finding the inverse is computationally expensive.
Instead, use LU factorization or another method from Fig. 3.13.
d𝑢
= −𝑟(𝑢, 𝑡) , (3.7)
d𝑡
which is called the semi-discrete form. A time-integration scheme is
then used to solve for the time history. The integration scheme can be
either explicit or implicit, depending on whether it involves evaluating
explicit expressions or solving implicit equations. If a system under a
certain condition has a steady state, these techniques can be used to
solve the steady state (d𝑢/d𝑡 = 0).
lim k𝑥 𝑘 − 𝑥 ∗ k = 0 , (3.8)
𝑘→∞
which means that the norm of the error tends to zero as the number of
iterations tends to infinity.
The rate of convergence of a sequence is of order 𝑝 with asymptotic
error constant 𝛾 when 𝑝 is the largest number that satisfies∗ ∗ Some authors refer to 𝑝 as the rate of
convergence. Here, we characterize the
rate of convergence by two metrics: order
k𝑥 𝑘+1 − 𝑥 ∗ k and error constant.
0 ≤ lim = 𝛾 < ∞. (3.9)
𝑘→∞ k𝑥 𝑘 − 𝑥 ∗ k 𝑝
3 Numerical Models and Solvers 63
Asymptotic here refers to the fact that this is the behavior in the limit,
when we are close to the solution. There is no guarantee that the initial
and intermediate iterations satisfy this condition.
To avoid dealing with limits, let us consider the condition expressed
in Eq. 3.9 at all iterations. We can relate the error from one iteration to
the next by
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 𝑝 . (3.10)
When 𝑝 = 1, we have linear order of convergence; when 𝑝 = 2, we have
quadratic order of convergence. Quadratic convergence is a highly
valued characteristic for an iterative algorithm, and in practice, orders of
convergence greater than 𝑝 = 2 are usually not worthwhile to consider.
When we have linear convergence, then
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.11)
Thus, after six iterations, we get six-digit precision. Now suppose that
𝛾 = 0.9. Then we would have
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 2 . (3.14)
If 𝛾 = 1, then the error norm sequence with a starting error norm of 0.1
would be
10−1 , 10−2 , 10−4 , 10−8 , . . . . (3.15)
This yields more than six digits of precision in just three iterations!
In this case, the number of correct digits doubles at every iteration.
When 𝛾 > 1, the convergence will not be as fast, but the series will still
converge.
3 Numerical Models and Solvers 64
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.16)
0.1 10−1
Linear
10−2 𝑝 = 1, 𝛾 = 0.9
0.08 Superlinear
10−3 𝑝=1
0.06 𝛾→0
10−4
𝑥 𝑥
0.04 10−5
10−6
0.02 Linear Fig. 3.15 Sample sequences for lin-
10−7 Quadratic 𝑝=1 ear, superlinear, and quadratic cases
𝑝=2 𝛾 = 0.1
0
plotted on a linear scale (left) and a
10−8
0 2 4 6 0 2 4 6 logarithmic scale (right).
𝑘 𝑘
When using a linear scale plot, you can only see differences in two significant
digits. To reveal changes beyond three digits, you should use a logarithmic
scale. This need frequently occurs in plotting the convergence behavior of
optimization algorithms.
k𝑟 𝑘+1 k = 𝛾𝑘 k𝑟 𝑘 k 𝑝 . (3.18)
This is the secant method, which is useful when the derivative is not
available. The convergence rate is not quadratic like Newton’s method,
but it is superlinear.
Example 3.7 Newton’s method and the secant method for a single variable
2𝑢 𝑘3 + 4𝑢 𝑘2 + 𝑢 𝑘 − 2
𝑢 𝑘+1 = 𝑢 𝑘 − .
6𝑢 𝑘2 + 8𝑢 𝑘 + 1
When we start with the guess 𝑢0 = 1.5 (left plot in Fig. 3.16), the iterations
are well behaved, and the method converges quadratically. We can see the
3 Numerical Models and Solvers 67
𝑟 𝑟
−0.5 0
0 0 Fig. 3.16 Newton iterations starting
0.54 1.5 0.54 from different starting points.
𝑢∗ 𝑢 𝑢0 𝑢0 𝑢∗
0
The iterations for the secant method are shown in Fig. 3.17, where we can see 0.54 1.3 1.5
the successive secant lines replacing the exact tangent lines used in Newton’s 𝑢∗ 𝑢 𝑢1 𝑢0
method.
Fig. 3.17 Secant method applied to a
one-dimensional function.
This means that if the derivative is close to zero or the curvature tends
to a large number at the solution, Newton’s method will not converge
as well or not at all.
Now we consider the general case where we have 𝑛 nonlinear
equations of 𝑛 unknowns, expressed as 𝑟(𝑢) = 0. Similar to the single-
variable case, we derive the Newton step from a truncated Taylor
series. However, the Taylor series needs to be multidimensional in
both the independent variable and the function. Consider first the
multidimensionality of the independent variable, 𝑢, for a component
of the residuals, 𝑟 𝑖 (𝑢). The first two terms of the Taylor series about
𝑢 𝑘 for a step Δ𝑢 (which is now a vector with arbitrary direction and
3 Numerical Models and Solvers 68
magnitude) are
Õ
𝜕𝑟 𝑖
𝑛
𝑟 𝑖 (𝑢 𝑘 + Δ𝑢) ≈ 𝑟 𝑖 (𝑢 𝑘 ) + Δ𝑢 𝑗 . (3.28)
𝜕𝑢 𝑗 𝑢=𝑢𝑘
𝑗=1
𝐽𝑘 Δ𝑢 𝑘 = −𝑟 𝑘 . (3.30)
𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (3.31)
1 √
𝑢2 = , 𝑢2 = 𝑢1 .
𝑢1
This corresponds to the two lines shown in Fig. 3.18, where the solution is at 𝑢2
5
their intersection, 𝑢 = (1, 1). (In this example, the two equations are explicit,
and we could solve them by substitution, but they could have been implicit.) 4
1
𝑟1 = 𝑢2 − =0 2
𝑢1
√ 1
𝑟2 = 𝑢2 − 𝑢1 = 0 . 𝑢∗
0
The Jacobian can be derived analytically, and the Newton step is given by the 0 1 2 3
𝑢1
linear system
" 1 #
1 Δ𝑢
𝑢2 − 𝑢11
Fig. 3.18 Newton iterations.
𝑢12 1
= − √ .
− √1
2 𝑢1
1 Δ𝑢2 𝑢2 − 𝑢1
Starting from 𝑢 = (2, 3) yields the iterations shown in the following table, with
the quadratic convergence shown in Fig. 3.19.
||𝑟||
100
𝑢1 𝑢2 k𝑢 − 𝑢∗ k k𝑟 k
10−3
2.000 000 3.000 000 2.24 2.96
0.485 281 0.878 680 5.29 × 10−1 1.20 10−6
the governing equations were not included because they were assumed
to be part of the computation of the objective and constraints for a Solve 𝑢
𝑟(𝑢; 𝑥) = 0
given 𝑥. However, we can include them in the problem statement for
completeness as follows: 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ
minimize 𝑓 (𝑥; 𝑢) ℎ(𝑥, 𝑢)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
Fig. 3.21 Computing the objective ( 𝑓 )
subject to 𝑔 𝑗 (𝑥; 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 and constraint functions (𝑔,ℎ) for a
ℎ 𝑘 (𝑥; 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ (3.32) given set of design variables (𝑥) usu-
ally involves the solution of a numeri-
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 cal model (𝑟 = 0) by varying the state
variables (𝑢).
while solving 𝑟 𝑙 (𝑢; 𝑥) = 0 𝑙 = 1, . . . , 𝑛𝑢
by varying 𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 .
Here, “while solving” means that the governing equations are solved
at each optimization iteration to find a valid 𝑢 for each value of 𝑥.
The semicolon in 𝑓 (𝑥; 𝑢) indicates that 𝑢 is fixed while the optimizer
determines the next value of 𝑥.
Recalling the truss problem of Ex. 3.2, suppose we want to minimize the
mass of the structure (𝑚) by varying the cross-sectional areas of the truss
members (𝑎), subject to stress constraints.
The structural mass is an explicit function that can be written as
Õ
15
𝑚= 𝜌𝑎 𝑖 𝑙 𝑖 ,
𝑖=1
where 𝜌 is the material density, 𝑎 𝑖 is the cross-sectional area of each member 𝑖,
and 𝑙 𝑖 is the member length. This function depends on the design variables
directly and does not depend on the displacements.
3 Numerical Models and Solvers 71
minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
while solving 𝐾𝑢 − 𝑞 = 0 (system of 18 equations)
by varying 𝑢𝑙 𝑙 = 1, . . . , 18 .
The governing equations are a linear set of equations whose solution determines
the displacements (𝑢) of a given design (𝑎) for a load condition (𝑞). We
mentioned previously that the objective and constraint functions are usually
explicit functions of the state variables, design variables, or both. As we saw in
Ex. 3.2, the mass is an explicit function of the cross-sectional areas. In this case,
it does not even depend on the state variables. The constraint function is also
explicit, but in this case, it is just a function of the state variables. This example
illustrates a common situation where the solution of the state variables requires
the solution of implicit equations (structural solver), whereas the constraints
(stresses) and objective (weight) are explicit functions of the states and design
variables.
minimize 𝑓 (𝑥, 𝑢)
𝑟(𝑥, 𝑢)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥 𝑟
𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (3.33) 𝑓 , 𝑔, ℎ ℎ(𝑥, 𝑢)
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ
Fig. 3.22 In the full-space approach,
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 the governing equations are solved
by the optimizer by varying the state
𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢 .
variables.
This approach is described in more detail in Section 13.4.3.
More generally, the optimization constraints and equations in a
model are interchangeable. Suppose a set of equations in a model can
be satisfied by varying a corresponding set of state variables. In that case,
these equations and variables can be moved to the optimization problem
statement as equality constraints and design variables, respectively.
3 Numerical Models and Solvers 72
To solve the structural sizing problem (Ex. 3.9) using a full-space approach,
we forgo the linear solver by adding 𝑢 to the set of design variables and letting
the optimizer enforce the governing equations. This results in the following
problem:
minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
𝑢𝑙 𝑙 = 1, . . . , 18
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
𝐾𝑢 − 𝑞 = 0 (system of 18 equations) .
Before you optimize, you should be familiar with the analysis (model and
solver) that computes the objective and constraints. If possible, make several
parameter sweeps to see what the functions look like—whether they are smooth,
whether they seem unimodal or not, what the trends are, and the range of
values. You should also get an idea of the computational effort required and if
that varies significantly. Finally, you should test the robustness of the analysis
to different inputs because the optimization is likely to ask for extreme values.
3.10 Summary
Problems
3.2 Choose an engineering system that you are familiar with and
describe each of the components illustrated in Fig. 3.1 for that
system. List all the options for the mathematical and numerical
models that you can think of, and describe the assumptions for
each model. What type of solver is usually used for each model
(see Section 3.6)? What are the state variables for each model?
𝑢12
+ 𝑢22 = 1
4
4𝑢1 𝑢2 = 𝜋
𝑓 = 4(𝑢1 + 𝑢2 ) .
3 Numerical Models and Solvers 75
3.4 Reproduce a plot similar to the one shown in Fig. 3.10 for
𝑓 (𝑥) = cos(𝑥) + 1
in the neighborhood of 𝑥 = 𝜋 .
𝑟(𝑢) = 𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .
𝐸 − 𝑒 sin(𝐸) = 𝑀,
3.7 Consider the equation from Prob. 3.5 where we replace one of the
coefficients with a parameter 𝑎 as follows:
𝑟(𝑢) = 𝑎𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .
3 Numerical Models and Solvers 76
3.8 Reproduce the solution of Ex. 3.8 and then try different initial
guesses. Can you define a distinct region from where Newton’s
method converges?
3.9 Choose a problem that you are familiar with and find the magni-
tude of numerical noise in one or more outputs of interest with
respect to one or more inputs of interest. What means do you
have to decrease the numerical noise? What is the lowest possible
level of noise you can achieve?
Unconstrained Gradient-Based Optimization
4
In this chapter we focus on unconstrained optimization problems with
continuous design variables, which we can write as
77
4 Unconstrained Gradient-Based Optimization 78
4.1 Fundamentals
𝜕𝑓 𝑓 (𝑥 1 , . . . , 𝑥 𝑖 + 𝜀, . . . , 𝑥 𝑛 ) − 𝑓 (𝑥1 , . . . , 𝑥 𝑖 , . . . , 𝑥 𝑛 )
= lim . (4.3)
𝜕𝑥 𝑖 𝜀→0 𝜀
Each component in the gradient vector quantifies the function’s local
rate of change with respect to the corresponding design variable, as
shown in Fig. 4.2 for the two-dimensional case. In other words, these
components represent the slope of the function along each coordinate
direction. The gradient is a vector pointing in the direction of the
greatest function increase from the current point.
4 Unconstrained Gradient-Based Optimization 79
𝑓 𝜕𝑓
𝜕𝑓 𝜕𝑥 1
𝜕𝑥 2
𝜕𝑓
𝜕𝑥 1
𝜕𝑓
𝑥2
𝜕𝑥2
∇𝑓
𝑥2
∇𝑓
𝑥1 𝜕𝑓
𝜕𝑓 𝜕𝑥1
𝜕𝑥2
𝑥1
The gradient vectors are normal to the surfaces of constant 𝑓 in Fig. 4.2 Components of the gradient
𝑛-dimensional space (isosurfaces). In the two-dimensional case, gradient vector in the two-dimensional case.
This defines the vector field plotted in Fig. 4.3, where each vector points in the
direction of the steepest local increase.
50
2 Saddle point 0
−50
Maximum
𝑥2 0
Minimum −100
−150
−2 Saddle point
2 Maximum Minimum
0 4
−4 2
−4 −2 0 2 4 𝑥2 0
−2 −2
𝑥1 −4 𝑥1
Consider the wing design problem from Ex. 1.1, where the objective function
is the required power (𝑃). For the derivative of power with respect to span
(𝜕𝑃/𝜕𝑏), the units are watts per meter (W/m). For a wing with 𝑐 = 1 m and
𝑏 = 12 m, we have 𝑃 = 1087.85 W and 𝜕𝑃/𝜕𝑏 = −41.65 W/m. This means that
for an increase in span of 1 m, the linear approximation predicts a decrease in
power of 41.65 W (to 𝑃 = 1046.20). However, the actual power at 𝑏 = 13 𝑚 is
1059.77 W because the function is nonlinear (see Fig. 4.4). The relative derivative
for this same design can be computed as (𝜕𝑃/𝜕𝑏)(𝑏/𝑃) = −0.459, which means
that for a 1 percent increase in span, the linear approximation predicts a 0.459
percent decrease in power. The actual decrease is 0.310 percent.
1.5 1,150
1.2 1,125
(12, 1)
𝜕𝑃 1,100
𝑐 0.9 𝑃
𝜕𝑏 𝜕𝑃
1,075
0.6 𝜕𝑏
1059.77
0.3
1046.20 Fig. 4.4 Power versus span and the
5 15 25 35 11 12 13 14 15 16 corresponding derivative.
𝑏 𝑏
4 Unconstrained Gradient-Based Optimization 81
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 . (4.6)
∇ 𝑓 |𝑝
𝑥2
∇𝑓 ∇ 𝑓 |𝑝
𝑥2
∇𝑓
𝑥1 𝑝
𝑝 ∇ 𝑓 |𝑝
𝑥1
From the gradient projection, we can see why the gradient is the Fig. 4.5 Projection of the gradient in
an arbitrary unit direction 𝑝.
direction of the steepest increase. If we use this definition of the dot
product,
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 = k∇ 𝑓 k k𝑝 k cos 𝜃 , (4.7)
where 𝜃 is the angle between the two vectors, we can see that this is
maximized when 𝜃 = 0◦ . That is, the directional derivative is largest
when 𝑝 points in the same direction as ∇ 𝑓 .
If 𝜃 is in the interval (−90, 90)◦ , the directional derivative is positive
and is thus in a direction of increase, as shown in Fig. 4.6. If 𝜃 is in the
interval (90, 180]◦ , the directional derivative is negative, and 𝑝 points
in a descent direction. Finally, if 𝜃 = ±90◦ , the directional derivative
is 0, and thus the function value does not change for small steps; it
is locally flat in that direction. This condition occurs when ∇ 𝑓 and 𝑝
are orthogonal; therefore, the gradient is orthogonal to the function
isosurfaces.
4 Unconstrained Gradient-Based Optimization 82
Positive directional
derivative (∇ 𝑓 | 𝑝 > 0)
∇𝑓
𝜃
Negative directional 𝑝
derivative (∇ 𝑓 | 𝑝 < 0)
Fig. 4.6 The gradient ∇ 𝑓 is always
orthogonal to contour lines (surfaces),
and the directional derivative in the
Contour line tangent
direction 𝑝 is given by ∇ 𝑓 | 𝑝.
(∇ 𝑓 | 𝑝 = 0)
To get the correct slope in the original units of 𝑥, the direction should
be normalized as 𝑝ˆ = 𝑝/k𝑝k. However, in some of the gradient-based
algorithms of this chapter, 𝑝 is not normalized because the length
contains useful information. If 𝑝 is not normalized, the slopes and
variable axis are scaled by a constant.
𝑓 (𝑥 1 , 𝑥2 ) = 𝑥12 + 2𝑥22 − 𝑥 1 𝑥2 .
(0,1)
1.5 10
(-1,1) (1,1)
∇𝑓 k∇ 𝑓 k
𝑥 8
1
𝑝
𝑥 + 𝛼𝑝 6
−6 −3 0 3 6
𝑥2 0.5 𝑓 (-1,0) (1,0)
4 𝑥 ∇ 𝑓 |𝑝
0
2
∇ 𝑓 |𝑝
(-1,-1) (1,-1)
−0.5 0
−1.5 −1 −0.5 0 0.5 𝛼 (0,-1)
𝑥1
4.1.2 Curvature and Hessians Fig. 4.7 Function contours and direc-
tion 𝑝 (left), one-dimensional slice
The rate of change of the gradient—the curvature—is also useful infor- along 𝑝 (middle), directional deriva-
tive for all directions on polar plot
mation because it tells us if a function’s slope is increasing (positive (right).
curvature), decreasing (negative curvature), or stationary (zero curva-
ture).
In one dimension, the gradient reduces to a scalar (the slope), and
the curvature is also a scalar that can be calculated by taking the second
derivative of the function. To quantify curvature in 𝑛 dimensions, we
need to take the partial derivative of each gradient component 𝑗 with
respect to each coordinate direction 𝑖, yielding
𝜕2 𝑓
. (4.8)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝜕2 𝑓 𝜕2 𝑓
𝜕2 𝑓
···
𝜕𝑥12 𝜕𝑥 1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥 𝑛
2
𝜕 𝑓 𝜕2 𝑓 𝜕2 𝑓
···
𝜕𝑥2 𝜕𝑥 𝑛 .
𝐻 𝑓 (𝑥) = 𝜕𝑥2 𝜕𝑥1 𝜕𝑥22
(4.10)
.. .. .. ..
. . . .
𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥 𝜕𝑥 ···
𝑛 1 𝜕𝑥 𝑛 𝜕𝑥2 𝜕𝑥 𝑛2
𝜕2 𝑓
𝐻 𝑓 𝑖𝑗 = . (4.11)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
Because of the symmetry of second derivatives, the Hessian is a sym-
metric matrix with 𝑛(𝑛 + 1)/2 independent elements.
Each row 𝑖 of the Hessian is a vector that quantifies the rate of
change of all components 𝑗 of the gradient vector with respect to the
direction 𝑖. On the other hand, each column 𝑗 of the matrix quantifies
the rate of change of component 𝑗 of the gradient vector with respect to
all coordinate directions 𝑖. Because the Hessian is symmetric, the rows
and columns are transposes of each other, and these two interpretations
are equivalent.
We can find the rate of change of the gradient in an arbitrary
normalized direction 𝑝 by taking the product 𝐻𝑝. This yields an 𝑛-
vector that quantifies the rate of change of the gradient in the direction
𝑝, where each component of the vector is the rate of the change of the
corresponding partial derivative with respect to a movement along 𝑝.
Therefore, this product is defined as follows:
∇ 𝑓 (𝑥 + 𝜀𝑝) − ∇ 𝑓 (𝑥)
𝐻𝑝 = ∇𝑝 (∇ 𝑓 (𝑥)) = lim . (4.12)
𝜀→0 𝜀
Because of the symmetry of second derivatives, we can also interpret
this as the rate of change in the directional derivative of the function
along 𝑝 with respect to each of the components of 𝑝.
To find the curvature of the one-dimensional function along a
direction 𝑝, we need to project 𝐻𝑝 onto direction 𝑝 as
∇𝑝 ∇𝑝 𝑓 (𝑥) = 𝑝 | 𝐻𝑝 , (4.13)
whose contours are shown in Fig. 4.8 (left). These contours are ellipses that
have the same center. The Hessian of this quadratic is
2 −1
𝐻= ,
−1 4
√
which is constant. To find the curvature in the direction 𝑝 = [−1/2, − 3/2], we
compute
h √ i 2
" −1 # √
− 3 −1 2
√ 7− 3
|
𝑝 𝐻𝑝 = −1 − 3 = .
2 2 −1 4 2
2
The principal curvature directions can be computed by solving the eigenvalue
problem (Eq. 4.14). This yields two eigenvalues and two corresponding
eigenvectors,
√ √
√ 1− 2 √ 1+ 2
𝜅 𝐴 = 3 + 2, 𝑣 𝐴 = , and 𝜅 𝐵 = 3 − 2, 𝑣 𝐵 = .
1 1
(0,1)
𝜅1 𝑣ˆ 1 (-1,1) 𝜅1 (1,1)
1
𝜅2 𝑣ˆ 2 𝜅2
Fig. 4.8 Contours of 𝑓 for Ex. 4.4
𝑥2 0 (-1,0) (1,0)
0 2 4 6 and the two principal curvature di-
rections in red. The polar plot shows
𝑝 | 𝐻𝑝 the curvature, with the eigenvectors
−1 𝑝 pointing at the directions of principal
(-1,-1) (1,-1)
curvature; all other directions have
−1 0 1 (0,-1) curvature values in between.
𝑥1
Consider the same polynomial from Ex. 4.1. Differentiating the gradient
we obtained previously yields the Hessian:
6𝑥1 4𝑥 2
𝐻(𝑥1 , 𝑥2 ) = .
4𝑥2 4𝑥1 − 6𝑥2
We can visualize the variation of the Hessian by plotting the principal curvatures
at different points (Fig. 4.9).
4 100
50
2 0
Saddle point
−50
−100
𝑥2 0
−150
Maximum Minimum
−200
4 Minimum
−2 Saddle point
2
0
Maximum
−4 −4
𝑥1 −2 −2
−4 −2 0 2 4 0
𝑥1 2
−4 4 𝑥2
Using the gradient and Hessian of the two-variable polynomial from Ex. 4.1
and Ex. 4.5, we can use Eq. 4.15 to construct a second-order Taylor expansion
about 𝑥0 ,
|
3𝑥12 + 2𝑥 22 − 20 6𝑥1 4𝑥 2
𝑓˜(𝑝) = 𝑓 (𝑥0 ) + 𝑝 + 𝑝| 𝑝.
4𝑥 1 𝑥2 − 3𝑥 22 4𝑥2 4𝑥1 − 6𝑥2
Figure 4.10 shows the resulting Taylor series expansions about different points.
We perform three expansions, each about three critical points: the minimum
(left), the maximum (middle), and the saddle point (right). The expansion
about the minimum yields a convex quadratic that is a good approximation of
the original function near the minimum but becomes worse as we step farther
away. The expansion about the maximum shows a similar trend except that the
approximation is a concave quadratic. Finally, the expansion about the saddle
point yields a saddle function.
4 4 4
2 2 2
Saddle point
𝑥2 0 𝑥2 0 𝑥2 0
Minimum Maximum
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
30 30 30
𝑓 0 𝑓 0 𝑓 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
1
𝑓 (𝑥 ∗ + 𝑝) = 𝑓 (𝑥 ∗ ) + ∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 + . . . . (4.16)
2
For 𝑥 ∗ to be an optimal point, we must have 𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) for all 𝑝.
This implies that the first- and second-order terms in the Taylor series
have to be nonnegative, that is,
1
∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 ≥ 0 for all 𝑝. (4.17)
2
Because the magnitude of 𝑝 is small, we can always find a 𝑝 such
that the first term dominates. Therefore, we require that
Because 𝑝 can be in any arbitrary direction, the only way this inequality
can be satisfied is if all the elements of the gradient are zero (refer to
Fig. 4.6),
∇ 𝑓 (𝑥 ∗ ) = 0 . (4.19)
This is the first-order necessary optimality condition for an unconstrained
problem. This is necessary because if any element of 𝑝 is nonzero, there
are descent directions (e.g., 𝑝 = −∇ 𝑓 ) for which the inequality would
not be satisfied.
Because the gradient term has to be zero, we must now satisfy the
remaining term in the inequality (Eq. 4.17), that is,
From Eq. 4.13, we know that this term represents the curvature in
direction 𝑝, so this means that the function curvature must be positive
or zero when projected in any direction. You may recognize this
inequality as the definition of a positive-semidefinite matrix. In other
words, the Hessian 𝐻(𝑥 ∗ ) must be positive semidefinite.
4 Unconstrained Gradient-Based Optimization 89
In summary, the necessary optimality conditions for an unconstrained Fig. 4.11 Quadratic functions with
different types of Hessians.
optimization problem are
∇ 𝑓 (𝑥 ∗ ) = 0
(4.21)
𝐻(𝑥 ∗ ) is positive semidefinite .
∇ 𝑓 (𝑥 ∗ ) = 0
(4.22)
𝐻(𝑥 ∗ ) is positive definite .
4 Unconstrained Gradient-Based Optimization 90
We can find the minima of this function by solving for the optimality conditions
analytically.
To find the critical points of this function, we solve for the points at which
the gradient is equal to zero,
𝜕𝑓
𝜕𝑥 2𝑥13 + 6𝑥12 + 3𝑥1 − 2𝑥2 0
∇ 𝑓 = 1 = = .
𝜕 𝑓
0
𝜕𝑥 2𝑥 2 − 2𝑥1
2
From the second equation, we have that 𝑥2 = 𝑥1 . Substituting this into the first
equation yields
𝑥1 2𝑥12 + 6𝑥 1 + 1 = 0 .
The solution of this equation yields three points:
√ √
3 7 3
0 − − 7 −
𝑥 𝐴 = , 𝑥 𝐵 = 2 √2 , 𝑥 𝐶 = √2 2 .
3 7 7 3
0 − −
2 2 2 − 2
To classify these points, we need to compute the Hessian matrix. Differentiating
the gradient, we get
𝜕2 𝑓 𝜕2 𝑓
2
𝜕𝑥 2 𝜕𝑥1 𝜕𝑥2 6𝑥 + 12𝑥1 + 3 −2
1
𝐻(𝑥1 , 𝑥2 ) = 2 1 = .
𝜕 𝑓 𝜕2 𝑓
2
𝜕𝑥2 𝜕𝑥1 −2
𝜕𝑥22
The Hessian at the first point is
3 −2
𝐻 (𝑥 𝐴 ) = ,
−2 2
The eigenvalues are 𝜅 1 ≈ 1.737 and 𝜅 2 ≈ 17.200, so this point is another local
minimum. For the third point,
√
9 − 3 7 −2
𝐻 (𝑥 𝐶 ) = .
−2 2
4 Unconstrained Gradient-Based Optimization 91
The eigenvalues for this Hessian are 𝜅 1 ≈ −0.523 and 𝜅 2 ≈ 3.586, so this point
is a saddle point.
Figure 4.12 shows these three critical points. To find out which of the two
local minima is the global one, we evaluate the function at each of these points.
Because 𝑓 (𝑥 𝐵 ) < 𝑓 (𝑥 𝐴 ), 𝑥 𝐵 is the global minimum.
𝑥 𝐴 : local minimum
0
𝑥 𝐶 : saddle point
𝑥2 −1
−2
𝑥 𝐵 : global minimum
−3
k∇ 𝑓 k ∞ ≤ 𝜏 , (4.23)
The absolute and relative conditions on the objective are of the same
form, although they only use an absolute value rather than a norm
because the objective is scalar.
𝑥𝑘 𝑥∗ 𝑥∗ 𝑥∗
𝑥𝑘
𝑥2 𝑥2 𝑥2
𝑥𝑘
𝑥1 𝑥1 𝑥1
A truncated Taylor series is, in general, only a good model within a Fig. 4.13 Taylor series quadratic mod-
els are only guaranteed to be accurate
small neighborhood, as shown in Fig. 4.13, which shows three quadratic
near the point about which the series
models of the same function based on three different points. All is expanded (𝑥 𝑘 ).
quadratic approximations match the local gradient and curvature at
the respective points. However, the Taylor series quadratic about the
first point (left plot) yields a quadratic without a minimum (the only
critical point is a saddle point). The second point (middle plot) yields
a quadratic whose minimum is closer to the true minimum. Finally,
the Taylor series about the actual minimum point (right plot) yields a
4 Unconstrained Gradient-Based Optimization 94
quadratic with the same minimum, as expected, but we can see how
the quadratic model worsens the farther we are from the point.
Because the Taylor series is only guaranteed to be a good model
locally, we need a globalization strategy to ensure convergence to an
optimum. Globalization here means making the algorithm robust
enough that it can converge to a local minimum when starting from
any point in the domain. This should not be confused with finding the
global minimum, which is a separate issue (see Tip 4.8). There are two
main globalization strategies: line search and trust region.
The line search approach consists of three main steps for every
iteration (Fig. 4.14): 𝑥0
Determine search direction, 𝑝 𝑘 Use any of the methods from Section 4.4
Determine step length, 𝛼 𝑘 Use a line search algorithm
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while
𝑥 𝑘 + 𝛼𝑝 𝑘
For the line search subproblem, we assume that we are given a
starting location at 𝑥 𝑘 and a suitable search direction 𝑝 𝑘 along which to 𝛼=2
search (Fig. 4.16). The line search then operates solely on points along
direction 𝑝 𝑘 starting from 𝑥 𝑘 , which can be written as 𝑝𝑘 𝛼=1
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , (4.26)
𝑥𝑘 𝛼=0
where the scalar 𝛼 is always positive and represents how far we go in Fig. 4.16 The line search starts from
the direction 𝑝 𝑘 . This equation produces a one-dimensional slice of a given point 𝑥 𝑘 and searches solely
𝑛-dimensional space, as illustrated in Fig. 4.17. along direction 𝑝 𝑘 .
𝑥𝑘
𝑥2
𝑝𝑘
𝑓
𝑥 𝑘+1
𝑥 𝑘 + 𝛼𝑝 𝑘 Fig. 4.17 The line search projects
the 𝑛-dimensional problem onto one
dimension, where the independent
𝑥1 𝛼 variable is 𝛼.
𝑥𝑘 𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘
Consider the bean function whose contours are shown in Fig. 4.19.
At point 𝑥 𝑘 , the direction 𝑝 𝑘 is a descent direction. However, it would
be wasteful to spend a lot of effort determining the exact minimum in 𝑥𝑘
the 𝑝 𝑘 direction because it would not take us any closer to the minimum
𝑝𝑘
of the overall function (the dot on the right side of the plot). Instead, 𝑥2
we should find a point that is good enough and then update the search
direction.
To simplify the notation for the line search, we define the single-
𝑥1
variable function
𝜙(𝛼) = 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 ) , (4.27) Fig. 4.19 The descent direction does
not necessarily point toward the min-
where 𝛼 = 0 corresponds to the start of the line search (𝑥 𝑘 in Fig. 4.17), imum, in which case it would be
and thus 𝜙(0) = 𝑓 (𝑥 𝑘 ). Then, using 𝑥 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , the slope of the wasteful to do an exact line search.
single-variable function is
𝜕 ( 𝑓 (𝑥)) Õ 𝜕 𝑓 𝜕𝑥 𝑖
𝑛
𝜙0(𝛼) = = . (4.28)
𝜕𝛼 𝜕𝑥 𝑖 𝜕𝛼
𝑖=1
𝜙0(𝛼) = ∇ 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 )| 𝑝 𝑘 , (4.29)
which is the directional derivative along the search direction. The slope
at the start of a given line search is
ure 4.20 is a version of the one-dimensional slice from Fig. 4.17 in this
notation. The 𝛼 axis and the slopes scale with the magnitude of 𝑝 𝑘 . 𝜙0(𝛼)
𝛼=0 𝛼
4.3.1 Sufficient Decrease and Backtracking
Fig. 4.20 For the line search, we de-
The simplest line search algorithm to find a “good enough” point relies note the function as 𝜙(𝛼) with the
on the sufficient decrease condition combined with a backtracking algorithm. same value as 𝑓 . The slope 𝜙0 (𝛼) is
The sufficient decrease condition, also known as the Armijo condition, is the gradient of 𝑓 projected onto the
given by the inequality search direction.
† This condition can be problematic near
a local minimum because 𝜙(0) and 𝜙(𝛼)
𝜙(𝛼) ≤ 𝜙(0) + 𝜇1 𝛼𝜙 (0) ,0
(4.31) are so similar that their subtraction is inac-
curate. Hager and Zhang77 introduced a
condition with improved accuracy, along
where 𝜇1 is a constant such that 0 < 𝜇1 ≤ 1.† The quantity 𝛼𝜙0(0) with an efficient line search based on a
represents the expected decrease of the function, assuming the function secant method.
continued at the same slope. The multiplier 𝜇1 states that Eq. 4.31 will 77. Hager and Zhang, A new conjugate
gradient method with guaranteed descent and
be satisfied as long we achieve even a small fraction of the expected an efficient line search, 2005.
4 Unconstrained Gradient-Based Optimization 98
decrease, as shown in Fig. 4.21. In practice, this constant is several 𝜙(0) 𝜇1 𝜙0(0)
orders of magnitude smaller than 1, typically 𝜇1 = 10−4 . Because 𝑝 𝑘
Sufficient
is a descent direction, and thus 𝜙0(0) = ∇ 𝑓 𝑘 | 𝑝 𝑘 < 0, there is always a decrease
positive 𝛼 that satisfies this condition for a smooth function.
𝜙
The concept is illustrated in Fig. 4.22, which shows a function with 𝜙0(0)
a negative slope at 𝛼 = 0 and a sufficient decrease line whose slope is Expected decrease
a fraction of that initial slope. When starting a line search, we know 𝛼=0 𝛼
the function value and slope at 𝛼 = 0, so we do not really know how
the function varies until we evaluate it. Because we do not want to Fig. 4.21 The sufficient decrease line
has a slope that is a small fraction
evaluate the function too many times, the first point whose value is
of the slope at the start of the line
below the sufficient decrease line is deemed acceptable. The sufficient search.
decrease line slope in Fig. 4.22 is exaggerated for illustration purposes;
for typical values of 𝜇1 , the line is indistinguishable from horizontal
when plotted.
𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line
𝜙(𝛼)
Inputs:
4 Unconstrained Gradient-Based Optimization 99
𝛼 = 𝛼init
while 𝜙(𝛼) > 𝜙(0) + 𝜇1 𝛼𝜙0 (0) do Function value is above sufficient decrease line
𝛼 = 𝜌𝛼 Backtrack
end while
0.5
Suppose we do a line search starting from 𝑥 = [−1.25, 1.25] in the direction
𝑝 = [4, 0.75], as shown in Fig. 4.23. Applying the backtracking algorithm with 0
−3 −2 −1 0 1 2 3
𝜇1 = 10−4 and 𝜌 = 0.7 produces the iterations shown in Fig. 4.24. The sufficient 𝑥1
decrease line appears to be horizontal, but that is because the small negative
slope cannot be discerned in a plot for typical values of 𝜇1 . Using a large initial Fig. 4.23 A line search direction for
step of 𝛼init = 1.2 (Fig. 4.24, left), several iterations are required. For a small an example problem.
4 Unconstrained Gradient-Based Optimization 100
initial step of 𝛼init = 0.05 (Fig. 4.24, right), the algorithm satisfies sufficient
decrease at the first iteration but misses the opportunity for further reductions.
30 30
20 20
𝑓 10 𝑓 10
0 0
This condition requires that the magnitude of the slope at the new
+𝜇2 𝜙0(0) −𝜇2 𝜙0(0)
point be lower than the magnitude of the slope at the start of the line
search by a factor of 𝜇2 , as shown in Fig. 4.25. This requirement is called 𝛼=0 𝛼
the sufficient curvature condition because by comparing the two slopes,
Fig. 4.25 The sufficient curvature con-
we quantify the function’s rate of change in the slope—the curvature.
dition requires the function slope
Typical values of 𝜇2 range from 0.1 to 0.9, and the best value depends magnitude to be a fraction of the ini-
on the method for determining the search direction and is also problem tial slope.
4 Unconstrained Gradient-Based Optimization 101
𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line
𝜙(𝛼)
𝜇2 𝜙0(0)
The bracketing phase is given by Alg. 4.3 and illustrated in Fig. 4.28.
For brevity, we use a notation in the following algorithms where,
for example, 𝜙0 ≡ 𝜙(0) and 𝜙low ≡ 𝜙(𝛼low ). Overall, the bracketing
algorithm increases the step size until it either finds an interval that
must contain a point satisfying the strong Wolfe conditions or a point
that already meets those conditions.
We start the line search with a guess for the step size, which defines
the first interval. For a smooth continuous function, we are guaranteed
to have a minimum within an interval if either of the following hold:
1. The function value at the candidate step is higher than the value
at the start of the line search.
2. The step satisfies sufficient decrease, and the slope is positive.
These two scenarios are illustrated in the top two rows of Fig. 4.28. In
either case, we have an interval within which we can find a point that
satisfies the strong Wolfe conditions using the pinpointing algorithm.
The order in arguments to the pinpoint function in Alg. 4.3 is significant
because this function assumes that the function value corresponding
to the first 𝛼 is the lower one. The third row in Fig. 4.28 illustrates the
scenario where the point satisfies the strong Wolfe conditions, in which
case the line search is finished.
If the point satisfies sufficient decrease and the slope at that point
is negative, we assume that there are better points farther along the
line, and the algorithm increases the step size. This larger step and the
previous one define a new interval that has moved away from the line
search starting point. We repeat the procedure and check the scenarios
for this new interval. To save function calls, bracketing should return
not just 𝛼 ∗ but also the corresponding function value and gradient to
the outer function.
Inputs:
𝛼init > 0: Initial step size
𝜙0 , 𝜙00 : computed in outer routine, pass in to save function call
0 < 𝜇1 < 1: Sufficient decrease factor
4 Unconstrained Gradient-Based Optimization 103
0
if 𝜙2 > 𝜙0 + 𝜇1 𝛼2 𝜙0 or not first and 𝜙2 > 𝜙1 then
𝛼∗ = pinpoint(𝛼 1 , 𝛼2 , . . .) 1 ⇒ low, 2 ⇒ high
return 𝛼∗
end if
𝜙02 = 𝜙0 (𝛼2 ) If not computed previously
if |𝜙02 | ≤ −𝜇2 𝜙00 then Step is acceptable; exit line search
return 𝛼∗ = 𝛼 2
else if 𝜙02 ≥ 0 then Bracketed minimum
∗
𝛼 = pinpoint (𝛼2 , 𝛼 1 , . . .) Find acceptable step, 2 ⇒ low, 1 ⇒ high
return 𝛼∗
else Slope is negative
𝛼1 = 𝛼2
𝛼2 = 𝜎𝛼2 Increase step
end if
first = false
end while
If the bracketing phase does not find a point that satisfies the
strong Wolfe conditions, it finds an interval where we are guaranteed
to find such a point in the pinpointing phase described in Alg. 4.4
and illustrated in Fig. 4.29. The intervals generated by this algorithm,
bounded by 𝛼 low and 𝛼high , always have the following properties:
1. The interval has one or more points that satisfy the strong Wolfe
conditions.
2. Among all the points generated so far that satisfy the sufficient
decrease condition, 𝛼low has the lowest function value.
3. The slope at 𝛼low decreases toward 𝛼 high .
𝛼1 𝛼2
𝛼1 𝛼2
Inputs:
4 Unconstrained Gradient-Based Optimization 105
𝑘=0
while true do
Find 𝛼 𝑝 in interval (𝛼low , 𝛼 high ) Use interpolation (see Section 4.3.3)
Uses 𝜙low , 𝜙high , and 𝜙0 from at least one endpoint
𝜙𝑝 = 𝜙 𝛼𝑝 Also evaluate 𝜙0𝑝 if derivatives available
0
if 𝜙 𝑝 > 𝜙0 + 𝜇1 𝛼 𝑝 𝜙0 or 𝜙 𝑝 > 𝜙low then
𝛼high = 𝛼 𝑝 Also update 𝜙high = 𝜙 𝑝 , and if cubic interpolation 𝜙0high = 𝜙0𝑝
else
𝜙0𝑝 = 𝜙0 𝛼 𝑝 If not already computed
if |𝜙0𝑝 | ≤ −𝜇2 𝜙00 then
𝛼∗ = 𝛼 𝑝
return 𝛼 𝑝
else if 𝜙0𝑝 (𝛼high − 𝛼 low ) ≥ 0 then
𝛼high = 𝛼 low
end if
𝛼low = 𝛼 𝑝
end if
𝑘 = 𝑘+1
end while
In theory, the line search given in Alg. 4.3 followed by Alg. 4.4 is
guaranteed to find a step length satisfying the strong Wolfe conditions.
In practice, some additional considerations are needed for improved
robustness. One of these criteria is to ensure that the new point
in the pinpoint algorithm is not so close to an endpoint as to cause
the interpolation to be ill-conditioned. A fallback option in case
the interpolation fails could be a simpler algorithm, such as bisection.
Another criterion is to ensure that the loop does not continue indefinitely
in case finite-precision arithmetic leads to indistinguishable function
value changes. A limit on the number of iterations might be necessary.
Let us perform the same line search as in Alg. 4.2 but using bracketing
and pinpointing instead of backtracking. In this example, we use quadratic
4 Unconstrained Gradient-Based Optimization 106
𝛼high 𝛼 low
𝛼 low 𝛼 high
30 30
Bracketing
20 20 Bracketing
𝜙 10 Pinpointing 𝜙 10
Pinpointing
0 0
The result is a point that is much better than the one obtained with backtracking.
˜
𝜙(𝛼) = 𝑐0 + 𝑐1 𝛼 + 𝑐2 𝛼2 , (4.33)
𝜙(𝛼 1 ) = 𝑐 0 + 𝑐 1 𝛼 1 + 𝑐2 𝛼21
𝜙(𝛼 2 ) = 𝑐 0 + 𝑐 2 𝛼 2 + 𝑐2 𝛼22 (4.34)
𝜙 (𝛼 1 ) = 𝑐 1 + 2𝑐 2 𝛼 1 .
0
We can use these three equations to find the three coefficients based
on function and derivative values. Once we have the coefficients for
the quadratic, we can find the minimum of the quadratic analytically
4 Unconstrained Gradient-Based Optimization 108
and derivatives at both points. With these four pieces of information, Fig. 4.31 Quadratic interpolation with
we can perform a cubic interpolation, two function values and one deriva-
tive.
˜
𝜙(𝛼) = 𝑐0 + 𝑐1 𝛼 + 𝑐2 𝛼2 + 𝑐3 𝛼3 , (4.36)
𝜙(𝛼 2 )
𝜙0(𝛼2 )
as shown in Fig. 4.32. To determine the four coefficients, we apply the
boundary conditions:
𝜙0(𝛼 1 )
˜
𝜙(𝛼 1 ) = 𝑐 0 + 𝑐 1 𝛼 1 + 𝑐2 𝛼21 + 𝑐3 𝛼31 𝜙(𝛼1 ) 𝜙(𝛼)
𝑝 = −∇ 𝑓 . (4.40)
One major issue with the steepest descent is that, in general, the
∇𝑓
entries in the gradient and its overall scale can vary greatly depending
on the magnitudes of the objective function and design variables. The
gradient itself contains no information about an appropriate step length,
and therefore the search direction is often better posed as a normalized
direction, 𝑝𝑘
∇ 𝑓𝑘
𝑝𝑘 = − . (4.41)
k∇ 𝑓 𝑘 k Fig. 4.33 The steepest-descent direc-
tion points in the opposite direction
Algorithm 4.5 provides the complete steepest descent procedure. of the gradient.
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
4 Unconstrained Gradient-Based Optimization 110
∇ 𝑓 𝑘−1 | 𝑝 𝑘−1
𝛼 𝑘 = 𝛼 𝑘−1 . (4.43)
∇ 𝑓𝑘 | 𝑝 𝑘
𝑥0 𝑥0 𝑥0
𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗
−5 −5 −5
−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥1 𝑥1 𝑥1
𝛽=1 𝛽=5 𝛽 = 15
120
𝑥𝑘 80
∇ 𝑓 |𝑝
𝑥2 0 𝑓
∇𝑓 40
1 2 3
34 iterations
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 , 𝑥0
2 2
using the steepest-descent algorithm with an exact line search, and a conver-
1
gence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . The optimization path is shown in Fig. 4.36.
𝑥∗
Although it takes only a few iterations to get close to the minimum, it takes
0
many more to satisfy the specified convergence tolerance.
−1
−2 −1 0 1 2
𝑥1
Tip 4.4 Scale the design variables and the objective function Fig. 4.36 Steepest-descent optimiza-
tion path.
Problem scaling is one of the most crucial considerations in practical
optimization. Steepest descent is susceptible to scaling, as demonstrated in
Ex. 4.10. Even though we will learn about less sensitive methods, poor scaling
can decrease the effectiveness of any method for general nonlinear functions.
A common cause of poor scaling is unit choice. For example, consider a
problem with two types of design variables, where one type is the material
thickness, on the order of 10−6 m, and the other type is the length of the
structure, on the order of 1 m. If both distances are measured in meters, the
derivative in the thickness direction is much larger than the derivative in the
length direction. In other words, the design space would have a valley that
is steep in one direction and shallow in the other. The optimizer would have
great difficulty in navigating this type of design space.
Similarly, if the objective is power and a typical value is 106 W, the gradients
would likely be relatively small, and satisfying convergence tolerances may be
challenging.
A good rule of thumb is to scale the objective function and every design
variable to be around unity. The scaling of the objective is only needed after
the model analysis computes the function and can be written as
𝑓¯ = 𝑓 /𝑠 𝑓 , (4.45)
where 𝑠 𝑓 is the scaling factor, which could be the value of the objective at the
starting point, 𝑓 (𝑥0 ), or another typical value. Multiplying the functions by a
4 Unconstrained Gradient-Based Optimization 113
scalar does not change the optimal solution but can significantly improve the
ability of the optimizer to find the optimum.
Scaling the design variables is more involved because scaling them changes 𝑥0 𝑥∗
the value that the optimizer would pass to the model and thus changes their
meaning. In general, we might use different scaling factors for different types 𝑥0 𝑠 𝑥 𝑥¯ ∗ 𝑠 𝑥
of variables, so we represent these as an 𝑛-vector, 𝑠 𝑥 . Starting with the physical
𝑥¯ 0 𝑥¯ ∗
design variables, 𝑥0 , we obtain the scaled variables by dividing them by the
scaling factors: Optimizer
𝑥¯ 0 = 𝑥0 𝑠 𝑥 , (4.46)
𝑥¯ 𝑓¯
where denotes element-wise division. Then, because the optimizer works
with the scaled variables, we need to convert them back to physical variables 𝑥¯ 𝑠 𝑥 𝑓 /𝑠 𝑓
by multiplying them by the scaling factors: 𝑥 𝑓
𝑥 = 𝑥¯ 𝑠 𝑥 , (4.47) Model
∇ 𝑓 (𝑥 ∗ ) = 𝐴𝑥 ∗ − 𝑏 = 0 . (4.49) 𝑥1
Thus, finding the minimum of a quadratic amounts to solving the linear Fig. 4.38 For a quadratic function with
system 𝐴𝑥 = 𝑏, and the residual vector is the gradient of the quadratic. elliptical contours and the principal
axis aligned with the coordinate axis,
If 𝐴 were a positive-definite diagonal matrix, the contours would be
we can find the minimum in 𝑛 steps,
elliptical, as shown in Fig. 4.38 (or hyper-ellipsoids in the 𝑛-dimensional where 𝑛 is the number of dimensions,
case), and the axes of the ellipses would align with the coordinate direc- by using a coordinate search.
tions. In that case, we could converge to the minimum by successively
16 iterations
performing an exact line search in each coordinate direction for a total
of 𝑛 line searches.
𝑥∗
In the more general case (but still assuming 𝐴 to be positive definite), 𝑥2
𝑝1
the axes of the ellipses form an orthogonal coordinate system in some 𝑝0
!| !
𝑖=0
1 Õ ©Õ ª Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼𝑖 𝑝𝑖 𝐴 𝛼 𝑗 𝑝 𝑗 ® − 𝑏| 𝛼𝑖 𝑝𝑖 (4.51)
2
𝑖=0 « 𝑗=0 ¬ 𝑖=0
1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑖 𝛼 𝑗 𝑝 𝑖 | 𝐴𝑝 𝑗 − 𝛼𝑖 𝑏| 𝑝𝑖 .
2
𝑖=0 𝑗=0 𝑖=0
Then, the double-sum term in Eq. 4.51 can be simplified to a single sum
and we can write
𝑛−1
Õ
1 2 iterations
∗
𝑓 (𝑥 ) = 𝛼 2 𝑝 𝑖 | 𝐴𝑝 𝑖 |
− 𝛼𝑖 𝑏 𝑝𝑖 . (4.53)
2 𝑖
𝑖=0
𝑥∗
𝑝1
Because each term in this sum involves only one direction 𝑝 𝑖 , we have 𝑥2
𝑝0
reduced the original problem to a series of one-dimensional quadratic
𝑥0
functions that can be minimized one at a time. Two possible conjugate
directions are shown for the two-dimensional case in Fig. 4.41. 𝑥1
Each one-dimensional problem corresponds to minimizing the
quadratic with respect to the step length 𝛼 𝑖 . Differentiating each term Fig. 4.41 By minimizing along a se-
quence of conjugate directions in
and setting it to zero yields
turn, we can find the minimum of
a quadratic in 𝑛 steps, where 𝑛 is the
𝑏| 𝑝𝑖
𝛼 𝑖 𝑝 𝑖 | 𝐴𝑝 𝑖 − 𝑏 | 𝑝 𝑖 = 0 ⇒ 𝛼 𝑖 = , (4.54) number of dimensions.
𝑝 𝑖 | 𝐴𝑝 𝑖
𝑝𝑘
which corresponds to the result of an exact line search in direction 𝑝 𝑖 .
𝛽 𝑘 𝑝 𝑘−1
There are many possible sets of vectors that are conjugate with
respect to 𝐴, including the eigenvectors. The conjugate gradient method 𝑝 𝑘−1
finds these directions starting with the steepest-descent direction,
−∇ 𝑓 𝑘
𝑝0 = −∇ 𝑓 (𝑥0 ) , (4.55)
Fig. 4.42 The conjugate gradient
and then finds each subsequent direction using the update, search direction update combines the
steepest-descent direction with the
𝑝 𝑘 = −∇ 𝑓 𝑘 + 𝛽 𝑘−1 𝑝 𝑘−1 . (4.56) previous conjugate gradient direc-
tion.
4 Unconstrained Gradient-Based Optimization 116
∇ 𝑓𝑘 | ∇ 𝑓𝑘
𝛽𝑘 = . (4.57)
∇ 𝑓 𝑘−1 | ∇ 𝑓 𝑘−1
∇ 𝑓 𝑘 | (∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1 )
𝛽𝑘 = . (4.59)
∇ 𝑓 𝑘−1 | ∇ 𝑓 𝑘−1
𝛽 ← max(0, 𝛽) . (4.60)
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
else
∇ 𝑓 |∇ 𝑓
𝛽 𝑘 = ∇ 𝑓 𝑘 | ∇ 𝑓𝑘
𝑘−1 𝑘−1
∇𝑓
𝑝 𝑘 = − k∇ 𝑓 𝑘 k + 𝛽 𝑘 𝑝 𝑘−1 Conjugate gradient direction update
𝑘
end if
𝛼 𝑘 = linesearch (𝑝 𝑘 , 𝛼init ) Perform a line search
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while
Minimizing the same bean function from Ex. 4.11 and the same line search 𝑥2
3
algorithm and settings, we get the optimization path shown in Fig. 4.43. The 22 iterations
𝑥0
changes in direction for the conjugate gradient method are smaller than for 2
steepest descent, and it takes fewer iterations to achieve the same convergence
tolerance. 1 𝑥∗
−1
4.4.3 Newton’s Method −2 −1 0 1 2
𝑥1
1
𝑓 (𝑥 𝑘 + 𝑠) ≈ 𝑓 (𝑥 𝑘 ) + 𝑠 𝑓 0 (𝑥 𝑘 ) + 𝑠 2 𝑓 00 (𝑥 𝑘 ) . (4.61)
2
We now include a second-order term to get a quadratic that we can
minimize. We minimize this quadratic approximation by differentiating
with respect to the step 𝑠 and setting the derivative to zero, which yields
𝑓 0 (𝑥 𝑘 )
𝑓 0 (𝑥 𝑘 ) + 𝑠 𝑓 00 (𝑥 𝑘 ) = 0 ⇒ 𝑠=− . (4.62)
𝑓 00 (𝑥 𝑘 )
4 Unconstrained Gradient-Based Optimization 119
𝑓 𝑘0
𝑥 𝑘+1 = 𝑥 𝑘 − . (4.63)
𝑓 𝑘00
We could also derive this equation by taking Newton’s method for root
finding (Eq. 3.24) and replacing 𝑟(𝑢) with 𝑓 0(𝑥).
20 𝑓
𝑓 (𝑥) = (𝑥 − 2)4 + 2𝑥 2 − 4𝑥 + 4 .
5
1
𝑓 (𝑥 𝑘 + 𝑠) ≈ 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑠 + 𝑠 | 𝐻 𝑘 𝑠 , (4.64) 𝑥2
2 1 iteration
5
where 𝑠 is a vector centered at 𝑥 𝑘 . Similar to the one-dimensional case,
𝑥0
we can find the step 𝑠 𝑘 that minimizes this quadratic model by taking
0 𝑥∗
the derivative with respect to 𝑠 and setting that equal to zero:
d 𝑓 (𝑥 𝑘 + 𝑠)
= ∇ 𝑓𝑘 + 𝐻𝑘 𝑠 = 0 . (4.65) −5
d𝑠
−5 0 5 10
Thus, each Newton step is the solution of a linear system where the 𝑥1
becomes the Hessian, the residual is the gradient, and the design
variables replace the states. We can use any of the linear solvers
mentioned in Section 3.6 and Appendix B to solve this system.
When minimizing the quadratic function from Ex. 4.10, Newton’s
method converges in one step for any value of 𝛽, as shown in Fig. 4.45.
Thus, Newton’s method is scale invariant
Because the function is quadratic, the quadratic “approximation”
from the Taylor series is exact, so we can find the minimum in one
step. It will take more iterations for a general nonlinear function, but
using curvature information generally yields a better search direction
than first-order methods. In addition, Newton’s method provides a
step length embedded in 𝑠 𝑘 because the quadratic model estimates
the stationary point location. Furthermore, Newton’s method exhibits
quadratic convergence.
Although Newton’s method is powerful, it suffers from a few issues
in practice. One issue is that the Newton step does not necessarily
result in a function decrease. This issue can occur if the Hessian is not
positive definite or if the quadratic predictions overshoot because the
actual function has more curvature than predicted by the quadratic
approximation. Both of these possibilities are illustrated in Fig. 4.46.
4 4
2 2
𝑥𝑘 𝑥𝑘
𝑥 𝑘+1
𝑥2 0 𝑥2 0 𝑥 𝑘+1
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1
25
0
20 Fig. 4.46 Newton’s method in its pure
𝑓 𝑓 −20 form is vulnerable to negative curva-
15 ture (in which case it might step away
−40 from the minimum) and overshoot-
10
𝑥 𝑘+1 𝑥𝑘 𝑥𝑘 𝑥 𝑘+1 ing (which might result in a function
increase).
Negative curvature Overshoot
If the Hessian is not positive definite, the step might not even be in
a descent direction. Replacing the real Hessian with a positive-definite
Hessian can mitigate this issue. The quasi-Newton methods in the next
section force a positive-definite Hessian by construction.
To fix the overshooting issue, we can use a line search instead of
4 Unconstrained Gradient-Based Optimization 121
Minimizing the same bean function from Exs. 4.11 and 4.12, we get the
optimization path shown in Fig. 4.47. Newton’s method takes fewer iterations
than steepest descent (Ex. 4.11) or conjugate gradient (Ex. 4.12) to achieve the
same convergence tolerance. The first quadratic approximation is a saddle
function that steps to the saddle point, away from the minimum of the function.
However, in subsequent iterations, the quadratic approximation becomes
convex, and the steps take us along the valley of the bean function toward the
minimum.
3 3 3
𝑥0
2 𝑠0 2 2
𝑥1 𝑥2 𝑥∗
𝑥2 1 𝑥2 1 𝑥2 1
𝑠2 𝑥3
0 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 10
4
8
𝑓 10 𝑓 6 𝑓 2
4
0
5 2
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼
1
𝑓˜ (𝑥 𝑘 + 𝑝) = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑝 + 𝑝 | 𝐻˜ 𝑘 𝑝 , (4.70)
2
We solve this linear system for 𝑝 𝑘 , but instead of accepting it as the final
step, we perform a line search in the 𝑝 𝑘 direction. Only after finding a
step size 𝛼 𝑘 that satisfies the strong Wolfe conditions do we update the
point using
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 . (4.72)
Quasi-Newton methods update the approximate Hessian at every
iteration based on the latest information using an update of the form
where the update Δ𝐻˜ 𝑘 is a function of the last two gradients. The first
Hessian approximation is usually set to the identity matrix (or a scaled
version of it), which yields a steepest-descent direction for the first line
search (set 𝐻˜ = 𝐼 in Eq. 4.71 to verify this).
We now develop the requirements for the approximate Hessian
update. Suppose we just obtained the new point 𝑥 𝑘+1 after a line search
starting from 𝑥 𝑘 in the direction 𝑝 𝑘 . We can write the new quadratic
based on an updated Hessian as follows:
1
𝑓˜ (𝑥 𝑘+1 + 𝑝) = 𝑓 𝑘+1 + ∇ 𝑓 𝑘+1 | 𝑝 + 𝑝 | 𝐻˜ 𝑘+1 𝑝 . (4.74)
2
We can assume that the new point’s function value and gradient are
given, but we do not have the new approximate Hessian yet. Taking
the gradient of this quadratic with respect to 𝑝, we obtain
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘 = 𝛼 𝑘 𝑝 𝑘 , (4.78)
𝑦 𝑘 = ∇ 𝑓 𝑘+1 − ∇ 𝑓 𝑘 . (4.79)
𝑥 𝑘+1
𝐻˜ 𝑘+1 𝑠 𝑘 = 𝑦 𝑘 . (4.80) 𝑝𝑘
𝑠𝑘
𝑥𝑘
∇ 𝑓 𝑘+1
This is called the secant equation and is a fundamental requirement
for quasi-Newton methods. The result is intuitive when we recall the
∇ 𝑓𝑘
meaning of the product of the Hessian with a vector (Eq. 4.12): it is the
rate of change of the Hessian in the direction defined by that vector. Fig. 4.49 Quasi-Newton methods use
Thus, it makes sense that the rate of change of the curvature predicted the gradient at the endpoint of each
step to estimate the curvature in the
by the approximate Hessian should match the difference between the
step direction and update an approx-
gradients.‡ imation of the Hessian.
We need 𝐻˜ to be positive definite. Using the secant equation ‡ The secant equation is also known as the
(Eq. 4.80) and the definition of positive definiteness (𝑠 | 𝐻𝑠 > 0), we see quasi-Newton condition.
that this requirement implies that the predicted curvature is positive
along the step; that is,
𝑠 𝑘 | 𝑦𝑘 > 0 . (4.81)
This is called the curvature condition, and it is automatically satisfied if
the line search finds a step that satisfies the strong Wolfe conditions.
The secant equation (Eq. 4.80) is a linear system of 𝑛 equations
where the step and the gradients are known. However, there are
4 Unconstrained Gradient-Based Optimization 125
Because the vectors 𝑦 𝑘 and 𝐻˜ 𝑘 𝑠 𝑘 are not parallel in general (because the
secant equation applies to 𝐻˜ 𝑘+1 , not to 𝐻˜ 𝑘 ), the only way to guarantee
this equality is to set the terms in parentheses to zero. Thus, the scalar
coefficients are
1 1
𝛼= | , 𝛽=− . (4.86)
𝑦𝑘 𝑠 𝑘 𝑠 𝑘 𝐻˜ 𝑘 𝑠 𝑘
|
Substituting these coefficients and the chosen vectors back into Eq. 4.82,
we get the BFGS update,
𝑦 𝑘 𝑦 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘
𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + − . (4.87)
𝑦𝑘 | 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘
Although we did not explicitly enforce positive definiteness, the rank 2
update is positive definite, and therefore, all the Hessian approxima-
tions are positive definite, as long as we start with a positive-definite
approximation.
Now recall that we want to solve the linear system that involves
§ Thisformula is also known as the Wood-
this matrix (Eq. 4.71), so it would be more efficient to approximate
bury matrix identity. Given a matrix and an
the inverse of the Hessian directly instead. The inverse can be found update to that matrix, it yields an explicit
expression for the inverse of the updated
analytically from the update (Eq. 4.87) using the Sherman–Morrison–
matrix in terms of the inverses of the matrix
Woodbury formula.§ Defining 𝑉˜ as the approximation of the inverse of and the update (see Appendix C.3).
4 Unconstrained Gradient-Based Optimization 127
𝑉˜ 𝑘+1 = (𝐼 − 𝜎 𝑘 𝑠 𝑘 𝑦 𝑘 | ) 𝑉˜ 𝑘 (𝐼 − 𝜎 𝑘 𝑦 𝑘 𝑠 𝑘 | ) + 𝜎 𝑘 𝑠 𝑘 𝑠 𝑘 | , (4.88)
where
1
𝜎𝑘 = . (4.89)
𝑦𝑘 | 𝑠 𝑘
Figure 4.51 shows the sizes of the vectors and matrices involved in this
equation.
𝑉˜ 𝑘+1 = 𝐼 − 𝜎𝑘 𝑠 𝑘 𝑦𝑘 𝑉˜ 𝑘 𝐼 − 𝜎𝑘 𝑦𝑘 𝑠𝑘 + 𝜎𝑘 𝑠 𝑘 𝑠𝑘
(𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛)
Now we can replace the potentially costly solution of the linear Fig. 4.51 Sizes of each term of the
BFGS update (Eq. 4.88).
system (Eq. 4.71) with the much cheaper matrix-vector product,
𝑝 𝑘 = −𝑉˜ 𝑘 ∇ 𝑓 𝑘 , (4.90)
1
𝑉˜ 0 = 𝐼. (4.91)
k∇ 𝑓0 k
∇ 𝑓0
𝑝0 = −𝑉˜ 0 ∇ 𝑓0 = − . (4.92)
k∇ 𝑓0 k
4 Unconstrained Gradient-Based Optimization 128
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
Minimizing the same bean function from previous examples using BFGS, we 𝑥2
3
get the optimization path shown in Fig. 4.52. We also show the corresponding 7 iterations
𝑥0
quadratic approximations for a few selected steps of this minimization in 2
Fig. 4.53. Because we generate approximations to the inverse, we invert those
approximations to get the Hessian approximation for the purpose of illustration. 1 𝑥∗
We initialize the inverse Hessian to the identity matrix, which results in
a quadratic with circular contours and a steepest-descent step (Fig. 4.53, left). 0
and the inverse Hessian approximation is Fig. 4.52 BFGS optimization path.
0.435747 −0.202020
𝑉˜ 2 = .
−0.202020 0.222556
3 3 3
𝑥0
2 2 2
𝑠0
𝑥2 1 𝑥2 1 𝑥2 1
𝑥∗
𝑥1
𝑥3
𝑥2
0 0 𝑠2 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 6
4
4
𝑓 10 𝑓 𝑓 2
2
0
5 0
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼
higher than the actual; however, the line search moves past the approximation
minimum toward the true minimum.
By the end of the optimization, at 𝑥 ∗ = (1.213412, 0.824123), the BFGS
estimate is
˜ ∗ 0.276946 0.224010
𝑉 = ,
0.224010 0.347847
whereas the exact one is
0.276901 0.223996
𝐻 −1 (𝑥 ∗ ) = .
0.223996 0.347867
Now the estimate is much more accurate. In the right plot of Fig. 4.53, we can
see that the minimum of the approximation coincides with the actual minimum.
The approximation is only accurate locally, worsening away from the minimum.
where
1
𝜎= . (4.94)
𝑠| 𝑦
If we save the sequence of 𝑠 and 𝑦 vectors and specify a starting value
for 𝑉˜ 0 , we can compute any subsequent 𝑉˜ 𝑘 . Of course, what we want
is 𝑉˜ 𝑘 ∇ 𝑓 𝑘 , which we can also compute using an algorithm with the
recurrence relationship. However, such an algorithm would not be
advantageous from the memory-usage perspective because we would
have to store a long sequence of vectors and a starting matrix.
4 Unconstrained Gradient-Based Optimization 131
Inputs:
∇ 𝑓 𝑘 : Gradient at point 𝑥 𝑘
𝑠 𝑘−1,...,𝑘−𝑚 : History of steps 𝑥 𝑘 − 𝑥 𝑘−1
𝑦 𝑘−1,...,𝑘−𝑚 : History of gradient differences ∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1
Outputs:
𝑝: Search direction −𝑉˜ 𝑘 ∇ 𝑓 𝑘
𝑑 = ∇ 𝑓𝑘
for 𝑖 = 𝑘 − 1 to 𝑘 − 𝑚 by −1 do
𝛼 𝑖 = 𝜎𝑖 𝑠 𝑖 | 𝑑
𝑑 = 𝑑 − 𝛼 𝑖 𝑦𝑖
end for |
𝑠 𝑦 𝑘−1
𝑉˜ 0 = 𝑘−1| 𝐼 Initialize Hessian inverse approximation as a scaled identity matrix
𝑦 𝑘−1 𝑦 𝑘−1
𝑑 = 𝑉˜ 0 𝑑
for 𝑖 = 𝑘 − 𝑚 to 𝑘 − 1 do
𝛽 𝑖 = 𝜎𝑖 𝑦 𝑖 | 𝑑
𝑑 = 𝑑 + (𝛼 𝑖 − 𝛽 𝑖 )𝑠 𝑖
end for
𝑝 = −𝑑 𝑥2
3
L-BFGS: 7 iterations
BFGS: 7 iterations
2 𝑥0
Using this technique, we no longer need to bear the memory cost
L-BFGS
of storing a large matrix or incur the computational cost of a large 1
𝑥∗
matrix-vector product. Instead, we store a small number of vectors and 0
BFGS
require fewer vector-vector products (a cost that scales linearly with 𝑛
rather than quadratically). −1
−2 0 2
𝑥1
Example 4.16 L-BFGS compared with BFGS for the bean function
Fig. 4.54 Optimization paths using
BFGS and L-BFGS.
4 Unconstrained Gradient-Based Optimization 132
Minimizing the same bean function from the previous examples, the
optimization iterations using BFGS and L-BFGS are the same, as shown in
Fig. 4.54. The L-BFGS method is applied to the same sequence using the
last five iterations. The number of variables is too small to benefit from the
limited-memory approach, but we show it in this small problem as an example.
˜ 𝑓 is estimated using Alg. 4.8 as
At the same 𝑥 ∗ as in Ex. 4.15, the product 𝑉∇
−7.38683 × 10−5
𝑑∗ = ,
5.75370 × 10−5
Example 4.17 Minimizing the total potential energy for a spring system
q 2 q 2
lem.
1 1
minimize 𝑘 (ℓ1 + 𝑥1 )2 + 𝑥 22 − ℓ1 + 𝑘 (ℓ 2 − 𝑥1 )2 + 𝑥22 − ℓ2
𝑥 1 ,𝑥2 2 1 2 2
− 𝑚 𝑔𝑥2 .
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Fig. 4.55 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔
The contours of this function are shown in Fig. 4.56 for the case where
𝑙1 = 12, 𝑙2 = 8, 𝑘1 = 1, 𝑘2 = 10, 𝑚 𝑔 = 7. There is a minimum and a maximum.
The minimum represents the position of the mass at the stable equilibrium
condition. The maximum also represents an equilibrium point, but it is unstable.
All methods converge to the minimum when starting near the maximum. All
4 Unconstrained Gradient-Based Optimization 133
four methods use the same parameters, convergence tolerance, and starting
point. Depending on the starting point, Newton’s method can become stuck at
the saddle point, and if a line search is not added to safeguard it, it could have
terminated at the maximum instead.
As expected, steepest descent is the least efficient, and the second-order
methods are the most efficient. The number of iterations and the relative
performance are problem dependent and sensitive to the optimization algorithm
parameters, so we should not analyze the number of iterations too closely.
However, these results show the expected trends for most problems.
12 12
32 iterations 27 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1
12 12
14 iterations 12 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
Fig. 4.56 Minimizing the total poten-
𝑥1 𝑥1
tial for two-spring system.
Quasi-Newton Newton
4 Unconstrained Gradient-Based Optimization 134
2 2
10,662 iterations 930 iterations
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0
0 0
−1 0 1 −1 0 1
𝑥1 𝑥1
2 2
36 iterations 24 iterations
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0
0 0
Fig. 4.57 Optimization paths for the
Rosenbrock function using steepest
−1 0 1 −1 0 1
descent, conjugate gradient, BFGS,
𝑥1 𝑥1
and Newton.
Quasi-Newton Newton
103
102
Steepest
101
descent
100
10−1
||∇ 𝑓 || ∞ Newton
10−2 Fig. 4.58 Convergence of the four
10−3 Quasi- methods shows the dramatic differ-
Newton ence between the linear convergence
10−4 of steepest descent, the superlinear
Conjugate convergence of the conjugate gradi-
10−5
gradient ent method, and the quadratic conver-
10−6 gence of the methods that use second-
100 101 102 103 order information.
Major iterations
the convergence criterion. The methods that use second-order information are
even more efficient, exhibiting quadratic convergence in the last few iterations.
1 𝑥𝐴
𝑥2 𝑥∗
−1
𝑥0
−3
−1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
𝑥1 ·104
Let us attempt to minimize this function starting from 𝑥0 = [−5000, −3]. Fig. 4.59 The contours the scaled
The gradient at this starting point is ∇ 𝑓 (𝑥0 ) = [−0.0653, −650.0], so the slope Rosenbrock function (Eq. 4.96) are
highly stretched in the 𝑥1 direction,
in the 𝑥2 direction is four orders of magnitude times larger than the slope in by orders of magnitude more than
the 𝑥1 direction! Therefore, there is a significant bias toward moving along the what we can show here.
𝑥2 direction but little incentive to move in the 𝑥1 direction. After an exact line
search in the steepest descent direction, we obtain the step to 𝑥 𝐴 = [−5000, 0.25]
as shown in Fig. 4.59. The optimization stops at this point, even though it is
not a minimum. This premature convergence is because 𝜕 𝑓 /𝜕𝑥1 is orders of
magnitude smaller, so both components of the gradient satisfy the optimality
conditions when using a standard relative tolerance.
To address this issue, we scale the design variables as explained in
Tip 4.4. Using the scaling 𝑠 𝑥 = [104 , 1], the scaled starting point becomes
𝑥¯ 0 = [−5000, −3] [104 , 1] = [−0.5, −3]. Before evaluating the function, we
need to convert the design variables back to their unscaled values, that is,
𝑓 (𝑥) = 𝑓 (𝑥¯ 𝑠 𝑥 ).
This scaling of the design variables alone is sufficient to improve the
optimization convergence. Still, let us also scale the objective because it is
large at our starting point (around 900). Dividing the objective by 𝑠 𝑓 = 1000,
the initial gradient becomes ∇ 𝑓 (𝑥0 ) = [−0.00206, −0.6]. This is still not ideally
scaled, but it has much less variation in orders of magnitude—more than
sufficient to solve the problem successfully. The optimizer returns 𝑥¯ ∗ = [1, 1],
where 𝑓¯∗ = 1.57 × 10−12 . When rescaled back to the problem coordinates,
𝑥 ∗ = [104 , 1], 𝑓 ∗ = 1.57 × 10−9 .
In this example, the function derivatives span many orders of magnitude,
so dividing the function by a scalar does not have much effect. Instead, we
could minimize log( 𝑓 ), which allows us to solve the problem even without
scaling 𝑥. If we also scale 𝑥, the number of required iterations for convergence
4 Unconstrained Gradient-Based Optimization 137
decreases. Using log( 𝑓 ) as the objective and scaling the design variables as
before yields 𝑥¯ ∗ = [1, 1], where 𝑓¯∗ = −25.28, which in the original problem
space corresponds to 𝑥 ∗ = [104 , 1], where 𝑓 ∗ = 1.05 × 10−11 .
Although this example does not correspond to a physical problem, such
differences in scaling occur frequently in engineering analysis. For example,
optimizing the operating point of a propeller might involve two variables: the
pitch angle and the rotation rate. The angle would typically be specified in
radians (a quantity of order 1) and the rotation rate in rotations per minute
(typically tens of thousands).
where the circles show the trust regions for each iteration. 𝑥∗
The trust-region subproblem solved at each iteration is
Fig. 4.60 Trust-region methods mini-
mize a model within a trust region for
minimize 𝑓˜(𝑠) each iteration, and then they update
𝑠
(4.97) the trust-region size and the model
subject to k𝑠 k ≤ Δ , before the next iteration.
where 𝑓˜(𝑠) is the local trust-region model, 𝑠 is the step from the current
iteration point, and Δ is the size of the trust region. We use 𝑠 instead
of 𝑝 to indicate that this is a step vector and not simply the direction
vector used in methods based on a line search.
4 Unconstrained Gradient-Based Optimization 139
𝑥∗
𝑥2 𝑥 𝑘+1
𝑥𝑘 𝑠𝑘
The subproblem (Eq. 4.97) defines the trust region as a norm. The
Euclidean norm, k𝑠 k 2 , defines a spherical trust region and is the most
common type of trust region. Sometimes ∞-norms are used instead
because they are easy to apply, but 1-norms are rarely used because
they are just as complex as 2-norms but introduce sharp corners that
can be problematic.84 The shape of the trust region is dictated by the 84. Conn et al., Trust Region Methods,
2000.
norm (see Fig. A.8) and can significantly affect the convergence rate.
The ideal trust-region shape depends on the local function space, and
some algorithms allow for the trust-region shape to change throughout
the optimization.
1
minimize 𝑓˜(𝑠) = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑠 + 𝑠 | 𝐻˜ 𝑘 𝑠
𝑠 2 (4.98)
subject to k𝑠 k 2 ≤ Δ 𝑘 ,
𝑓 (𝑥) − 𝑓 (𝑥 + 𝑠)
𝑟= . (4.99)
𝑓˜(0) − 𝑓˜(𝑠)
Inputs:
𝑥 0 : Starting point
Δ0 : Initial size of the trust region
Outputs:
𝑥 ∗ : Optimal point
too long to reduce the trust region to an acceptable size over other
portions of the design space where a smaller trust region is needed.
The same convergence criteria used in other gradient-based methods
are applicable.∗ ∗ Conn et al.84provide more detail on trust-
region problems, including trust-region
norms and scaling, approaches to solving
Example 4.20 Trust-region method applied to the total potential energy of the trust-region subproblem, extensions to
spring system the model, and other important practical
considerations.
Minimizing the total potential energy function from Ex. 4.17 using a trust-
region method starting from the same points as before yields the optimization
path shown in Fig. 4.63. The initial trust region size is Δ = 0.3, and the
maximum allowable is Δmax = 1.5.
8 8 8
4 4 4
𝑥2 𝑥2 𝑥2
0 𝑠0 0 0
𝑥0
𝑠5
−4 −4 −4
𝑠3
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
8 8 𝑠 11 8
𝑠8
𝑥∗
4 4 4
𝑥2 𝑥2 𝑥2
0 0 0
−4 −4 −4
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
𝑘=8 𝑘 = 11 𝑘 = 15
The first few quadratic approximations do not have a minimum because Fig. 4.63 Minimizing the total poten-
the function has negative curvature around the starting point, but the trust tial for two-spring system using a
trust-region method shown at differ-
region prevents steps that are too large. When it gets close enough to the bowl ent iterations. The local quadratic
containing the minimum, the quadratic approximation has a minimum, and approximation is overlaid on the func-
the trust-region subproblem yields a minimum within the trust region. In the tion contours and the trust region is
shown as a red circle.
last few iterations, the quadratic is a good model, and therefore the region
remains large.
4 Unconstrained Gradient-Based Optimization 143
2 2 2
1 𝑥0 𝑠0 1 1
𝑥2 𝑥2 𝑥2
𝑠3
0 0 0
𝑠7
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
2 2 2
𝑥∗
1 1 1
𝑥2 𝑥2 𝑠17 𝑥2
0 𝑠12 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
𝑘 = 12 𝑘 = 17 𝑘 = 35
|𝑥|
if |𝑥| > Δ𝑥
(4.100) Fig. 4.65 Smoothed absolute value
𝑓 (𝑥) =
𝑥2 Δ𝑥
+ otherwise , function.
2Δ𝑥 2
where Δ𝑥 is a user-adjustable parameter representing the half-width of the
transition.
Piecewise functions are often used in fits to empirical data. Cubic splines
or a sigmoid function can blend the transition between two functions smoothly.
We can also use the same technique to blend discrete steps (where the two
functions are constant values) or implement smooth max or min functions.† † Another option to smooth the max of
For example, a sigmoid can be used to blend two functions ( 𝑓1 (𝑥) and 𝑓2 (𝑥)) multiple functions is aggregation, which
is detailed in Section 5.7.
together at a transition point 𝑥 𝑡 using
1
𝑓 (𝑥) = 𝑓1 (𝑥) + ( 𝑓2 (𝑥) − 𝑓1 (𝑥)) , (4.101)
1 + 𝑒 −ℎ(𝑥−𝑥 𝑡 )
where ℎ is a user-selected parameter that controls how sharply the transition
occurs. The left side of Fig. 4.66 shows an example transitioning 𝑥 and 𝑥 2 with
𝑥 𝑡 = 0 and ℎ = 50.
4 Unconstrained Gradient-Based Optimization 145
0.2 0.2
𝑓1 (𝑥) 𝑓1 (𝑥)
0.1 0.1
𝑓2 (𝑥) 𝑓2 (𝑥)
𝑓 0 𝑓 0
−0.1 −0.1
𝑓 (𝑥) 𝑓 (𝑥)
−0.2 −0.2
Another approach is to use a cubic spline for the blending. Given a transition
point 𝑥 𝑡 and a half-width Δ𝑥, we can compute a cubic spline transition as
𝑓 (𝑥)
1
if 𝑥 < 𝑥1
𝑓 (𝑥) = if (4.102)
𝑓 (𝑥)
2 𝑥 > 𝑥2
𝑐 𝑥3 + 𝑐 𝑥2 + 𝑐 𝑥 + 𝑐
1 2 3 4 otherwise ,
𝑥2
3
2
Tip 4.8 Gradient-based optimization can find the global optimum
1
Gradient-based methods are local search methods. If the design space is
fundamentally multimodal, it may be helpful to augment the gradient-based 0
search with a global search. The simplest and most common approach is to use −1
𝑥∗
a multistart approach, where we run a gradient-based search multiple times,
starting from different points, as shown in Fig. 4.67. The starting points might −2
−2 0 2 4
be chosen from engineering intuition, randomly generated points, or sampling 𝑥1
found with a single starting point. One advantage of this approach is that it
can easily be run in parallel.
Another approach is to start with a global search strategy (see Chapter 7).
After a suitable initial exploration, the design(s) given by the global search
become starting points for gradient-based optimization(s). This finds points
that satisfy the optimality conditions, which is typically challenging with a
pure gradient-free approach. It also improves the convergence rate and finds
optima more precisely.
4.6 Summary
Problems
4.4 Review Kepler’s wine barrel story from Section 2.2. Approximate
the barrel as a cylinder and find the height and diameter of a
barrel that maximizes its volume for a diagonal measurement of
1 m.
𝑓 = 𝑥14 + 3𝑥 13 + 3𝑥 22 − 6𝑥1 𝑥2 − 2𝑥 2 .
4.6 Consider a slightly modified version of the function from Prob. 4.5,
where we add a 𝑥24 term to get
Can you find the critical points analytically? Plot the function
contours. Locate the critical points graphically and classify them.
4.7 Implement the two line search algorithms from Section 4.3, such
that they work in 𝑛 dimensions (𝑥 and 𝑝 can be vectors of any
size).
a. As a first test for your code, reproduce the results from the
examples in Section 4.3 and plot the function and iterations
for both algorithms. For the line search that satisfies the
strong Wolfe conditions, reduce the value of 𝜇2 until you get
an exact line search. How much accuracy can you achieve?
4 Unconstrained Gradient-Based Optimization 150
a. For your first test problem, reproduce the results from the
examples in Section 4.4.
b. Minimize the two-dimensional Rosenbrock function (see
Appendix D.1.2) using the various algorithms and compare
your results starting from 𝑥 = (−1, 2). Compare the total
number of evaluations. Compare the number of minor
4 Unconstrained Gradient-Based Optimization 151
4.12 The brachistochrone problem seeks to find the path that minimizes
travel time between two points for a particle under the force of
gravity.∗ Solve the discretized version of this problem using ∗ This problem was mentioned in Sec-
tion 2.2 as one of the problems that inspired
an optimizer of your choice (see Appendix D.1.7 for a detailed developments in calculus of variations.
description).
152
5 Constrained Gradient-Based Optimization 153
minimize 𝑓 (𝑥)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (5.1)
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 ,
by inspection. We can visualize the contours for this problem because the 𝑥1
functions can be evaluated quickly and because it has only two dimensions. If
Fig. 5.1 Graphical solution for con-
the functions were more expensive, we would not be able to afford the many strained problem showing contours
evaluations needed to plot the contours. If the problem had more dimensions, of the objective, the two constraint
it would become difficult or impossible to visualize the functions and feasible curves, and the shaded infeasible re-
space fully. gion.
5 Constrained Gradient-Based Optimization 154
𝜕ℎ1 𝜕ℎ1
··· ∇ℎ |
𝜕𝑥1 𝜕𝑥 𝑛 𝑥 1
𝜕ℎ . .. ..
𝐽ℎ = = .. ..
. = . , (5.2)
𝜕ℎ
.
|
𝜕ℎ 𝑛 ℎ ∇ℎ 𝑛
𝜕𝑥
𝑛ℎ ℎ
···
𝜕𝑥1 𝜕𝑥 𝑛 𝑥
| {z }
(𝑛 ℎ ×𝑛 𝑥 )
There are several essential linear algebra concepts for constrained 86. Boyd and Vandenberghe, Convex
Optimization, 2004.
optimization. The span of a set of vectors is the space formed by all the
87. Strang, Linear Algebra and its Applica-
points that can be obtained by a linear combination of those vectors. tions, 2006.
With one vector, this space is a line, with two linearly independent
vectors, this space is a two-dimensional plane (see Fig. 5.2), and so
on. With 𝑛 linearly independent vectors, we can obtain any point in
𝑛-dimensional space.
𝛼𝑢 + 𝛽𝑣 𝑤
𝑢 𝛼𝑢
𝑢
𝑢
𝑥0 𝑝|𝑣 > 0
𝑥0
Fig. 5.4 Hyperplanes and half-spaces
𝑝|𝑣 < 0 in two and three dimensions.
𝑝|𝑣 <0
𝑥0
Tangent
plane ∇𝑓
∇𝑓
Tangent
line
Fig. 5.5 The gradient of a function
defines the hyperplane tangent to the
function isosurface.
𝑓 isosurface
5 Constrained Gradient-Based Optimization 157
𝛼𝑢 + 𝛽𝑣
𝛼, 𝛽 ≥ 0
𝑢
𝑣
Fig. 5.6 Polyhedral cones in two and
three dimensions.
𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) . (5.4)
Given the Taylor series expansion (Eq. 5.3), the only way that this
inequality can be satisfied is if
∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 . (5.5)
5 Constrained Gradient-Based Optimization 158
𝐽ℎ (𝑥)𝑝 = 0 . (5.8)
This equation states that any feasible direction has to lie in the nullspace
of the Jacobian of the constraints, 𝐽ℎ .
Assuming that 𝐽ℎ has full row rank (i.e., the constraint gradients are Feasible point
linearly independent), then the feasible space is a subspace of dimension ∇ℎ1
𝑛 𝑥 − 𝑛 ℎ . For optimization to be possible, we require 𝑛 𝑥 > 𝑛 ℎ . Figure 5.8
illustrates a case where 𝑛 𝑥 = 𝑛 ℎ = 2, where the feasible space reduces 𝑥
of Fig. 5.9 for the three-dimensional case. For two or more constraints,
Fig. 5.8 If we have two equality con-
the feasible space corresponds to the intersection of all the tangent straints (𝑛 ℎ = 2) in two-dimensional
hyperplanes. On the right side of Fig. 5.9, we show the intersection of space (𝑛 𝑥 = 2), we are left with no
two tangent hyperplanes in three-dimensional space (a line). freedom for optimization.
5 Constrained Gradient-Based Optimization 159
∇ℎ | 𝑝 = 0 ℎ2 = 0
∇ℎ ∇ℎ1
𝐽ℎ 𝑝 = 0
∇ℎ 2
𝑥∗
𝑥∗
In other words, the projection of the objective function gradient onto the
feasible space must vanish. Figure 5.10 illustrates this requirement for a
case with two constraints in three dimensions.
∇𝑓
∇ℎ1 ∇ℎ 1
∇𝑓
Fig. 5.10 If the projection of ∇ 𝑓 onto
the feasible space is nonzero, there is
a feasible descent direction (left); if
𝑝 ∇ 𝑓 |𝑝 ∇ℎ2 𝑝 ∇ 𝑓 | 𝑝 = 0 ∇ℎ2 the projection is zero, the point is a
constrained optimum (right).
where 𝜆 𝑗 are called the Lagrange multipliers.† There is a multiplier † Despite our convention of reserving
associated with each constraint. The sign of the Lagrange multipliers Greek symbols for scalars, we use 𝜆 to
is arbitrary for equality constraints but will be significant later when represent the 𝑛 ℎ -vector of Lagrange multi-
pliers because it is common usage.
dealing with inequality constraints.
5 Constrained Gradient-Based Optimization 160
where we have reexpressed Eq. 5.10 in matrix form and added the
constraint satisfaction condition.
In constrained optimization, it is sometimes convenient to use the
Lagrangian function, which is a scalar function defined as
∇𝑥 ℒ = ∇ 𝑓 (𝑥) + 𝐽ℎ (𝑥)| 𝜆 = 0
(5.13)
∇𝜆 ℒ = ℎ(𝑥) = 0 ,
minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to ℎ(𝑥1 , 𝑥2 ) = 𝑥 + 𝑥22 − 1 = 0 .
4 1
The Lagrangian for this problem is
1
ℒ(𝑥1 , 𝑥2 , 𝜆) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 .
4
These two points are shown in Fig. 5.12, together with the objective and
constraint gradients. The optimality conditions (Eq. 5.11) state that the gradient
must be a linear combination of the gradients of the constraints at the optimum.
In the case of one constraint, this means that the two gradients are colinear
(which occurs in this example).
5 Constrained Gradient-Based Optimization 162
2
∇𝑓
∇ℎ
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
𝑥𝐴
−1
the feasible directions, in this case, we can show that it is positive or negative 2
definite in all possible directions. The Hessian is negative definite for 𝑥 𝐵 , so
this is not a minimum; instead, it is a maximum.
Figure 5.13 shows the Lagrangian function (with the optimal Lagrange 0
multiplier we solved for) overlaid on top of the original function and constraint.
𝑥∗
The unconstrained minimum of the Lagrangian corresponds to the constrained
minimum of the original function. The Lagrange multiplier can be visualized −2
as a third dimension coming out of the page. Here we show only the slice for −2 0 2
the Lagrange multiplier that solves the optimality conditions. 𝑥1
subject to ℎ(𝑥1 , 𝑥2 ) = 𝛽𝑥 12 − 𝑥2 = 0 ,
4 4 4
∇ℎ ∇ℎ ∇ℎ
2 2 2
𝑥2 𝑥2 𝑥2
0 0 0
−2 ∇𝑓 −2 ∇𝑓 −2 ∇𝑓
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
𝛽 = −0.5 1 𝛽 = 0.5
𝛽= 12
For 𝛽 = −0.5, the Hessian of the Lagrangian is positive definite, and we Fig. 5.14 Three different problems il-
have a minimum. For 𝛽 = 0.5, the Lagrangian has negative curvature in the lustrating the meaning of the second-
order conditions for constrained
feasible directions, so the point is not a minimum; we can reduce the objective problems.
by moving along the curved constraint. The first-order conditions alone do
not capture this possibility because they linearize the constraint. Finally, in the
limiting case (𝛽 = 1/12), the curvature of the constraint matches the curvature
of the objective, and the curvature of the Lagrangian is zero in the feasible
directions. This point is not a minimum either.
5 Constrained Gradient-Based Optimization 164
∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 , (5.16)
∇ 𝑓 |𝑝 < 0
which is the same as for the equality constrained case. We use the Descent directions
arc in Fig. 5.15 to show the descent directions, which are in the open
Fig. 5.15 The descent directions are
half-space defined by the hyperplane tangent to the gradient of the
in the open half-space defined by the
objective. objective function gradient.
To consider inequality constraints, we use the same linearization as
∇𝑔 | 𝑝 ≥ 0
the equality constraints (Eq. 5.6), but now we enforce an inequality to Infeasible
get directions ∇𝑔
For a given candidate point that satisfies all constraints, there are 𝑥
Feasible
two possibilities to consider for each inequality constraint: whether
directions
the constraint is inactive (𝑔 𝑗 (𝑥) < 0) or active (𝑔 𝑗 (𝑥) = 0). If a given
𝑔=0
constraint is inactive, we do not need to add any condition for it because
we can take a step 𝑝 in any direction and remain feasible as long as Fig. 5.16 The feasible directions for
the step is small enough. Thus, we only need to consider the active each constraint are in the closed half-
constraints for the optimality conditions. space defined by the inequality con-
straint gradient.
For the equality constraint, we found that all first-order feasible
directions are in the nullspace of the Jacobian matrix. Inequality
|
𝐽 𝑔 𝜎, 𝜎 ≥ 0 ∇𝑔2
directions
constraints are not as restrictive. From Eq. 5.17, if constraint 𝑗 is
active (𝑔 𝑗 (𝑥) = 0), then the nearby point 𝑔 𝑗 (𝑥 + 𝑝) is only feasible if ∇𝑔1
∇𝑔 𝑗 (𝑥)| 𝑝 ≤ 0 for all constraints 𝑗 that are active. In matrix form, we can
write 𝐽 𝑔 (𝑥)𝑝 ≤ 0, where the Jacobian matrix includes only the gradients
𝑥
of the active constraints. Thus, the feasible directions for inequality
Feasible
constraint 𝑗 can be any direction in the closed half-space, corresponding directions
to all directions 𝑝 such that 𝑝 | 𝑔 𝑗 ≤ 0, as shown in Fig. 5.16. In this
figure, the arc shows the infeasible directions.
The set of feasible directions that satisfies all active constraints is Fig. 5.17 Excluding the infeasible di-
rections with respect to each con-
the intersection of all the closed half-spaces defined by the inequality
straint (red arcs) leaves the cone of
constraints, that is, all 𝑝 such that 𝐽 𝑔 (𝑥)𝑝 ≤ 0. This intersection of the feasible directions (blue), which is
feasible directions forms a polyhedral cone, as illustrated in Fig. 5.17 the polar cone of the active constraint
for a two-dimensional case with two constraints. To find the cone of gradients cone (gray).
5 Constrained Gradient-Based Optimization 165
feasible directions, let us first consider the cone formed by the active
inequality constraint gradients (shown in gray in Fig. 5.17). This cone
is defined by all vectors 𝑑 such that
Õ
𝑛𝑔
where 𝜎𝑗 ≥ 0 . (5.18)
|
𝑑 = 𝐽𝑔 𝜎 = 𝜎 𝑗 ∇𝑔 𝑗 ,
𝑗=1
𝐽 | 𝜎, 𝜎 ≥ 0 ∇𝑔2 −∇ 𝑓 ∇𝑔2
directions
∇𝑔1
∇𝑔1
−∇ 𝑓
Feasible ∇ 𝑓
descent
directions ∇𝑓
1. Feasible descent direction ex- 2. No feasible descent di- Fig. 5.18 Two possibilities involving
ists, so point is not an optimum rection exists, so point is an active inequality constraints.
optimum
𝑔 𝑗 + 𝑠 2𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.20)
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑙 Õ 𝜕𝑔 𝑗
𝑛𝑔
∇𝑥 ℒ = 0 ⇒ = + 𝜆𝑙 + 𝜎𝑗 =0
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑙=1 𝑗=1
𝑖 = 1, . . . , 𝑛 𝑥 . (5.22)
𝜕ℒ
∇𝜆 ℒ = 0 ⇒ = ℎ 𝑙 = 0, 𝑙 = 1, . . . , 𝑛 ℎ , (5.23)
𝜕𝜆 𝑙
which enforces the equality constraints as before. Taking derivatives
with respect to the inequality Lagrange multipliers, we get
𝜕ℒ
∇𝜎 ℒ = 0 ⇒ = 𝑔 𝑗 + 𝑠 2𝑗 = 0 𝑗 = 1, . . . , 𝑛 𝑔 , (5.24)
𝜕𝜎 𝑗
𝜕ℒ
∇𝑠 ℒ = 0 ⇒ = 2𝜎 𝑗 𝑠 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.25)
𝜕𝑠 𝑗
ℎ=0
𝑔+𝑠𝑠 =0 (5.26)
𝜎𝑠=0
𝜎 ≥ 0.
∇𝑔1
∇𝑔1
𝑥∗ 𝑥∗ ∇𝑔2 ∇𝑔1 𝑥∗
∇𝑓 ∇𝑓 ∇𝑓
∇𝑔2
∇𝑔2
The middle and right panes of Fig. 5.19 illustrate cases where 𝑥 ∗ Fig. 5.19 The KKT conditions apply
only to regular points. A point 𝑥 ∗
is also a constrained minimum. However, 𝑥 ∗ is not a regular point in
is regular when the gradients of the
either case because the gradients of the two constraints are not linearly constraints are linearly independent.
independent. This means that the gradient of the objective cannot be The middle and right panes illustrate
cases where 𝑥 ∗ is a constrained mini-
expressed as a unique linear combination of the constraints. Therefore, mum but not a regular point.
we cannot use the KKT conditions, even though 𝑥 ∗ is a minimum.
The problem would be ill-conditioned, and the numerical methods
described in this chapter would run into numerical difficulties. Similar
to the equality constrained case, this situation is uncommon in practice.
5 Constrained Gradient-Based Optimization 169
Consider a variation of the problem in Ex. 5.2 where the equality is replaced
by an inequality, as follows:
The objective function and feasible region are shown in Fig. 5.20.
2
∇𝑓
∇𝑔
1 Maximum
𝑥𝐵
∇𝑓
𝑥2 0
Minimum
𝑥𝐴
−1
∇𝑔
−2 Fig. 5.20 Inequality constrained prob-
lem with linear objective and feasible
−3 −2 −1 0 1 2 3 space within an ellipse.
𝑥1
Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions
𝜕ℒ 1
=1+ 𝜎𝑥 = 0
𝜕𝑥1 2 1
𝜕ℒ
= 2 + 2𝜎𝑥 2 = 0
𝜕𝑥2
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 = 0
𝜕𝜎 4 1
𝜕ℒ
= 2𝜎𝑠 = 0 .
𝜕𝑠
There are two possibilities in the last (complementary slackness) condition:
𝑠 = 0 (meaning the constraint is active) and 𝜎 = 0 (meaning the constraint is
not active). However, we can see that setting 𝜎 = 0 in either of the two first
equations does not yield a solution. Assuming that 𝑠 = 0 and 𝜎 ≠ 0, we can
solve the equations to obtain:
√ √
𝑥1 − 2 𝑥1 2
√ √
𝑥 𝐴 = 𝑥2 = − 2/2 , 𝑥 𝐵 = 𝑥2 = 2/2 .
𝜎 √2 𝜎 −√2
5 Constrained Gradient-Based Optimization 170
These are the same critical points as in the equality constrained case of Ex. 5.2,
as shown in Fig. 5.20. However, now the sign of the Lagrange multiplier is
significant.
According to the KKT conditions, the Lagrange multiplier has to be nonneg-
ative. Point 𝑥 𝐴 satisfies this condition. As a result, there is no feasible descent
direction at 𝑥 𝐴 , as shown in Fig. 5.21 (left). The Hessian of the Lagrangian at
this point is the same as in Ex. 5.2, which we have already shown to be positive
definite. Therefore, 𝑥 𝐴 is a minimum.
∇𝑓 Infeasible ∇𝑓
directions Fig. 5.21 At the minimum (left), the
∇𝑔 Lagrange multiplier is positive, and
there is no feasible descent direction.
𝑥∗ 𝑥𝐵 At the critical point 𝑥 𝐵 (right), the
Descent
directions Feasible Lagrange multiplier is negative, and
descent all descent directions are feasible, so
Infeasible directions this point is not a minimum.
∇𝑔 directions
Unlike the equality constrained problem, we do not need to check the Hes-
sian at point 𝑥 𝐵 because the Lagrange multiplier is negative. As a consequence,
there are feasible descent directions, as shown in Fig. 5.21 (right). Therefore,
𝑥 𝐵 is not a minimum.
Consider a variation of Ex. 5.4 where we add one more inequality constraint,
as follows:
minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to 𝑔1 (𝑥 1 , 𝑥2 ) =𝑥 + 𝑥22 − 1 ≤ 0
4 1
𝑔2 (𝑥 2 ) = −𝑥2 ≤ 0 .
The feasible region is the top half of the ellipse, as shown in Fig. 5.22.
The Lagrangian for this problem is
1 2
ℒ(𝑥, 𝜎, 𝑠) = 𝑥1 + 2𝑥2 + 𝜎1 𝑥1 + 𝑥2 − 1 + 𝑠 1 + 𝜎2 −𝑥2 + 𝑠 22 .
2 2
4
Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions,
𝜕ℒ 1
= 1 + 𝜎1 𝑥1 = 0
𝜕𝑥1 2
𝜕ℒ
= 2 + 2𝜎1 𝑥2 − 𝜎2 = 0
𝜕𝑥2
𝜕ℒ 1
= 𝑥12 + 𝑥 22 − 1 + 𝑠 12 = 0
𝜕𝜎1 4
5 Constrained Gradient-Based Optimization 171
𝜕ℒ
= −𝑥 2 + 𝑠 22 = 0
𝜕𝜎2
𝜕ℒ
= 2𝜎1 𝑠 1 = 0
𝜕𝑠 1
𝜕ℒ
= 2𝜎2 𝑠 2 = 0 .
𝜕𝑠 2
We now have two complementary slackness conditions, which yield the four
potential combinations listed in Table 5.1.
2
∇𝑓
∇𝑓 ∇𝑔1 ∇𝑓
1
𝑥𝐵
∇𝑔1 Minimum 𝑥𝐶
𝑥2 0
𝑥∗ ∇𝑔1
−1
∇𝑔2 ∇𝑔2
−2
Fig. 5.22 Only one point satisfies the
−3 −2 −1 0 1 2 3 first-order KKT conditions.
𝑥1
∇𝑓 ∇𝑓
Infeasible Feasible
𝑔1 descent Infeasible
directions directions 𝑔1 Fig. 5.23 At the minimum (left), the
directions intersection of the feasible directions
∇𝑔1 𝑥∗ 𝑥𝐶 and descent directions is null, so
∇𝑔1 there is no feasible descent direction.
At this point, there is a cone of de-
Infeasible
Infeasible scent directions that is also feasible,
𝑔2
∇𝑔2 𝑔2 ∇𝑔2 so it is not a minimum.
directions
directions
5 Constrained Gradient-Based Optimization 172
Assuming that both constraints are active yields two possible solutions (𝑥 ∗
and 𝑥 𝐶 ) corresponding to two different Lagrange multipliers. According to the
KKT conditions, the Lagrange multipliers for all active inequality constraints
have to be positive, so only the solution with 𝜎1 = 1 (𝑥 ∗ ) is a candidate for a
minimum. This point corresponds to 𝑥 ∗ in Fig. 5.22. As shown in Fig. 5.23 (left),
there are no feasible descent directions starting from 𝑥 ∗ . The Hessian of the
Lagrangian at 𝑥 ∗ is identical to the previous example and is positive definite
when 𝜎1 is positive. Therefore, 𝑥 ∗ is a minimum.
The other solution for which both constraints are active is point 𝑥 𝐶 in
Fig. 5.22. As shown in Fig. 5.23 (right), there is a cone of feasible descent
directions, and therefore 𝑥 𝐶 is not a minimum.
Assuming that neither constraint is active yields 1 = 0 for the first optimality
condition, so this situation is not possible. Assuming that 𝑔1 is active yields
the solution corresponding to the maximum that we already found in Ex. 5.4,
𝑥 𝐵 . Finally, assuming that only 𝑔2 is active yields no candidate point.
Objective Objective
𝑥∗ 𝑥∗
𝑥∗ 𝑥∗
𝑝 𝑝
Interior Exterior
Constraint Constraint
penalty penalty
𝑓ˆ(𝑥; 𝜇)
Fig. 5.27 Quadratic penalty for an
equality constrained problem. The
𝜇↑ 𝑓 (𝑥) minimum of the penalized function
(black dots) approaches the true con-
strained minimum (blue circle) as the
𝑥 penalty parameter 𝜇 increases.
∗
𝑥 true
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor (𝜌 ∼ 1.2 is conservative, 𝜌 ∼ 10 is aggressive)
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 = 0 , (5.38)
|
∇𝑥 𝑓ˆ = ∇ 𝑓 + 𝜇𝐽ℎ ℎ = 0 , (5.39)
|
5 Constrained Gradient-Based Optimization 178
𝜆∗𝑗
ℎ𝑗 ≈ . (5.40)
𝜇
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Consider the equality constrained problem from Ex. 5.2. The penalized Fig. 5.28 The quadratic penalized
function for that case is function minimum approaches the
2 constrained minimum as the penalty
𝜇 1 2 parameter increases.
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝑥1 + 𝑥 22 − 1 . (5.41)
2 4
Figure 5.28 shows this function for different values of the penalty parameter | 𝑓ˆ∗ − 𝑓 ∗ |
𝜇. The penalty is active for all points that are infeasible, but the minimum of
the penalized function does not coincide with the constrained minimum of 100
the original problem. The penalty parameter needs to be increased for the
minimum of the penalized function to approach the correct solution, but this 10−1
with a small value of 𝜇 and reusing the optimal point for one solution as the
10−3
starting point for the next. Figure 5.29 shows that large penalty values are
10−1 100 101 102 103
required for high accuracy. In this example, even using a penalty parameter of 𝜇
𝜇 = 1, 000 (which results in extremely skewed contours), the objective value
achieves only three digits of accuracy. Fig. 5.29 Error in optimal solution for
increasing penalty parameter.
5 Constrained Gradient-Based Optimization 179
𝜇Õ 2
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.42)
2
𝑗=1
𝑓ˆ(𝑥; 𝜇)
Õ
𝑛ℎ
𝜇𝑔 Õ
𝑛𝑔
2
ˆ𝑓 (𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇 ℎ 2
ℎ 𝑙 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.43)
2 2
𝑙=1 𝑗=1
Consider the inequality constrained problem from Ex. 5.4. The penalized
function for that case is
2
𝜇 1
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + max 0, 𝑥 12 + 𝑥22 − 1 .
2 4
This function is shown in Fig. 5.31 for different values of the penalty parameter
𝜇. The contours of the feasible region inside the ellipse coincide with the
5 Constrained Gradient-Based Optimization 180
original function contours. However, outside the feasible region, the contours
change to create a function whose minimum approaches the true constrained
minimum as the penalty parameter increases.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
The considerations on scaling discussed in Tip 4.4 are just as crucial for
constrained problems. Similar to scaling the objective function, a good scaling
rule of thumb is to normalize each constraint function such they are of order
1. For constraints, a natural scale is typically already defined by the limits we
provide. For example, instead of
𝑔 𝑗 (𝑥)
−1 ≤ 0. (5.45)
𝑔max 𝑗
Augmented Lagrangian
Õ
𝑛ℎ
𝜇Õℎ 𝑛
𝑓ˆ(𝑥; 𝜆, 𝜇) = 𝑓 (𝑥) + 𝜆 𝑗 ℎ 𝑗 (𝑥) + ℎ 𝑗 (𝑥)2 . (5.46)
2
𝑗=1 𝑗=1
5 Constrained Gradient-Based Optimization 181
Õ
𝑛ℎ
∇𝑥 𝑓ˆ(𝑥; 𝜆, 𝜇) = ∇ 𝑓 (𝑥) + 𝜆 𝑗 + 𝜇ℎ 𝑗 (𝑥) ∇ℎ 𝑗 = 0 , (5.47)
𝑗=1
Õ
𝑛ℎ
∇𝑥 ℒ(𝑥 ∗ , 𝜆∗ ) = ∇ 𝑓 (𝑥 ∗ ) + 𝜆∗𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0 . (5.48)
𝑗=1
𝜆∗𝑗 ≈ 𝜆 𝑗 + 𝜇ℎ 𝑗 . (5.49)
𝜆∗𝑗
ℎ𝑗 ≈ . (5.52)
𝜇
Inputs:
𝑥 0 : Starting point
𝜆0 = 0: Initial Lagrange multiplier
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜆 𝑘 , 𝜇 𝑘 )
𝑥𝑘
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) Update Lagrange multipliers
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
Consider the inequality constrained problem from Ex. 5.4. Assuming the
inequality constraint is active, the augmented Lagrangian (Eq. 5.46) is
2
1 𝜇 1 2
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 + 𝑥1 + 𝑥22 − 1 .
4 2 4
Applying Alg. 5.2, starting with 𝜇 = 0.5 and using 𝜌 = 1.1, we get the iterations
shown in Fig. 5.32.
5 Constrained Gradient-Based Optimization 183
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Compared with the quadratic penalty in Ex. 5.7, the penalized function Fig. 5.32 Augmented Lagrangian ap-
is much better conditioned, thanks to the term associated with the Lagrange plied to inequality constrained prob-
lem.
multiplier. The minimum of the penalized function eventually becomes the
minimum of the constrained problem without a large penalty parameter.
As done in Ex. 5.6, we solve a sequence of problems starting with a small | 𝑓ˆ∗ − 𝑓 ∗ |
value of 𝜇 and reusing the optimal point for one solution as the starting point
for the next. In this case, we update the Lagrange multiplier estimate between 10−1
optimizations as well. Figure 5.33 shows that only modest penalty parameters 10−4
are needed to achieve tight convergence to the true solution, a significant
improvement over the regular quadratic penalty. 10−7
10−10
10−13
10−1 100
5.4.2 Interior Penalty Methods 𝜇
Interior penalty methods work the same way as exterior penalty Fig. 5.33 Error in optimal solution
as compared with true solution as
methods—they transform the constrained problem into a series of
a function of an increasing penalty
unconstrained problems. The main difference with interior penalty parameter.
methods is that they always seek to maintain feasibility. Instead of
adding a penalty only when constraints are violated, they add a penalty
as the constraint is approached from the feasible region. This type of
penalty is particularly desirable if the objective function is ill-defined
outside the feasible region. These methods are called interior because 𝜋(𝑥)
8
the iteration points remain on the interior of the feasible region. They
are also referred to as barrier methods because the penalty function acts 6
One possible interior penalty function to enforce 𝑔(𝑥) ≤ 0 is the 2 Inverse barrier
inverse barrier,
Õ
0
𝑛𝑔 −2 −1 𝑔(𝑥) 0
1
𝜋(𝑥) = − , (5.55) −2 Logarithmic barrier
𝑔 𝑗 (𝑥)
𝑗=1
Fig. 5.34 Two different interior
where 𝜋(𝑥) → ∞ as 𝑔 𝑗 (𝑥) → 0− (where the superscript “−” indicates a penalty functions: inverse barrier and
left-sided derivative). A more popular interior penalty function is the logarithmic barrier.
5 Constrained Gradient-Based Optimization 184
logarithmic barrier,
Õ
𝑛𝑔
𝜋(𝑥) = − ln −𝑔 𝑗 (𝑥) , (5.56)
𝑗=1
which also approaches infinity as the constraint tends to zero from the
feasible side. The penalty function is then
Õ
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) − 𝜇 ln(−𝑔 𝑗 (𝑥)) . (5.57)
𝑗=1
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 < 1: Penalty decrease factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Decrease penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
Consider the equality constrained problem from Ex. 5.4. The penalized
function for that case using the logarithmic penalty (Eq. 5.57) is
1
𝑓ˆ(𝑥; 𝜇) = 𝑥 1 + 2𝑥2 − 𝜇 ln − 𝑥12 − 𝑥22 + 1 .
4
Figure 5.36 shows this function for different values of the penalty parameter 𝜇.
The penalized function is defined only in the feasible space, so we do not plot
its contours outside the ellipse.
2 2 2
1 1 1
𝑥2 0 𝑥 ∗ˆ 𝑥2 0 𝑥2 0
𝑓 𝑥 ∗ˆ 𝑥 ∗ˆ
𝑓
𝑓
−1 𝑥∗ −1 𝑥∗ −1 𝑥∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
tends to zero.93 There are augmented and modified barrier approaches 93. Murray, Analytical expressions for the
eigenvalues and eigenvectors of the Hessian
that can avoid the ill-conditioning issue (and other methods that remain matrices of barrier and penalty functions,
ill-conditioned but can still be solved reliably, albeit inefficiently).94 1971.
94. Forsgren et al., Interior methods for
However, these methods have been superseded by the modern interior- nonlinear optimization, 2002.
point methods discussed in Section 5.6, so we do not elaborate on
further improvements to classical penalty methods.
𝐽𝑟 (𝑢 𝑘 ) 𝑝 𝑢 = −𝑟 (𝑢 𝑘 ) , (5.60)
Differentiating the vector of residuals (Eq. 5.59) with respect to the two
concatenated vectors in 𝑢 yields the following block linear system:
|
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇𝑥 ℒ
= . (5.62)
𝐽ℎ 0 𝑝𝜆 −ℎ
5 Constrained Gradient-Based Optimization 187
This is like the procedure we used in solving the KKT conditions, except
that these are linear equations, so we can solve them directly without
5 Constrained Gradient-Based Optimization 188
1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑥 ℒ | 𝑝
𝑝 2 (5.69)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .
1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 + 𝜆| 𝐽 ℎ 𝑝 . (5.70)
2
Then, we substitute the constraint 𝐽ℎ 𝑝 = −ℎ into the objective:
1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 − 𝜆| ℎ . (5.71)
2
Now, we can remove the last term in the objective because it does
not depend on the variable (𝑝), resulting in the following equivalent
problem:
1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑓 | 𝑝
𝑝 2 (5.72)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .
Using the QP solution method outlined previously results in the
following system of linear equations:
|
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇ 𝑓
= . (5.73)
𝐽ℎ 0 𝜆 𝑘+1 −ℎ
1 |
minimize 𝑠 𝐻ℒ 𝑠 + ∇ 𝑥 ℒ | 𝑠
𝑠 2
subject to 𝐽ℎ 𝑠 + ℎ = 0 (5.76)
𝐽𝑔 𝑠 + 𝑔 ≤ 0 .
The determination of the working set could happen in the inner loop,
that is, as part of the inequality constrained QP subproblem (Eq. 5.76).
Alternatively, we could choose a working set in the outer loop and
then solve the QP subproblem with only equality constraints (Eq. 5.69),
where the working-set constraints would be posed as equalities. The
former approach is more common and is discussed here. In that case,
we need consider only the active-set problem in the context of a QP.
Many variations on active-set methods exist; we outline just one such
approach based on a binding-direction method.
The general QP problem we need to solve is as follows:
1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (5.77)
𝐶𝑥 + 𝑑 ≤ 0 .
Assume, for the moment, that the working set does not change at
nearby points (i.e., we ignore the constraints outside the working set).
We seek a step 𝑝 to update the design variables as follows: 𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝.
We find 𝑝 by solving the following simplified QP that considers only
the working set:
1
minimize (𝑥 𝑘 + 𝑝)| 𝑄(𝑥 𝑘 + 𝑝) + 𝑞 | (𝑥 𝑘 + 𝑝)
𝑝 2
(5.79)
subject to 𝐴(𝑥 𝑘 + 𝑝) + 𝑏 = 0
𝐶𝑤 (𝑥 𝑘 + 𝑝) + 𝑑𝑤 = 0 .
1 |
minimize 𝑝 𝑄𝑝 + (𝑞 + 𝑄 | 𝑥 𝑘 )𝑝
𝑝 2
(5.80)
subject to 𝐴𝑝 = 0
𝐶𝑤 𝑝 = 0 .
Figure 5.39 shows the structure of the matrix in this linear system. 𝑛 𝑔𝑤 𝐶𝑤 0 0
Let us consider the case where the solution of this linear system is
nonzero. Solving the KKT conditions in Eq. 5.80 ensures that all the Fig. 5.39 Structure of the QP subprob-
constraints in the working set are still satisfied at 𝑥 𝑘 + 𝑝. Still, there is no lem within the inequality constrained
QP solution process.
guarantee that the step does not violate some of the constraints outside
of our working set. Suppose that 𝐶 𝑛 and 𝑑𝑛 define the constraints
outside of the working set. If
𝐶 𝑛 (𝑥 𝑘 + 𝑝) + 𝑑𝑛 ≤ 0 (5.82)
for all rows, all the constraints are still satisfied. In that case, we accept
the step 𝑝 and update the design variables as follows:
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝 . (5.83)
5 Constrained Gradient-Based Optimization 192
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 . (5.84)
We cannot take the full step (𝛼 = 1), but we would like to take as large
a step as possible while still keeping all the constraints feasible.
Let us consider how to determine the appropriate step size, 𝛼.
Substituting the step update (Eq. 5.84) into the equality constraints, we
obtain the following:
𝑐 𝑖 (𝑥 𝑘 + 𝛼𝑝) + 𝑑 𝑖 ≤ 0 . (5.86)
|
(5.87)
| |
𝛼𝑐 𝑖 𝑝 ≤ −(𝑐 𝑖 𝑥 𝑘 + 𝑑 𝑖 ) .
Inputs:
𝑄, 𝑞, 𝐴, 𝑏, 𝐶, 𝐷: Matrices and vectors defining the QP (Eq. 5.77); Q must be positive definite
𝜀: Tolerance used for termination and for determining whether constraint is active
Outputs:
𝑥 ∗ : Optimal point
𝑘=0
𝑥 𝑘 = 𝑥0
𝑊𝑘 = 𝑖 for all 𝑖 where (𝑐 𝑖 | 𝑥 𝑘 + 𝑑 𝑖 ) > −𝜀 and length(𝑊𝑘 ) ≤ 𝑛 𝑥 One possible
initial working set
while true do
set 𝐶𝑤 = 𝐶 𝑖,∗ and 𝑑𝑤 = 𝑑 𝑖 for all 𝑖 ∈ 𝑊𝑘 Select rows for working set
Solve the KKT system (Eq. 5.81)
if k𝑝 k < 𝜀 then
if 𝜎 ≥ 0 then Satisfied KKT conditions
𝑥∗ = 𝑥 𝑘
return
else
𝑖 = argmin 𝜎
𝑊𝑘+1 = 𝑊𝑘 \ {𝑖} Remove 𝑖 from working set
𝑥 𝑘+1 = 𝑥 𝑘
end if
else
𝛼=1 Initialize with optimum step
𝐵 = {} Blocking index
for 𝑖 ∉ 𝑊𝑘 do Check constraints outside of working set
if 𝑐 𝑖 𝑝 > 0 then
|
Potential blocking constraint
|
−(𝑐 𝑖 𝑥 𝑘 +𝑑 𝑖 )
𝛼𝑏 = | 𝑐 𝑖 is a row of 𝐶 𝑛
𝑐𝑖 𝑝
if 𝛼 𝑏 < 𝛼 then
𝛼 = 𝛼𝑏
𝐵=𝑖 Save or overwrite blocking index
end if
end if
end for
𝑊𝑘+1 = 𝑊𝑘 ∪ {𝐵} Add 𝐵 to working set (if linearly independent)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝
end if
𝑘 = 𝑘+1
end while
we solve the QP formed by the equality constraints and any constraints in the
active set (treated as equality constraints). The sequence of iterations is detailed 𝑥0
2
as follows and is plotted in Fig. 5.40:
𝑥∗
𝑘 = 1 The QP subproblem yields 𝑝 = [−1.75, −6.25] and 𝜎 = [0, 0, 0]. Next,
we check whether any constraints are blocking at the new point 𝑥 + 𝑝. 0
Because all three constraints are outside of the working set, we check
all three. Constraint 1 is potentially blocking (𝑐 𝑖 𝑝 > 0) and leads to
|
0 2 4
𝛼 𝑏 = 0.35955. Constraint 2 is also potentially blocking and leads to 𝑥1
values was blocking, so we can take the full step (𝛼 = 1). The new 𝑥
point is 𝑥 = [0.5, 1.0], and the working set is unchanged at 𝑊 = {1}.
𝑘 = 5 The QP yields 𝑝 = [0, 0], 𝜎 = [3, 0, 0]. Because 𝑝 = 0, we check for
convergence. All Lagrange multipliers are nonnegative, so the problem
5 Constrained Gradient-Based Optimization 196
objective is lower and the sum of its constraint violations is lower. The
filter consists of all the points that have been found to be non-dominated
in the line searches so far. The line search terminates when it finds a
point that is not dominated by any point in the current filter. That new
point is then added to the filter, and any points that it dominates are
removed from the filter.‖ ‖
See Section 9.2 for more details on the
concept of dominance.
This is only the basic concept. Robust implementation of a fil-
ter method requires imposing sufficient decrease conditions, not un-
like those in the unconstrained case, and several other modifications.
Fletcher et al.99 provide more details on filter methods. 99. Fletcher et al., A brief history of filter
methods, 2006.
5 Constrained Gradient-Based Optimization 198
accepted, the line search ends, and this new point is added to the filter. 𝑓 (𝑥)
Unlike the previous case, none of the points in the filter are dominated.
Fig. 5.41 Filter method example show-
Therefore, no points are removed from the filter set, which becomes ing three points in the filter (blue
{(1, 6), (2, 5), (3, 2), (7, 1)}. dots); the shaded regions correspond
3. (4, 3): This point is dominated by a point in the filter, (3, 2). The step to all the points that are dominated by
is rejected, and the line search continues by selecting a new candidate the filter. The red dots illustrate three
different possible outcomes when
point. The filter is unchanged.
new points are considered.
𝐻˜ ℒ 𝑘 𝑠 𝑘 𝑠 𝑘 𝐻˜ ℒ 𝑘
| |
𝑦𝑘 𝑦𝑘
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 − + , (5.91)
𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
|
|
𝑦𝑘 𝑠 𝑘
5 Constrained Gradient-Based Optimization 199
where:
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘
(5.92)
𝑦 𝑘 = ∇𝑥 ℒ(𝑥 𝑘+1 , 𝜆 𝑘+1 ) − ∇𝑥 ℒ(𝑥 𝑘 , 𝜆 𝑘+1 ) .
The step in the design variable space, 𝑠 𝑘 , is the step that resulted from
the latest line search. The Lagrange multiplier is fixed to the latest value
when approximating the curvature of the Lagrangian because we only
need the curvature in the space of the design variables.
Recall that for the QP problem (Eq. 5.76) to have a solution, 𝐻˜ ℒ 𝑘
must be positive definite. To ensure a positive definite approximation,
we can use a damped BFGS update.25∗∗ This method replaces 𝑦 with a 25. Powell, Algorithms for nonlinear con-
straints that use Lagrangian functions, 1978.
new vector 𝑟, defined as ∗∗ The damped BFGS update is not al-
when storing a dense Hessian for large
1
if 𝑠 𝑘 𝑦 𝑘 ≥ 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
| | problems is prohibitive.101
𝜃𝑘 = (5.94)
0.8𝑠 𝐻˜ 100. Fletcher, Practical Methods of Opti-
|
𝑠 | 𝐻˜ ℒ 𝑠 𝑘 −𝑠 | 𝑦 𝑘
𝑠
if 𝑠 𝑘 𝑦 𝑘 < 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘 ,
𝑘 ℒ𝑘 𝑘 | |
𝑘 𝑘
mization, 1987.
𝑘
101. Liu and Nocedal, On the limited
memory BFGS method for large scale opti-
which can range from 0 to 1. We then use the same BFGS update mization, 1989.
formula (Eq. 5.91), except that we replace each 𝑦 𝑘 with 𝑟 𝑘 .
To better understand this update, let us consider the two extremes
for 𝜃. If 𝜃𝑘 = 0, then Eq. 5.93 in combination with Eq. 5.91 yields
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 ; that is, the Hessian approximation is unmodified. At the
other extreme, 𝜃𝑘 = 1 yields the full BFGS update formula (𝑟 𝑘 is set
to 𝑦 𝑘 ). Thus, the parameter 𝜃𝑘 provides a linear weighting between
keeping the current Hessian approximation and using the full BFGS
update.
The definition of 𝜃𝑘 (Eq. 5.94) ensures that 𝐻˜ ℒ 𝑘+1 stays close enough
to 𝐻˜ ℒ 𝑘 and remains positive definite. The damping is activated when
the predicted curvature in the new latest step is below one-fifth of the
curvature predicted by the latest approximate Hessian. This could †† A few popular SQP implementations
happen when the function is flattening or when the curvature becomes include SNOPT,96 Knitro,102 MATLAB’s
fmincon, and SLSQP.103 The first three
negative. are commercial options, whereas SLSQP
is open source. There are interfaces in dif-
ferent programming languages for these
5.5.5 Algorithm Overview optimizers, including pyOptSparse (for
SNOPT and SLSQP).1
We now put together the various pieces in a high-level description 1. Wu et al., pyOptSparse: A Python frame-
of SQP with quasi-Newton approximations in Alg. 5.5.†† For the work for large-scale constrained nonlinear
optimization of sparse systems, 2020.
convergence criterion, we can use an infinity norm of the KKT system
102. Byrd et al., Knitro: An Integrated
residual vector. For better control over the convergence, we can consider Package for Nonlinear Optimization, 2006.
two separate tolerances: one for the norm of the optimality and another 103. Kraft, A software package for sequential
quadratic programming, 1988.
5 Constrained Gradient-Based Optimization 200
for the norm of the feasibility. For problems that only have equality
constraints, we can solve the corresponding QP (Eq. 5.62) instead.
Inputs:
𝑥 0 : Starting point
𝜏opt : Optimality tolerance
𝜏feas : Feasibility tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝜆 𝑘+1 = 𝜆 𝑘 + 𝑝𝜆
𝛼 = linesearch (𝑝 𝑥 , 𝛼 init ) Use merit function or filter (Section 5.5.3)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 Update step
𝑊𝑘+1 = 𝑊𝑘 Active set becomes initial working set for next QP
Evaluate functions ( 𝑓 , 𝑔, ℎ) and derivatives (∇ 𝑓 , 𝐽 𝑔 , 𝐽 ℎ )
| |
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 + 𝐽 𝑔 𝜎
𝑘 = 𝑘+1
end while
We now solve Ex. 5.2 using the SQP method (Alg. 5.5). We start at
𝑥0 = [2, 1] with an initial Lagrange multiplier 𝜆 = 0 and an initial estimate
5 Constrained Gradient-Based Optimization 201
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
We repeat this process for subsequent iterations, as shown in Figure 5.42. Fig. 5.42 SQP algorithm iterations.
The gray contours show the QP subproblem (Eq. 5.72) solved at each itera-
tion: the quadratic objective appears as elliptical contours and the linearized
5 Constrained Gradient-Based Optimization 202
constraint as a straight line. The starting point is infeasible, and the iterations
remain infeasible until the last few iterations.
k∇𝑥 ℒ k
This behavior is common for SQP because although it satisfies the linear 101
approximation of the constraints at each step, it does not necessarily satisfy the
constraints of the actual problem, which is nonlinear. As the constraint approx-
10−2
imation becomes more accurate near the solution, the nonlinear constraint is
then satisfied. Figure 5.43 shows the convergence of the Lagrangian gradient
norm, with the characteristic quadratic convergence at the end. 10−5
10−8
0 5 10
𝑘
Example 5.13 SQP applied to inequality constrained problem
We now solve the inequality constrained version of the previous example Fig. 5.43 Convergence history of the
(Ex. 5.4) with the same initial conditions and general approach. The only norm of the Lagrangian gradient.
difference is that rather than solving the linear system of equations Eq. 5.62, we
have to solve an active-set QP problem at each iteration, as outlined in Alg. 5.4.
The iteration history and convergence of the norm of the Lagrangian gradient
are plotted in Figs. 5.44 and 5.45, respectively.
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
k∇𝑥 ℒ k
10−5
max(𝜎) ≤ 𝜎yield .
𝜎 𝑗 ≤ 𝜎yield , 𝑗 = 1, . . . , 𝑛 𝜎 .
Interior-point methods use concepts from both SQP and interior penalty
methods.∗ These methods form an objective similar to the interior ∗ The name interior point stems from early
methods based on interior penalty meth-
penalty but with the key difference that instead of penalizing the ods that assumed that the initial point was
constraints directly, they add slack variables to the set of optimization feasible. However, modern interior-point
methods can start with infeasible points.
variables and penalize the slack variables. The resulting formulation is
as follows:
Õ
𝑛𝑔
minimize 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑗
𝑥,𝑠
𝑗=1
(5.95)
subject to ℎ(𝑥) = 0
𝑔(𝑥) + 𝑠 = 0 .
This formulation turns the inequality constraints into equality con-
straints and thus avoids the combinatorial problem.
Similar to SQP, we apply Newton’s method to solve for the KKT
conditions. However, instead of solving the KKT conditions of the
5 Constrained Gradient-Based Optimization 204
original problem (Eq. 5.59), we solve the KKT conditions of the interior-
point formulation (Eq. 5.95).
These slack variables in Eq. 5.95 do not need to be squared, as was
done in deriving the KKT conditions, because the logarithm is only
defined for positive 𝑠 values and acts as a barrier preventing negative
values of 𝑠 (although we need to prevent the line search from producing
negative 𝑠 values, as discussed later). Because 𝑠 is always positive,
that means that 𝑔(𝑥 ∗ ) < 0 at the solution, which satisfies the inequality
constraints.
Like penalty method formulations, the interior-point formulation
(Eq. 5.95) is only equivalent to the original constrained problem in the
limit, as 𝜇𝑏 → 0. Thus, as in the penalty methods, we need to solve a
sequence of solutions to this problem where 𝜇𝑏 approaches zero.
First, we form the Lagrangian for this problem as
to the original KKT system (Eq. 5.97) and then made it symmetric, we
would have obtained a term with 𝑆−2 , which would make the system 𝑛𝑥 𝐻ℒ |
𝐽ℎ
|
𝐽𝑔 0
more challenging than with the 𝑆 −1 term in Eq. 5.100. Figure 5.46 shows
the structure and block sizes of the matrix.
𝑛ℎ 𝐽ℎ 0 0 0
Õ 1
𝑛𝑔
𝑓ˆ(𝑥) = 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑖 + 𝜇𝑝 k ℎ(𝑥)k 2 + k 𝑔(𝑥) + 𝑠 k 2 , (5.101)
2
𝑖=1
where 𝜇𝑏 is the barrier parameter from Eq. 5.95, and 𝜇𝑝 is the penalty
parameter. Additionally, we must enforce an 𝛼max in the line search so
that the implicit constraint on 𝑠 > 0 remains enforced. The maximum
allowed step size can be computed prior to the line search because we
know the value of 𝑠 and 𝑝 𝑠 and require that
𝑠 + 𝛼𝑝 𝑠 ≥ 0 . (5.102)
5 Constrained Gradient-Based Optimization 206
𝑠 + 𝛼 max 𝑝 𝑠 = 𝜏𝑠 , (5.103)
Inputs:
𝑠: Current slack values
𝑝 𝑠 : Proposed step
𝜏: Fractional tolerance (e.g., 0.005)
Outputs:
𝛼max : Maximum feasible step length
𝛼max = 1
for 𝑖 = 1 to 𝑛 𝑔 do
𝑠
𝛼 = (𝜏 − 1) 𝑖
𝑝𝑠 𝑖
if 𝛼 > 0 then
𝛼max = min(𝛼max , 𝛼)
end if
end for
𝜆 𝑘+1 = 𝜆 𝑘 + 𝛼 𝜎 𝑝𝜆 (5.106)
𝜎 𝑘+1 = 𝜎 𝑘 + 𝛼 𝜎 𝑝 𝜎 . (5.107)
5 Constrained Gradient-Based Optimization 207
4 4
2 2
𝑥0 𝑥∗ 𝑥0 𝑥∗
𝑥2 𝑥2
0 0
−2 −2
−2 0 2 4 −2 0 2 4
Fig. 5.47 Numerical solution of prob-
𝑥1 𝑥1
lem solved graphically in Ex. 5.1.
Sequential quadratic programming Interior-point method
5 Constrained Gradient-Based Optimization 209
2 19 iterations
1 + 12 𝜎𝑥1 1 𝑥0
∇𝑥 ℒ(𝑥1 , 𝑥2 ) = = , 1
2 + 2𝜎𝑥2 2
0
and the gradient of the constraint is
1 −1 𝑥∗
𝑥 1
∇𝑔(𝑥1 , 𝑥2 ) = 2 1 = .
2𝑥 2 2 −2
−3 −2 −1 0 1 2 3
The interior-point system of equations (Eq. 5.100) at the starting point is 𝑥1
𝑥 𝑐1 𝑥 𝑐2
𝑦𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
The optimization paths for SQP and the interior-point method are shown in
Fig. 5.50.
12 12
8 8
4 4
𝑥2 𝑥∗ 𝑥2 𝑥∗
0 0
𝑥0 𝑥0
−4 −4
−8 −8
−5
ℓrope2 0 5 10 15 −5
ℓrope2 0 5 10 15
rope1 rope1 Fig. 5.50 Optimization of constrained
𝑥1 𝑥1
spring system.
Sequential quadratic programming Interior-point method
1 ©Õ
Systematic control design by optimizing a
ª
𝑛𝑔
ln exp(𝜌𝑔 𝑗 )® ,
vector performance index, 1979.
𝑔¯ KS (𝜌, 𝑔) = (5.111)
𝜌
« 𝑗=1 ¬
where 𝜌 is an aggregation factor that determines how close this function
is to the maximum function (Eq. 5.110). As 𝜌 → ∞, 𝑔¯ KS (𝜌, 𝑔) → max(𝑔).
However, as 𝜌 increases, the curvature of 𝑔¯ increases, which can cause
ill-conditioning in the optimization.
The exponential function disproportionately weighs the higher
positive values in the constraint vector, but it does so in a smooth way.
Because the exponential function can easily result in overflow, it is
preferable to use the alternate (but equivalent) form of the KS function,
1 ©Õ ª
𝑛𝑔
𝑔¯ KS (𝜌, 𝑔) = max 𝑔 𝑗 + ln exp 𝜌 𝑔 𝑗 − max 𝑔 𝑗 ®. (5.112)
𝑗 𝜌 𝑗
« 𝑗=1 ¬
The value of 𝜌 should be tuned for each problem, but 𝜌 = 100 works
well for many problems.
Consider the constrained spring system from Ex. 5.16. Aggregating the two
constraints using the KS function, we can formulate a single constraint as
1
𝑔¯ KS (𝑥 1 , 𝑥2 ) = ln (exp (𝜌𝑔2 (𝑥1 , 𝑥2 )) + exp (𝜌𝑔2 (𝑥1 , 𝑥2 ))) ,
𝜌
5 Constrained Gradient-Based Optimization 212
where q
𝑔1 (𝑥1 , 𝑥2 ) = (𝑥 1 + 𝑥 𝑐1 )2 + (𝑥2 + 𝑦 𝑐 )2 − ℓ 𝑐1
q
𝑔2 (𝑥1 , 𝑥2 ) = (𝑥 1 − 𝑥 𝑐2 )2 + (𝑥2 + 𝑦 𝑐 )2 − ℓ 𝑐2 .
Figure 5.51 shows the contour of 𝑔¯ KS = 0 for increasing values of the aggregation
parameter 𝜌.
8 8 8
6 6 6
𝑥2 𝑥2 𝑥2
4 𝑥∗ 4 𝑥∗ 4 𝑥∗
∗
∗
∗
𝑥KS 𝑥 KS
2 𝑥KS 2 2
ℓrope2
rope1 ℓrope2
rope1 ℓrope2
rope1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
𝑥1 𝑥1 𝑥1
𝜌KS = 2, ∗
𝑓KS = −19.448 𝜌KS = 10, ∗
𝑓KS = −21.653 𝜌KS = 100, ∗
𝑓KS = −22.090
For the lowest value of 𝜌, the feasible region is reduced, resulting in a Fig. 5.51 KS function aggregation of
conservative optimum. For the highest value of 𝜌, the optimum obtained two constraints. The optimum of the
problem with aggregated constraints,
with constraint aggregation is graphically indistinguishable, and the objective ∗ , approaches the true optimum
𝑥KS
function value approaches the true optimal value of −22.1358. as the aggregation parameter 𝜌KS in-
creases.
©Õ 𝑔 𝑗 ª
𝑔¯ 𝑃𝑁 (𝜌) = max | 𝑔 𝑗 | max 𝑗 𝑔 𝑗 ® . (5.113)
𝑗
« 𝑗=1 ¬
The absolute value in this equation can be an issue if 𝑔 can take both
positive and negative values because the function is not differentiable
in regions where 𝑔 transitions from positive to negative.
A class of aggregation functions known as induced functions was
designed to provide more accurate estimates of max(𝑔) for a given
value of 𝜌 than the KS and induced norm functions.110 There are two 110. Kennedy and Hicken, Improved
constraint-aggregation methods, 2015.
main types of induced functions: one uses exponentials, and the other
uses powers. The induced exponential function is given by
Í𝑛 𝑔
𝑗=1
𝑔 𝑗 exp(𝜌𝑔 𝑗 )
𝑔IE (𝜌) = Í𝑛 𝑔 . (5.114)
𝑗=1
exp(𝜌𝑔 𝑗 )
5 Constrained Gradient-Based Optimization 213
5.8 Summary
Problems
5.2 Let us modify Ex. 5.2 so that the equality constraint is the negative
of the original one—that is,
1
ℎ(𝑥1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 = 0 .
4
5 Constrained Gradient-Based Optimization 215
Classify the critical points and compare them with the original
solution. What does that tell you about the significance of the
Lagrange multiplier sign?
5.3 Similar to the previous exercise, consider Ex. 5.4 and modify it
so that the inequality constraint is the negative of the original
one—that is,
1
ℎ(𝑥 1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 ≤ 0 .
4
Classify the critical points and compare them with the original
solution.
be stated as follows:
minimize 2𝜌ℓ 𝜋𝑅𝑡 mass
by varying 𝑅, 𝑡 radius, wall thickness
𝐹
subject to − 𝜎yield ≤ 0 yield stress
2𝜋𝑅𝑡
𝜋3 𝐸𝑅3 𝑡
𝐹− ≤0 buckling load
4ℓ 2
In the formula for the mass in this objective, 𝜌 is the material
density, and we assume that 𝑡 𝑅. The first constraint is the
compressive stress, which is simply the force divided by the cross-
sectional area. The second constraint uses Euler’s critical buckling
load formula, where 𝐸 is the material Young’s modulus, and the
second moment of area is replaced with the one corresponding
to a circular cross section (𝐼 = 𝜋𝑅 3 𝑡).
Find the optimum 𝑅 and 𝑡 as a function of the other parameters.
Pick reasonable values for the parameters, and verify your solution
graphically. Plot the gradients of the objective and constraints at
the optimum, and verify the Lagrange multipliers graphically.
5.8 Beam with H section. Consider a cantilevered beam with an H-
𝑏 = 125 mm
shaped cross section composed of a web and flanges subject to a
transverse load, as shown in Fig. 5.53. The objective is to minimize
the structural weight by varying the web thickness 𝑡𝑤 and the
𝑡𝑏
flange thickness 𝑡𝑏 , subject to stress constraints. The other cross-
sectional parameters are fixed; the web height ℎ is 250 mm, and
𝑡𝑤 ℎ = 250 mm
the flange width 𝑏 is 125 mm. The axial stress in the flange and
the shear stress in the web should not exceed the corresponding
𝑡𝑏
yield values (𝜎yield = 200 MPa, and 𝜏yield = 116 MPa, respectively).
The optimization problem can be stated as follows:
minimize 2𝑏𝑡𝑏 + ℎ𝑡𝑤 mass
by varying 𝑡𝑏 , 𝑡𝑤 flange and web thicknesses 𝑃 = 100 kN
𝑃ℓ ℎ
subject to − 𝜎yield ≤ 0 axial stress
2𝐼
ℓ =1m
1.5𝑃
− 𝜏yield ≤ 0 shear stress
ℎ𝑡𝑤 Fig. 5.53 Cantilever beam with H sec-
The second moment of area for the H section is tion.
ℎ3 𝑏 ℎ2𝑏
𝐼= 𝑡𝑤 + 𝑡𝑏3 + 𝑡𝑏 .
12 6 2
Find the optimal values of 𝑡𝑏 and 𝑡𝑤 by solving the KKT conditions
analytically. Plot the objective contours and constraints to verify
your result graphically.
5 Constrained Gradient-Based Optimization 217
a. Reproduce the results from Ex. 5.12 (SQP) or Ex. 5.15 (interior
point).
b. Solve Prob. 5.3.
c. Solve Prob. 5.11.
d. Compare the computational cost, precision, and robustness
of your optimizer with those of an existing software package.
𝑑
5.11 Aircraft fuel tank. A jet aircraft needs to carry a streamlined ℓ
minimize 𝐷(ℓ , 𝑑)
by varying ℓ , 𝑑
subject to 𝑉req − 𝑉(ℓ , 𝑑) ≤ 0 .
where the air density is 𝜌 = 0.55 kg/m3 , and the aircraft speed is
𝑣 = 300 m/s. The drag coefficient of an ellipsoid can be estimated
as∗ " 3/2 3#
𝑑 𝑑
𝐶 𝐷 = 𝐶 𝑓 1 + 1.5 +7 .
ℓ ℓ
5.12 Solve a variation of Ex. 5.16 where we replace the system of cables
with a cable and a rod that resists both tension and compression.
The cable is positioned above the spring, as shown in Fig. 5.55,
where 𝑥 𝑐 = 2 m, and 𝑦 𝑐 = 3 m, with a maximum length of
ℓ 𝑐 = 7.0 m. The rod is positioned at 𝑥 𝑟 = 2 m and 𝑦𝑟 = 4 m,
with a length of ℓ 𝑟 = 4.5 m. How does this change the problem
𝑥𝑐
𝑦𝑐 ℓ𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
ℓ𝑟 𝑦𝑟
Fig. 5.55 Spring system constrained
by two cables.
𝑥𝑟
5.14 Solve the same three-bar truss optimization problem in Prob. 5.13
by aggregating all the constraints into a single constraint. Try
5 Constrained Gradient-Based Optimization 220
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.13.
5.16 Solve the same 10-bar truss optimization problem of Prob. 5.15
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.15.
(𝐿/𝑏)(𝑏/2)2 𝐿𝑏
𝑀= = .
2 8
Now we assume that the wing structure has the H-shaped cross
section from Prob. 5.8 with a constant thickness of 𝑡𝑤 = 𝑡𝑏 = 4 mm.
We relate the cross-section height ℎsec and width 𝑏sec to the chord
as ℎ sec = 0.1𝑐 and 𝑏sec = 0.4𝑐. With these assumptions, we can
compute the second moment of area 𝐼 in terms of 𝑐.
The maximum bending stress is then
𝑀 ℎ sec
𝜎max = .
2𝐼
5 Constrained Gradient-Based Optimization 221
Considering the safety factor of 1.5 and the ultimate load factor
of 2.5, the stress constraint is
𝜎yield
2.5𝜎max − ≤ 0,
1.5
where 𝜎yield = 200 MPa.
Solve this problem and compare the solution with the uncon-
strained optimum. Plot the objective contours and constraint to
verify your result graphically.
Computing Derivatives
6
The gradient-based optimization methods introduced in Chapters 4
and 5 require the derivatives of the objective and constraints with
respect to the design variables, as illustrated in Fig. 6.1. Derivatives
also play a central role in other numerical algorithms. For example, the
Newton-based methods introduced in Section 3.8 require the derivatives
of the residuals.
𝑥
The accuracy and computational cost of the derivatives are critical for Optimizer
the success of these methods. Gradient-based methods are only efficient
when the derivative computation is also efficient. The computation of 𝑓,𝑔 Model
derivatives can be the bottleneck in the overall optimization procedure,
especially when the model solver needs to be called repeatedly. This Derivative
chapter introduces the various methods for computing derivatives and ∇ 𝑓 , 𝐽𝑔 Computation
discusses the relative advantages of each method.
Fig. 6.1 Efficient derivative computa-
tion is crucial for the overall efficiency
of gradient-based optimization.
By the end of this chapter you should be able to:
222
6 Computing Derivatives 223
1
| {z }
(𝑛 𝑓 ×𝑛 𝑥 )
Consider the following function with two variables and two functions of
interest:
𝑓1 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + sin 𝑥1
𝑓 (𝑥) = = .
𝑓2 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + 𝑥22
We can differentiate this symbolically to obtain exact reference values:
𝜕𝑓 𝑥 + cos 𝑥1 𝑥1
= 2 .
𝜕𝑥 𝑥2 𝑥1 + 2𝑥 2
We evaluate this at 𝑥 = (𝜋/4, 2), which yields
𝜕𝑓 2.707 0.785
= .
𝜕𝑥 2.000 4.785
6 Computing Derivatives 224
𝑥 𝑥 𝑥 𝑣1 = 𝑥
𝑣2 = 𝑣2 (𝑣1 )
Solver Solver 𝑢 Solver
𝑢 𝑣3 = 𝑣3 (𝑣1 , 𝑣2 )
..
𝑟(𝑥, 𝑢) = 0 𝑟(𝑢; 𝑥) . 𝑟(𝑥, 𝑢) = 0
𝑟
𝑓 = 𝑣 𝑛 (𝑣1 , . . .)
𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓
codes are driven by reading and writing input and output files. However, the
numbers in the files usually have fewer digits than the code’s working precision.
The ideal solution is to modify the code to be called directly and pass the data
through memory. Another solution is to increase the precision in the files.
fixed-point iteration to determine the value of 𝑓 for a given input 𝑥. That means
we start with a guess for 𝑓 on the right-hand side of that expression to estimate
a new value for 𝑓 , and repeat. In this case, convergence typically happens in
about 10 iterations. Arbitrarily, we choose 𝑥 as the initial guess for 𝑓 , resulting
in the following computational procedure:
Input: 𝑥
𝑓 =𝑥
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 )
end for
return 𝑓
dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin (2*x))))))))))*( cos(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin (2*x)))))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin(x + sin (2*x))))))))*( cos(x + sin(x + sin(x + sin
(x + sin(x + sin(x + sin (2*x)))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin (2*x))))))*( cos(x + sin(x + sin(x + sin(x + sin (2*x)))))
*( cos(x + sin(x + sin(x + sin (2*x))))*( cos(x + sin(x + sin (2*x)))*(
cos(x + sin (2*x))*(2* cos (2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) +
1) + 1)
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + ℎ + + +... , (6.3)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
6 Computing Derivatives 227
where 𝑒ˆ 𝑗 is the unit vector in the 𝑗th direction. Solving this for the first
derivative, we obtain the finite-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= + 𝒪(ℎ) , (6.4)
𝜕𝑥 𝑗 ℎ
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥) 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= lim ≈ . (6.5) 𝑓 (𝑥 + ℎ)
𝜕𝑥 𝑗 ℎ→0 ℎ ℎ True derivative
𝑥 𝑥+ℎ
The truncation error is 𝒪(ℎ), and therefore this is a first-order approx-
imation. The difference between this approximation and the exact Fig. 6.3 Exact derivative compared
derivative is illustrated in Fig. 6.3. with a forward finite-difference ap-
The backward-difference approximation can be obtained by replac- proximation (Eq. 6.4).
ing ℎ with −ℎ to yield
𝜕𝑓 𝑓 (𝑥) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ) , (6.6)
𝜕𝑥 𝑗 ℎ
see that this estimate is closer to the actual derivative than the forward
difference.
Even more accurate estimates can be derived by combining differ-
ent Taylor series expansions to obtain higher-order truncation error
6 Computing Derivatives 228
𝜕2 𝑓 𝑓 (𝑥 + 2ℎ 𝑒ˆ 𝑗 ) − 2 𝑓 (𝑥) + 𝑓 (𝑥 − 2ℎ 𝑒ˆ 𝑗 )
= + 𝒪 ℎ2 . (6.9)
𝜕𝑥 𝑗 2 4ℎ 2
𝑓 (𝑥 + ℎ𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 = + 𝒪(ℎ) . (6.10)
ℎ
1.5 10
𝑥 8
1
𝑥 + ℎ𝑝
6
𝑥2 0.5 𝑝 𝑓
4 𝑓 (𝑥)
0
∇𝑝 𝑓
2 𝑓 (𝑥 + ℎ𝑝)
Fig. 6.5 Computing a directional
derivative using a forward finite dif-
−0.5 0
−1.5 −1 −0.5 0 0.5 ℎ 𝛼 ference.
𝑥1
Table 6.1 lists the data for the forward difference, where we can see the
number of digits in the difference Δ 𝑓 decreasing with decreasing step size until 10−8
Tip 6.2 When using finite differencing, always perform a step-size Fig. 6.7 As the step size ℎ decreases,
study the total error in the finite-difference
estimates initially decreases because
In practice, most gradient-based optimizers use finite differences by default of a reduced truncation error. How-
to compute the gradients. Given the potential for inaccuracies, finite differences ever, subtractive cancellation takes
are often the culprit in cases where gradient-based optimizers fail to converge. over when the step is small enough
Although some of these optimizers try to estimate a good step size, there is and eventually yields an entirely
no substitute for a step-size study by the user. The step-size study must be wrong derivative.
6 Computing Derivatives 230
ℎ 𝑓 (𝑥 + ℎ) Δ𝑓 d 𝑓 /d𝑥
10−1 4.9562638252880662 0.4584837713419043 4.58483771
10−2 4.5387928890592475 0.0410128351130856 4.10128351
10−4 4.4981854440562818 0.0004053901101200 4.05390110
10−6 4.4977841073787870 0.0000040534326251 4.05343263
10−8 4.4977800944804409 0.0000000405342790 4.05342799
10−10 4.4977800543515052 0.0000000004053433 4.05344203
10−12 4.4977800539502155 0.0000000000040536 4.05453449
10−14 4.4977800539462027 0.0000000000000409 4.17443857
10−16 4.4977800539461619 0.0000000000000000 0.00000000 Table 6.1 Subtractive cancellation
10−18 4.4977800539461619 0.0000000000000000 0.00000000 leads to a loss of precision and, ul-
timately, inaccurate finite-difference
Exact 4.4977800539461619 4.05342789 estimates.
performed for all variables and does not necessarily apply to the whole design
space. Therefore, repeating this study for other values of 𝑥 might be required.
Because we do not usually know the exact derivative, we cannot plot the
error as we did in Fig. 6.7. However, we can always tabulate the derivative
estimates as we did in Table 6.1. In the last column, we can see from the pattern
of digits that match the previous step size that ℎ = 10−8 is the best step size in
this case.
𝑓
Finite-difference approximations are sometimes used with larger
0.5369
steps than would be desirable from an accuracy standpoint to help +2 · 10−8
smooth out numerical noise or discontinuities in the model. This
approach sometimes works, but it is better to address these problems 0.5369
within the model whenever possible. Figure 6.8 shows an example of
this effect. For this noisy function, the larger step ignores the noise and 0.5369
−2 · 10−8
gives the correct trend, whereas the smaller step results in an estimate
2 − 1 · 10−6 2.0 2 + 1 · 10−6
with the wrong sign. 𝑥
This is similar to the expression for the convergence criterion in Eq. 4.24.
Although the absolute step size usually differs for each 𝑥 𝑗 , the relative
step size ℎ is often the same and is user-specified.
6 Computing Derivatives 231
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Vector of functions of interest
Outputs:
𝐽: Jacobian of 𝑓 with respect to 𝑥
6.5.1 Theory
The complex-step method can also be derived using a Taylor series
expansion. Rather than using a real step ℎ, as we did to derive the ∗ This method originated with the work
finite-difference formulas, we use a pure imaginary step, 𝑖 ℎ.∗ If 𝑓 is a of Lyness and Moler,112 who developed
real function in real variables and is also analytic (differentiable in the formulas that use complex arithmetic for
computing the derivatives of real functions
complex domain), we can expand it in a Taylor series about a real point of arbitrary order with arbitrary order trun-
𝑥 as follows: cation error, much like the Taylor series
combination approach in finite differences.
Later, Squire and Trapp49 observed that
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + 𝑖 ℎ − − 𝑖 +... . (6.12) the simplest of these formulas was conve-
𝜕𝑥 𝑗 2 𝜕𝑥 𝑗 2 6 𝜕𝑥 𝑗 3 nient for computing first derivatives.
49. Squire and Trapp, Using complex
Taking the imaginary parts of both sides of this equation, we have variables to estimate derivatives of real
functions, 1998.
101
10−5
10−8
10−11
Fig. 6.9 Unlike finite differences, the
10−14
complex-step method is not subject to
Complex step subtractive cancellation. Therefore,
the error is the same as that of the
10−17
10−1 10−4 10−8 10−12 10−16 10−20 10−200 10−321
function evaluation (machine zero in
this case).
Step size, ℎ
ℎ Re ( 𝑓 ) Im ( 𝑓 ) /ℎ
10−1 4.4508662116993065 4.0003330384671729
10−2 4.4973069409015318 4.0528918144659292
10−4 4.4977800066307951 4.0534278402854467
10−6 4.4977800539414297 4.0534278938932582
10−8 4.4977800539461619 4.0534278938986201
10−10 4.4977800539461619 4.0534278938986201
10−12 4.4977800539461619 4.0534278938986201
10−14 4.4977800539461619 4.0534278938986210
10−16 4.4977800539461619 4.0534278938986201 Table 6.2 For a small enough step, the
10−18 4.4977800539461619 4.0534278938986210 real part of the complex evaluation is
10−200 4.4977800539461619 4.0534278938986201 identical to the real evaluation, and
the derivative matches to machine
Exact 4.4977800539461619 4.0534278938986201 precision.
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
Once you have made your code complex, the first test you should perform
is to run your code with no imaginary perturbation and verify that no variable
ends up with a nonzero imaginary part. If any number in the code acquires a
nonzero imaginary part, something is wrong, and you must trace the source of
the error. This is a necessary but not sufficient test.
the convergence criterion so that it checks for the convergence of the imaginary 50. Martins et al., The complex-step deriva-
tive approximation, 2003.
part, in addition to the existing check on the real part. The imaginary part,
which contains the derivative information, often lags relative to the real part 𝜀
101
in terms of convergence, as shown in Fig. 6.12. Therefore, if the solver only
checks for the real part, it might yield a derivative with a precision lower
than the function value. In this example, 𝑓 is the drag coefficient given by a 10−4
Im 𝑓
computational fluid dynamics solver and 𝜀 is the relative error for each part.
10−9
Re 𝑓
10−14
0 50 100 150 200 250
6.6 Algorithmic Differentiation Iterations
Algorithmic differentiation (AD)—also known as computational differenti- Fig. 6.12 The imaginary parts of the
ation or automatic differentiation—is a well-known approach based on the variables often lag relative to the real
parts in iterative solvers.
systematic application of the chain rule to computer programs.115,116
115. Griewank, Evaluating Derivatives,
The derivatives computed with AD can match the precision of the 2000.
function evaluation. The cost of computing derivatives with AD can 116. Naumann, The Art of Differentiating
Computer Programs—An Introduction to
be proportional to either the number of variables or the number of Algorithmic Differentiation, 2011.
functions, depending on the type of AD, making it flexible.
Another attractive feature of AD is that its implementation is largely
automatic, thanks to various AD tools. To explain AD, we start by
outlining the basic theory with simple examples. Then we explore how
the method is implemented in practice with further examples.
6 Computing Derivatives 237
6.6.2 Forward-Mode AD
The chain rule for the forward mode can be written as
d𝑣 𝑖 Õ 𝜕𝑣 𝑖 d𝑣 𝑘
𝑖−1
= , (6.21)
d𝑣 𝑗 𝜕𝑣 𝑘 d𝑣 𝑗
𝑘=𝑗
Õ
𝑖−1
𝜕𝑣 𝑖
𝑣¤ 𝑖 = 𝑣¤ 𝑘 . (6.22)
𝜕𝑣 𝑘
𝑘=𝑗
𝑣¤ 1 = 1
𝜕𝑣2
𝑣¤ 2 = 𝑣¤ 1
𝜕𝑣1
𝜕𝑣3 𝜕𝑣3 (6.23)
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2
𝜕𝑣1 𝜕𝑣2
𝜕𝑣4 𝜕𝑣4 𝜕𝑣4 d𝑓
𝑣¤ 4 = 𝑣¤ 1 + 𝑣¤ 2 + 𝑣¤ 3 ≡ .
𝜕𝑣1 𝜕𝑣2 𝜕𝑣3 d𝑥
1 0 0 0
d𝑣2
1 0 0
d𝑣1
𝐽𝑣 = d𝑣3 .
0
d𝑣 3 (6.24)
d𝑣1 1
d𝑣 2
d𝑣4 d𝑣 4 d𝑣4
1
d𝑣1 d𝑣 2 d𝑣3
By setting the seed 𝑣¤ 1 = 1 and using the forward chain rule (Eq. 6.22), we
have computed the first column of 𝐽𝑣 from top to bottom. This column
corresponds to the tangent with respect to 𝑣1 . Using forward-mode
AD, obtaining derivatives for other outputs is free (e.g., d𝑣3 /d𝑣1 ≡ 𝑣¤ 3
in Eq. 6.23).
However, if we want the derivatives with respect to additional
inputs, we would need to set a different seed and evaluate an entire
set of similar calculations. For example, if we wanted d𝑣4 /d𝑣2 , we
would set the seed as 𝑣¤ 2 = 1 and evaluate the equations for 𝑣¤ 3 and 𝑣¤ 4 ,
where we would now have d𝑣 4 /d𝑣2 = 𝑣¤ 4 . This would correspond to
computing the second column in 𝐽𝑣 (Eq. 6.24).
Thus, the cost of the forward mode scales linearly with the number
of inputs we are interested in and is independent of the number of
outputs.
6 Computing Derivatives 240
Consider the function with two inputs and two outputs from Ex. 6.1. We
could evaluate the explicit expressions in this function using only two lines of
code. However, to make the AD process more apparent, we write the code such
that each line has a single unary or binary operation, which is how a computer
ends up evaluating the expression:
𝑣1 = 𝑣1 (𝑣1 ) = 𝑥1
𝑣2 = 𝑣 2 (𝑣2 ) = 𝑥2
𝑣3 = 𝑣 3 (𝑣1 , 𝑣2 ) = 𝑣1 𝑣 2
𝑣4 = 𝑣 4 (𝑣1 ) = sin 𝑣 1
𝑣5 = 𝑣 5 (𝑣3 , 𝑣4 ) = 𝑣3 + 𝑣4 = 𝑓1
𝑣6 = 𝑣 6 (𝑣2 ) = 𝑣22
𝑣 7 = 𝑣 7 (𝑣3 , 𝑣6 ) = 𝑣3 + 𝑣6 = 𝑓2 .
Using the forward mode, set the seed 𝑣¤ 1 = 1, and 𝑣¤ 2 = 0 to obtain the derivatives
with respect to 𝑥1 . When using the chain rule (Eq. 6.22), only one or two partial
derivatives are nonzero in each sum because the operations are either unary
or binary in this case. For example, the addition operation that computes
𝑣5 does not depend explicitly on 𝑣2 , so 𝜕𝑣5 /𝜕𝑣2 = 0. To further elaborate,
when evaluating the operation 𝑣5 = 𝑣3 + 𝑣4 , we do not need to know how 𝑣3
was computed; we just need to know the value of the two numbers we are
adding. Similarly, when evaluating the derivative 𝜕𝑣5 /𝜕𝑣2 , we do not need
to know how or whether 𝑣3 and 𝑣4 depended on 𝑣2 ; we just need to know
how this one operation depends on 𝑣2 . So even though symbolic derivatives
are involved in individual operations, the overall process is distinct from
symbolic differentiation. We do not combine all the operations and end up
with a symbolic derivative. We develop a computational procedure to compute
the derivative that ends up with a number for a given input—similar to the
computational procedure that computes the functional outputs and does not
produce a symbolic functional output.
Say we want to compute d 𝑓2 /d𝑥1 , which in our example corresponds to
d𝑣7 /d𝑣1 . The evaluation point is the same as in Ex. 6.1: 𝑥 = (𝜋/4, 2). Using the
chain rule (Eq. 6.22) and considering only the nonzero partial derivative terms,
we get the following sequence:
𝑣¤ 1 = 1
𝑣¤ 2 = 0
𝜕𝑣3 𝜕𝑣
𝑣¤ 3 = 𝑣¤ 1 + 3 𝑣¤ 2 = 𝑣 2 · 𝑣¤ 1 + 𝑣1 · 0 = 2
𝜕𝑣1 𝜕𝑣 2
𝜕𝑣4
𝑣¤ 4 = 𝑣¤ 1 = cos 𝑣1 · 𝑣¤ 1 = 0.707 . . .
𝜕𝑣1
𝜕𝑣5 𝜕𝑣 𝜕 𝑓1
𝑣¤ 5 = 𝑣¤ 3 + 5 𝑣¤ 4 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 4 = 2.707 . . . ≡
𝜕𝑣3 𝜕𝑣 4 𝜕𝑥1
6 Computing Derivatives 241
𝜕𝑣6
𝑣¤ 6 = 𝑣¤ 2 = 2𝑣2 · 𝑣¤ 2 = 0
𝜕𝑣2
(6.25)
𝜕𝑣7 𝜕𝑣 𝜕 𝑓2
𝑣¤ 7 = 𝑣¤ 3 + 7 𝑣¤ 6 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 6 = 2 ≡ .
𝜕𝑣3 𝜕𝑣6 𝜕𝑥1
This sequence is illustrated in matrix form in Fig. 6.16. The procedure is
equivalent to performing forward substitution in this linear system.
We now have a procedure (not a symbolic expression) for computing d 𝑓2 /d𝑥1 𝑣¤ 1 1 𝑣¤ 1
for any (𝑥1 , 𝑥2 ). The dependencies of these operations are shown in Fig. 6.17 as 𝑣¤ 2 𝑣¤ 2
a computational graph. 𝑣¤ 3 𝑣¤ 3
(say, 𝑝 = [1, . . . , 1]) and then comparing it to the directional derivative in that
direction. If the result matches the reference, then all the gradient elements are
most likely correct (it is good practice to try a couple more directions just to be
sure). However, if the result does not match, this directional derivative does
not reveal which gradient elements are incorrect.
6.6.3 Reverse-Mode AD
The reverse mode is also based on the chain rule but uses the alternative
form:
d𝑣 𝑖 Õ 𝑖
𝜕𝑣 𝑘 d𝑣 𝑖
= , (6.26)
d𝑣 𝑗 𝜕𝑣 𝑗 d𝑣 𝑘
𝑘=𝑗+1
Õ
𝑖
𝜕𝑣 𝑘
𝑣¯ 𝑗 = 𝑣¯ 𝑘 . (6.27)
𝜕𝑣 𝑗
𝑘=𝑗+1
This chain rule propagates the total derivatives backward after setting
the reverse seed 𝑣¯ 𝑖 = 1, as shown in Fig. 6.18. This affects all the 𝑥 𝑓
variables on which the seeded variable depends.
Seeded output, 𝑣¯ 𝑖
The reverse-mode variables 𝑣¯ represent the derivatives of one output,
𝑖, with respect to all the input variables (instead of the derivatives of all
the outputs with respect to one input, 𝑗, in the forward mode). Once
we are done applying the reverse chain rule (Eq. 6.27) for the chosen
output variable 𝑣 𝑖 , we end up with the total derivatives d𝑣 𝑖 /d𝑣 𝑗 for all
𝑗 < 𝑖.
Applying the reverse mode to the same four-variable example as
before, we get the following sequence of derivative computations (we
set 𝑖 = 4 and decrement 𝑗):
Fig. 6.18 The reverse mode propa-
gates derivatives to all the variables
𝑣¯ 4 = 1 on which the seeded output variable
𝜕𝑣 4 depends.
𝑣¯ 3 = 𝑣¯ 4
𝜕𝑣 3
6 Computing Derivatives 243
𝜕𝑣3 𝜕𝑣4
𝑣¯ 2 = 𝑣¯ 3 + 𝑣¯ 4
𝜕𝑣2 𝜕𝑣2
(6.28)
𝜕𝑣2 𝜕𝑣3 𝜕𝑣4 d𝑓
𝑣¯ 1 = 𝑣¯ 2 + 𝑣¯ 3 + 𝑣¯ 4 ≡ .
𝜕𝑣1 𝜕𝑣1 𝜕𝑣1 d𝑥
The partial derivatives of 𝑣 must be computed for 𝑣 4 first, then 𝑣3 , and
so on. Therefore, we have to traverse the code in reverse. In practice,
not every variable depends on every other variable, so a computational
graph is created during code evaluation. Then, when computing the
adjoint variables, we traverse the computational graph in reverse. As
before, the derivatives we need to compute in each line are only partial
derivatives.
Recall the Jacobian of the variables,
1 0 0 0
d𝑣2
1 0 0
d𝑣1
𝐽𝑣 = d𝑣3 .
0
d𝑣 3 (6.29)
d𝑣1 1
d𝑣 2
d𝑣4 d𝑣 4 d𝑣4
1
d𝑣1 d𝑣 2 d𝑣3
By setting 𝑣¯ 4 = 1 and using the reverse chain rule (Eq. 6.27), we have
computed the last row of 𝐽𝑣 from right to left. This row corresponds
to the gradient of 𝑓 ≡ 𝑣4 . Using the reverse mode of AD, obtaining
derivatives with respect to additional inputs is free (e.g., d𝑣 4 /d𝑣 2 ≡ 𝑣¯ 2
in Eq. 6.28).
However, if we wanted the derivatives of additional outputs, we
would need to evaluate a different sequence of derivatives. For example,
if we wanted d𝑣3 /d𝑣1 , we would set 𝑣¯ 3 = 1 and evaluate the expressions
for 𝑣¯ 2 and 𝑣¯ 1 in Eq. 6.28, where d𝑣 3 /𝑑𝑣 1 ≡ 𝑣¯ 1 . Thus, the cost of
the reverse mode scales linearly with the number of outputs and is
independent of the number of inputs.
One complication with the reverse mode is that the resulting se-
quence of derivatives requires the values of the variables, starting with
the last ones and progressing in reverse. For example, the partial deriva-
tive in the second operation of Eq. 6.28 might involve 𝑣3 . Therefore, the
code needs to run in a forward pass first, and all the variables must be
stored for use in the reverse pass, which increases memory usage.
Suppose we want to compute 𝜕 𝑓2 /𝜕𝑥1 for the function from Ex. 6.5. First,
we need to run the original code (a forward pass) and store the values of all
the variables because they are necessary in the reverse chain rule (Eq. 6.26)
to compute the numerical values of the partial derivatives. Furthermore, the
6 Computing Derivatives 244
reverse chain rule requires the information on all the dependencies to determine
which partial derivatives are nonzero. The forward pass and dependencies are
represented by the computational graph shown in Fig. 6.19.
𝑥1 𝑣1 = 𝑥1 𝑣 4 = sin 𝑣 1 𝑣5 = 𝑣3 + 𝑣4 𝑓1 = 𝑣5 𝑓1
𝑣3 = 𝑣1 𝑣2
Using the chain rule (Eq. 6.26) and setting the seed for the desired variable
𝑣¯ 7 = 1, we get
𝑣¯ 7 = 1
𝜕𝑣 7
𝑣¯ 6 = 𝑣¯ 7 = 𝑣¯ 7 = 1
𝜕𝑣 6
𝑣¯ 5 = ==0
𝜕𝑣 5
𝑣¯ 4 = 𝑣¯ 5 = 𝑣¯ 5 = 0
𝜕𝑣 4 (6.30)
𝜕𝑣 7 𝜕𝑣
𝑣¯ 3 = 𝑣¯ 7 + 5 𝑣¯ 5 = 𝑣¯ 7 + 𝑣¯ 5 = 1
𝜕𝑣 3 𝜕𝑣3
𝜕𝑣6 𝜕𝑣3 𝜕 𝑓2
𝑣¯ 2 = 𝑣¯ 6 + 𝑣¯ 3 = 2𝑣2 𝑣¯ 6 + 𝑣1 𝑣¯ 3 = 4.785 =
𝜕𝑣 2 𝜕𝑣2 𝜕𝑥2
𝜕𝑣 4 𝜕𝑣 𝜕 𝑓2
𝑣¯ 1 = 𝑣¯ 4 + 3 𝑣¯ 3 = (cos 𝑣1 )𝑣¯ 4 + 𝑣2 𝑣¯ 3 = 2 = .
𝜕𝑣 1 𝜕𝑣1 𝜕𝑥1
After running the forward evaluation and storing the elements of 𝑣, we can run
the reverse pass shown in Fig. 6.20. This reverse pass is illustrated in matrix
form in Fig. 6.21. The procedure is equivalent to performing back substitution
in this linear system.
𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 1 = 𝑣¯ 4 cos 𝑣1
= 𝑣¯ 1 𝑣¯ 4 = 𝑣¯ 5 𝑣¯ 5 = 0 𝑓1
𝜕𝑥1 𝜕𝑥1 + 𝑣2 𝑣¯ 3
𝑣¯ 3 = 𝑣¯ 7 + 𝑣¯ 5
Fig. 6.20 Computational graph for the
reverse mode, showing the backward
𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 2 = 2𝑣2 𝑣¯ 6 propagation of the derivative of 𝑓2 .
= 𝑣¯ 2 𝑣¯ 6 = 𝑣¯ 7 𝑣¯ 7 = 1 𝑓2
𝜕𝑥2 𝜕𝑥2 + 𝑣 1 𝑣¯ 3
evaluating only one more line of code. Conversely, if we want the derivatives
of 𝑓1 , a whole new set of computations is needed.
In forward mode, the computation of a given derivative, 𝑣¤ 𝑖 , requires the 𝑣¯ 1 𝑣¯ 1
partial derivatives of the line of code that computes 𝑣 𝑖 with respect to its inputs. 𝑣¯ 2 𝑣¯ 2
In the reverse case, however, to compute a given derivative, 𝑣¯ 𝑗 , we require the 𝑣¯ 3 𝑣¯ 3
𝑣¯ 4 = 𝑣¯ 4
partial derivatives with respect to 𝑣 𝑗 of the functions that the current variable
𝑣¯ 5 𝑣¯ 5
𝑣 𝑗 affects. Knowledge of the function a variable affects is not encoded in that
𝑣¯ 6 𝑣¯ 6
variable computation, and that is why the computational graph is required.
𝑣¯ 7 1 𝑣¯ 7
AD tools that use source code transformation process the whole source
code automatically with a parser and add lines of code that compute
the derivatives. The added code is highlighted in Exs. 6.7 and 6.8.
Running an AD source transformation tool on the code from Ex. 6.2 produces
the code that follows.
The AD tool added a new line after each variable assignment that computes the
corresponding derivative. We can then set the seed, 𝑥¤ = 1 and run the code. As
the loops proceed, 𝑓¤ accumulates the derivative as 𝑓 is successively updated.
6 Computing Derivatives 247
The first loop is identical to the original code except for one line. Because the
derivatives that accumulate in the reverse loop depend on the intermediate
values of the variables, we need to store all the variables in the forward loop.
We store and retrieve the variables using a stack, hence the call to “push”.† † A stack, also known as last in, first out
The second loop, which runs in reverse, is where the derivatives are (LIFO), is a data structure that stores a one-
dimensional array. We can only add an
computed. We set the reverse seed, 𝑓¯ = 1, and then the adjoint variables element to the top of the stack (push) and
accumulate the derivatives back to the start. take the element from the top of the stack
(pop).
Operator Overloading
where we compute the original function value in the first term, and the
second term carries the derivative of the multiplication.
Although we wrote the two parts explicitly in Eq. 6.31, the source
code would only show a normal multiplication, such as 𝑣3 = 𝑣1 · 𝑣 2 .
However, each of these variables would be of the new type and carry the
corresponding 𝑣¤ quantities. By overloading all the required operations,
6 Computing Derivatives 248
the computations happen “behind the scenes”, and the source code
does not have to be changed, except to declare all the variables to be of
the new type and to set the seed. Example 6.9 lists the original code
from Ex. 6.2 with notes on the actual computations that are performed
as a result of overloading.
Using the derived data types and operator overloading approach in forward
mode does not change the code listed in Ex. 6.2. The AD tool provides
overloaded versions of the functions in use, which in this case are assignment,
addition, and sine. These functions are overloaded as follows:
𝑣2 = 𝑣1 ⇒ (𝑣2 , 𝑣¤ 2 ) = (𝑣 1 , 𝑣¤ 1 )
𝑣1 + 𝑣2 ⇒ (𝑣 1 , 𝑣¤ 1 ) + (𝑣2 , 𝑣¤ 2 ) ≡ (𝑣1 + 𝑣2 , 𝑣¤ 1 + 𝑣¤ 2 )
sin(𝑣) ⇒ sin (𝑣, 𝑣¤ ) ≡ (sin(𝑣), cos(𝑣)𝑣¤ ) .
In this case, the source code is unchanged, but additional computations occur
through the overloaded functions. We reproduce the code that follows with
notes on the hidden operations that take place.
Input: 𝑥 ¤
𝑥 is of a new data type with two components (𝑥, 𝑥)
𝑓 =𝑥 ( 𝑓 , 𝑓¤) = (𝑥, 𝑥)
¤ through the overloading of the “=” operation
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 ) Code is unchanged, but overloading
‡ The overloading
computes the derivative‡ of ‘’+” computes
end for (𝑣, 𝑣¤ ) = 𝑥 + 𝑓 , 𝑥¤ + 𝑓¤ and then the
return 𝑓 The new data type includes 𝑓¤, which is d 𝑓 /d𝑥 overloading of “sin” computes 𝑓 , 𝑓¤ =
(sin(𝑣), cos(𝑣)𝑣¤ ).
We set the seed, 𝑥¤ = 1, and for each function assignment, we add the cor-
responding derivative line. As the loops are repeated, 𝑓¤ accumulates the
derivative as 𝑓 is successively updated.
The source code transformation and the operator overloading ap- 121. Bradbury et al., JAX: Composable
Transformations of Python+NumPy Pro-
proaches each have their relative advantages and disadvantages. The grams, 2018.
overloading approach is much more elegant because the original code 122. Revels et al., Forward-mode automatic
differentiation in Julia, 2016.
stays practically the same and can be maintained directly. On the other
123. Neidinger, Introduction to automatic
hand, the source transformation approach enlarges the original code differentiation and MATLAB object-oriented
and results in less readable code, making it hard to work with. Still, it programming, 2010.
is easier to see what operations take place when debugging. Instead of 124. Betancourt, A geometric theory of
higher-order automatic differentiation, 2018.
maintaining source code transformed by AD, it is advisable to work
with the original source and devise a workflow where the parser is
rerun before compiling a new version.
One advantage of the source code transformation approach is that
it tends to yield faster code and allows more straightforward compile-
time optimizations. The overloading approach requires a language that
supports user-defined data types and operator overloading, whereas
source transformation does not. Developing a source transformation
AD tool is usually more challenging than developing the overloading
approach because it requires an elaborate parser that understands the
source syntax.
6 Computing Derivatives 250
𝐶¤ = 𝐴𝐵
¤ + 𝐴 𝐵¤ . (6.33)
The idea is to use 𝐴¤ and 𝐵¤ from the AD code preceding the operation
and then manually implement this formula (bypassing any AD of the
¤ as shown in Fig. 6.24.
code that performs that operation) to obtain 𝐶,
Then we can use 𝐶¤ to seed the remainder of the AD code.
The reverse mode of the multiplication yields
𝐴¯ = 𝐶𝐵
¯ |, 𝐵¯ = 𝐴| 𝐶¯ . (6.34)
Forward mode
𝐴¤ Manual 𝐶¤
𝑥¤ 𝑓¤
implementation
𝐵¤
Forward AD 𝐴, 𝐵 Forward AD
Original code
𝐴 Matrix 𝐶
𝑥 𝑓
operation
𝐵
Reverse AD 𝐴, 𝐵 Reverse AD
Fig. 6.24 Matrix operations, including
Reverse mode the solution of linear systems, can
𝐴¯ Manual 𝐶¯ be differentiated manually to bypass
𝑥¯ 𝑓¯ more costly AD code.
implementation
𝐵¯
𝐵¯ = 𝐴−| 𝐶,
¯ 𝐴¯ = −𝐵𝐶
¯ | (6.36)
approach is only consistent in the limit of a fine discretization. The uous and discrete adjoint approaches in
more detail.
resulting inconsistencies can mislead the optimization.∗
126. Peter and Dwight, Numerical sensitiv-
ity analysis for aerodynamic optimization: A
survey of approaches, 2010.
6 Computing Derivatives 252
𝑟(𝑢; 𝑥) = 0 , (6.37)
where the semicolon denotes that the design variables 𝑥 are fixed when
these equations are solved for the state variables 𝑢. Through these
equations, 𝑢 is an implicit function of 𝑥. This relationship is represented
by the box containing the solver and residual equations in Fig. 6.25.
The functions of interest, 𝑓 (𝑥, 𝑢), are typically explicit functions of 𝑥
the state variables and the design variables. However, because 𝑢 is an
Solver 𝑢
implicit function of 𝑥, 𝑓 is ultimately an implicit function of 𝑥 as well. 𝑢
To compute 𝑓 for a given 𝑥, we must first find 𝑢 such that 𝑟(𝑢; 𝑥) = 0. 𝑟(𝑢; 𝑥)
𝑟
This is usually the most computationally costly step and requires a
solver (see Section 3.6). The residual equations could be nonlinear and 𝑓 (𝑥, 𝑢) 𝑓
Recall Ex. 3.2, where we introduced the structural model of a truss structure.
The residuals in this case are the linear equations,
where the state variables are the displacements, 𝑢. Solving for the displacement
requires only a linear solver in this case, but it is still the most costly part of the
analysis. Suppose that the design variables are the cross-sectional areas of the
truss members. Then, the stiffness matrix is a function of 𝑥, but the external
forces are not.
Suppose that the functions of interest are the stresses in each of the truss
members. This is an explicit function of the displacements, which is given by
6 Computing Derivatives 253
𝑓 (𝑥, 𝑢) ≡ 𝜎(𝑢) = 𝑆𝑢 ,
d𝑓 𝜕𝑓 𝜕 𝑓 d𝑢
= + , (6.39)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where the result is an (𝑛 𝑓 × 𝑛 𝑥 ) matrix.† † Thischain rule can be derived by writing
the total differential of 𝑓 as
In this context, the total derivatives, d 𝑓 /d𝑥, take into account the
change in 𝑢 that is required to keep the residuals of the governing 𝜕𝑓 𝜕𝑓
d𝑓 = d𝑥 + d𝑢
equations (Eq. 6.37) equal to zero. The partial derivatives in Eq. 6.39 𝜕𝑥 𝜕𝑢
represent the variation of 𝑓 (𝑥, 𝑢) with respect to changes in 𝑥 or 𝑢 and then “dividing” it by d𝑥 . See Ap-
without regard to satisfying the governing equations. pendix A.2 for more background on differ-
entials.
To better understand the difference between total and partial deriva-
tives in this context, imagine computing these derivatives using finite
differences with small perturbations. For the total derivatives, we
would perturb 𝑥, re-solve the governing equations to obtain 𝑢, and
then compute 𝑓 , which would account for both dependency paths
in Fig. 6.25. To compute the partial derivatives 𝜕 𝑓 /𝜕𝑥 and 𝜕 𝑓 /𝜕𝑢,
however, we would perturb 𝑥 or 𝑢 and recompute 𝑓 without re-solving
the governing equations. In general, these partial derivative terms are
cheap to compute numerically or can be obtained symbolically.
To find the total derivative d𝑢/d𝑥, we need to consider the governing
equations. Assuming that we are at a point where 𝑟(𝑥, 𝑢) = 0, any
perturbation in 𝑥 must be accompanied by a perturbation in 𝑢 such that
the governing equations remain satisfied. Therefore, the differential of
the residuals can be written as
𝜕𝑟 𝜕𝑟
d𝑟 = d𝑥 + d𝑢 = 0 . (6.40)
𝜕𝑥 𝜕𝑢
This constraint is illustrated in Fig. 6.26 in two dimensions, but keep
in mind that 𝑥, 𝑢, and 𝑟 are vectors in the general case. The governing
equations (Eq. 6.37) map an 𝑛 𝑥 -vector 𝑥 to an 𝑛𝑢 -vector 𝑢. This mapping
defines a hypersurface (also known as a manifold) in the 𝑥–𝑢 space.
6 Computing Derivatives 254
where 𝜕𝑟/𝜕𝑥 and d𝑢/d𝑥 are both (𝑛𝑢 × 𝑛 𝑥 ) matrices, and 𝜕𝑟/𝜕𝑢 is a 𝑟(𝑥, 𝑢) = 0
square matrix of size (𝑛𝑢 × 𝑛𝑢 ). This linear system is useful because if
we provide the partial derivatives in this equation (which are cheap 𝑥
d𝑓 𝜕𝑓 𝜕 𝑓 𝜕𝑟 −1 𝜕𝑟
= − , (6.42)
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
where all the derivative terms on the right-hand side are partial deriva-
tives. The partial derivatives in this equation can be computed using any
of the methods that we have described earlier: symbolic differentiation,
finite differences, complex step, or AD. Equation 6.42 shows two ways
to compute the total derivatives, which we call the direct method and the
adjoint method.
The direct method (already outlined earlier) consists of solving the
linear system (Eq. 6.41) and substituting d𝑢/d𝑥 into Eq. 6.39. Defining
𝜙 ≡ − d𝑢/d𝑥, we can rewrite Eq. 6.41 as
𝜕𝑟 𝜕𝑟
𝜙= . (6.43)
𝜕𝑢 𝜕𝑥
After solving for 𝜙 (one column at the time), we can use it in the total
derivative equation (Eq. 6.39) to obtain,
d𝑓 𝜕𝑓 𝜕𝑓
= − 𝜙. (6.44)
d𝑥 𝜕𝑥 𝜕𝑢
Solving the linear system (Eq. 6.43) is typically the most computa-
tionally expensive operation in this procedure. The cost of this approach
scales with the number of inputs 𝑛 𝑥 but is essentially independent
of the number of outputs 𝑛 𝑓 . This is the same scaling behavior as
finite differences and forward-mode AD. However, the constant of
proportionality is typically much smaller in the direct method because
we only need to solve the nonlinear equations 𝑟(𝑢; 𝑥) = 0 once to obtain
the states.
𝜙 (𝑛𝑢 × 𝑛 𝑥 )
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑟 −1 𝜕𝑟
= −
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
Fig. 6.27 The total derivatives
(Eq. 6.42) can be computed either by
(𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛 𝑥 )
solving for 𝜙 (direct method) or by
solving for 𝜓 (adjoint method).
𝜓| (𝑛 𝑓 × 𝑛𝑢 )
𝜕𝑟 | 𝜕𝑓 |
𝜓= . (6.46)
𝜕𝑢 𝜕𝑢
This linear system has no dependence on 𝑥. Each adjoint vector is
associated with a function of interest 𝑓 𝑗 and is found by solving the
adjoint equation (Eq. 6.46) with the corresponding row 𝜕 𝑓 𝑗 /𝜕𝑢. The
solution (𝜓) is then used to compute the total derivative
d𝑓 𝜕𝑓 𝜕𝑟
= − 𝜓| . (6.47)
d𝑥 𝜕𝑥 𝜕𝑥
This is sometimes called the reverse mode because it is analogous to
reverse-mode AD.
6 Computing Derivatives 256
Although implementing implicit analytic methods is labor intensive, 127. Martins, Perspectives on aerodynamic
design optimization, 2020.
6 Computing Derivatives 257
𝜆
+ cos 𝜆 = 0 .
𝑚
Figure 6.29 shows the equivalent of Fig. 6.25 in this case. Our goal is to compute
the derivative d 𝑓 /d𝑚. Because 𝜆 is an implicit function of 𝑚, we cannot find
an explicit expression for 𝜆 as a function of 𝑚, substitute that expression into
Eq. 6.48, and then differentiate normally. Fortunately, the implicit analytic
methods allow us to compute this derivative.
Referring back to our nomenclature, 𝑚
d𝑓 𝜆
= 2𝜆𝑚 + .
d𝑚 1 − sin 𝜆
𝑚
Thus, we obtained the desired derivative despite the implicitly defined function.
Here, it was possible to get an explicit expression for the total derivative, but
generally, it is only possible to get a numeric value.
6 Computing Derivatives 258
𝜕𝜎
=𝑆,
𝜕𝑢
which is an (𝑛𝑡 × 𝑛𝑢 ) matrix that we already have from the stress computation.
Now we can use either the direct or adjoint method by replacing the partial
derivatives in the respective equations. The direct linear system (Eq. 6.43)
yields
𝜕
𝐾𝜙 𝑖 = (𝐾𝑢) ,
𝜕𝑥 𝑖
where 𝑖 corresponds to each truss member area. Once we have 𝜙 𝑖 , we can use
it to compute the total derivatives of all the stresses with respect to member
6 Computing Derivatives 259
The adjoint linear system (Eq. 6.46) yields†† †† Usually, the stiffness matrix is symmetric,
partial derivatives in the adjoint equations (Eq. 6.46) and total derivative
equation (Eq. 6.47).§§ §§ Kenway et al.129
provide more details on
this approach and its applications.
The partials terms 𝜕 𝑓 /𝜕𝑥 form an (𝑛 𝑓 × 𝑛 𝑥 ) matrix and 𝜕 𝑓 /𝜕𝑢 is
129. Kenway et al., Effective Adjoint Ap-
an (𝑛 𝑓 × 𝑛𝑢 ) matrix. These partial derivatives can be computed by proaches for Computational Fluid Dynamics,
identifying the section of the code that computes 𝑓 for a given 𝑥 and 𝑢 2019.
and running the AD tool for that section. This produces code that takes
𝑓¯ as an input and outputs 𝑥¯ and 𝑢,¯ as shown in Fig. 6.31. Recall that Original code
we must first run the entire original code that computes 𝑢 and 𝑓 . Then 𝑥
𝑓 (𝑥, 𝑢) = 0 𝑓
we can run the AD code with the desired seed. Suppose we want the 𝑢
derivative of the 𝑗th component of 𝑓 . We would set 𝑓¯𝑗 = 1 and the other
elements to zero. After running the AD code, we obtain 𝑥¯ and 𝑢, ¯ which Reverse mode
correspond to the rows of the respective matrix of partial terms, that is, 𝑥¯
𝑓¯
𝑢¯
𝜕 𝑓𝑗 𝜕 𝑓𝑗
𝑥¯ = , 𝑢¯ = . (6.49)
𝜕𝑥 𝜕𝑢 Fig. 6.31 Applying reverse AD to the
code that computes 𝑓 produces code
Thus, with each run of the AD code, we obtain the derivatives of one
that computes the partial derivatives
function with respect to all design variables and all state variables. One of 𝑓 with respect to 𝑥 and 𝑢.
run is required for each element of 𝑓 . The reverse mode is advantageous
if 𝑛 𝑓 < 𝑛 𝑥 , .
The Jacobian 𝜕𝑟/𝜕𝑢 can also be computed using AD. Because 𝜕𝑟/𝜕𝑢
is typically sparse, the techniques covered in Section 6.8 significantly
increase the efficiency of computing this matrix. This is a square matrix,
so neither AD mode has an advantage over the other if we explicitly
compute and store the whole matrix.
However, reverse-mode AD is advantageous when using an iterative
method to solve the adjoint linear system (Eq. 6.46). When using an
iterative method, we do not form 𝜕𝑟/𝜕𝑢. Instead, we require products
of the transpose of this matrix with some vector 𝑣,¶¶ ¶¶ See Appendix B.4 for more details on
iterative solvers.
𝜕𝑟 |
𝑣. (6.50)
𝜕𝑢
The elements of 𝑣 act as weights on the residuals and can be interpreted Original code
The final term needed to compute total derivatives with the adjoint
method is the last term in Eq. 6.47, which can be written as
|
𝜕𝑟 𝜕𝑟 |
𝜓 |
= 𝜓 . (6.51)
𝜕𝑥 𝜕𝑥
This is yet another transpose vector product that can be obtained using
the same reverse AD code for the residuals, except that now the residual
seed is 𝑟¯ = 𝜓, and the product we want is given by 𝑥.
¯
In sum, it is advantageous to use reverse-mode AD to compute
the partial derivative terms for the adjoint equations, especially if the
adjoint equations are solved using an iterative approach that requires
only matrix-vector products. Similar techniques and arguments apply
for the direct method, except that in that case, forward-mode AD is
advantageous for computing the partial derivatives.
𝜕𝑟 𝜕 𝑓𝑗
(6.52)
|
𝜓𝑗 = 𝜙𝑖 .
𝜕𝑥 𝑖 𝜕𝑢
Each side of this equation yields a scalar that should match to working precision.
The dot-product test verifies that your partial derivatives and the solutions for
the direct and adjoint linear systems are consistent. For AD, the dot-product
test for a code with inputs 𝑥 and outputs 𝑓 is as follows:
𝜕𝑓 | ¯ 𝜕 𝑓 | ¯ ¤| ¯
𝑥¤ | 𝑥¯ = 𝑥¤ | 𝑓 = 𝑥¤ | 𝑓 = 𝑓 𝑓. (6.53)
𝜕𝑥 𝜕𝑥
[𝑥1 , 𝑥 2 , . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 ] . (6.54)
𝐽11 0 0 0 0
0 0 0 0
d𝑓 𝐽22
≡ 0 0 𝐽33 0 0 . (6.55)
d𝑥 0 0
0 0 𝐽44
0 𝐽55
0 0 0
For this scenario, the Jacobian can be constructed with one evaluation
rather than 𝑛 𝑥 evaluations. This is because a given output 𝑓𝑖 depends
on only one input 𝑥 𝑖 . We could think of the outputs as 𝑛 𝑥 independent
functions. Thus, for finite differencing, rather than requiring 𝑛 𝑥 input
vectors with 𝑛 𝑥 function evaluations, we can use one input vector, as
follows:
[𝑥1 + ℎ, 𝑥 2 + ℎ, . . . , 𝑥5 + ℎ] , (6.56)
allowing us to compute all the nonzero entries in one pass.∗ et al.130 were the first to show that
∗ Curtis
Although the diagonal case is easy to understand, it is a special the number of function evaluations could
be reduced for sparse Jacobians.
situation. To generalize this concept, let us consider the following (5 × 6)
130. Curtis et al., On the estimation of
matrix as an example: sparse Jacobian matrices, 1974.
𝐽11 0 0 0 𝐽16
𝐽14
0 0 0 0
𝐽23 𝐽24
𝐽31 0 .
𝐽32 0 0 0 (6.57)
0 0
0 0 0 𝐽45
0 𝐽56
0 𝐽53 0 𝐽55
A subset of columns that does not have more than one nonzero in
any given row are said to be structurally orthogonal. In this example,
the following sets of columns are structurally orthogonal: (1, 3), (1,
5), (2, 3), (2, 4, 5), (2, 6), and (4, 5). Structurally orthogonal columns
can be combined, forming a smaller Jacobian that reduces the number
6 Computing Derivatives 263
100
Jacobian time [s]
AD
10−1
10−2
AD with coloring
10−3
Now that we have introduced all the methods for computing deriva-
tives, we will see how they are connected. For example, we have
mentioned that the direct and adjoint methods are analogous to the
forward and reverse mode of AD, respectively, but we did not show
this mathematically. The unified derivatives equation (UDE) expresses
both methods.134 Also, the implicit analytic methods from Section 6.7 134. Martins and Hwang, Review and uni-
fication of methods for computing derivatives
assumed one set of implicit equations (𝑟 = 0) and one set of explicit of multidisciplinary computational models,
functions ( 𝑓 ). The UDE formulates the derivative computation for 2013.
𝑟2
of unknowns,
=
𝑟2
+Δ 𝑟 1
+Δ
𝑟1 =
=0
𝑟2 =
𝑟2
0
𝑟1 =
−Δ
𝑟 𝑖 (𝑢1 , 𝑢2 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (6.60) 𝑢2
𝑟2
−Δ 𝑟 1
𝑢∗ 𝑟1 =
and that there is at least one solution 𝑢 ∗ such that 𝑟(𝑢 ∗ ) = 0. Such a
solution can be visualized for 𝑛 = 2, as shown in Fig. 6.35.
These residuals are general: each one can depend on any subset of 𝑢1
the variables 𝑢 and can be truly implicit functions or explicit functions Fig. 6.35 Solution of a system of two
converted to the implicit form (see Section 3.3 and Ex. 3.3). The total equations expressed by residuals.
differentials for these residuals are
𝜕𝑟 𝑖 𝜕𝑟 𝑖
d𝑟 𝑖 = d𝑢1 + . . . + d𝑢𝑛 , 𝑖 = 1, . . . , 𝑛 . (6.61)
𝜕𝑢1 𝜕𝑢𝑛
These represent first-order changes in 𝑟 due to perturbations in 𝑢. The
differentials of 𝑢 can be visualized as perturbations in the space of the +d
𝑟1
𝑟1 =
variables. The differentials of 𝑟 can be visualized as linear changes to 𝑟1 =
0
𝜕𝑟 𝜕𝑟1
1 d𝑢1 d𝑟1
···
𝜕𝑢1 𝜕𝑢𝑛
𝑢1
.
.. .. .. ..
. = . .
.. Fig. 6.36 The differential d𝑟 can be
. .
(6.62)
visualized as a linearized (first-order)
𝜕𝑟𝑛 𝜕𝑟𝑛
d𝑢𝑛 d𝑟𝑛
change of the contour value.
𝜕𝑢 ···
𝜕𝑢𝑛
1
6 Computing Derivatives 266
𝜕𝑟 d𝑢
=𝐼. (6.65)
𝜕𝑢 d𝑟
6 Computing Derivatives 267
which is the reverse form of the UDE. Now, each column 𝑗 yields 𝑟2
d𝑢 𝑗 /d𝑟—the total derivative of one variable with respect to all the =
+
d𝑟 +d
𝑟1
2
𝑟1 =
residuals. This total derivative is interpreted visually in Fig. 6.38. 𝑟2
=
0 0
𝑟1 =
The usefulness of the total derivative Jacobian d𝑢/d𝑟 might still not 𝑢2
𝑢∗
be apparent. In the next section, we explain how to set up the UDE to
include d 𝑓 /d𝑥 in the UDE unknowns (d𝑢/d𝑟). d𝑢1
𝑢1
Example 6.14 Computing and interpreting d𝑢/d𝑟
Fig. 6.38 The total derivatives d𝑢1 /d𝑟1
Suppose we want to find the rectangle that is inscribed in the ellipse given and d𝑢1 /d𝑟2 represent the first-order
by change in 𝑢1 resulting from perturba-
𝑢2 tions d𝑟1 and d𝑟2 .
𝑟1 (𝑢1 , 𝑢2 ) = 1 + 𝑢22 − 1 = 0 .
4
A change in this residual represents a change in the size of the ellipse without
changing its proportions. Of all the possible rectangles that can be inscribed in
the ellipse, we want the rectangle with an area that is half of that of this ellipse,
such that
𝑟2 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝜋 = 0 .
A change in this residual represents a change in the area of the rectangle. There
are two solutions, as shown in the left pane of Fig. 6.39. These solutions can be
found using Newton’s method, which converges to one solution or the other,
depending on the starting guess. We will pick the one on the right, which is
[𝑢1 , 𝑢2 ] = [1.79944, 0.43647]. The solution represents the coordinates of the
rectangle corner that touches the ellipse.
Taking the partial derivatives, we can write the forward UDE (Eq. 6.65) for
this problem as follows:
d𝑢1 d𝑢1
𝑢1 /2 2𝑢2 1 0
d𝑟2 =
d𝑟1 .
d𝑢2
(6.67)
d𝑢
4𝑢2 4𝑢1 2 0 1
d𝑟1 d𝑟2
6 Computing Derivatives 268
2 0.48
𝑟2 = −
𝑟2 = = 0
𝑟1
𝑟1
=
𝑟2
+
1.5 +Δ 0.46
d𝑟
0
Δ𝑟 2
𝑟2
1
𝑟1 = +Δ𝑟 𝑟2 = 0
1
𝑢2 1 𝑟1 = 0 𝑢2 0.44 𝑢∗
𝑟1 = −Δ𝑟 d𝑢2
1
𝑢∗ d𝑢1
0.5 0.42
0 0.4
0 0.5 1 1.5 2 1.76 1.78 1.8 1.82 1.84
𝑢1 𝑢1
Fig. 6.39 Rectangle inscribed in ellipse
Residuals and solution First-order perturbation view showing problem.
interpretation of d𝑢/d𝑟1
Solving this linear system for each of the two right-hand sides, we get
d𝑢 d𝑢1
1 1.45353
d𝑟 −0.17628
1 d𝑟2 =
d𝑢 .
d𝑢2
(6.68)
2
d𝑟 −0.35257 0.18169
1 d𝑟2
These derivatives reflect the change in the coordinates of the point where the
rectangle touches the ellipse as a result of a perturbation in the size of the
ellipse, d𝑟1 , and the area of the rectangle d𝑟2 . The right side of Fig. 6.39 shows
the visual interpretation of d𝑢/d𝑟1 as an example.
This is a vector with 𝑛 𝑥 + 𝑛𝑢 + 𝑛 𝑓 variables. For the residuals, we
need a vector with the same size. We can obtain this by realizing that
the residuals associated with the inputs and outputs are just explicit
functions that can be written in implicit form. Then, we have
𝑥 − 𝑥ˇ
𝑟ˆ ≡ 𝑟 − 𝑟ˇ(𝑥, 𝑢) = 0 . (6.70)
𝑓 − 𝑓ˇ(𝑥, 𝑢)
Here, we distinguish 𝑥 (the actual variable in the UDE system) from
𝑥ˇ (a given input) and 𝑓 (the variable) from 𝑓ˇ (an explicit function of
𝑥 and 𝑢). Similarly, 𝑟 is the vector of variables associated with the
residual and 𝑟ˇ is the residual function itself. Taking the differential of
the residuals, and considering only one of them to be nonzero at a time,
we obtain,
d𝑥
dˆ𝑟 ≡ d𝑟 . (6.71)
d 𝑓
Using these variable and residual definitions in Eqs. 6.65 and 6.66 yields
the full UDE shown in Fig. 6.40, where the block we ultimately want to
compute is d 𝑓 /d𝑥.
| | |
𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼 − − 𝐼
𝜕𝑥 𝜕𝑥 d𝑥 d𝑥
| | |
𝜕 𝑟ˇ 𝜕 𝑟ˇ d𝑢 d𝑢 𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
− − 0 0 = 0 𝐼 0 = 0 − − 0
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟 𝜕𝑢 𝜕𝑢 d𝑟 d𝑟
𝜕 𝑓ˇ 𝜕 𝑓ˇ d𝑓 d𝑓
− − 𝐼 𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟
To compute d 𝑓 /d𝑥 using the forward UDE (left-hand side of the Fig. 6.40 The direct and adjoint meth-
ods can be recovered from the UDE.
equation in Fig. 6.40, we can ignore all but three blocks in the total
derivatives matrix: 𝐼, d𝑢/d𝑥, and d 𝑓 /d𝑥. By multiplying these blocks
and using the definition 𝜙 ≡ − d𝑢/d𝑥, we recover the direct linear
system (Eq. 6.43) and the total derivative equation (Eq. 6.44).
To compute d 𝑓 /d𝑥 using the reverse UDE (right-hand side of
the equation in Fig. 6.40), we can ignore all but three blocks in the
total derivatives matrix: 𝐼, d 𝑓 /d𝑟, and d 𝑓 /d𝑥. By multiplying these
blocks and defining 𝜓 ≡ − d 𝑓 /d𝑟, we recover the adjoint linear system
(Eq. 6.46) and the corresponding total derivative equation (Eq. 6.47). The
definition of 𝜓 here is significant because, as mentioned in Section 6.7.2,
the adjoint vector is the total derivative of the objective function with
respect to the governing equation residuals.
6 Computing Derivatives 270
Say we want to compute the total derivatives of the perimeter of the rectangle
from Ex. 6.14 with respect to the axes of the ellipse. The equation for the ellipse
can be rewritten as
𝑢2 𝑢2
𝑟3 (𝑢1 , 𝑢2 ) = 1 + 2 − 1 = 0 ,
𝑥12 𝑥22
where 𝑥1 and 𝑥2 are the semimajor and semiminor axes of the ellipse, respec-
tively. The baseline values are [𝑥1 , 𝑥2 ] = [2, 1]. The residual for the rectangle
area is
𝜋
𝑟4 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝑥1 𝑥2 = 0 .
2
To add the independent variables 𝑥1 and 𝑥2 , we write them as residuals in
implicit form as
𝑟1 (𝑥1 ) = 𝑥1 − 2 = 0, 𝑟2 (𝑥2 ) = 𝑥2 − 1 = 0 .
𝑟5 (𝑢1 , 𝑢2 ) = 𝑓 − 4(𝑢1 + 𝑢2 ) = 0 .
Now we have a system of five equations and five variables, with the 𝑥
dependencies shown in Fig. 6.41. The first two variables in 𝑥 are given, and we
𝑟3 (𝑢, 𝑥) = 0
can compute 𝑢 and 𝑓 using a solver as before.
Taking all the partial derivatives, we get the following forward system: 𝑢
𝑟4 (𝑢, 𝑥) = 0
𝑢
1 0 1 0
0 0 0 0 0 0
𝑟5 (𝑢) = 0
0 0 0 0
1 0 0 1 0 0
2𝑢12 2𝑢22 2𝑢1 2𝑢2 d𝑢1 d𝑢1 d𝑢1 d𝑢1 𝑓
− 0 0 = 𝐼 .
𝑥3 − d𝑥1
𝑥23 𝑥12 𝑥22 d𝑥2 d𝑟3 d𝑟4
1 d𝑢2 Fig. 6.41 Dependencies of the residu-
𝜋 d𝑢2 d𝑢2 d𝑢2
0 als.
− 𝑥 2 0 d𝑥
𝜋
4𝑢2 4𝑢1
2 − 𝑥1
2 1 d𝑥2 d𝑟3 d𝑟4
d𝑓
d𝑓 d𝑓 d𝑓
1
0 0 1 d𝑥
1
−4 −4 d𝑥2 d𝑟3 d𝑟4
6 Computing Derivatives 271
We only want the two d 𝑓 /d𝑥 terms in this equation. We can either solve this
linear system twice to compute the first two columns, or we can compute both
terms with a single solution of the reverse (transposed) system. Transposing
the system, substituting the numerical values for 𝑥 and 𝑢, and removing the
total derivative terms that we do not need, we get the following system:
d𝑓
1 0
0
d𝑥1
−0.80950 −1.57080 0
d𝑓
0 0
1
d𝑥2
−0.38101 −3.14159 0
d𝑓
0 = 0 .
0 0.89972 1.74588 −4
d𝑟3
d𝑓
0
0 −4
d𝑟4
0 0.87294 7.19776
1
0 0 0 0 1 1
Solving this linear system, we obtain
d𝑓
d𝑥 3.59888
1
d𝑓
d𝑥 1.74588
2 = .
d𝑓
4.40385
d𝑟
3
d𝑓
0.02163
d𝑟4
The total derivatives of interest are shown in Fig. 6.42. 𝑥2
2
We could have obtained the same solution using the adjoint equations
from Section 6.7.2. The only difference is the nomenclature because the adjoint
vector in this case is 𝜓 = −[d 𝑓 /d𝑟3 , d 𝑓 /d𝑟4 ]. We can interpret these terms as 1.5
the change of 𝑓 with respect to changes in the ellipse size and rectangle area, d𝑓
d𝑥 2
respectively. 1
d𝑓
d𝑥1
0.5
1 1.5 2 2.5 3
𝑥1
6.9.3 Recovering AD
Now we will see how we can recover AD from the UDE. First, we Fig. 6.42 Contours of 𝑓 as a function
of 𝑥 and the total derivatives at 𝑥 =
define the UDE variables associated with each operation or line of code [2, 1].
(assuming all loops have been unrolled), such that 𝑢 ≡ 𝑣 and
Recall from Section 6.6.1 that each variable is an explicit function of the
previous ones.
6 Computing Derivatives 272
𝑟 𝑖 = 𝑣 𝑖 − 𝑣ˇ 𝑖 (𝑣 1 , . . . , 𝑣 𝑖−1 ) . (6.73)
d𝑣1
1 0 0 0 0
d𝑣1
... ...
𝜕 𝑣ˇ 2 .. d𝑣 ..
− .. 2 d𝑣 2 ..
𝜕𝑣 1 . . . .
d𝑣1 d𝑣2 = 𝐼 . (6.74)
.. .
1
. .. ..
0 ..
.. ..
0
. . . .
𝜕 𝑣ˇ
− 𝑛 𝜕 𝑣ˇ 𝑛
1 𝑛
d𝑣 d𝑣 𝑛 d𝑣 𝑛
... − ...
𝜕𝑣1 𝜕𝑣 𝑛−1 d𝑣1 d𝑣 𝑛−1 d𝑣 𝑛
This equation is the matrix form of the AD forward chain rule (Eq. 6.21),
where each column of the total derivative matrix corresponds to the
tangent vector (𝑣¤ ) for the chosen input variable. As observed in Fig. 6.16,
the partial derivatives form a lower triangular matrix. The Jacobian
we ultimately want to compute (d 𝑓 /d𝑥) is composed of a subset of
derivatives in the bottom-left corner near the d𝑣 𝑛 /d𝑣1 term. To compute
these derivatives, we need to perform forward substitution and compute
one column of the total derivative matrix at a time, where each column
is associated with the inputs of interest.
Similarly, the reverse form of the UDE (Eq. 6.66) yields the transpose
of Eq. 6.74,
d𝑣1
1 − 2
𝜕 𝑣ˇ
𝜕 𝑣ˇ 𝑛 d𝑣2 d𝑣 𝑛
... − d𝑣1 ...
𝜕𝑣1
𝜕𝑣1 d𝑣1 d𝑣1
d𝑣2 ..
0
.. .. ..
0 1 .
.
. .
. d𝑣2 = 𝐼 . (6.75)
.. 𝜕 𝑣ˇ 𝑛 . d𝑣 𝑛−1 d𝑣 𝑛
. . .. ..
.
..
. −
𝜕𝑣 𝑛−1
.
d𝑣 𝑛−1
d𝑣 𝑛−1
0 d𝑣 𝑛
0 1 d𝑣 𝑛
0 0
... ...
This is equivalent to the AD reverse chain rule (Eq. 6.26), where each
column of the total derivative matrix corresponds to the gradient vector
6 Computing Derivatives 273
(𝑣¯ ) for the chosen output variable. The partial derivatives now form
an upper triangular matrix, as previously shown in Fig. 6.21. The
derivatives of interest are now near the top-right corner of the total
derivative matrix near the d𝑣 𝑛 /d𝑣1 term. To compute these derivatives,
we need to perform back substitutions, computing one column of the
matrix at a time.
When scaling a problem (Tips 4.4 and 5.3), you should be aware that the
scale changes also affect the derivatives. You can apply the derivative methods
of this chapter to the scaled function directly. However, scaling is often done
outside the model because the scaling is specific to the optimization problem. In
this case, you may want to use the original functions and derivatives and make
the necessary modifications in an outer function that provides the objectives
and constraints.
Using the nomenclature introduced in Tip 4.4, we represent the scaled
design variables given to the optimizer as 𝑥.¯ Then, the unscaled variables are
¯ Thus, the required scaled derivatives are
𝑥 = 𝑠 𝑥 𝑥.
d 𝑓¯ d𝑓 𝑠𝑥
= . (6.76)
d𝑥¯ d𝑥 𝑠𝑓
Tip 6.10 Provide your own derivatives and use finite differences only
as a last resort
Because of the step-size dilemma, finite differences are often the cause of
failed optimizations. To put it more dramatically, finite differences are the root
of all evil. Most gradient-based optimization software uses finite differences
internally as a default if you do not provide your own gradients. Although
some software packages try to find reasonable finite-difference steps, it is easy
to get inaccurate derivatives, which then causes optimization difficulties or
total failure. This is the top reason why beginners give up on gradient-based
optimization!
Instead, you should provide gradients computed using one of the other
methods described in this chapter. In contrast with finite differences, the
derivatives computed by the other methods are usually as accurate as the
function computation. You should also avoid using finite-difference derivatives
as a reference for a definitive verification of the other methods.
If you have to use finite differences as a last resort, make sure to do a step-
size study (see Tip 6.2). You should then provide your own finite-difference
derivatives to the optimization or make sure that the optimizer finite-difference
estimates are acceptable.
6 Computing Derivatives 274
6.10 Summary
scaling factor for the forward mode is generally lower than that for
finite differences. The cost of reverse-mode AD is independent of the
number of design variables.
Implicit analytic methods (direct and adjoint) are accurate and
Problems
What does that mean, and how should you show those points on
the plot?
6.5 Suppose you have two airplanes that are flying in a horizontal
plane defined by 𝑥 and 𝑦 coordinates. Both airplanes start at 𝑦 = 0,
but airplane 1 starts at 𝑥 = 0, whereas airplane 2 has a head start
of 𝑥 = Δ𝑥. The airplanes fly at a constant velocity. Airplane 1 has
a velocity of 𝑣1 in the direction of the positive 𝑥-axis, and airplane
2 has a velocity of 𝑣2 at an angle 𝛾 with the 𝑥-axis. The functions
of interest are the distance (𝑑) and the angle (𝜃) between the two
airplanes as a function of time. The independent variables are
Δ𝑥, 𝛾, 𝑣1 , 𝑣2 , and 𝑡. Write the code that computes the functions of
interest (outputs) for a given set of independent variables (inputs).
Use AD to differentiate the code. Choose a set of inputs, compute
the derivatives of all the outputs with respect to the inputs, and
verify them against the complex-step method.
𝐸 − 𝑒 sin(𝐸) = 𝑀 ,
6.7 Compute the derivatives for the 10-bar truss problem described
in Appendix D.2.2 using the direct and adjoint implicit differenti-
ation methods. Compute the derivatives of the objective (mass)
with respect to the design variables (10 cross-sectional areas),
and the derivatives of the constraints (stresses in all 10 bars)
with respect to the design variables (a 10 × 10 Jacobian matrix).
Compute the derivatives using the following:
6.8 You can now solve the 10-bar truss problem (previously solved in
Prob. 5.15) using the derivatives computed in Prob. 6.7. Solve this
optimization problem using both finite-difference derivatives and
derivatives computed using an implicit analytic method. Report
the following:
6.9 Aggregate the constraints for the 10-bar truss problem and extend
the code from Prob. 6.7 to compute the required constraint deriva-
tives using the implicit analytic method that is most advantageous
in this case. Verify your derivatives against the complex-step
method. Solve the optimization problem and compare your re-
sults to the ones you obtained in Prob. 6.8. How close can you get
to the reference solution?
Gradient-Free Optimization
7
Gradient-free algorithms fill an essential role in optimization. The
gradient-based algorithms introduced in Chapter 4 are efficient in
finding local minima for high-dimensional nonlinear problems defined
by continuous smooth functions. However, the assumptions made
for these algorithms are not always valid, which can render these
algorithms ineffective. Also, gradients might not be available when a
function is given as a black box.
In this chapter, we introduce only a few popular representative
gradient-free algorithms. Most are designed to handle unconstrained
functions only, but they can be adapted to solve constrained problems
by using the penalty or filtering methods introduced in Chapter 5. We
start by discussing the problem characteristics relevant to the choice
between gradient-free and gradient-based algorithms and then give an
overview of the types of gradient-free algorithms.
279
7 Gradient-Free Optimization 280
rithms are easier to get up and running but are much less efficient,
particularly as the dimension of the problem increases.
One significant advantage of gradient-free algorithms is that they
do not assume function continuity. For gradient-based algorithms,
function smoothness is essential when deriving the optimality con-
ditions, both for unconstrained functions and constrained functions.
More specifically, the Karush–Kuhn–Tucker (KKT) conditions (Eq. 5.11)
require that the function be continuous in value (𝐶 0 ), gradient (𝐶 1 ), and
Hessian (𝐶 2 ) in at least a small neighborhood of the optimum. If, for
example, the gradient is discontinuous at the optimum, it is undefined,
and the KKT conditions are not valid. Away from optimum points, this
requirement is not as stringent. Although gradient-based algorithms
work on the same continuity assumptions, they can usually tolerate
the occasional discontinuity, as long as it is away from an optimum
point. However, for functions with excessive numerical noise and
discontinuities, gradient-free algorithms might be the only option.
Many considerations are involved when choosing between a gradient-
based and a gradient-free algorithm. Some of these considerations are
common sources of misconception. One problem characteristic often
cited as a reason for choosing gradient-free methods is multimodality.
Design space multimodality can be a result of an objective function
with multiple local minima. In the case of a constrained problem, the
multimodality can arise from the constraints that define disconnected
or nonconvex feasible regions.
As we will see shortly, some gradient-free methods feature a global
search that increases the likelihood of finding the global minimum. This
feature makes gradient-free methods a common choice for multimodal
problems. However, not all gradient-free methods are global search
methods; some perform only a local search. Additionally, even though
gradient-based methods are by themselves local search methods, they
are often combined with global search strategies, as discussed in Tip 4.8.
It is not necessarily true that a global search, gradient-free method is
more likely to find a global optimum than a multistart gradient-based
method. As always, problem-specific testing is needed.
Furthermore, it is assumed far too often that any complex prob-
lem is multimodal, but that is frequently not the case. Although it
might be impossible to prove that a function is unimodal, it is easy to
prove that a function is multimodal simply by finding another local
minimum. Therefore, we should not make any assumptions about
the multimodality of a function until we show definite multiple local
minima. Additionally, we must ensure that perceived local minima are
not artificial minima arising from numerical noise.
7 Gradient-Free Optimization 281
107 Gradient-free
Number of function evaluations
106
105
2.52
104 Fig. 7.1 Cost of optimization for in-
creasing number of design variables
103 in the 𝑛-dimensional Rosenbrock
Gradient-based function. A gradient-free algorithm
0.37
is compared with a gradient-based
102
algorithm, with gradients computed
analytically. The gradient-based al-
101 102 103 gorithm has much better scalability.
Number of design variables
Deterministic
2013.
Stochastic
Surrogate
Heuristic
Global
Direct
Local
Nelder–Mead • • • •
GPS • • • •
MADS • • • •
Trust region • • • •
Implicit filtering • • • •
DIRECT • • • •
MCS • • • •
EGO • • • •
Table 7.1 Classification of gradient-
Hit and run • • • • free optimization methods using the
Evolutionary • • • • characteristics of Fig. 1.22.
7 Gradient-Free Optimization 283
but it estimates lower and upper bounds for the optimum by using
the function variation between partitions. MCS is another algorithm
that partitions the design space into boxes, where a limit is imposed on
how small the boxes can get based on the number of times it has been
divided.
Global-search algorithms based on surrogate models are similar to
their local search counterparts. However, they use surrogate models
to reproduce the features of a multimodal function instead of convex
surrogate models. One of the most widely used of these algorithms is
efficient global optimization (EGO), which employs kriging surrogate
models and uses the idea of expected improvement to maximize the
likelihood of finding the optimum more efficiently (surrogate modeling
techniques, including kriging are introduced in Chapter 10, which also
described EGO). Other algorithms use radial basis functions (RBFs) as
the surrogate model and also maximize the probability of improvement
at new iterates.
Stochastic algorithms rely on one or more nondeterministic pro-
cedures; they include hit-and-run algorithms and the broad class of
evolutionary algorithms. When performing benchmarks of a stochastic
algorithm, you should run a large enough number of optimizations to
obtain statistically significant results. ‡ Simon140 provides a more comprehensive
review of evolutionary algorithms.
Hit-and-run algorithms generate random steps about the current
140. Simon, Evolutionary Optimization
iterate in search of better points. A new point is accepted when it is Algorithms, 2013.
better than the current one, and this process repeats until the point § These algorithms include the follow-
cannot be improved. ing: ant colony optimization, artificial bee
colony algorithm, artificial fish swarm, ar-
What constitutes an evolutionary algorithm is not well defined.‡ tificial flora optimization algorithm, bac-
Evolutionary algorithms are inspired by processes that occur in nature terial foraging optimization, bat algo-
rithm, big bang–big crunch algorithm,
or society. There is a plethora of evolutionary algorithms in the literature, biogeography-based optimization, bird
thanks to the fertile imagination of the research community and a mating optimizer, cat swarm optimization,
cockroach swarm optimization, cuckoo
never-ending supply of phenomena for inspiration.§ These algorithms search, design by shopping paradigm,
are more of an analogy of the phenomenon than an actual model. dolphin echolocation algorithm, elephant
herding optimization, firefly algorithm,
They are, at best, simplified models and, at worst, barely connected flower pollination algorithm, fruit fly op-
to the phenomenon. Nature-inspired algorithms tend to invent a timization algorithm, galactic swarm op-
timization, gray wolf optimizer, grenade
specific terminology for the mathematical terms in the optimization explosion method, harmony search algo-
problem. For example, a design point might be called a “member of rithm, hummingbird optimization algo-
rithm, hybrid glowworm swarm optimiza-
the population”, or the objective function might be the “fitness”. tion algorithm, imperialist competitive al-
The vast majority of evolutionary algorithms are population based, gorithm, intelligent water drops, invasive
weed optimization, mine bomb algorithm,
which means they involve a set of points at each iteration instead of a monarch butterfly optimization, moth-
single one (we discuss a genetic algorithm in Section 7.6 and a particle flame optimization algorithm, penguin
search optimization algorithm, quantum-
swarm method in Section 7.7). Because the population is spread out in behaved particle swarm optimization,
the design space, evolutionary algorithms perform a global search. The salp swarm algorithm, teaching–learning-
based optimization, whale optimization
stochastic elements in these algorithms contribute to global exploration algorithm, and water cycle algorithm.
7 Gradient-Free Optimization 285
The simplex method of Nelder and Mead28 is a deterministic, direct- 28. Nelder and Mead, A simplex method
for function minimization, 1965.
search method that is among the most cited gradient-free methods. It
is also known as the nonlinear simplex—not to be confused with the
simplex algorithm used for linear programming, with which it has
nothing in common. To avoid ambiguity, we will refer to it as the
Nelder–Mead algorithm.
The Nelder–Mead algorithm is based on a simplex, which is a 𝑥 (3)
geometric figure defined by a set of 𝑛 + 1 points in the design space of 𝑛
variables, 𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛) . Each point 𝑥 (𝑖) represents a design
(i.e., a full set of design variables). In two dimensions, the simplex 𝑥 (0)
𝑥 (2)
is a triangle, and in three dimensions, it becomes a tetrahedron (see
Fig. 7.2). 𝑥 (1)
Each optimization iteration corresponds to a different simplex. The Fig. 7.2 A simplex for 𝑛 = 3 has four
algorithm modifies the simplex at each iteration using five simple vertices.
operations. The sequence of operations to be performed is chosen
based on the relative values of the objective function at each of the
points.
The first step of the simplex algorithm is to generate 𝑛 + 1 points
based on an initial guess for the design variables. This could be done by
simply adding steps to each component of the initial point to generate
𝑛 new points. However, this will generate a simplex with different edge
lengths, and equal-length edges are preferable. Suppose we want the
length of all sides to be 𝑙 and that the first guess is 𝑥 . The remaining
(0)
where 𝛼 is a scalar, and 𝑥 𝑐 is the centroid of all the points except for the
worst one, that is,
1 Õ (𝑖)
𝑛−1
𝑥𝑐 = 𝑥 . (7.4)
𝑛
𝑖=0
This generates a new point along the line that connects the worst point,
𝑥 (𝑛) , and the centroid of the remaining points, 𝑥 𝑐 . This direction can be
seen as a possible descent direction.
Each iteration aims to replace the worst point with a better one
to form a new simplex. Each iteration always starts with reflection,
which generates a new point using Eq. 7.3 with 𝛼 = 1, as shown in
Fig. 7.4. If the reflected point is better than the best point, then the
“search direction” was a good one, and we go further by performing an
expansion using Eq. 7.3 with 𝛼 = 2. If the reflected point is between the
second-worst and the worst point, then the direction was not great, but
it improved somewhat. In this case, we perform an outside contraction
(𝛼 = 1/2). If the reflected point is worse than our worst point, we try
an inside contraction instead (𝛼 = −1/2). Shrinking is a last-resort
operation that we can perform when nothing along the line connecting
𝑥 (𝑛) and 𝑥 𝑐 produces a better point. This operation consists of reducing
the size of the simplex by moving all the points closer to the best point,
𝑥 (𝑖) = 𝑥 (0) + 𝛾 𝑥 (𝑖) − 𝑥 (0) for 𝑖 = 1, . . . , 𝑛 , (7.5)
where 𝛾 = 0.5.
7 Gradient-Free Optimization 287
𝑥𝑐
Inputs:
𝑥 (0) : Starting point
𝜏𝑥 : Simplex size tolerances
𝜏 𝑓 : Function value standard deviation tolerances
Outputs:
𝑥 ∗ : Optimal point
Sort 𝑥 (0) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) Order from the lowest (best) to the highest 𝑓 (𝑥 (𝑗) )
1 Í𝑛−1 (𝑖)
𝑥𝑐 = 𝑛 𝑖=0 𝑥 The centroid excluding the worst point 𝑥 (𝑛) (Eq. 7.4)
if 𝑓 (𝑥 𝑒 ) < 𝑓 (𝑥 (0) )
then Is expanded point better than the best?
𝑥 (𝑛) = 𝑥 𝑒 Accept expansion and replace worst point
else
𝑥 (𝑛) = 𝑥 𝑟 Accept reflection
end if
else if 𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) then Is reflected better than second worst?
𝑥 (𝑛) = 𝑥 𝑟 Accept reflected point
else
if 𝑓 (𝑥 𝑟 ) > 𝑓 (𝑥 (𝑛) ) then Is reflected point worse than the worst?
if 𝑓 (𝑥 𝑖𝑐 ) < 𝑓 (𝑥 (𝑛) )
then Inside contraction better than worst?
𝑥 (𝑛) = 𝑥 𝑖𝑐 Accept inside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, Eq. 7.5 with 𝛾 = 0.5
end for
end if
else
𝑥 𝑜𝑐 = 𝑥 𝑐 + 0.5 𝑥 𝑐 − 𝑥 (𝑛) Outside contraction, Eq. 7.3 with 𝛼 = 0.5
end for
end if
end if
end if
end while
𝑘 = 𝑘+1
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (0) ) 𝑓 (𝑥 𝑒 ) ≤ 𝑓 (𝑥 (0) )
𝑥 (𝑛) = 𝑥 𝑒
else
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) 𝑥 (𝑛) = 𝑥 𝑟
𝑓 (𝑥 𝑟 ) ≥ 𝑓 (𝑥 (𝑛) ) 𝑓 (𝑥 𝑖𝑐 ) ≤ 𝑓 (𝑥 (𝑛) )
𝑥𝑐 𝑥 (𝑛) = 𝑥 𝑖𝑐
else
else
else 𝑓 (𝑥 𝑜𝑐 ) ≤ 𝑓 (𝑥 𝑟 )
𝑥 (𝑛) = 𝑥 𝑜𝑐
Like most direct-search methods, Nelder–Mead cannot directly Fig. 7.5 Flowchart of Nelder–Mead
(Alg. 7.1).
handle constraints. One approach to handling constraints would be to
use a penalty method (discussed in Section 5.4) to form an unconstrained
problem. In this case, the penalty does not need not be differentiable,
so a linear penalty method would suffice.
Figure 7.6 shows the sequence of simplices that results when minimizing
the bean function using a Nelder–Mead simplex. The initial simplex on the
upper left is equilateral. The first iteration is an expansion, followed by an
inside contraction, another reflection, and an inside contraction before the
shrinking. The simplices then shrink dramatically in size, slowly converging to
the minimum.
Using a convergence tolerance of 10−6 in the difference between 𝑓best and
𝑓worst , the problem took 68 function evaluations.
7 Gradient-Free Optimization 290
2
𝑥0
𝑥2 1
𝑥∗
𝑑1 = [1, 0, 0]
𝑑2 = [0, 1, 0]
𝐷 = {𝑑1 , . . . , 𝑑4 }, where . (7.9)
𝑑3 = [0, 0, 1]
𝑑4 = [−1, −1, −1]
Figure 7.8 shows an example maximal set (four vectors) and minimal
set (three vectors) for a two-dimensional problem.
These direction vectors are then used to create a mesh. Given a
current center point 𝑥 𝑘 , which is the best point found so far, and a mesh
size Δ 𝑘 , the mesh is created as follows:
The type of search can change throughout the optimization. Like the
polling phase, the goal of the search phase is to find a better point
(i.e., 𝑓 (𝑥 𝑘+1 ) < 𝑓 (𝑥 𝑘 )) but within a broader domain. We begin with a
search at every iteration. If the search fails to produce a better point, we
continue with a poll. If a better point is identified in either phase, the
iteration ends, and we begin a new search. Optionally, a successful poll
could be followed by another poll. Thus, at each iteration, we might
perform a search and a poll, just a search, or just a poll. Δ𝑘
We describe one option for a search procedure based on the same
mesh ideas as the polling step. The concept is to extend the mesh
throughout the entire domain, as shown in Fig. 7.10. In this example, the
Fig. 7.10 Meshing strategy extended
mesh size Δ 𝑘 is shared between the search and poll phases. However, it across the domain. The same direc-
is usually more effective if these sizes are independent. Mathematically, tions (and potentially spacing) are
we can define the global mesh as the set repeated at each mesh point, as indi-
cated by the lighter arrows through-
𝐺 = {𝑥 𝑘 + Δ 𝑘 𝐷𝑧 for all 𝑧 𝑖 ∈ Z+ }, (7.11) out the entire domain.
7 Gradient-Free Optimization 293
Inputs:
𝑥 𝑘 : Center point
Δ 𝑘 : Mesh size
𝑥, 𝑥: Lower and upper bounds
𝐷: Column vectors representing positive spanning set
𝑛 𝑠 : Number of search points
𝑓 𝑘 : The function previously evaluated at 𝑓 (𝑥 𝑘 )
Outputs:
success: True if successful in finding improved point
𝑥 𝑘+1 : New center point
𝑓 𝑘+1 : Corresponding function value
success = false
𝑥 𝑘+1 = 𝑥 𝑘
𝑓 𝑘+1 = 𝑓 𝑘
Construct global mesh 𝐺, using directions 𝐷, mesh size Δ 𝑘 , and bounds 𝑥, 𝑥
for 𝑖 = 1 to 𝑛 𝑠 do
Randomly select 𝑠 ∈ 𝐺
Evaluate 𝑓𝑠 = 𝑓 (𝑠)
if 𝑓𝑠 < 𝑓 𝑘 then
𝑥 𝑘+1 = 𝑠
𝑓 𝑘+1 = 𝑓𝑠
success = true
break
end if
end for
Inputs:
𝑥 0 : Starting point
𝑥, 𝑥: Lower and upper bounds
Δ0 : Starting mesh size
𝑛 𝑠 : Number of search points
𝑘max : Maximum number of iterations
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
GPS can handle linear and nonlinear constraints. For linear con-
straints, one effective strategy is to change the positive spanning di-
rections so that they align with any linear constraints that are nearby
(Fig. 7.11). For nonlinear constraints, penalty approaches (Section 5.4)
are applicable, although the filter method (Section 5.5.3) is another
effective approach.
Example 7.2 Minimization of a multimodal function with GPS Fig. 7.11 Mesh direction changed dur-
ing optimization to align with linear
In this example, we optimize the Jones function (Appendix D.1.4). We start constraints when close to the con-
at 𝑥 = [0, 0] with an initial mesh size of Δ = 0.1. We evaluate two search points straint.
at each iteration and run for 12 iterations. The iterations are plotted in Fig. 7.12.
3 3 3
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
𝑥0
0 0 0
−1 −1 −1
−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
−1 −1 −1
−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1
𝑘=6 𝑘=8 𝑘 = 12
smaller while allowing the poll size (which dictates the maximum
magnitude of the step) to remain large. This provides a much denser
set of options in poll directions (e.g., the grid points on the right panel
of Fig. 7.13). MADS randomly chooses the polling directions from this
much larger set of possibilities while maintaining a positive spanning
set.†
† TheNOMAD software is an open-source
implementation of MADS.142
𝑝 142. Le Digabel, Algorithm 909: NOMAD:
Δ𝑚 Δ𝑘
𝑘 Nonlinear optimization with the MADS
algorithm, 2011.
and evaluating every point in this grid. This is called an exhaustive search,
and the precision of the minimum depends on how fine the grid is. The
cost of this brute-force strategy is high and goes up exponentially with
the number of design variables.
The DIRECT method relies on a grid, but it uses an adaptive meshing
scheme that dramatically reduces the cost. It starts with a single 𝑛-
dimensional hypercube that spans the whole design space—like many
other gradient-free methods, DIRECT requires upper and lower bounds
on all the design variables. Each iteration divides this hypercube into
smaller ones and evaluates the objective function at the center of each
of these. At each iteration, the algorithm only divides rectangles
determined to be potentially optimal. The fundamental strategy in the
7 Gradient-Free Optimization 297
Lipschitz Constant
Shubert’s Algorithm
𝑓 𝑓
𝑎 𝑥1 𝑏 𝑥2 𝑥1 𝑥3
𝑥 𝑥
𝑘=0 𝑘=1
𝑓 𝑓
is updated with the resulting new cones. We then iterate by finding the
two points that minimize the new lower bounding function, evaluating
the function at these points, updating the lower bounding function,
and so on.
The lowest bound on the function increases at each iteration and
ultimately converges to the global minimum. At the same time, the
segments in 𝑥 decrease in size. The lower bound can switch from
distinct regions as the lower bound in one region increases beyond the
lower bound in another region.
The two significant shortcomings of Shubert’s algorithm are that
(1) a Lipschitz constant is usually not available for a general function,
and (2) it is not easily extended to 𝑛 dimensions. The DIRECT algorithm
addresses these two shortcomings.
One-Dimensional DIRECT
+𝐿 −𝐿
𝑓 𝑓
𝑓 (𝑐) − 12 𝐿(𝑏 − 𝑎)
𝑑 = 12 (𝑏 − 𝑎)
𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏 𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏
𝑥 𝑥
Like Shubert’s method, DIRECT starts with the domain [𝑎, 𝑏]. How- Fig. 7.16 The DIRECT algorithm eval-
uates the middle point (left), and each
ever, instead of sampling the endpoints 𝑎 and 𝑏, it samples the midpoint.
successive iteration trisects the seg-
Consider the closed domain [𝑎, 𝑏] shown in Fig. 7.16 (left). For each ments that have the greatest potential
segment, we evaluate the objective function at the segment’s midpoint. (right).
In the first segment, which spans the whole domain, the midpoint is
𝑐 0 = (𝑎 + 𝑏)/2. Assuming some value of 𝐿, which is not known and
that we will not need, the lower bound on the minimum would be
𝑓 (𝑐) − 𝐿(𝑏 − 𝑎)/2.
We want to increase this lower bound on the function minimum
by dividing this segment further. To do this in a regular way that
reuses previously evaluated points and can be repeated indefinitely,
7 Gradient-Free Optimization 300
𝑓 (𝑐)
𝐿
𝑓 (𝑐 𝑗 )
The overall rationale for the potentially optimal criterion is that two
metrics quantify this potential: the size of the segment and the function
value at the center of the segment. The larger the segment is, the greater
the potential for that segment to contain the global minimum. The
lower the function value, the greater that potential is as well. For a set
of segments of the same size, we know that the one with the lowest
function value has the best potential and should be selected. If two
segments have the same function value and different sizes, we should
select the one with the largest size. For a general set of segments with
various sizes and value combinations, there might be multiple segments
that can be considered potentially optimal.
We identify potentially optimal segments as follows. If we draw a
line with a slope corresponding to a Lipschitz constant 𝐿 from any point
in Fig. 7.17, the intersection of this line with the vertical axis is a bound
on the objective function for the corresponding segment. Therefore,
the lowest bound for a given 𝐿 can be found by drawing a line through
the point that achieves the lowest intersection.
However, we do not know 𝐿, and we do not want to assume a value
because we do not want to bias the search. If 𝐿 were high, it would favor
dividing the larger segments. Low values of 𝐿 would result in dividing
the smaller segments. The DIRECT method hinges on considering all
7 Gradient-Free Optimization 301
where 𝑓min is the best current objective function value, and 𝜀 is a small
positive parameter. The first condition corresponds to finding the
points in the lower convex hull mentioned previously.
The second condition in Eq. 7.16 ensures that the potential minimum
is better than the lowest function value found so far by at least a small
amount. This prevents the algorithm from becoming too local, wasting
function evaluations in search of smaller function improvements. The
parameter 𝜀 balances the search between local and global. A typical
value is 𝜀 = 10−4 , and its range is usually such that 10−7 ≤ 𝜀 ≤ 10−2 .
There are efficient algorithms for finding the convex hull of an
arbitrary set of points in two dimensions, such as the Jarvis march.144 144. Jarvis, On the identification of the
convex hull of a finite set of points in the
These algorithms are more than we need because we only require the plane, 1973.
lower part of the convex hull, so the algorithms can be simplified for
our purposes.
As in the Shubert algorithm, the division might switch from one
part of the domain to another, depending on the new function values.
Compared with the Shubert algorithm, the DIRECT algorithm produces
a discontinuous lower bound on the function values, as shown in
Fig. 7.18.
DIRECT in 𝑛 Dimensions
version of DIRECT.143
deal with hyperrectangles instead of segments. A hyperrectangle can
143. Jones, Direct Global Optimization
be defined by its center-point position 𝑐 in 𝑛-dimensional space and a Algorithm, 2009.
half-length in each direction 𝑖, 𝛿𝑒 𝑖 , as shown in Fig. 7.19. The DIRECT
algorithm assumes that the initial dimensions are normalized so that
we start with a hypercube.
δe2
Fig. 7.19 Hyperrectangle in three di-
c δe1 mensions, where 𝑑 is the maximum
δe3
distance between the center and the
d vertices, and 𝛿𝑒 𝑖 is the half-length in
each direction 𝑖.
sion and evaluates the two new points. The values for these three points
7 Gradient-Free Optimization 303
1 𝑓
2 𝑓
are plotted in the second column from the right in the 𝑓 –𝑑 plot, where
the center point is reused, as indicated by the arrow and the matching
color. At this iteration, we have two points that define the convex hull.
In the second iteration, we have three rectangles of the same size, so
we divide the one with the lowest value and evaluate the centers of
the two new rectangles (which are squares in this case). We now have
another column of points in the 𝑓 –𝑑 plot corresponding to a smaller 𝑑
and an additional point that defines the lower convex hull. Because the
convex hull now has two points, we trisect two different rectangles in
the third iteration.
Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Optimal point
the final points and convex hull are highlighted. The sequence of rectangles 30
is shown in Fig. 7.22. The algorithm converges to the global minimum after 20
dividing the rectangles around the other local minima a few times.
10
0
3
−10
−20
2
10−2 10−1 100
𝑑
1
Fig. 7.21 Potentially optimal rectan-
𝑥2 gles for the DIRECT iterations shown
0 in Fig. 7.22.
Population
Gene Chromosome
𝑥 (0) 𝑥1 𝑥2 ... 𝑥𝑛
specified for the generation of the initial population, and the size of Offspring
that population varies. Similarly, there are many possible methods
for selecting the parents, generating the offspring, and selecting the
survivors. Here, the new population (𝑃𝑘+1 ) is formed exclusively by
the offspring generated from the crossover. However, some GAs add
an extra selection process that selects a surviving population of size 𝑛 𝑝 Mutation
among the population of parents and offspring. Population 𝑃𝑘+1
In addition to the flexibility in the various operations, GAs use differ-
ent methods for representing the design variables. The design variable
representation can be used to classify genetic algorithms into two broad
categories: binary-encoded and real-encoded genetic algorithms. Binary-
encoded algorithms use bits to represent the design variables, whereas
the real-encoded algorithms keep the same real value representation Fig. 7.24 GA iteration steps.
7 Gradient-Free Optimization 306
used in most other algorithms. The details of the operations in Alg. 7.5
depend on whether we are using one or the other representation, but
the principles remain the same. In the rest of this section, we describe a
particular way of performing these operations for each of the possible
design variable representations.
Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
𝑘 = 0n o
𝑃 𝑘 = 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑛 𝑝 ) Generate initial population
while 𝑘 < 𝑘max do
Compute 𝑓 (𝑥) for all 𝑥 ∈ 𝑃 𝑘 Evaluate objective function
Select 𝑛 𝑝 /2 parent pairs from 𝑃 𝑘 for crossover Selection
Generate a new population of 𝑛 𝑝 offspring (𝑃 𝑘+1 ) Crossover
Randomly mutate some points in the population Mutation
𝑘 = 𝑘+1
end while
𝑥−𝑥
Δ𝑥 = . (7.17)
2𝑚 − 1
To have a more precise representation, we must use more bits.
When using binary-encoded GAs, we do not need to encode the
design variables because they are generated and manipulated directly
in the binary representation. Still, we do need to decode them be-
fore providing them to the evaluation function. To decode a binary
7 Gradient-Free Optimization 307
representation, we use
Õ
𝑚−1
𝑥=𝑥+ 𝑏 𝑖 2𝑖 Δ𝑥 . (7.18)
𝑖=0
𝑖 1 2 3 4 5 6 7 8 9 10 11 12
𝑏𝑖 0 0 0 1 0 1 1 0 0 0 0 1
We can use Eq. 7.18 to compute the equivalent real number, which turns out to
be 𝑥 ≈ 32.55.
Initial Population
Selection
Figure 7.25 illustrates the process with a small population. Each member of
the population ends up in the mating pool zero, one, or two times, with better
points more likely to appear in the pool. The best point in the population will
always end up in the pool twice, whereas the worst point in the population is
always eliminated.
12 2
10 2
10 15
7 6
7 6
15 7
Fig. 7.25 Tournament selection exam-
2 10 ple. The best point in each randomly
2 10 selected pair is moved into the mating
pool.
6 12
− 𝑓𝑖 + Δ𝐹
𝐹= , (7.19)
max(1, Δ𝐹 − 𝑓low )
7 Gradient-Free Optimization 309
where Δ𝐹 = 1.1 𝑓high −0.1 𝑓low is based on the highest and lowest function
values in the population, and the denominator is introduced to scale
the fitness.
Then, to find the sizes of the sectors in the roulette wheel selection,
we take the normalized cumulative sum of the scaled fitness values to
compute an interval for each member in the population 𝑗 as
Í𝑗
𝐹𝑖
𝑖=1
𝑆𝑗 = . (7.20)
Í𝑛𝑝
𝐹𝑖
𝑖=1
This ensures that the probability of a member being selected for repro- 0
0.875
duction is proportional to its scaled fitness value. 𝑥 (4)
𝑥 (1)
1 and the remaining bits from parent 2. For the second offspring, the 1 1 0 0 0 1 1 0
first 𝑘 bits are taken from parent 2 and the remaining ones from parent Offspring 2
Initial Population
𝑥 𝑖 = 𝑥 𝑖 + 𝑟(𝑥 𝑖 − 𝑥 𝑖 ) (7.22)
Selection
The selection operation does not depend on the design variable encod-
ing. Therefore, we can use one of the selection approaches described
for the binary-encoded GA: tournament or roulette wheel selection.
7 Gradient-Free Optimization 311
Crossover
When using real encoding, the term crossover does not accurately
describe the process of creating the two offspring from a pair of points.
Instead, the approaches are more accurately described as a blending,
although the name crossover is still often used.
There are various options for the reproduction of two points encoded
using real numbers. A standard method is linear crossover, which
generates two or more points in the line defined by the two parent
points. One option for linear crossover is to generate the following two
points:
𝑥 𝑐1 = 0.5𝑥 𝑝1 + 0.5𝑥 𝑝2 ,
(7.23)
𝑥 𝑐2 = 2𝑥 𝑝2 − 𝑥 𝑝1 ,
where parent 2 is more fit than parent 1 ( 𝑓 (𝑥 𝑝2 ) < 𝑓 (𝑥 𝑝1 )). An example
of this linear crossover approach is shown in Fig. 7.29, where we can
see that child 1 is the average of the two parent points, whereas child 2
is obtained by extrapolating in the direction of the “fitter” parent.
Another option is a simple crossover like the binary case where a
𝑥 𝑐2
random integer is generated to split the vectors—for example, with a
split after the first index: 𝑥 𝑝2
𝑥 𝑝1 = [𝑥 1 , 𝑥2 , 𝑥3 , 𝑥4 ] 𝑥 𝑐1
𝑥 𝑝2 = [𝑥 5 , 𝑥6 , 𝑥7 , 𝑥8 ]
𝑥 𝑝1
⇓ (7.24)
𝑥 𝑐1 = [𝑥 1 , 𝑥6 , 𝑥7 , 𝑥8 ] Fig. 7.29 Linear crossover produces
two new points along the line defined
𝑥 𝑐2 = [𝑥 5 , 𝑥2 , 𝑥3 , 𝑥4 ] . by the two parent points.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘 = 10 𝑘 = 20 𝑘 = 50
Figure 7.30 shows the evolution of the population when minimizing the Fig. 7.30 Evolution of the population
bean function using a bit-encoded GA. The initial population size was 40, and using a bit-encoded GA to minimize
the simulation was run for 50 generations. Figure 7.31 shows the evolution the bean function, where 𝑘 is the gen-
eration number.
when using a real-encoded GA but otherwise uses the same parameters as the
bit-encoded optimization. The real-encoded GA converges faster in this case.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘=4 𝑘=6 𝑘 = 10
grangian, linear penalty). However, there are additional options for Fig. 7.31 Evolution of the population
GAs. In the tournament selection, we can use other selection criteria using a real-encoded GA to minimize
the bean function, where 𝑘 is the gen-
that do not depend on penalty parameters. One such approach for eration number.
choosing the best selection among two competitors is as follows:
This concept is a lot like the filter methods discussed in Section 5.5.3.
7.6.4 Convergence
Rigorous mathematical convergence criteria, like those used in gradient-
based optimization, do not apply to GAs. The most common way to
terminate a GA is to specify a maximum number of iterations, which
corresponds to a computational budget. Another similar approach is
to let the algorithm run indefinitely until the user manually terminates
the algorithm, usually by monitoring the trends in population fitness.
7 Gradient-Free Optimization 314
these are just design points, the history for each point is relevant to 150. Eberhart and Kennedy, New opti-
mizer using particle swarm theory, 1995.
the PSO algorithm, so we adopt the term particle. Each particle moves
according to a velocity. This velocity changes according to the past
objective function values of that particle and the current objective values
of the rest of the particles. Each particle remembers the location where
it found its best result so far, and it exchanges information with the
swarm about the location where the swarm has found the best result
so far.
The position of particle 𝑖 for iteration 𝑘 + 1 is updated according to
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑣 𝑘+1 Δ𝑡 , (7.27)
where Δ𝑡 is a constant artificial time step. The velocity for each particle
is updated as follows:
(𝑖) (𝑖) (𝑖)
(𝑖) (𝑖) 𝑥best − 𝑥 𝑘 𝑥best − 𝑥 𝑘
𝑣 𝑘+1 = 𝛼𝑣 𝑘 + 𝛽 +𝛾 . (7.28)
Δ𝑡 Δ𝑡
The first component in this update is the “inertia”, which determines
how similar the new velocity is to the velocity in the previous iteration
7 Gradient-Free Optimization 315
through the parameter 𝛼. Typical values for the inertia parameter 𝛼 are
in the interval [0.8, 1.2]. A lower value of 𝛼 reduces the particle’s inertia
and tends toward faster convergence to a minimum. A higher value of 𝛼
increases the particle’s inertia and tends toward increased exploration to
potentially help discover multiple minima. Some methods are adaptive,
choosing the value of 𝛼 based on the optimizer’s progress.151 151. Zhan et al., Adaptive particle swarm
optimization, 2009.
The second term represents “memory” and is a vector pointing
(𝑖)
toward the best position particle 𝑖 has seen in all its iterations so far, 𝑥best .
The weight in this term consists of a random number 𝛽 in the interval
[0, 𝛽max ] that introduces a stochastic component to the algorithm. Thus,
𝛽 controls how much influence the best point found by the particle so
far has on the next direction.
The third term represents “social” influence. It behaves similarly
to the memory component, except that 𝑥best is the best point the entire
swarm has found so far, and 𝛾 is a random number between [0, 𝛾max ]
that controls how much of an influence this best point has in the next
direction. The relative values of 𝛽 and 𝛾 thus control the tendency
toward local versus global search, respectively. Both 𝛽 max and 𝛾max are
in the interval [0, 2] and are typically closer to 2. Sometimes, rather
than using the best point in the entire swarm, the best point is chosen
within a neighborhood.
Because the time step is artificial, we can eliminate it by multiplying
Eq. 7.28 by Δ𝑡 to yield a step:
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥 best − 𝑥 𝑘 . (7.29)
We then use this step to update the particle position for the next
iteration:
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 . (7.30)
The three components of the update in Eq. 7.29 are shown in Fig. 7.32
for a two-dimensional case.
𝑥best
(𝑖)
𝑥 best (𝑖)
𝑥 𝑘+1
(𝑖) (𝑖)
𝛽 𝑥best − 𝑥 𝑘
(𝑖)
Δ𝑥 𝑘+1
(𝑖)
𝑥 𝑘−1
(𝑖)
𝛾 𝑥best − 𝑥 𝑘
Fig. 7.32 Components of the PSO up-
(𝑖)
𝑥𝑘 (𝑖)
date.
𝛼Δ𝑥 𝑘
7 Gradient-Free Optimization 316
The first step in the PSO algorithm is to initialize the set of particles
(Alg. 7.6). As with a GA, the initial set of points can be determined
randomly or can use a more sophisticated sampling strategy (see
Section 10.2). The velocities are also randomly initialized, generally
using some fraction of the domain size (𝑥 − 𝑥).
Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
𝛼: Inertia parameter
𝛽 max : Self influence parameter
𝛾max : Social influence parameter
Δ𝑥max : Maximum velocity
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
𝑘=0
for 𝑖 = 1 to 𝑛 do Loop to initialize all particles
(𝑖)
Generate position 𝑥0 within specified bounds.
(𝑖)
Initialize “velocity” Δ𝑥 0
end for
while not converged do Main iteration loop
for 𝑖 = 1 to 𝑛 do
(𝑖)
if 𝑓 𝑥 (𝑖) < 𝑓 𝑥best then Best individual points
(𝑖)
𝑥best = 𝑥 (𝑖)
end if
if 𝑓 (𝑥 (𝑖) ) < 𝑓 (𝑥best ) then Best swarm point
𝑥best = 𝑥 (𝑖)
end if
end for
for 𝑖 = 1 to 𝑛 do
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥best − 𝑥 𝑘
(𝑖) (𝑖)
Δ𝑥 𝑘+1 = max min Δ𝑥 𝑘+1 , Δ𝑥max , −Δ𝑥 max Limit velocity
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥𝑘 + Δ𝑥 𝑘+1
Update the particle position
(𝑖) (𝑖)
𝑥 𝑘+1 = max min 𝑥 𝑘+1 , 𝑥 ,𝑥 Enforce bounds
end for
𝑘 = 𝑘+1
end while
7 Gradient-Free Optimization 317
Figure 7.33 shows the particle movements that result when minimizing the
bean function using a particle swarm method. The initial population size was
40, and the optimization required 600 function evaluations. Convergence was
assumed if the best value found by the population did not improve by more
than 10−4 for three consecutive iterations.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘=5 𝑘 = 12 𝑘 = 17
criteria include the distance (sum or norm) between each particle and
the best particle, the best particle’s objective function value changes
for the last few generations, and the difference between the best and
worst member. For PSO, another alternative is to check whether the
velocities for all particles (as measured by a metric such as norm or
mean) are below some tolerance. Some of these criteria that assume
all the particles congregate (distance, velocities) do not work well for
multimodal problems. In those cases, tracking only the best particle’s
objective function value may be more appropriate.
𝑓
30
Example 7.9 Comparison of algorithms for a multimodal discontinuous
function 20
By taking the ceiling of the product of the two sine waves, this function creates a 𝑥∗
−20
0 1 2 3
checkerboard pattern with 0s and 4s. In this latter case, each gradient evaluation −1
𝑥1
is counted as an evaluation in addition to each function evaluation. Adding this
function to the Jones function produces the discontinuous pattern shown in Fig. 7.34 Slice of the Jones function
Fig. 7.34. This is a one-dimensional slice of constant 𝑥 2 through the optimum of with the added checkerboard pattern.
7 Gradient-Free Optimization 319
the Jones function; the full two-dimensional contour plot is shown in Fig. 7.35.
The global optimum remains the same as the original function.
3 3 3
179 evaluations 119 evaluations 99 evaluations
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗ 𝑥∗
−1 −1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
3 3 3
2420 evaluations 760 evaluations 96 evaluations
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗
−1 −1 −1
𝑥∗ 𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
The resulting optimization paths demonstrate that some gradient-free Fig. 7.35 Convergence path for
algorithms effectively handle the discontinuities and find the global minimum. gradient-free algorithms compared
with gradient-based algorithms with
Nelder–Mead converges quickly, but not necessarily to the global minimum. multistart.
GPS and DIRECT quickly converge to the global minimum. GAs and PSO
also find the global minimum, but they require many more evaluations. The
gradient-based algorithm (quasi-Newton) with multistart also converges the
global minimum in two of the six random starts.
7.8 Summary
Problems
7.3 Program the DIRECT algorithm and perform the following stud-
ies:
7.5 Program the PSO algorithm and perform the following studies:
Minimize the power with respect to span and chord by doing the
following:
325
8 Discrete Optimization 326
Even though a discrete optimization problem limits the options and thus
conceptually sounds easier to solve, discrete optimization problems
8 Discrete Optimization 327
Unless your optimization problem fits specific forms that are well suited to
discrete optimization, your problem is likely expensive to solve, and it may be
helpful to consider approaches to avoid discrete variables.
𝑥1 = 0 1
𝑥2 = 0 1 0 1
Another variation of branch and bound arises from how the tree
search is performed. Two common strategies are depth-first and breadth-
8 Discrete Optimization 331
Inputs:
𝑓best : Best known solution, if any; otherwise 𝑓best = ∞
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Optimal function value
To solve this problem, we begin at the first node by solving the linear
relaxation. The binary constraint is removed and instead replaced with
continuous bounds: 0 ≤ 𝑥 𝑖 ≤ 1. The solution to this LP is as follows:
The first branch (see Fig. 8.5) yields a feasible binary solution! The corre-
sponding function value 𝑓 = −4 is saved as the best value so far. There is no
need to continue on this branch because the solution cannot be improved on
this particular branch.
We continue solving along the rest of this row (Fig. 8.6). The third node
in this row yields another binary solution. In this case, the function value is
𝑓 = −4.9, which is better, so this becomes the new best value so far. The second
8 Discrete Optimization 333
and fourth nodes do not yield a solution. Typically, we would have to branch
these further, but they have a lower bound that is worse than the best solution
so far. Thus, we can prune both of these branches.
All branches have been pruned, so we have solved the original problem:
𝑥 ∗ = [1, 0, 1, 1]
𝑓 ∗ = −4.9.
𝑥3 = 0 1
𝑥2 = 0 1 0 1
𝑓 ∗ = −4 𝑓 ∗ = −4.9 bounded
𝑥1 = 0 1
𝑓 ∗ = −2.6
𝑥4 = 0 1
Fig. 8.7 Search path using a depth-
first strategy.
𝑓 ∗ = −3.6 infeasible
𝑥3 ≤ 4 𝑥3 ≥ 5
infeasible
𝑥2 ≤ 1 𝑥2 ≥ 2
𝑥1 ≤ 0 𝑥1 ≥ 1 𝑥3 ≤ 2 𝑥3 ≥ 3
bounded
𝑥1 ≤ 1 𝑥1 ≥ 2
Fig. 8.8 A breadth search of the mixed-
integer programming example.
infeasible bounded
𝑥 ∗ = [0, 2, 3, 0.5]
𝑓 ∗ = −13.75.
Greedy algorithms are among the simplest methods for discrete opti-
mization problems. This method is more of a concept than a specific
algorithm. The implementation varies with the application. The idea is
to reduce the problem to a subset of smaller problems (often down to a
single choice) and then make a locally optimal decision. That decision
is locked in, and then the next small decision is made in the same
manner. A greedy algorithm does not revisit past decisions and thus
ignores much of the coupling between design variables.
8 Discrete Optimization 336
5
5 3
2 2 9
5
2 4 3
Global 4 6
4
1 6
1 3 10 12
6 7
5 2
Greedy 7 5 1
3 Fig. 8.9 The greedy algorithm in this
4 4 11
weighted directed graph results in
5 2 a total cost of 15, whereas the best
possible cost is 10.
8
A greedy algorithm simply makes the best choice assuming each decision
is the only decision to be made. Starting at node 1, we first choose to move to
node 3 because that is the lowest cost between the three options (node 2 costs
2, node 3 costs 1, node 4 costs 5). We then choose to move to node 6 because
that is the smallest cost between the next two available options (node 6 costs
4, node 7 costs 6), and so on. The path selected by the greedy algorithm is
highlighted in the figure and results in a total cost of 15. The global optimum
is also highlighted in the figure and has a total cost of 10.
The greedy algorithm used in Ex. 8.6 is easy to apply and scalable
but does not generally find the global optimum. To find that global
optimum, we have to consider the impact of our choices on future
decisions. A method to achieve this for certain problem structures is
discussed in the next section.
Even for a fixed problem, there are many ways to construct a greedy
algorithm. The advantage of the greedy approach is that the algorithms
are easy to construct, and they bound the computational expense of
the problem. One disadvantage of the greedy approach is that it
usually does not find an optimal solution (and in some cases finds the
worst solution!152 ). Furthermore, the solution is not necessarily feasible. 152. Gutin et al., Traveling salesman should
not be greedy: domination analysis of greedy-
type heuristics for the TSP, 2002.
8 Discrete Optimization 337
A few other examples of greedy algorithms are listed below. For the
traveling salesperson problem (Ex. 8.1), always select the nearest city as the
next step. Consider the propeller problem (Ex. 8.2 but with additional discrete
variables (number of blades, type of material, and number of shear webs). A
greedy method could optimize the discrete variables one at a time, with the
others fixed (i.e., optimize the number of blades first, fix that number, then
optimize material, and so on). As a final example, consider the grocery store
shopping problem discussed in a separate chapter (Ex. 11.1).∗ A few possible ∗ This is a form of the knapsack problem,
procedure fib(𝑛)
if 𝑛 ≤ 1 then
return 𝑛
else
return fib(𝑛 − 1) + fib(𝑛 − 2)
end if
end procedure
fib(5)
fib(4) fib(3)
𝑓0 = 0
𝑓1 = 1
for 𝑖 = 2 to 𝑛 do
𝑓𝑖 = 𝑓𝑖−1 + 𝑓𝑖−2
end for
where 𝑡 is a transition function.† At each transition, we compute the † For some problems, the transition func-
cost function 𝑐.‡ For generality, we specify a cost function that may tion is stochastic.
change at each iteration 𝑖: ‡ It
is common to use discount factors on
future costs.
𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ). (8.5)
We want to make a set of decisions that minimize the sum of the
current and future costs up to a certain time, which is called the value
function,
Let us solve the graph problem posed in Ex. 8.6 using dynamic programming.
For convenience, we repeat a smaller version of the figure in Fig. 8.12. We use
the tabulation (bottom-up) approach. To do this, we construct a table where we
keep track of the cost to move from this node to the end (node 12) and which
node we should move to next:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost
Next
We start from the end. The last node is simple: there is no cost to move 5 5 3
from node 12 to the end (we are already there), and there is no next node. 2
2 5 9
2 4 3
4 6
Node 1 2 3 4 5 6 7 8 9 10 11 12 4
1 6
1 3 6 10 12
Cost 0 5
7
2
Next – 3 7 5 1
4
4 5 2 11
Now we move back one level to consider nodes 9, 10, and 11. These nodes
8
all lead to node 12 and are thus straightforward. We need to be more careful
with the formulas as we get to the more complicated cases next. Fig. 8.12 Small version of Fig. 8.9 for
convenience.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 6 2 0
Next 12 12 12 –
8 Discrete Optimization 341
Now we move back one level to nodes 5, 6, 7, and 8. Using the Bellman
equation for node 5, the cost is
We have already computed the minimum value for cost(9), cost(10), and cost(11),
so we just look up these values in the table. In this case, the minimum total
value is 3 and is associated with moving to node 11. Similarly, the cost for node
6 is
cost(6) = min[5 + cost(9), 4 + cost(10)]. (8.10)
The result is 8, and it is realized by moving to node 9.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 8 3 6 2 0
Next 11 9 12 12 12 –
We repeat this process, moving back and reusing optimal solutions to find
the global optimum. The completed table is as follows:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 10 8 12 9 3 8 7 4 3 6 2 0
Next 2 5 6 8 11 9 11 11 12 12 12 –
From this table, we see that the minimum cost is 10. This cost is achieved
by moving first to node 2. Under node 2, we see that we next go to node 5, then
11, and finally 12. Thus, the tabulation gives us the global minimum for cost
and the design decisions to achieve that.
In its present form, the knapsack problem has a linear objective and
linear constraints, so branch and bound would be a good approach.
However, this problem can also be formulated as a Markov chain, so we
can use dynamic programming. The dynamic programming version
accommodates variations such as stochasticity and other constraints
more easily.
To pose this problem as a Markov chain, we define the state as the
remaining capacity of the knapsack 𝑘 and the number of items we
have already considered. In other words, we are interested in 𝑣(𝑘, 𝑖),
where 𝑣 is the value function (optimal value given the inputs), 𝑘 is
the remaining capacity in the knapsack, and 𝑖 indicates that we have
already considered items 1 through 𝑖 (this does not mean we have
added them all to our knapsack, only that we have considered them).
We iterate through a series of decisions 𝑥 𝑖 deciding whether to take
item 𝑖 or not, which transitions us to a new state where 𝑖 increases and
𝑘 may decrease, depending on whether or not we took the item.
The real problem we are interested in is 𝑣(𝐾, 𝑛), which we solve
using tabulation. Starting at the bottom, we know that 𝑣(𝑘, 0) = 0 for
any 𝑘. This means that no matter the capacity, the value is 0 if we have
not considered any items yet. To work forward, consider a general case
considering item 𝑖, with the assumption that we have already solved
up to item 𝑖 − 1 for any capacity. If item 𝑖 cannot fit in our knapsack
(𝑤 𝑖 > 𝑘), then we cannot take the item. Alternatively, if the weight is
less than the capacity, we need to decide whether to select item 𝑖 or
not. If we do not, then the value is unchanged, and 𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1).
If we do select item 𝑖, then our value is 𝑐 𝑖 plus the best we could do
with the previous items but with a capacity that was smaller by 𝑤 𝑖 :
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1). Whichever of these decisions yields a
better value is what we should choose.
To determine which items produce this cost, we need to add more
logic. To keep track of the selected items, we define a selection matrix
𝑆 of the same size as 𝑣 (note that this matrix is indexed starting at zero
in both dimensions). Every time we accept an item 𝑖, we register that in
the matrix as 𝑆 𝑘,𝑖 = 1. Algorithm 8.4 summarizes this process.
8 Discrete Optimization 343
Inputs:
𝑐 𝑖 : Cost of item 𝑖
𝑤 𝑖 : Weight of item 𝑖
𝐾: Total available capacity
Outputs:
𝑥 ∗ : Optimal selections
𝑣(𝐾, 𝑛): Corresponding cost, 𝑣(𝑘, 𝑖) is the optimal cost for capacity 𝑘 considering items 1
through 𝑖 ; note that indexing starts at 0
for 𝑘 = 0 to 𝐾 do
𝑣(𝑘, 0) = 0 No items considered; value is zero for any capacity
end for
for 𝑖 = 1 to 𝑛 do Iterate forward solving for one additional item at a time
for 𝑘 = 0 to 𝐾 do
if 𝑤 𝑖 > 𝑘 then
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1) Weight exceeds capacity; value unchanged
else
if 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1) > 𝑣(𝑘, 𝑖 − 1) then Take item
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)
𝑆(𝑘, 𝑖) = 1
else Reject item
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1)
end if
end if
end for
end for
𝑘=𝐾 Initialize
𝑥 ∗ = {} Initialize solution 𝑥 ∗ as an empty set
for 𝑖 = 𝑛 to 1 by −1 do Loop to determine which items we selected
if 𝑆 𝑘,𝑖 = 1 then
add 𝑖 to 𝑥 ∗ Item 𝑖 was selected
𝑘 = 𝑘 − 𝑤𝑖
end if
end for
We fill all entries in the matrix 𝑣[𝑘, 𝑖] to extract the last value
𝑣[𝐾, 𝑛]. For small numbers, filling this matrix (or table) is often
illustrated manually, hence the name tabulation. As with the Fibonacci
example, using dynamic programming instead of a fully recursive
solution reduces the complexity from 𝒪(2𝑛 ) to 𝒪(𝐾𝑛), which means it
is pseudolinear. It is only pseudolinear because there is a dependence
on the knapsack size. For small capacities, the problem scales well
even with many items, but as the capacity grows, the problem scales
8 Discrete Optimization 344
much less efficiently. Note that the knapsack problem requires integer
weights. Real numbers can be scaled up to integers (e.g., 1.2, 2.4 become
12, 24). Arbitrary precision floats are not feasible given the number of
combinations to search across.
𝑤 𝑖 = [4, 5, 2, 6, 1]
𝑐 𝑖 = [4, 3, 3, 7, 2].
The capacity of our knapsack is 𝐾 = 10. Using Alg. 8.4, we find that the optimal
cost is 12. The value matrix is as follows:
0 0 0 0 0 0
0 2
0 0 0 0
0 3
0 0 3 3
0 5
0 0 3 3
0 5
4 4 4 4
0 4 4 4 4 6.
0 4 4 7 7 7
0 4 4 7 7 9
0 10
4 4 7 10
0 12
4 7 7 10
0 12
4 7 7 11
0 0 0 0 0 0
0 1
0 0 0 0
0 0
0 0 1 0
0 1
0 0 1 0
0 1
1 0 0 0
𝑆 = 0 1 0 0 0 1 .
0 1 0 1 0 0
0 1 0 1 0 1
0 0
1 0 1 1
0 1
1 1 0 1
0 1
1 1 0 1
Following this algorithm, we find that we selected items 3, 4, and 5 for a total
cost of 12, as expected, and a total weight of 9.
Simulated annealing∗ is a methodology designed for discrete opti- developed by Kirkpatrick et al.153
∗ First
and Černý.154
mization problems. However, it can also be effective for continuous
multimodal problems, as we will discuss. The algorithm is inspired by 153. Kirkpatrick et al., Optimization by
simulated annealing, 1983.
the annealing process of metals. The atoms in a metal form a crystal
154. Černý, Thermodynamical approach to
lattice structure. If the metal is heated, the atoms move around freely. the traveling salesman problem: An efficient
As the metal cools down, the atoms slow down, and if the cooling is slow simulation algorithm, 1985.
exp ®,
® (8.14)
𝑇
« ¬
where Boltzmann’s constant is removed because it is just an arbitrary
scale factor in the optimization context. Otherwise, the state remains
unchanged. Constraints can be handled in this algorithm without
resorting to penalties by rejecting any infeasible step.
We must supply the optimizer with a function that provides a
random neighboring design from the set of possible design configurations.
A neighboring design is usually related to the current design instead
of picking a pure random design from the entire set. In defining the
neighborhood structure, we might wish to define transition probabilities
so that all neighbors are not equally likely. This type of structure is
common in Markov chain problems. Because the nature of different
discrete problems varies widely, we cannot provide a generic neighbor-
selecting algorithm, but an example is shown later for the specific case
of a traveling salesperson problem.
Finally, we need to determine the annealing schedule (or cooling
schedule), a process for decreasing the temperature throughout the
optimization. A common approach is an exponential decrease:
𝑇 = 𝑇0 𝛼 𝑘 , (8.15)
ple.
The annealing schedule can substantially impact the algorithm’s perfor-
156. Andresen and Gordon, Constant ther-
mance, and some experimentation is required to select an appropriate modynamic speed for minimizing entropy
schedule for the problem at hand. One essential requirement is that the production in thermodynamic processes and
simulated annealing, 1994.
temperature should start high enough to allow for exploration. This
should be significantly higher than the maximum expected energy
8 Discrete Optimization 347
Inputs:
𝑥 0 : Starting point
𝑇0 : Initial temperature
Outputs:
𝑥 ∗ : Optimal point
𝑥 (𝑘+1)
= 𝑥new
else
𝑟 ∈ 𝒰[0, 1] ! Randomly draw from uniform distribution
− 𝑓 (𝑥 new )− 𝑓 𝑥 (𝑘)
𝑃 = exp 𝑇
choose two points and flip the direction of the path segments between those
two points, or (2) randomly choose two points and move the path segments to
follow another randomly chosen point. The distance traveled by the randomly
generated initial set of points is 26.2.
We specify an iteration budget of 25,000 iterations, set the initial temperature
to be 10, and decrease the temperature by a multiplicative factor of 0.95 at every
100 iterations. The right panel of Fig. 8.13 shows the final path, which has a
length of 5.61. The final path might not be the global optimum (remember,
these finite time methods are only approximations of the full combinatorial
search), but the methodology is effective and fast for this problem in finding at
least a near-optimal solution. Figure 8.14 shows the iteration history.
30
20
Distance
10
The binary form of a genetic algorithm (GA) can be directly used with
discrete variables. Because the binary form already requires a discrete
representation for the population members, using discrete design
variables is a natural fit. The details of this method were discussed in
Section 7.6.1.
8.8 Summary
Problems
8.2 Branch and bound. Solve the following problem using a manual
branch-and-bound approach (i.e., show each LP subproblem), as
8 Discrete Optimization 351
A B C D Limit
Chlorine 0.74 −0.05 1.0 −0.15 97
Sodium hydroxide 0.39 0.4 0.91 0.44 99
Sulfuric acid 0.86 0.89 0.09 0.83 52
Labor (person-hours) 5 7 7 6 1000
𝑤 𝑖 = [2, 5, 3, 4, 6, 1]
𝑐 𝑖 = [5, 3, 1, 5, 7, 2]
a. A greedy algorithm where you take the item with the best
cost-to-weight ratio (that fits within the remaining capacity)
at each iteration
8 Discrete Optimization 352
b. Dynamic programming
353
9 Multiobjective Optimization 354
becomes
𝑓1 (𝑥)
𝑓2 (𝑥)
minimize 𝑓 (𝑥) = . , where 𝑛𝑓 ≥ 2 . (9.2)
..
𝑥
𝑓𝑛 (𝑥)
𝑓
The constraints are unchanged unless some of them have been refor-
mulated as objectives. This multiobjective formulation might require
trade-offs when trying to minimize all functions simultaneously be-
cause, beyond some point, further reduction in one objective can only
be achieved by increasing one or more of the other objectives.
One exception occurs if the objectives are independent because they
depend on different sets of design variables. Then, the objectives are
said to be separable, and they can be minimized independently. If there
are constraints, these need to be separable as well. However, separable
objectives and constraints are rare because functions tend to be linked
in engineering systems.
Given that multiobjective optimization requires trade-offs, we need
a new definition of optimality. In the next section, we explain how there
is an infinite number of optimal points, forming a surface in the space of
objective functions. After defining optimality for multiple objectives, we
present several possible methods for solving multiobjective optimization
problems.
𝐵
9.2 Pareto Optimality 𝑓2 𝐴
Figure 9.1 shows three designs measured against two objectives that
Fig. 9.1 Three designs, 𝐴, 𝐵, and 𝐶,
we want to minimize: 𝑓1 and 𝑓2 . Let us first compare design A with
are plotted against two objectives, 𝑓1
design B. From the figure, we see that design A is better than design B and 𝑓2 . The region in the shaded
in both objectives. In the language of multiobjective optimization, we rectangle highlights points that are
say that design A dominates design B. One design is said to dominate dominated by design 𝐴.
9 Multiobjective Optimization 356
Noise
the left side tells us how much power we have to sacrifice for a given reduction
in noise. If the slope is steep, as is the case in the figure, we can see that a small
sacrifice in maximum power production can be exchanged for significantly
reduced noise. However, if more significant noise reductions are sought, then
large power reductions are required. Conversely, if the left side of the figure −Power
had a flatter slope, we would know that small reductions in noise would require
Fig. 9.3 A notional Pareto front rep-
significant decreases in power. Understanding the magnitude of these trade-off
resenting power and noise trade-offs
sensitivities helps make high-level design decisions. for a wind farm optimization prob-
lem.
Õ
𝑛𝑓
𝑤𝑖 = 1 . (9.4)
𝑖=1
of 𝑤 should be used to sweep out the Pareto set evenly, and (3) this
method can only return points on the convex portion of the Pareto 𝑤=0
front.
𝑓1
In Fig. 9.5, we highlight the convex portions of the Pareto front from
Fig. 9.4. If we utilize the concept of pushing a line down and to the left, Fig. 9.5 The convex portions of this
we see that these are the only portions of the Pareto front that can be Pareto front are the portions high-
found using a weighted-sum method. lighted.
9 Multiobjective Optimization 358
passes through the anchor points. We space points along this plane
(usually evenly) and, starting from those points, solve optimization
problems that search along directions normal to this plane.
This procedure is shown in Fig. 9.7 for a two-objective case. In this
case, the plane that passes through the anchor points is a line. We
now space points along this line by choosing a vector of weights 𝑏, as
illustrated on the left-hand of Fig. 9.7. The weights are constrained such
9 Multiobjective Optimization 359
Í
that 𝑏 𝑖 ∈ [0, 1], and 𝑖 𝑏 𝑖 = 1. If we make 𝑏 𝑖 = 1 and all other entries
zero, then this equation returns one of the anchor points, 𝑓 (𝑥 ∗𝑖 ). For
two objectives, we would set 𝑏 = [𝑤, 1 − 𝑤] and vary 𝑤 in equal steps
between 0 and 1.
𝑓 (𝑥1∗ ) 𝑓 (𝑥1∗ )
𝑏 = [0.8, 0.2]
𝛼 𝑛ˆ
𝑓2 𝑓2 Fig. 9.7 A notional example of the
𝑃𝑏 + 𝑓 ∗ 𝑃𝑏 + 𝑓 ∗ NBI method. A plane is created that
passes through the single-objective
optima (the anchor points), and solu-
𝑓∗ 𝑓∗
𝑓 (𝑥2∗ ) 𝑓 (𝑥2∗ ) tions are sought normal to that plane
for a more evenly spaced Pareto front.
𝑓1 𝑓1
𝑓1 (𝑥 ∗ )
1
𝑓2 (𝑥 ∗ )
2
𝑓 = . ,
∗
(9.8)
..
𝑓𝑛 (𝑥 ∗ )
𝑛
𝑛˜ = −𝑃𝑒 , (9.11)
maximize 𝛼
𝑥,𝛼
subject to 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ = 𝑓 (𝑥)
(9.12)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0 .
This means that we find the point farthest away from the anchor-point
plane, starting from a given value for 𝑏, while satisfying the original
problem constraints. The process is then repeated for additional values
of 𝑏 to sweep out the Pareto front.
In contrast to the previously mentioned methods, this method yields
a more uniformly spaced Pareto front, which is desirable for computa-
tional efficiency, albeit at the cost of a more complex methodology.
For most multiobjective design problems, additional complexity
beyond the NBI method is unnecessary. However, even this method
can still have deficiencies for problems with unusual Pareto fronts,
and new methods continue to be developed. For example, the normal
constraint method uses a very similar approach,161 but with inequality
161. Ismail-Yahaya and Messac, Effective
constraints to address a deficiency in the NBI method that occurs when generation of the Pareto frontier using the
normal constraint method, 2002.
the normal line does not cross the Pareto front. This methodology has
undergone various improvements, including better scaling through
162. Messac and Mattson, Normal con-
normalization.162 A more recent improvement performs an even more straint method with guarantee of even repre-
sentation of complete Pareto frontier, 2004.
efficient generation of the Pareto frontier by avoiding regions of the
163. Hancock and Mattson, The smart nor-
Pareto front where minimal trade-offs occur.163 mal constraint method for directly generating
a smart Pareto set, 2013.
9 Multiobjective Optimization 361
3 (2, 3)
𝑓2 2
𝑛ˆ
1 (5, 1)
𝑓∗
First, we optimize the objectives one at a time, which in our example results
in the two anchor points shown in Fig. 9.8: 𝑓 (𝑥1∗ ) = (2, 3) and 𝑓 (𝑥2∗ ) = (5, 1).
The utopia point is then
∗ 2
𝑓 = .
1
For the matrix 𝑃, recall that the 𝑖th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ :
0 3
𝑃= .
2 0
Our quasi-normal vector is given by −𝑃𝑒 (note that the true normal is
[−2, −3]):
−3
𝑛˜ = .
−2 ∗ The first application of an evolutionary al-
We now have all the parameters we need to solve Eq. 9.12. gorithm for solving a multiobjective prob-
lem was by Schaffer.164
164. Schaffer, Some experiments in machine
learning using vector evaluated genetic
algorithms. 1984.
9.3.4 Evolutionary Algorithms
Gradient-free methods can, and occasionally do, use all of the previously
described methods. However, evolutionary algorithms also enable a
𝑓2
fundamentally different approach. Genetic algorithms (GAs), a specific
type of evolutionary algorithm, were introduced in Section 7.6.∗
A GA is amenable to an extension that can handle multiple objectives
because it keeps track of a large population of designs at each iteration. 𝑓1
If we plot two objective functions for a given population of a GA
iteration, we get something like that shown in Fig. 9.9. The points Fig. 9.9 Population for a multiob-
jective GA iteration plotted against
represent the current population, and the highlighted points in the
two objectives. The nondominated
lower left are the current nondominated set. As the optimization set is highlighted at the bottom left
progresses, the nondominated set moves further down and to the left and eventually converges toward the
and eventually converges toward the actual Pareto front. Pareto front.
9 Multiobjective Optimization 362
is just the current approximation of the Pareto front). The algorithm 165. Deb, Introduction to evolutionary
multiobjective optimization, 2008.
recursively divides the population in half and finds the nondominated
166. Kung et al., On finding the maxima of
set for each half separately. a set of vectors, 1975.
Inputs:
𝑝: A population sorted by the first objective
Outputs:
𝑓 : The nondominated set for the population
procedure front(𝑝)
if length(𝑝) = 1 then If there is only one point, it is the front
return f
end if
Split population into two halves 𝑝 𝑡 and 𝑝 𝑏
⊲ Because input was sorted, 𝑝 𝑡 will be superior to 𝑝 𝑏 in the first objective
𝑡 = front(𝑝 𝑡 ) Recursive call to find front for top half
𝑏 = front(𝑝 𝑏 ) Recursive call to find front for bottom half
Initialize 𝑓 with the members from 𝑡 merged population
for 𝑖 = 1 to length(𝑏) do
dominated = false Track whether anything in 𝑡 dominates 𝑏 𝑖
for 𝑗 = 1 to length(𝑡) do
if 𝑡 𝑗 dominates 𝑏 𝑖 then
dominated = true
break No need to continue search through 𝑡
end if
end for
if not dominated then 𝑏 𝑖 was not dominated by anything in 𝑡
Add 𝑏 𝑖 to 𝑓
end if
end for
return 𝑓
end procedure
9 Multiobjective Optimization 363
Inputs:
𝑝: A population
Outputs:
rank: The rank for each member in the population
Inputs:
𝑝: A population
Outputs:
𝑑: Crowding distances
Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
Outputs:
𝑥 ∗ : Best point
We see that the current nondominated set consists of points D and J and that
there are four different ranks.
Next, we start filling the new population in the order of rank. Our maximum
capacity is 6, so all rank 1 {D, J} and rank 2 {E, H, K} fit. We cannot add rank 3
{A, C, I, L} because the population size would be 9. So far, our new population
consists of {D, J, E, H, K}. To choose which items from rank 3 continue forward,
we compute the crowding distance for the members of rank 3:
A C I L
1.67 ∞ 1.5 ∞
We would then add, in order {C, L, A, I}, but we only have room for one, so we
add C and complete this iteration with a new population of {D, J, E, H, K, C}.
9.4 Summary
Problems
• (20, 4)
• (18, 5)
• (34, 2)
• (19, 6)
9 Multiobjective Optimization 369
𝑓1 𝑓2
6.0 8.0
6.0 4.0
5.0 6.0
2.0 8.0
10.0 5.0
6.0 0.5
8.0 3.0
4.0 9.0
9.0 7.0
8.0 6.0
3.0 1.0
7.0 9.0
1.0 2.0
3.0 7.0
1.5 1.5
4.0 6.5
370
10 Surrogate-Based Optimization 371
There are various scenarios for which surrogate models are helpful.
One scenario is when the original model is computationally expensive.
Surrogate models can be queried with minimal computational cost, but
constructing them requires multiple evaluations of the original model.
Suppose the number of evaluations needed to build a sufficiently
accurate surrogate model is less than that needed to optimize the
original model directly. In that case, SBO may be a worthwhile option.
Constructing a surrogate model becomes even more compelling when
it is reused in multiple optimizations.
Surrogate modeling can be effective in handling noisy models
because they create a smooth representation of noisy data. This can be
particularly advantageous when using gradient-based optimization.
One scenario that leads to both expensive evaluation and noisy
output is experimental data. When the model data are experimental
and the optimizer cannot query the experiment in an automated way,
we can construct a surrogate model based on the experimental data.
Then, the optimizer can query the surrogate model in the optimization.
Surrogate models are also helpful when we want to understand
the design space, that is, how the objective and constraints (outputs)
vary with respect to the design variables (inputs). By constructing a
continuous model over discrete data, we obtain functional relationships
that can be visualized more effectively.
When multiple sources of data are available, surrogate models can
fuse the data to build a single model. The data could come from
numerical models with different levels of fidelity or experimental data.
For example, surrogate models can calibrate numerical model data
using experimental data. This is helpful because experimental data is
usually much more scarce than numerical data. The same reasoning
applies to low- versus high-fidelity numerical data.
One potential issue with surrogate models is the curse of dimension-
ality, which refers to poor scalability with the number of inputs. The
larger the number of inputs, the more model evaluations are needed
to construct a surrogate model that is accurate enough. Therefore, the
reasons for using surrogate models cited earlier might not be enough if
the optimization problem has a large number of design variables.
The SBO process is shown in Fig. 10.2. First, we use sampling
methods to choose the initial points to evaluate the function or conduct
experiments. These points are sometimes referred to as training data.
Next, we build a surrogate model from the sampled points. We can
then perform optimization by querying the surrogate model. Based
10 Surrogate-Based Optimization 372
10.2 Sampling
𝑥2 𝑥2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Fig. 10.3 Contrast between random
0 0 and Latin hypercube sampling with
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 50 points using uniform distribu-
𝑥1 𝑥1
tions.
Random Latin hypercube sampling
𝑥2 𝑥2
plan shown on the left of Fig. 10.6. This plan meets our criteria but
clearly does not fill the space and likely will not capture the relationships
between design parameters well. Alternatively, the right side of Fig. 10.6
has a sample in each row and column while also spanning the space
much more effectively.
LHS approach is that rather than relying on the law of large numbers to
fill out our chosen probability distributions, we enforce it as a constraint.
This method may still require many samples to characterize the design
space accurately, but it usually requires far fewer than pure random
sampling.
Instead of defining LHS as an optimization problem, a much simpler
approach is typically used in which we ensure one sample per interval,
but we rely on randomness to choose point combinations. Although
this does not necessarily yield a maximum spread, it works well in
practice and is simple to implement. Before discussing the algorithm,
we discuss how to generate other distributions besides just uniform
distributions.
We can convert from uniformly sampled points to an arbitrary
distribution using a technique called inversion sampling. Assume that
we want to generate samples 𝑥 from an arbitrary probability density
function (PDF) 𝑝(𝑥) or, equivalently, from the corresponding cumulative
distribution function (CDF) 𝑃(𝑥).∗ The probability integral transform ∗ PDFs and CDFs are reviewed in Ap-
pendix A.9.
states that for any continuous CDF, 𝑦 = 𝑃(𝑥), the variable 𝑦 is uniformly
distributed (a simple proof, but it is not shown here to avoid introducing
additional notation). The procedure is to randomly sample from a
uniform distribution (e.g., generate 𝑦), then compute the corresponding
𝑥 such that 𝑃(𝑥) = 𝑦, which we denote as 𝑥 = 𝑃 −1 𝑦. This latter step is
known as an inverse CDF, a percent-point function, or a quantile function.
This process is depicted in Fig. 10.7 for a normal distribution. This
same procedure allows us to use LHS with any distribution, simply by
generating the samples on a uniform distribution.
1
CDF
0.8
Inputs:
𝑛 𝑠 : Number of samples
𝑛 𝑑 : Number of dimensions
𝑃 = {𝑃1 , . . . , 𝑃𝑛 𝑑 }: (optionally) A set of cumulative distribution functions
Outputs:
𝑋 = {𝑥1 , . . . , 𝑥 𝑛 𝑠 }: Set of sample points
for 𝑗 = 1 to 𝑛 𝑑 do
for 𝑖 = 1 to 𝑛 𝑠 do
𝑖 𝑅 𝑖𝑗
𝑉𝑖𝑗 = − where 𝑅 𝑖𝑗 ∈ U[0, 1] Randomly choose a value in each equally
𝑛𝑠 𝑛𝑠
spaced cell from uniform distribution
end for
𝑋∗𝑗 = 𝑃 𝑗−1 (𝑉∗𝑗 ) where 𝑃 𝑗 is a CDF Evaluate inverse CDF
Randomly permute the entries of this column 𝑋∗𝑗 Alternatively, permute the
indices 1 . . . 𝑛 𝑠 in the prior for loop
end for
𝑥2
An example using Alg. 10.1 for eight points is shown in Fig. 10.8. 3
𝑖=1
Inputs:
𝑖: 𝑖 th point in sequence
𝑏: Base (integer)
Outputs:
𝜙: Generated point
0.8
0.6
Halton Sequence
A Halton sequence uses pairwise prime numbers (larger than 1) for the 0.4
base of each dimension of the problem.† The 𝑖th point in the Halton 0.2
sequence is
0
𝜙(𝑖, 𝑏1 ), 𝜙(𝑖, 𝑏2 ), . . . , 𝜙(𝑖, 𝑏 𝑛 𝑥 ) , (10.5) 0 0.2 0.4 0.6 0.8 1
𝑥1
where the 𝑏 𝑗 set is pairwise prime. As an example in two dimensions,
Fig. 10.10 shows 30 generated points of the Halton sequence where 𝑥1 Fig. 10.10 Halton sequence with base
2 for 𝑥1 and base 3 for 𝑥2 . First, 30
uses base 2, and 𝑥2 uses base 3, and then a subsequent 20 generated
points are selected (in blue), and then
points are added (in another color), showing the reuse of existing points. 20 points are added (in red). These
If the dimensionality of the problem is high, then some of the points would be identical to 50 points
base combinations lead to points that are highly correlated and thus chosen at once.
10 Surrogate-Based Optimization 380
undesirable for a sampling plan. For example, the left of Fig. 10.11
shows 50 generated points where 𝑥1 uses base 17, and 𝑥2 uses base 19.
To avoid this issue, we can use a scrambled Halton sequence.
𝑥2 𝑥2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fig. 10.11 Halton sequence with base
𝑥1 𝑥1
17 for 𝑥1 and base 19 for 𝑥2 .
Standard Halton sequence Scrambled Halton
Hammersley Sequence
𝑥2
The Hammersley sequence is closely related to the Halton sequence. 1
Other Sequences
where 𝑥 (𝑖) is an input vector from the sampling plan, and 𝑓 (𝑖) contains
the corresponding outputs from evaluating the model: 𝑓 (𝑖) = 𝑓 𝑥 (𝑖) .
We seek to construct a surrogate model from this data set. Surrogate
models can be based on physics, mathematics, or a combination of
the two. Incorporating known physics into a model is often desirable
to improve model accuracy. However, functional relationships are
unknown for many complex problems, and a data-driven mathematical
model can be more effective.
Surrogate-based models can be based on interpolation or regression, Regression
as illustrated in Fig. 10.13. Interpolation builds a function that exactly
matches the provided training data. Regression models do not try to 𝑓
Interpolation
match the training data points exactly; instead, they minimize the
error between a smooth trend function and the training data. The
nature of the training data can help decide between these two types
of surrogate models. Regression is particularly useful when the data 𝑥
The coefficients are chosen to minimize the error between our ∗ The choice of minimizing the sum of the
predicted function values 𝑓ˆ and the actual function values 𝑓 (𝑖) . Because squares rather than the sum of the abso-
we want to minimize both positive and negative errors, we minimize lute values or some other metric is not
arbitrary. The motivation for using the
the sum of the square of the errors (or a weighted sum of squared sum of the squares is discussed further in
errors):∗ the following section.
10 Surrogate-Based Optimization 383
Õ 2
minimize 𝑓ˆ 𝑤; 𝑥 (𝑖) − 𝑓 (𝑖) . (10.11)
𝑤
𝑖
minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 + 𝑓 | 𝑓 . (10.15)
𝑤
We can omit the last term from the objective because our optimization
variables are 𝑤, and the last term has no 𝑤 dependence:
minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 . (10.16)
𝑤
1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥 , (10.17)
𝑥 2
10 Surrogate-Based Optimization 384
where
𝑄 = 2Ψ| Ψ (10.18)
|
𝑞 = −2Ψ 𝑓 . (10.19)
𝑄𝑥 = −𝑞 . (10.21)
𝑤 = Ψ† 𝑓 . (10.24)
Tip 10.2 Least squares is not the same as a linear system solution
Consider the quadratic fit discussed in Ex. 10.2. We are provided the data
points, 𝑥 and 𝑓 , shown as circles in Fig. 10.14. From these data, we construct 𝑓
the matrix Ψ for our basis functions as follows: 20
𝑥 (1) 2
𝑥 (1) 1
𝑥 (2) 2 10
𝑥 (2) 1
Ψ = ..
.
.
(𝑛 ) 2
𝑥 𝑠 1
0
𝑥 (𝑛 𝑠 )
−2 −1 0 1 2
We can then solve for the coefficients 𝑤 using the linear least squares solution 𝑥
(Eq. 10.23). Substituting the coefficients and respective basis functions into
Eq. 10.10, we obtain the surrogate model, Fig. 10.14 Linear least squares exam-
ple with a quadratic fit on a one-
𝑓ˆ(𝑥) = 𝑤 1 𝑥 2 + 𝑤 2 𝑥 + 𝑤3 , dimensional function.
𝑤 ∗ = (𝐴| 𝐴)−1 𝐴| 𝑏
(10.27)
= (Ψ| Ψ + 𝜇𝐼)−1 Ψ| 𝑓 .
where 𝜀 captures the error associated with the 𝑖th data point. We
assume that the error is normally distributed with mean zero and a
standard deviation of 𝜎:
!
1 𝜀(𝑖)
2
𝑝 𝜀(𝑖) = √ exp − 2 . (10.30)
𝜎 2𝜋 2𝜎
10 Surrogate-Based Optimization 387
© Õ ª
𝑛𝑠 𝑓 −𝑤 𝑥(𝑖) | (𝑖)
maximize − ®
® ⇒ (10.37)
𝑤 2𝜎2
« ¬
𝑖=1
Õ 𝑛𝑠
( 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖) )2
minimize . (10.38)
𝑤
𝑖=1
2𝜎 2
𝜕𝑟 𝑖
𝐽𝑟 𝑖𝑗 = . (10.43)
𝜕𝑤 𝑗
This is now the same form as linear least squares (Eq. 10.14), so we can
reuse its solution (Eq. 10.23) to solve for the step
−1
(10.45)
| |
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 .
The gradient is
𝜕𝑟 𝑖
∇𝑒 𝑗 = 2𝑟 𝑖 , (10.48)
𝜕𝑤 𝑗
or in matrix form:
∇𝑒 = 2𝐽𝑟 𝑟 . (10.49)
|
If we neglect the second term in the Hessian, then the Newton update
is:
1 | −1 |
𝑤 𝑘+1 = 𝑤 𝑘 − 𝐽 𝐽𝑟 2𝐽𝑟 𝑟
2 𝑟 (10.52)
| −1 |
= 𝑤 𝑘 − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 ,
which is the same update as before.
Thus, another interpretation of this method is that a Gauss–Newton
step is a modified Newton step where the second derivatives of the
residual are neglected (and thus, a quasi-Newton approach to estimate
second derivatives is not needed). This method is particularly effective
near convergence because as 𝑟 → 0 (i.e., as we approach the solution to
our residual minimization), the neglected term also approaches zero.
The appeal of this approach is that we can often obtain an accurate
prediction for the Hessian using only the first derivatives because of
the known structure of the objective.
When the second term is not small, then the Gauss–Newton step
may be too inaccurate. We could use a line search, but the Levenberg–
Marquardt algorithm utilizes a different strategy. The idea is to regular-
ize the problem as discussed in the previous section or, in other words,
provide the ability to dampen the steps as needed. Each linearized
subproblem becomes
1 |
Δ𝑤 = − 𝐽𝑟 𝑟 . (10.55)
𝜇
10 Surrogate-Based Optimization 391
where 𝐷 is defined as
𝐷 2 = diag 𝐽𝑟 𝐽𝑟 . (10.57)
|
This matrix scales the objective by the diagonal elements of the Hessian.
Thus, when 𝜇 is large, and the direction tends toward the steepest de-
scent, the components of the gradient are scaled by the curvature, which
reduces the amount of zigzagging. The solution to the minimization
problem of Eq. 10.56 is
−1
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 + 𝜇 diag 𝐽𝑟 𝐽𝑟 (10.58)
| | |
𝐽𝑟 𝑟 .
Inputs:
𝑥 0 : Starting point
𝜇0 : Initial damping parameter
𝜌: Damping parameter factor
Outputs:
𝑥 ∗ : Optimal solution
𝑘=0
𝑥 = 𝑥0
10 Surrogate-Based Optimization 392
𝜇 = 𝜇0
𝑟, 𝐽 = residual(𝑥)
𝑒 = k𝑟 k 22 Residual error
−1 |
while |Δ| > 𝜏 do
𝑠 = − 𝐽 | 𝐽 + 𝜇 diag(𝐽 | 𝐽) 𝐽 𝑟 Evaluate step
𝑟 𝑠 , 𝐽𝑠 = residual(𝑥 + 𝑠)
𝑒 𝑠 = k𝑟 𝑠 k 22
Δ = 𝑒𝑠 − 𝑒 Change in residual error
if Δ < 0 then Objective decreased; accept step
𝑥=𝑥+𝑠
𝑟, 𝐽, 𝑒 = 𝑟 𝑠 , 𝐽𝑠 , 𝑒 𝑠
𝜇 = 𝜇/𝜌
else Reject step
𝜇=𝜇·𝜌 Increase damping
end if
𝑘 = 𝑘+1
end while
In the following example, we use the same starting point as Ex. 4.18
(𝑥0 = [−1.2, −1]), an initial damping parameter of 𝜇 = 0.01, an update factor
of 𝜌 = 10, and a tolerance of 𝜏 = 10−6 (change in sum of squared errors). The
iteration path is shown on the left of Fig. 10.15, and the convergence of the sum
of squared errors is shown on the right side.
𝑥2 k𝑟 k 22
2 102
42 iterations
10−1
𝑥0 𝑥∗ 10−4
1
10−7
10−10
0 10−13
10−16
0 1 0 10 20 30 40
−1 Fig. 10.15 Levenberg–Marquardt al-
𝑥1 𝑘
gorithm applied to the minimization
Iteration history Convergence of the sum of squared of the Rosenbrock function.
residuals
of freedom and closely fit a given set of data, but the resulting model
has a poor predictive ability. In other words, we are fitting noise. The
following example illustrates this idea with a one-dimensional function.
Consider the set of training data (Fig. 10.16, left), which we use to create a
surrogate function. This is a one-dimensional problem so that it can be easily
visualized. In general, however, visualization is limited, and determining the
right basis functions to use can be difficult. If we use a polynomial basis, we
might attempt to determine the appropriate order by trying each case (e.g.,
quadratic, cubic, quartic) and measuring the error in our fit (Fig. 10.16, center).
𝑓 𝑓
5
4 4
4
2 3 2
Error
2
0 0
1
−2 −2
0
0 5 10 15 20
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
𝑥 Order of polynomial 𝑥
Training data The error in fitting the data decreases A 19th-order polynomial fit to the
with the order of the polynomial data has low error but poor predic-
tive ability
It seems as if the higher the order of the polynomial, the lower the error. Fig. 10.16 Fitting different order poly-
For example, a 20th-order polynomial reduces the error to almost zero. The nomials to data.
problem is that although the error is low on this set of data, the predictive
capability of such a model for other data points is poor. For example, the right
10 Surrogate-Based Optimization 394
side of Fig. 10.16 shows a 19th-order polynomial fit to the data. The model
passes right through the points, but it does not work well for many of the
points that are not part of the training set (which is the whole purpose of the
surrogate).
The opposite of overfitting is underfitting, which is also a potential issue.
When underfitting, we do not have enough degrees of freedom to create a
useful model (e.g., imagine using a linear fit for the previous example).
1. Randomly split your data into a training set and a validation set
(e.g., a 70–30 split).
2. Train each candidate model (the different options for 𝜓) using
only the training set, but evaluate the error with the validation set.
The error on previously unseen data is called the generalization
error (𝑒 𝑔 in Fig. 10.17).
3. Choose the model with the lowest generalization error, and
optionally retrain that model using all of the data.
Train Test
An alternative option that is more involved but uses the data more
effectively is called 𝑘-fold cross validation. It is particularly advantageous
when we have a small data set where we cannot afford to leave much out.
This procedure is illustrated in Fig. 10.18 and consists of the following
steps:
The extreme version of this process, when training data are very limited,
is leave-one-out cross validation (i.e., each testing subset consists of one
data point).
.. ..
. .
Train Test 𝑒𝑔 𝑛
·104
3 15
2 10
Error
Error
1 5
4
This example continues from Ex. 10.5. First, we perform 𝑘-fold cross
validation using 10 divisions. The average error across the divisions using the 2
training data is shown in Fig. 10.19 (with a smaller 𝑦-axis scale on the right).
The error increases dramatically as the polynomial order increases. Zoom- 0
ing in on the flat region, we see a range of options with similar errors. Among
the similar solutions, we generally prefer the simplest model. In this case, −2
a fourth-order polynomial seems reasonable. A fourth-order polynomial is
−3 −2 −1 0 1 2
compared against the data in Fig. 10.20. This model has a much better predictive 𝑥
ability.
Fig. 10.20 A fourth-order polynomial
fit to the data.
10 Surrogate-Based Optimization 396
Polynomials
where 𝑐 is the center point, and 𝑟 is the radius about the center point.
Although the center points can be placed anywhere, we usually choose
the sampling data as centering points:
𝜓(𝑖) = 𝜓
𝑥 − 𝑥 (𝑖)
. (10.60)
This is often a useful choice because it captures the idea that our
ability to predict function behavior is related to how close we are to
known function values (in other words, nearby points are more highly
correlated). This form naturally lends itself to interpolation, although
regularization can be added to allow for regression. Polynomials are
often combined with radial basis functions because the polynomial can
10 Surrogate-Based Optimization 397
capture global function behavior, while the radial basis functions can
introduce modifications to capture local behavior.
One popular radial basis function is the Gaussian basis:
2
© Õ (𝑖) ª
𝜓 (𝑖) (𝑥) = exp − 𝜃𝑗 𝑥 − 𝑥 𝑗 ® , (10.61)
« 𝑗 ¬
The surrogate modeling toolbox (SMT)† is a useful package for surrogate † https://smt.readthedocs.io/
10.4 Kriging
𝐾 𝑥 (𝑖) , 𝑥 (𝑗) = corr 𝑍 𝑥 (𝑖) , 𝑍 𝑥 (𝑗) (10.63)
As a matrix, the kernel is represented as 𝐾 𝑖𝑗 = 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) . Various
kernel functions are used with kriging.∗ The most commonly used ∗ Kernel functions must be symmetric and
!
trix is always symmetric and positive defi-
Õ nite.
(𝑖) (𝑗) 𝑝 𝑙
𝑛𝑑
𝐾 𝑥 ,𝑥(𝑖) (𝑗)
= exp − 𝜃𝑙 𝑥 𝑙 − 𝑥 𝑙 , (10.64)
𝑙=1
viewed in Appendix A.9.
Σ𝑖𝑗 = cov 𝐹 (𝑖) , 𝐹 (𝑗) = 𝜎2 corr 𝐹 (𝑖) , 𝐹 (𝑗) = 𝜎2 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) . (10.67)
function:
𝑛𝑠 𝑛𝑠
ℓ (𝜇, 𝜎, 𝜃, 𝑝) = − ln(2𝜋) − ln 𝜎2
2 2
1 ( 𝑓 − 𝑒𝜇)| 𝐾 −1 ( 𝑓 − 𝑒𝜇)
− ln |𝐾| − . (10.68)
2 2𝜎2
We can maximize part of this term analytically by taking derivatives
with respect to 𝜇 and 𝜎, setting them equal to zero, and solving for their
optimal values to obtain:
𝑒 | 𝐾 −1 𝑓
𝜇∗ = (10.69)
𝑒 | 𝐾 −1 𝑒
( 𝑓 − 𝑒𝜇∗ )| 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ )
𝜎 ∗2 = . (10.70)
𝑛𝑠
We now substitute these values back into the log likelihood function
(Eq. 10.68), which yields
𝑛𝑠 1
ℓ (𝜃, 𝑝) = − ln 𝜎∗2 − ln |𝐾| . (10.71)
2 2
This function, also called the concentrated likelihood function, only de-
pends on the kernel 𝐾, which depends on 𝜃 and 𝑝.
We cannot solve for optimal values of 𝜃 and 𝑝 analytically. Instead,
we rely on numerical optimization to maximize Eq. 10.71. Because 𝜃 can
vary across a broad range, it is often better to search using logarithmic
scaling. Once we solve that optimization problem, we compute the
mean and variance in Eqs. 10.69 and 10.70.
Now that we have a fitted model, we can make predictions at new
points where we have not sampled. We do this by substituting 𝑥 𝑝 into
a formula called the kriging predictor. The formula is unique, but there
are many ways to derive it. One way to derive it is to find the function
value at 𝑥 𝑝 that is the most consistent with the behavior of the function
captured by the fitted kriging model.
Let 𝑓𝑝 be our guess for the value of the function at 𝑥 𝑝 . One way
to assess the consistency of our guess is to add (𝑥 𝑝 , 𝑓𝑝 ) as an artificial
point to our training data (so that we now have 𝑛 𝑠 + 1 points) and
estimate the likelihood using the parameters from our fitted kriging
model. The likelihood of this augmented data can now be thought of
as a function of 𝑓𝑝 : high values correspond to guessed values of 𝑓𝑝 that
are consistent with function behavior captured by the fitted kriging
model. Therefore, the value of 𝑓𝑝 that maximizes the likelihood of this
augmented data set is a natural way to predict the value of the function.
10 Surrogate-Based Optimization 401
𝐾 𝑘
𝐾¯ = | , (10.72)
𝑘 1
where 𝑘 is the correlation of the new point with the training data given
by
corr 𝐹 𝑥 (1) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (1) , 𝑥 𝑝
𝑘= ..
. (10.73)
.
corr 𝐹 𝑥 (𝑛𝑠 ) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (𝑛𝑠 ) , 𝑥 𝑝
The 1 in the bottom right of the augmented correlation matrix (Eq. 10.72)
is because the correlation of the new variable 𝐹(𝑥 𝑝 ) with itself is 1. The
log likelihood function with these new augmented vectors and the
previously determined parameters is as follows (see Eq. 10.68):
𝑛𝑠 𝑛𝑠 1 ( 𝑓¯ − 𝑒𝜇∗ )| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗ )
ℓ ( 𝑓𝑝 ) = − ln(2𝜋) − ¯ −
ln(𝜎∗2 ) − ln | 𝐾| .
2 2 2 2𝜎∗2
We want to maximize this function with respect to 𝑓𝑝 . Because only the
last term depends on 𝑓𝑝 (it is a part of 𝑓¯) we can omit the other terms
and formulate the following:
𝑓¯ − 𝑒𝑣)| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗
maximize ℓ ( 𝑓𝑝 ) = − . (10.74)
𝑓𝑝 2𝜎∗2
This problem can be solved analytically, yielding the mean value of the
kriging prediction,
𝑓𝑝 = 𝜇∗ + 𝑘 | 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ ) . (10.75)
The mean square error of the kriging prediction (that is, the expected
squared value of the error) is given by¶ ¶ The formula for mean squared error does
𝑥 𝑝 is the same as one of the training data points, 𝑥 (𝑖) , then 𝑘 is just 𝑖th
column of 𝐾. Hence, 𝐾 −1 𝑘 is a vector 𝑒 𝑖 , with all zeros except for 1 in
the 𝑖th element. In the prediction (Eq. 10.75), 𝑘 | 𝐾 −1 = 𝑒 𝑖 and so the
|
We can fit a kriging model to this data by following the procedure in this 0.5
section. This includes solving the optimization problem of Eq. 10.71 using a
gradient-based method with exact derivatives. We fix 𝑝 = 2 and search for 𝜃 in 0
the range [10−3 , 102 ] with the exponent as the optimization variable.
Actual
The resulting interpolation is shown in Fig. 10.21, where we plot the mean −0.5
line. The shaded area represents the uncertainty corresponding to ±1 standard
0 5 10
error. The uncertainty goes to zero at the known data points and is largest 𝑥
when far from known data points.
Fig. 10.21 Kriging model showing the
training data (dots), the kriging pre-
If we can provide the gradients of the function at the training data dictor (blue line) and the confidence
interval corresponding to ±1 stan-
points (in addition to the function values), we can use that information
dard error (shaded areas), compared
to build a more accurate kriging model. This approach is called gradient- to the actual function (gray line).
enhanced kriging (GEK). The methodology is the same as before, except
we add more observed outputs (i.e., in addition to the function values at
the sampled points, we add their gradients). In addition to considering
the correlation between the function values at different sampled points,
the kernel matrix 𝐾 needs to be expanded to consider correlations
between function values and gradients, gradients and function values,
and among gradient components.
We can use still use equation (Eq. 10.75) for the GEK predictor and
equation (Eq. 10.76) for the mean square error if we plug in “expanded
10 Surrogate-Based Optimization 403
versions” of the outputs 𝑓 , the vector 𝑘, the matrix 𝐾, and the vector of
1s, 𝑒.
We expand the output vector to include not just the function values
at the sampled points but also their gradients:
𝑓1
..
.
𝑓𝑛 𝑠
≡ .
𝑓GEK (10.77)
∇ 𝑓1
.
..
∇ 𝑓
𝑛𝑠
This vector is of length 𝑛 𝑠 + 𝑛 𝑠 𝑛 𝑑 , where 𝑛 𝑑 is the dimension of 𝑥. The
gradients are usually provided at the same 𝑥 locations as the function
samples, but that is not required.
Recall that the term 𝑒𝜇∗ in Eq. 10.75 for the kriging predictor
represents the expected value of the random variables 𝐹 (1) , . . . , 𝐹 (𝑛 𝑠 ) .
Now that we have expanded the outputs to include the gradients at the
sampled points, the mean vector needs to be expanded to include the
expected values of ∇𝐹 (𝑖) , which are all zero. We can still use 𝑒𝜇∗ in the
formula for the predictor if we use the following definition:
where 1 occurs for the first 𝑛 𝑠 entries, and 0 for the remaining 𝑛 𝑠 𝑛 𝑑
entries.
The additional correlations (between function values and derivatives
and between the derivatives themselves) are as follows:
corr 𝐹 𝑥 (𝑖) , 𝐹 𝑥 (𝑗) = 𝐾 𝑖𝑗
© 𝜕𝐹 𝑥 (𝑗) ª 𝜕𝐾 𝑖𝑗
corr 𝐹 𝑥 ®=
𝜕𝑥 𝑙 ® 𝜕𝑥 (𝑗)
(𝑖)
,
« ¬ 𝑙
© 𝜕𝐹 𝑥
(𝑖)
ª 𝜕𝐾 𝑖𝑗
corr , 𝐹 𝑥 (𝑗) ®® = (10.79)
𝜕𝑥 𝑙 (𝑖)
𝜕𝑥 𝑙
« ¬
© 𝜕𝐹 𝑥 𝜕𝐹 𝑥 (𝑗) ª
(𝑖)
𝜕2 𝐾 𝑖𝑗
corr ®=
𝜕𝑥 𝑘 ® 𝜕𝑥 (𝑖) 𝜕𝑥 (𝑗)
, .
𝜕𝑥 𝑙
« ¬ 𝑙 𝑘
𝜕𝐾11 | 𝜕𝐾 1𝑛 𝑠 |
𝜕𝑥 (1) ...
𝜕𝑥 (𝑛 𝑠 )
𝐽𝐾 = ... ..
.
..
.
(10.82)
𝜕𝐾 | 𝜕𝐾 𝑛 𝑠 𝑛 𝑠 |
𝑛𝑠 1
(1) ...
𝜕𝑥 𝜕𝑥 (𝑛 𝑠 )
and the (𝑛 𝑠 𝑛 𝑑 × 𝑛 𝑠 𝑛 𝑑 ) matrix of second derivatives is
𝜕2 𝐾11 𝜕2 𝐾 1𝑛 𝑠
𝜕𝑥 (1) 𝜕𝑥 (1) ...
𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 )
𝐻𝐾 = ..
.
..
.
..
.
.
(10.83)
2
𝐾 𝜕 𝐾 𝑛𝑠 𝑛𝑠
2
𝜕 𝑛𝑠 1
𝜕𝑥 (𝑛𝑠 ) 𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 ) 𝜕𝑥 (𝑛 𝑠 )
...
We can still get the estimates 𝜇∗ and 𝜎∗2 with Eqs. 10.69 and 10.70
using the expanded versions of 𝐾, 𝑒, 𝑓 and replacing 𝑛 𝑠 in Eq. 10.76
with 𝑛 𝑠 (𝑛 𝑑 + 1), which is the new length of the outputs.
The predictor equations (Eqs. 10.75 and 10.76) also apply with the
expanded matrices and vectors. However, we also need to expand 𝑘 in
these computations to include the correlations between the gradients
at the sampled points with the gradient at the point 𝑥 where we make
10 Surrogate-Based Optimization 405
𝑘
© 𝜕𝐹 𝑥 ª 𝜕𝐾 𝑥 , 𝑥 𝑝
(1) (1)
corr ®
𝜕𝑥 (1) , 𝐹(𝑥 𝑝 )® =
𝜕𝑥 (1)
« ¬
≡ .
𝑘GEK .. (10.84)
.
© 𝜕𝐹 𝑥 ª 𝜕𝐾 𝑥 , 𝑥 𝑝
(𝑛 𝑠 ) (𝑛 𝑠 )
corr ®
𝜕𝑥 (𝑛𝑠 ) , 𝐹(𝑥 𝑝 )® =
𝜕𝑥 (𝑛 𝑠 )
« ¬
We repeat Ex. 10.7 but this time include the gradients (Fig. 10.22). The 𝑓
1
standard error reduces dramatically between points. The additional information Fit
contained in the derivatives significantly helps in creating a more accurate fit.
0.5
2 2
1 1
𝑥2 𝑥2
0 0
−1 −1
−1 0 1 2 3 −1 0 1 2 3
Fig. 10.23 Kriging fit to the multi-
𝑥1 𝑥1
modal Jones function.
Original function Kriging fit
Like kriging, deep neural nets can be used to approximate highly non-
linear simulations where we do not need to provide a parametric form.
Neural networks follow the same basic steps described for other surro-
gate models but with a unique model leading to specialized approaches
for derivative computation and optimization strategy. Neural networks
loosely mimic the brain, which consists of a vast network of neurons.
In neural networks, each neuron is a node that represents a simple
function. A network defines chains of these simple functions to obtain
composite functions that are much more complex. For example, three
simple functions, 𝑓 (1) , 𝑓 (2) , and 𝑓 (3) , may be chained into the composite
function (or network):
𝑓 (𝑥) = 𝑓 (3) 𝑓 (2) 𝑓 (1) (𝑥) . (10.86)
function. This means that the output from a neuron is a number, and
thus the output from a whole layer can be represented as a vector 𝑥.
We represent the vector of values for layer 𝑘 by 𝑥 (𝑘) , and the value for
(𝑘)
the 𝑖th neuron in layer 𝑘 by 𝑥 𝑖 .
Consider a neuron in layer 𝑘. This neuron is connected to many
neurons from the previous layer 𝑘 − 1 (see the first part of Fig. 10.25).
We need to choose a functional form for each neuron in the layer that
takes in the values from the previous layer as inputs. Chaining together
linear functions would yield another linear function. Therefore, some
layers must use nonlinear functions.
1
The most common choice for hidden layers is a layer of linear
functions followed by a layer of functions that create nonlinearity. A 0.8
neuron in the linear layer produces the following intermediate variable: Sigmoid
0.6
𝑎(𝑧)
Õ
𝑛 0.4
(𝑘−1)
𝑧= 𝑤𝑗 𝑥𝑗 +𝑏. (10.87)
0.2
𝑗=1
5
In vector form:
−5 𝑧
𝑧=𝑤 𝑥 | (𝑘−1)
+𝑏. (10.88)
4
The first term is a weighted sum of the values from the neurons in the
previous layer. The 𝑤 vector contains the weights. The term 𝑏 is the
𝑎(𝑧)
bias, which is an offset that scales the significance of the overall output. 2
These two terms are analogous to the weights used in the previous ReLU
section but with the constant term separated for convenience. The
second column of Fig. 10.25 illustrates the linear (summation and bias) −5 𝑧 5
layer.
Fig. 10.26 Activation functions.
10 Surrogate-Based Optimization 408
𝑧1
(𝑘−1) 𝑧2
𝑥1
𝑤1
(𝑘−1) 𝑧3
𝑥2
𝑤2
(𝑘−1)
𝑤3 Í (𝑘−1)
(𝑘)
𝑥3 𝑧4 = 𝑗 𝑤𝑗 𝑥𝑗 + 𝑏4 𝑥 (𝑘) = 𝑎(𝑧) 𝑥4
.. 𝑤𝑛 𝑧5
.
(𝑘−1)
𝑥𝑛 ..
.
𝑧𝑚
To compute the outputs for all the neurons in this layer, the weights
𝑤 for one neuron form one row in a matrix of weights 𝑊 and we can
write:
𝑥 (𝑘) 𝑊1,𝑛 𝑘−1 𝑥 (𝑘−1) 𝑏
1 © 𝑊1,1 ··· 𝑊1,𝑗 ··· 1 1 ª
. .. . . ®
.. .
.. ..
.. .. ®
(𝑘−1) ®®
. .
(𝑘)
𝑥 𝑖 = 𝑎 𝑊𝑖,1 · · · 𝑊𝑖,𝑗 · · · 𝑊𝑖,𝑛 𝑘−1 𝑥 𝑗 + 𝑏 𝑖 ®
. .
. .. .. ..
.. .. ®®
. . . . . ®
(𝑘) (𝑘−1)
𝑥 𝑛 · · · 𝑊𝑛 𝑘 ,𝑛 𝑘−1 𝑥
𝑘 « 𝑊𝑛 𝑘 ,1 · · · 𝑊𝑛 𝑘 ,𝑗
𝑛 𝑘−1 𝑛 𝑘 ¬
𝑥
(10.92)
or
𝑥 (𝑘) = 𝑎 𝑊 𝑥 (𝑘−1) + 𝑏 . (10.93)
The activation function is applied separately for each row. The following
equation is more explicit (where 𝑤 𝑖 is the 𝑖th row of 𝑊):
(𝑘) | (𝑘−1)
𝑥𝑖 = 𝑎 𝑤𝑖 𝑥𝑖 + 𝑏𝑖 . (10.94)
We now have the objective and variables in place to train the neural
net. As with the other models discussed in this chapter, it is critical to
set aside some data for cross validation.
10 Surrogate-Based Optimization 410
Although less general, this approach can increase efficiency and stability. 58. Baydin et al., Automatic differentiation
in machine learning: A survey, 2018.
The ReLU activation function (Fig. 10.26, bottom) is not differentiable
at 𝑧 = 0, but in practice, this is generally not problematic—primarily
because these methods typically rely on inexact gradients anyway, as
discussed next.
The objective function in Eq. 10.95 consists of a sum of subfunctions,
each of which depends on a single data point (𝑥 (𝑖) , 𝑓 (𝑖) ). Objective
functions vary across machine learning applications, but most have this
same form:
minimize 𝑓 (𝜃) , (10.96)
𝜃
where
Õ
𝑛 Õ
𝑛
𝑓 (𝜃) = ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) = ℓ 𝑖 (𝜃) . (10.97)
𝑖=1 𝑖=1
As previously mentioned, the challenge with these problems is that
we often have large training sets where 𝑛 may be in the billions. That
means that computing the objective can be costly, and computing the
gradient can be even more costly.
If we divide the objective by 𝑛 (which does not change the solution),
the objective function becomes an approximation of the expected value
(see Appendix A.9):
1Õ
𝑛
𝑓 (𝜃) = ℓ 𝑖 (𝜃) = E(ℓ (𝜃)) (10.98)
𝑛
𝑖=1
Using the minibatch, we can estimate the gradient as the sum of the
subfunction gradients at different training points:
1 Õ
∇𝜃 𝑓 (𝜃) ≈ ∇𝜃 ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) . (10.99)
𝑚
𝑖∈𝑆
Thus, we divide the training data into these minibatches and use a new
minibatch to estimate the gradients at each iteration in the optimization.
This approach works well for these specific problems because of Fig. 10.27 Minibatches are randomly
the unique form for the objective (Eq. 10.98). As an example, for one drawn from the training data.
million training samples, a single gradient evaluation would require
evaluating all one million training samples. Alternatively, for a similar
cost, a minibatch approach can update the optimization variables a
million times using the gradient estimated from one training sample
at a time. This latter process usually converges much faster, mainly
because we are only fitting parameters against limited data in these
problems, so we generally do not need to find the exact minimum.
Typically, this gradient is used with steepest descent methods (Sec-
tion 4.4.1), more typically referred to as gradient descent in the machine
learning communities. As discussed in Chapter 4, steepest descent
is not the most effective optimization algorithm. However, steepest
descent with the minibatch updates, called stochastic gradient descent,
has been found to work well in machine learning applications. This
suitability is primarily because (1) many machine learning optimiza-
tions are performed repeatedly, (2) the true objective is difficult to
formalize, and (3) finding the absolute minimum is not as important as
finding a good enough solution quickly. One key difference in stochastic
gradient descent relative to the steepest descent method is that we do
not perform a line search. Instead, the step size (called the learning rate
in machine learning applications) is a preselected value that is usually
decreased between major optimization iterations.
Stochastic minibatching is easily applied to first-order methods and
has thus driven the development of improvements on stochastic gradient 177. Ruder, An overview of gradient descent
descent, such as momentum, Adagrad, and Adam.177 Although some of optimization algorithms, 2016.
10 Surrogate-Based Optimization 412
10.6.1 Exploitation
For models that do not provide uncertainty estimates, the only real
option is exploitation. A prediction-based exploitation infill strategy
adds an infill point wherever the surrogate predicts the optimum. The
reasoning behind this approach is that in SBO, we do not necessarily
care about having a globally accurate surrogate; instead, we only care
about having an accurate surrogate near the optimum.
The most logical point to sample is thus the optimum predicted by
the surrogate. Likely, the location predicted by the surrogate will not
be at the true optimum. However, evaluating this point adds valuable
information in the region of interest.
We rebuild the surrogate and re-optimize, repeating the process until
convergence. This approach usually results in the quickest convergence
to an optimum, which is desirable when the actual function is expensive
to evaluate. The downside is that we may converge prematurely to an
inferior local optimum for problems with multiple local optima.
Even though the approach is called exploitation, the optimizer used
on the surrogate can be a global search method (gradient-based or
gradient-free), although it is usually a local search method. If uncer-
tainty is present, using the mean value of the surrogate as the infill
criteria results in essentially an exploitation strategy.
10 Surrogate-Based Optimization 413
Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Variable lower and upper bounds
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value
𝑥 (𝑖) = sample(𝑛
𝑠 , 𝑛𝑑 ) Sample
𝑥 ∗ , 𝑓ˆ∗
= min 𝑓ˆ(𝑥) Perform optimization on the surrogate function
𝑓new = 𝑓 (𝑥 ∗ ) Evaluate true function at predicted optimum
𝑥 (𝑖) = 𝑥 (𝑖) ∪ 𝑥 ∗ Append new point to training data
𝑓 (𝑖) = 𝑓 (𝑖) ∪ 𝑓new Append corresponding function value
𝑘 = 𝑘+1
end while
The expected value for a kriging model can be found analytically as:
∗
𝑓 ∗ − 𝜇 𝑓 (𝑥) 𝑓 ∗ − 𝜇 𝑓 (𝑥)
𝐸𝐼(𝑥) = ( 𝑓 − 𝜇 𝑓 (𝑥))Φ + 𝜎 𝑓 (𝑥)𝜙 ,
𝜎 𝑓 (𝑥) 𝜎 𝑓 (𝑥)
(10.102)
where Φ and 𝜙 are the CDF and PDF, respectively, for the standard
normal distribution, and 𝜇 𝑓 and 𝜎 𝑓 are the mean and standard error
functions produced from kriging ( Eqs. 10.75 and 10.76).
The algorithm is similar to that of the previous section (Alg. 10.4),
but instead of choosing the minimum of the surrogate, the selected
infill point is the point with the greatest expected improvement. The
corresponding algorithm is detailed in Alg. 10.5.
Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Lower and upper bounds
𝜏: Minimum expected improvement
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value
·10−2 ·10−3
4 8
3 6
𝐸𝐼(𝑥) 2 𝐸𝐼(𝑥) 4
1 2
0 0
1 1
0.5 0.5
𝑓 𝑓
0 0
−0.5 −0.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
𝑥 𝑥
𝑘=1 𝑘=5
·10−3 ·10−5
1.5 1
𝐸𝐼(𝑥) 1 𝐸𝐼(𝑥) 0.5
0.5
0 0
1 1
0.5 0.5
𝑓 𝑓
0 0
−0.5 −0.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 10.29 Expected improvement
𝑥 𝑥
evaluated across the domain.
𝑘 = 10 𝑘 = 12
10.7 Summary
Problems
10.4 Linear regression. Use the following training data sampled at 𝑥 with
the resulting function value 𝑓 (also tabulated on the resources
website):
10.5 Cross validation. Use the following training data sampled at 𝑥 with
resulting function value 𝑓 (also tabulated on resources website):
𝑦 = exp(−𝑥) cos(5𝑥),
10.8 Efficient global optimization. Use EGO with the function from the
previous problem, showing the iteration history until the expected
improvement reduces below 0.001.
Convex Optimization
11
General nonlinear optimization problems are difficult to solve. De-
pending on the particular optimization algorithm, they may require
tuning parameters, providing derivatives, adjusting scaling, and trying
multiple starting points. Convex optimization problems do not have
any of those issues and are thus easier to solve. The challenge is that
these problems must meet strict requirements. Even for candidate
problems with the potential to be convex, significant experience is
usually needed to recognize and utilize techniques that reformulate the
problems into an appropriate form.
11.1 Introduction
421
11 Convex Optimization 422
because the linearization can be updated in the next time step. However,
this fidelity reduction is problematic for design applications.
In design scenarios, the optimization is performed once, and the
design cannot continue to be updated after it is created. For this reason,
convex optimization is less frequently used for design applications, ex-
cept for some limited uses in geometric programming, a topic discussed
in more detail in Section 11.6.
This chapter just introduces convex optimization and is not a re-
placement for more comprehensive textbooks on the topic.† We focus on † Boyd and Vandenberghe86 is the most
cited textbook on convex optimization.
understanding what convex optimization is useful for and describing
86. Boyd and Vandenberghe, Convex
the most widely used forms. Optimization, 2004.
The known categories of convex optimization problems include
linear programming, quadratic programming, second-order cone pro-
gramming, semidefinite programming, cone programming, and graph
form programming. Each of these categories is a subset of the next
(Fig. 11.2).‡ ‡ Several references exist with examples for
functions and operations and the rules specified by disciplined convex 181. Parikh and Boyd, Block splitting for
distributed optimization, 2013.
programming, and a software tool transforms the problem into a
182. Vandenberghe and Boyd, Semidefi-
suitable conic form that can be solved. Section 11.5 describes this nite programming, 1996.
procedure. 183. Vandenberghe and Boyd, Applica-
tions of semidefinite programming, 1999.
After covering the three main categories of convex optimization
problems, we discuss geometric programming. Geometric program- Graph form programming
(GFP)
ming problems are not convex, but with a change of variables, they
Cone programming
can be transformed into an equivalent convex form, thus extending the (CP)
types of problems that can be solved with convex optimization. Semidefinite programming
(SDP)
Second-order cone
11.2 Linear Programming programming (SOCP)
Quadratic
A linear program (LP) is an optimization problem with a linear objective programming (QP)
and linear constraints and can be written as
Linear programming
(LP)
minimize |
𝑓 𝑥
𝑥
subject to 𝐴𝑥 + 𝑏 = 0 (11.2)
𝐶𝑥 + 𝑑 ≤ 0 ,
Fig. 11.2 Relationship between vari-
where 𝑓 , 𝑏, and 𝑑 are vectors and 𝐴 and 𝐶 are matrices. All LPs are ous convex optimization problems.
convex.
11 Convex Optimization 424
Suppose we are shopping and want to find how best to meet our nutritional
needs for the lowest cost. We enumerate all the food options and use the
variable 𝑥 𝑗 to represent how much of food 𝑗 we purchase. The parameter 𝑐 𝑗 is
the cost of a unit amount of food 𝑗. The parameter 𝑁𝑖𝑗 is the amount of nutrient
𝑖 contained in a unit amount of food 𝑗. We need to make sure we have at least
𝑟 𝑖 of nutrient 𝑖 to meet our dietary requirements. We can now formulate the
cost objective as Õ
minimize 𝑐 𝑗 𝑥 𝑗 = 𝑐| 𝑥 .
𝑥
𝑗
To meet the nutritional requirement of nutrient 𝑖, we need to satisfy
Õ
𝑁𝑖𝑗 𝑥 𝑗 ≥ 𝑟 𝑖 ⇒ 𝑁 𝑥 ≥ 𝑟 .
𝑗
If the amount of each food is 𝑥, the cost column is 𝑐, and the nutrient
columns are 𝑛1 , 𝑛2 , and 𝑛3 , we can formulate the LP as
minimize 𝑐| 𝑥
𝑥
subject to 5 ≤ 𝑛1 𝑥 ≤ 8
|
7 ≤ 𝑛2 𝑥
|
1 ≤ 𝑛3 𝑥 ≤ 10
|
𝑥 ≤ 4.
11 Convex Optimization 425
The last constraint ensures that we do not overeat any one item and get tired of
it. LP solvers are widely available, and because the inputs of an LP are just a
table of numbers some solvers do not even require a programming language.
The solution for this problem is
suggesting that our optimal diet consists of items B, F, H, and I in the proportions
shown here. The solution reached the upper limit for nutrient 1 and the lower
limit for nutrient 2.
𝐶𝑥 + 𝑑 ≤ 0 .
The left pane of Fig. 11.3 shows some example data that are both noisy and
biased relative to the true (but unknown) underlying curve, represented as a
dashed line. Given the data points, we would like to estimate the underlying
functional relationship. We assume that the relationship is cubic and write it as
𝑦(𝑥) = 𝑎1 𝑥 3 + 𝑎2 𝑥 2 + 𝑎3 𝑥 + 𝑎4 .
30 30 30
20 20 20
𝑦 𝑦 𝑦
10 10 10
0 0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥 𝑥 𝑥
Suppose that we know the upper bound of the function value based on Fig. 11.3 True function on the left,
measurements or additional data at a few locations. In this example, assume least squares in the middle, and con-
strained least squares on the right.
that we know that 𝑓 (−2) ≤ −2, 𝑓 (0) ≤ 4, and 𝑓 (2) ≤ 26. These requirements
can be posed as linear constraints:
𝑎1
(−2)3 (−2)2 1 −2
−2 𝑎2
0 1 ≤4 .
0 0 𝑎3
23 1 26
22 2 𝑎
4
After adding these linear constraints and retaining a quadratic objective
(the sum of the squared error), the resulting problem is still a QP. The resulting
solution is shown in the right pane of Fig. 11.3, which results in a much more
accurate fit.
11 Convex Optimization 427
𝑥 𝑡+1 = 𝐴𝑥 𝑡 + 𝐵𝑢𝑡 ,
where 𝑥 𝑡 is the deviation from a desired state at time 𝑡 (e.g., the positions and
velocities of an aircraft), and 𝑢𝑡 represents the control inputs that we want to
optimize (e.g., control surface deflections). This dynamic equation can be used
as a set of linear constraints in an optimization problem, but we must decide
on an objective.
We would like to have small 𝑥 𝑡 because that would mean reducing the error
in our desired state quickly, but we would also like to have small 𝑢𝑡 because
small control inputs require less energy. These are competing objectives, where
a small control input will take longer to minimize error in a state, and vice
versa.
One way to express this objective is as a quadratic function,
1Õ |
𝑛
minimize
|
𝑥 𝑡 𝑄𝑥 𝑡 + 𝑢𝑡 𝑅𝑢𝑡 ,
𝑥,𝑢 2
𝑡=0
minimize 𝑓 |𝑥
𝑥
subject to
|
k𝐴 𝑖 𝑥 + 𝑏 𝑖 k 2 ≤ 𝑐 𝑖 𝑥 + 𝑑 𝑖 (11.4)
𝐺𝑥 + ℎ = 0 .
1 |
minimize 𝑥 𝑄𝑥 + 𝑓 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (11.5)
1 |
𝑥 𝑅 𝑖 𝑥 + 𝑐 𝑖 𝑥 + 𝑑 𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑚 ,
|
2
where 𝑄 and 𝑅 must be positive semidefinite for the QCQP to be
convex. A QCQP reduces to a QP if 𝑅 = 0. We formulated QCQPs
when solving trust-region problems in Section 4.5. However, for trust-
region problems, only an approximate solution method is typically
used.
Every QCQP can be expressed as an SOCP (although not vice versa).
The QCQP in Eq. 11.5 can be written in the equivalent form,
minimize 𝛽
𝑥,𝛽
If we square both sides of the first and last constraints, this formulation
is exactly equivalent to the QCQP where 𝑄 = 2𝐹 | 𝐹, 𝑓 = 2𝐹 | 𝑔, 𝑅 𝑖 =
2𝐺 𝑖 𝐺 𝑖 , 𝑐 𝑖 = 2𝐺 𝑖 ℎ 𝑖 , and 𝑑 𝑖 = ℎ 𝑖 ℎ 𝑖 . The matrices 𝐹 and 𝐺 𝑖 are the square
| | |
Functions Examples
𝛼↑
𝑒 𝑎𝑥
( 𝛼≥1
−𝑥 𝑎 if 0 ≤ 𝑎 ≤ 1 𝛼≤0
𝑥𝑎 otherwise
0≤𝛼≤1
− log(𝑥)
k𝑥k 1 , k𝑥 k 2 , . . .
max(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 )
ln 𝑒 𝑥1 + 𝑒 𝑥2 + . . . + 𝑒 𝑥 𝑛
Table 11.1 Examples of convex func-
tions.
CVX and its variants are free popular tools for disciplined convex program-
ming with interfaces for multiple programming languages.∗ ∗ https://stanford.edu/~boyd/software.
html
𝑓 (𝑥) = 𝑎 | 𝑥 + 𝛽 ,
that separates the two data sets, or in other words, a function that classifies the
objects. For example, if we call one data set 𝑦 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑦 , and the other
𝑧 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑧 , we need to satisfy the following constraints:
𝑎 | 𝑦𝑖 + 𝛽 ≥ 𝜀
(11.7)
𝑎 | 𝑧 𝑖 + 𝛽 ≤ −𝜀 ,
for some small tolerance 𝜀. In general, there are an infinite number of separating
hyperplanes, so we seek the one that maximizes the distance between the points.
However, such a problem is not yet well defined because we can multiply 𝑎 and
𝛽 in the previous equations by an arbitrary constant to achieve any separation
we want, so we need to normalize or fix some reference dimension (only the
ratio of the parameters matters in defining the hyperplane, not their absolute
magnitudes). We define the optimization problem as follows:
maximize 𝛾
by varying 𝛾, 𝑎, 𝛽
subject to 𝑎 | 𝑦 𝑖 + 𝛽 ≥ 𝛾 for 𝑖 = 1 . . . 𝑛 𝑦
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −𝛾 for 𝑗 = 1, . . . , 𝑛 𝑧
k𝑎k ≤ 1 .
The last constraint provides a normalization to prevent the problem from being
unbounded. This norm constraint is always active (k𝑎 k = 1), but we express
11 Convex Optimization 431
𝑎 | 𝑦 𝑖 + 𝛽 ≥ 1 − 𝑢𝑖
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −(1 − 𝑣 𝑗 ) ,
where we seek to minimize the sum of the entries in 𝑢 and 𝑣. If they sum
to 0, we have the original constraints for a completely separable function.
However, recall that we are interested in not just creating separation but also in
maximizing the distance to the classification boundary. To accomplish this, we
use a regularization approach where our two objectives include maximizing
the distance from the boundary and maximizing the sum of the classification
margins. The width between the two planes 𝑎 | 𝑥 + 𝛽 = 1 and 𝑎 | 𝑥 + 𝛽 = −1 is † Inthe machine learning community, this
2/k𝑎 k. Therefore, to maximize the separation distance, we minimize k𝑎 k. The optimization problem is known as a support
optimization problem is defined as follows:† vector machine. This problem is an example
of supervised learning because classification
by varying 𝑎, 𝛽, 𝑢, 𝑣
subject to 𝑎 | 𝑦 𝑖 + 𝛽 ≥ (1 − 𝑢𝑖 ), 𝑖 = 1, . . . , 𝑛 𝑦
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −(1 − 𝑣 𝑗 ), 𝑗 = 1, . . . , 𝑛 𝑧
𝑥2
2
𝑢≥0 k𝑎k
𝑣 ≥ 0.
Õ
𝑛
𝑎1𝑗 𝑎2𝑗 𝑎𝑚 𝑗
𝑓 (𝑥) = 𝑐 𝑗 𝑥1 𝑥2 · · · 𝑥 𝑚 , (11.9)
𝑗=1
𝐶 𝐿2
𝐷 = 𝐶 𝐷 𝑝 𝑞𝑆 + 𝑞𝑆.
𝜋𝐴𝑅𝑒
minimize 𝑓0 (𝑥)
𝑥
subject to 𝑓𝑖 (𝑥) ≤ 1 (11.10)
ℎ 𝑖 (𝑥) = 1 ,
where 𝑓𝑖 are posynomials, and ℎ 𝑖 are monomials. This problem does not
fit into any of the convex optimization problems defined in the previous
section, and it is not convex. This formulation is useful because we can
convert it into an equivalent convex optimization problem.
11 Convex Optimization 433
First, we take the logarithm of the objective and of both sides of the
constraints:
minimize ln 𝑓0 (𝑥)
𝑥
subject to ln 𝑓𝑖 (𝑥) ≤ 0 (11.11)
ln ℎ 𝑖 (𝑥) = 0 .
Let us examine the equality constraints further. Recall that ℎ 𝑖 is a
monomial, so writing one of the constraints explicitly results in the
following form:
𝑎𝑚
ln 𝑐𝑥 1𝑎1 𝑥2𝑎2 . . . 𝑥 𝑚 = 0. (11.12)
Using the properties of logarithms, this can be expanded to the equiva-
lent expression:
ln 𝑐 + 𝑎1 ln 𝑥 1 + 𝑎 2 ln 𝑥2 + . . . + 𝑎 𝑚 ln 𝑥 𝑚 = 0 . (11.13)
𝑎1 𝑦1 + 𝑎2 𝑦2 + . . . + 𝑎 𝑚 𝑦𝑚 + ln 𝑐 = 0 , 𝑎 | 𝑦 + ln 𝑐 = 0, (11.14)
©Õ 𝑎𝑚 𝑗 ª
𝑛
ln 𝑐 𝑗 𝑥1 𝑥2 . . . 𝑥 𝑚 ® .
𝑎1𝑗 𝑎2𝑗
(11.15)
« 𝑗=1 ¬
Because this is a sum of products, we cannot use the logarithm to
expand each term. However, we still introduce the same change of
variables (expressed as 𝑥 𝑖 = 𝑒 𝑦𝑖 ):
©Õ ª
𝑛
ln 𝑓𝑖 = ln 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 exp 𝑦2 𝑎2𝑗 . . . exp 𝑦𝑚 𝑎 𝑚 𝑗 ®
« 𝑗=1 ¬
©Õ ª
𝑛
= ln 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 + 𝑦2 𝑎2𝑗 + 𝑦𝑚 𝑎 𝑚 𝑗 ® (11.16)
« 𝑗=1 ¬
©Õ ª
𝑛
= ln exp 𝑎 𝑗 𝑦 + 𝑏 𝑗 ® , where 𝑏 𝑗 = ln 𝑐 𝑗 .
|
« 𝑗=1 ¬
This is a log-sum-exp of an affine function. As mentioned in the previous
section, log-sum-exp is convex, and a convex function composed of an
11 Convex Optimization 434
by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
subject to 2(𝑥 ℎ 𝑥 𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥 𝑤 𝑥 𝑑 ) ≤ 𝐴
𝑥𝑤
𝛼𝑙 ≤ ≤ 𝛼ℎ .
𝑥𝑑
We can express this problem in GP form (Eq. 11.10):
minimize 𝑥 −1 𝑥 −1 𝑥 −1
ℎ 𝑤 𝑑
by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
2 2 2
subject to 𝑥 𝑥𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥𝑤 𝑥 𝑑 ≤ 1
𝐴 ℎ 𝐴 𝐴
1
𝑥 𝑤 𝑥 −1
𝑑
≤1
𝛼ℎ
−1
𝛼 𝑙 𝑥 𝑑 𝑥𝑤 ≤ 1.
We can now plug this into a GP solver. For this example, we use the
following parameters: 𝛼 𝑙 = 2, 𝛼 ℎ = 8, 𝐴 = 100. The solution is 𝑥 𝑑 = 2.887, 𝑥 ℎ =
3.849, 𝑥 𝑤 = 5.774, with a total volume of 64.16.
Unfortunately, many other functions do not fit this form (e.g., de-
sign variables that can be positive or negative, terms with negative
coefficients, trigonometric functions, logarithms, and exponents). GP
modelers use various techniques to extend usability, including using a
Taylor series across a restricted domain, fitting functions to posynomi-
als,185 and rearranging expressions to other equivalent forms, including 185. Hoburg et al., Data fitting with
geometric-programming-compatible softmax
implicit relationships. Creativity and some sacrifice in fidelity are functions, 2016.
usually needed to create a corresponding GP from a general nonlinear
programming problem. However, if the sacrifice in fidelity is not too
great, there is a significant advantage because the formulation comes
with all the benefits of convexity—guaranteed convergence, global
optimality, efficiency, no parameter tuning, and limited scaling issues.
11 Convex Optimization 435
11.7 Summary
Problems
11.2 Solve the following using a convex solver (not a general nonlinear
solver):
11.3 The following foods are available to you at your nearest grocer:
11 Convex Optimization 437
Minimize the amount you spend while making sure you get at
least 5 units of nutrient 1, between 8 and 20 units of nutrient 2,
and between 5 and 30 units of nutrient 3. Also be sure not to buy
more than 4 units of any one food item, just for variety. Determine
the optimal amount of each item to purchase and the total cost.
1 2
𝐿= 𝜌𝑣 𝑏𝑐𝐶 𝐿
2
as an equality constraint. This is equivalent to a GP-compatible
monomial constraint
𝜌𝑣 2 𝑏𝑐𝐶 𝐿
= 1.
2𝐿
438
12 Optimization Under Uncertainty 439
2 2 2
15 3 1.5
𝑝(𝑥) 10 𝑝(𝑥) 2 𝑝(𝑥) 1
5 1 0.5
0 0 0
−1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2
𝑥 𝑥 𝑥
a mean value of 𝑥 = 0.5 and three different standard deviations is Fig. 12.1 The global minimum of the
shown on the bottom row the figure. For a small variance (𝜎𝑥 = expected value 𝜇 𝑓 can shift depend-
ing on the standard deviation of 𝑥,
0.01), the expected value function 𝜇 𝑓 (𝑥) is indistinguishable from the 𝜎𝑥 . The bottom row of figures shows
deterministic function 𝑓 (𝑥), and the global minimum is the same for the normal probability distributions
at 𝑥 = 0.5.
both functions. However, for 𝜎𝑥 = 0.2, the minimum of the expected
value function is different from that of the deterministic function.
Therefore, the minimum on the right is not as robust as the one on
the left. The minimum one on the right is a narrow valley, so the
expected value increases rapidly with increased variance. The opposite
is true for the minimum on the left. Because it is in a broad valley, the
expected value is less sensitive to variability in 𝑥. Thus, a design whose
performance changes rapidly with respect to variability is not robust.
Of course, the mean is just one possible statistical output metric.
Variance, or standard deviation (𝜎 𝑓 ), is another common metric. How-
ever, directly minimizing the variance is less common because although
low variability is often desirable, such an objective has no incentive to
improve mean performance and so usually performs poorly. These two
metrics represent a trade-off between risk (variance) and reward (mean). 𝜎𝑓
The compromise between these two metrics can be quantified through
multiobjective optimization (see Chapter 9), which would result in a
Pareto front with the notional behavior illustrated in Fig. 12.2. Because
both multiobjective optimization and uncertainty quantification are 𝜇𝑓
costly, the overall cost of producing such a Pareto front might be pro-
hibitive. Therefore, we might instead seek to minimize the expected Fig. 12.2 When designing for robust-
ness, there is an inherent trade-off
value while constraining the variance to a value that the designer between risk (represented by the vari-
can tolerate. Another option is to minimize the mean plus weighted ance, 𝜎 𝑓 ) and reward (represented by
standard deviations. the expected value, 𝜇 𝑓 ).
12 Optimization Under Uncertainty 441
the red drag curve shown in Fig. 12.3. The drag is much lower at Mach 0.71 (as
requested!), but any deviation from the target Mach number causes significant 𝑐𝑑
·10−4
drag penalties. In other words, the design is not robust. 220
One way to improve the design is to use multipoint optimization, where we 200
minimize a weighted sum of the drag coefficient evaluated at different Mach 180 Baseline
numbers. In this case, we use Mach = 0.68, 0.71, 0.725. Compared with the 160
single-point design, the multipoint design has a higher drag at Mach 0.71 but a 140
lower drag at the other Mach numbers, as shown in Fig. 12.3. Thus, a trade-off Multi-point
120
in peak performance was required to achieve enhanced robustness. 100 Single-point
A multipoint optimization is a simplified example of OUU. Effectively, we 0.64 0.66 0.68 0.7 0.72 0.74
have treated the Mach number as a random parameter with a given probability Mach number
at three discrete values. We then minimized the expected value of the drag.
This simple change significantly increased the robustness of the design. Fig. 12.3 Single-point optimization
performs the best at the target speed
but poorly away from the condition.
Multipoint optimization is more ro-
bust to changes in speed.
Example 12.2 Robust wind farm layout optimization † See other wind farm OUU problems with
parameter. Figure 12.4 shows a PDF of the wind direction for an actual wind
farm, known as a wind rose, which is commonly visualized as shown in the
plot on the right. The predominant wind directions are from the west and the
south. Because of the variable nature of the wind, it would be challenging to
intuit the optimal layout.
·10−3 N
8
NW NE
Relative Probability
4
W E
60
50 W E
40
Fig. 12.5 Wind farm power as a func-
1 dir tion of wind direction for two opti-
30 mization approaches: deterministic
SE
optimization using the most probable
0 90 180 270 360 SW
direction and OUU.
Wind direction [deg] S
12 Optimization Under Uncertainty 443
We can also analyze the trade-off in the optimal layouts. The left side of
Fig. 12.6 shows the optimal layout using the deterministic formulation, with
the wind coming from the predominant direction (the direction we optimized
for). The wakes are shown in blue, and the boundaries are depicted with a
dashed line. The optimization spaced the wind turbines out so that there is
minimal wake interference. However, the performance degrades significantly
when the wind changes direction. The right side of Fig. 12.6 shows the same
layout but with the wind coming from the second-most-probable direction. In
this case, many of the turbines are operating in the wake of another turbine
and produce much less power.
In contrast, the robust layout is shown in Fig. 12.7, with the predominant
wind direction on the left and the second-most-probable direction on the right.
In both cases, the wake effects are relatively minor. The turbines are not ideally
placed for the predominant direction, but trading the performance for that
one direction yields better overall performance when considering other wind
directions.
Consider the Barnes problem shown on the left side of Fig. 12.8. The three
red lines are the three nonlinear constraints of the problem, and the red regions
highlight regions of infeasibility. With deterministic inputs, the optimal value
is on the constraint line. An uncertainty ellipse shown around the optimal
point highlights the fact that the solution is not reliable. Any variability in the
inputs can cause one or more constraints to be violated.
Conversely, the right side of Fig. 12.8 shows a reliable optimum, with the
same uncertainty ellipse. In this case, it is much more probable that the design
will satisfy all constraints under the input variations. However, as noted in
the introduction, increased reliability presents a performance trade-off, with a
corresponding increase in the objective function. The higher the reliability we
seek, the more we need to give up on performance.
12 Optimization Under Uncertainty 445
𝜎(𝑥) ≤ 𝜂𝜎 𝑦 ,
where 𝜂 is a total safety factor that accounts for safety factors from loads,
materials, and failure modes. Of course, not all applications have standards-
driven safety factors already determined. The statistical approach discussed in
this chapter is useful in these situations to obtain reliable designs.
𝜇 𝑓 = 𝑓 (𝜇𝑥 ) . (12.5)
That is, when considering only first-order terms, the mean of the function
is the function evaluated at the mean of the input.
The variance of 𝑓 is given by
ÕÕ 𝜕𝑓 𝜕𝑓
(12.6)
(𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) − 𝑓 (𝜇𝑥 )2
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝑖 𝑗
ÕÕ 𝜕𝑓 𝜕𝑓 h i
= E (𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) .
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝑖 𝑗
𝑛
Õ 2
𝜕𝑓
𝜎2𝑓 = 𝜎𝑥 𝑖 . (12.8)
𝜕𝑥 𝑖
𝑖=1
All the design variables are random variables with standard deviations 𝜎𝑥1 =
𝜎𝑥2 = 0.033, and 𝜎𝑥3 = 0.0167. We seek a reliable optimum, where each
constraint has a target reliability of 99.865 percent.
First, we compute the deterministic optimum, which is
We compute the standard deviation of each constraint, using Eq. 12.8, about
the deterministic optimum, yielding 𝜎 𝑔1 = 0.081, 𝜎 𝑔2 = 0.176. Using an
inverse CDF function (discussed in Section 10.2.1) shows that a CDF of 0.99865
corresponds to a 𝑧-score of 3. We then re-optimize with the new reliability
constraints to obtain the solution:
Number of samples
4
To check these results, we use Monte Carlo simulations (explained in Deterministic
3
Section 12.3.3) with 100,000 samples to produce the output histograms shown
in Fig. 12.9. The deterministic optimum fails often (k 𝑔(𝑥)k ∞ > 0), so its 2
reliability is a surprisingly poor 34.6 percent. The reliable optimum shifts the 1
distribution to the left, yielding a reliability of 99.75 percent, which is close to
0
our design target. −0.5 0 0.5
k 𝑔(𝑥 ∗ )k ∞
is to add a new node between all existing nodes. Thus, the accuracy of
the integral can be improved up to a specified tolerance while reusing
previous function evaluations.
Although straightforward to apply, the Newton–Cotes formulas are
usually much less efficient than Gaussian quadrature, at least for smooth,
nonperiodic functions. Efficiency is highly desirable because the output
functions must be called many times for forward propagation, as well
as throughout the optimization. The Newton–Cotes formulas are based
on fitting polynomials: constant (midpoint), linear (trapezoidal), and
quadratic (Simpson’s).The weights are adjusted between the different
methods, but the nodes are fixed. Gaussian quadrature includes the
nodes as degrees of freedom selected by the quadrature strategy. The
method approximates the integrand as a polynomial and then efficiently
evaluates the integral for the polynomial exactly. Because some of the
concepts from Gaussian quadrature are used later in this chapter, we
review them here.
An 𝑛-point Gaussian quadrature has 2𝑛 degrees of freedom (𝑛 node
positions and 𝑛 corresponding weights), so it can be used to exactly
integrate any polynomial up to order 2𝑛 − 1 if the weights and nodes are
appropriately chosen. For example, a 2-point Gaussian quadrature can
exactly integrate all polynomials up to order 3. To illustrate, consider
an integral over the bounds −1 to 1 (we will later see that these bounds
can be used as a general representation of any finite bounds through a
change of variables):
∫ 1
𝑓 (𝑥) d𝑥 ≈ 𝑤 1 𝑓 (𝑥1 ) + 𝑤 2 𝑓 (𝑥2 ) . (12.13)
−1
2𝑎 = 𝑎(𝑤 1 + 𝑤2 ). (12.14)
2 = 𝑤1 + 𝑤2
0 = 𝑤1 𝑥1 + 𝑤2 𝑥2
2 (12.15)
= 𝑤 1 𝑥 12 + 𝑤 2 𝑥22
3
0 = 𝑤 1 𝑥 13 + 𝑤 2 𝑥23 .
√
Solving these equations yields 𝑤 1 = 𝑤 2 = 1, 𝑥1 = −𝑥2 = 1/ 3. Thus,
we have the weights and node positions that integrate a cubic (or
12 Optimization Under Uncertainty 451
Legendre polynomials are defined over the interval [−1, 1], but we
can reformulate them for an arbitrary interval [𝑎, 𝑏] through a change
of variables:
𝑏−𝑎 𝑏+𝑎
𝑥= 𝑧+ , (12.21)
2 2
where 𝑧 ∈ [−1, 1].
Using the change of variables, we can write
∫ 𝑏 ∫ 1
(𝑏 − 𝑎) 𝑏+𝑎 𝑏−𝑎
𝑓 (𝑥) d𝑥 = 𝑓 𝑧+ d𝑧 . (12.22)
𝑎 −1 2 2 2
where the node locations and respective weights come from the Legen-
dre polynomials.
Recall that what we are after in this section is not just any generic
integral but, rather, metrics such as the expected value,
∫
𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥) d𝑥 . (12.24)
derived).
12 Optimization Under Uncertainty 453
Table 12.1 Orthogonal polynomials that correspond to some common probability distri-
butions.
Prob. dist. Weight function Polynomial Support range
Uniform 1 Legendre [−1, 1]
2
Normal 𝑒 −𝑥 Hermite (−∞, ∞)
Exponential 𝑒 −𝑥 Laguerre [0, ∞)
Beta (1 − 𝑥)𝛼 (1 + 𝑥)𝛽 Jacobi (−1, 1)
Gamma 𝑥 𝛼 𝑒 −𝑥 Generalized Laguerre [0, ∞)
distributed, then the integral is given by Fig. 12.11 The first few Hermite poly-
∫ ∞ nomials.
1 1 𝑥 − 𝜇 2
𝜇𝑓 = 𝑓 (𝑥) √ exp − d𝑥. (12.28)
−∞ 𝜎 2𝜋 2 𝜎
We use the change of variables,
𝑥−𝜇
𝑧= √ . (12.29)
2𝜎
Then, the resulting integral becomes
∫ ∞ √
1
𝜇𝑓 = √ 𝑓 𝜇 + 2𝜎𝑧 exp −𝑧 2 d𝑧. (12.30)
𝜋 −∞
This is now in the appropriate form, so the quadrature rule (using the
Hermite nodes and weights) is
1 Õ
𝑛 √
𝜇𝑓 = √ 𝑤 𝑖 𝑓 𝜇 + 2𝜎𝑧 𝑖 . (12.31)
𝜋 𝑖=1
12 Optimization Under Uncertainty 454
Figure 12.13 compares a two-dimension full tensor grid using the Clenshaw–
Curtis exponential rule (left) with a level 5 sparse grid using the same quadrature
strategy (right).
1 1
0.5 0.5
𝑥2 0 𝑥2 0
Fig. 12.13 Comparison between a two-
−0.5 −0.5 dimensional full tensor grid (left)
and a level 5 sparse grid using the
Clenshaw–Curtis exponential rule
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (right).
𝑥1 𝑥1
1Õ
𝑛
𝜇𝑓 = 𝑓𝑖 , , (12.33)
𝑛
𝑖=1
𝑛
!
1 Õ
2
𝜎𝑓 = 𝑓𝑖2 − 𝑛𝜇2𝑓 . (12.34)
𝑛−1
𝑖=1
statistics like the mean and variance, Monte Carlo generates the output
probability distributions. This is a unique advantage compared with
first-order perturbation and direct quadrature, which provide summary
statistics but not distributions.
2 2 2
Exact
1.8 1.8 1.8
Monte Carlo
1.6 1.6 1.6
1 1 1
6.4 1.5
Random sampling
Random sampling 1.4
6.2 1.3
LHS LHS
1.2
𝜇 𝜎
6 1.1
Halton 1 Halton
Fig. 12.15 Convergence of the mean
5.8 0.9 (left) and standard deviation (right)
0.8 versus the number of samples using
101 102 103 104 105 106 101 102 103 104 105 106 Monte Carlo.
𝑁 𝑁
12 Optimization Under Uncertainty 459
From the data, we conclude that we need about 𝑛 = 104 samples to have
well-converged statistics. Using 𝑛 = 104 yields 𝜇 = 6.127, 𝜎 = 1.235, and Count ·104
𝑟 = 0.9914. The random sampling of these results varies between simulations
(except for the Halton sequence in quasi-Monte Carlo, which is deterministic). 1.5
The production of an output histogram is a key benefit of this method. The
histogram of the objective function is shown in Fig. 12.16. Notice that it is not 1
0
5 10
𝑓
12.3.4 Polynomial Chaos
Polynomial chaos (also known as spectral expansions) is a class of forward- Fig. 12.16 Histogram of objective func-
tion for 10,000 samples.
propagation methods that take advantage of the inherent smoothness
of the outputs of interest using polynomial approximations.§ § Polynomial chaos is not chaotic and does
The method extends the ideas of Gaussian quadrature to estimate the not actually need polynomials. The name
polynomial chaos came about because it was
output function, from which the output distribution and other summary initially derived for use in a physical theory
198
statistics can be efficiently generated. In addition to using orthogonal of chaos.
198. Wiener, The homogeneous chaos, 1938.
polynomials to evaluate integrals, we use them to approximate the
output function. As in Gaussian quadrature, the polynomials are
orthogonal with respect to a specified probability distribution (see
Eq. 12.25 and Table 12.1). A general function that depends on uncertain
variables 𝑥 can be represented as a sum of basis functions 𝜓 𝑖 (which
are usually polynomials) with weights 𝛼 𝑖 ,
Õ
∞
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥). (12.35)
𝑖=0
Õ
𝑛
𝑓 (𝑥) ≈ 𝛼 𝑖 𝜓 𝑖 (𝑥) . (12.36)
𝑖=0
h𝜓 𝑖 , 𝜓 𝑗 i = 0 if 𝑖 ≠ 𝑗. (12.38)
12 Optimization Under Uncertainty 460
Compute Statistics
The coefficients 𝛼 𝑖 are constants that can be taken out of the integral, so
we can write
Õ ∫
𝜇𝑓 = 𝛼𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥) d𝑥
𝑖
∫ ∫ ∫
= 𝛼0 𝜓0 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 1 𝜓1 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 2 𝜓2 (𝑥)𝑝(𝑥) d𝑥 + . . . .
Because the polynomials are orthogonal, all the terms except the first
are zero (see Eq. 12.38). From the definition of a PDF (Eq. A.63), we
know that the first term is 1. Thus, the mean of the function is simply
the zeroth coefficient,
𝜇 𝑓 = 𝛼0 . (12.41)
We can derive a formula for the variance using a similar approach.
Substituting the polynomial representation (Eq. 12.36) into the definition
of variance and using the same techniques used in deriving the mean,
we obtain
∫ !2
Õ
𝜎2𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥 − 𝛼20
Õ
𝑖
∫
= 𝛼2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20
𝑖
12 Optimization Under Uncertainty 461
∫ Õ
𝑛 ∫
𝜎2𝑓 = 𝛼 20 𝜓02 𝑝(𝑥) d𝑥 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 20 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20
∫
𝑖=1
(12.42)
Õ
𝑛
= 𝛼2𝑖 2
𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥
𝑖=1
Õ𝑛
= 𝛼 2𝑖 h𝜓 2𝑖 i .
𝑖=1
That last step is just the definition of the weighted inner product
(Eq. 12.25), providing the variance in terms of the coefficients and
polynomials:
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 h𝜓 2𝑖 i . (12.43)
𝑖=1
𝜇 𝑓 = 𝛼0 (12.45)
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 hΨ2𝑖 i . (12.46)
𝑖=1
Determine Coefficients
Using the orthogonality property of the basis functions (Eq. 12.38), all
the terms in the summation are zero except for
h 𝑓 (𝑥), 𝜓 𝑖 i = 𝛼 𝑖 h𝜓 2𝑖 i . (12.49)
12 Optimization Under Uncertainty 463
the corresponding quadrature strategy or utilize random sequences.‖ 201. Feinberg and Langtangen, Chaospy:
An open source tool for designing methods of
uncertainty quantification, 2015.
12 Optimization Under Uncertainty 464
where the current iteration is at 𝑥 = [1, 1], and we assume that both design
variables are normally distributed with the following standard deviations:
𝜎 = [0.06, 0.2].
We approximate the function with fourth-order Hermite polynomials.
Using Eq. 12.37, we see that there are 15 basis functions from the various
combinations of 𝐻𝑖 𝐻 𝑗 :
The integrals for the basis functions (Hermite polynomials) have analytic
solutions:
hΨ2𝑘 i = h(𝐻𝑚 𝐻𝑛 )2 i = 𝑚!𝑛! .
We now compute the following double integrals to obtain the coefficients using
Gaussian quadrature:
∫ ∞∫ ∞
1
𝛼𝑘 = 𝑓 (𝑥)Ψ 𝑘 (𝑥)𝑝(𝑥) d𝑥1 d𝑥 2
hΨ2𝑘 i −∞ −∞
We must be careful with variable definitions because the inputs are not standard
normal distributions. The function 𝑓 is defined over the unnormalized variable
𝑥, whereas our basis functions are defined over a standard normal distribu-
tion: 𝑦 = (𝑥 − 𝜇)/𝜎. The probability distribution in this case is a bivariate,
12 Optimization Under Uncertainty 465
1 Õ
𝑛𝑖 Õ
𝑛𝑗 √
𝛼𝑘 ≈ 𝑤 𝑖 𝑤 𝑗 𝑓 (𝑋𝑖𝑗 )Ψ 𝑘 2𝑍 𝑖𝑗 ,
𝜋hΨ2𝑘 i 𝑖=1 𝑗=1
and 𝑍 = 𝑧1 ⊗ 𝑧2 .
In this case, we choose a full tensor product mesh of the fifth order in both
dimensions. The nodes and weights are given by
𝛼0 = 2.1725 1
𝛼1 = −0.0586 0
𝛼2 = 0.0117
−1
𝛼3 = −0.00156
−2
𝛼5 = −0.0250
−2 −1 0 1 2
𝛼 9 = 0.01578 . 𝑧1
We can now easily compute the mean and standard deviation as Fig. 12.17 Evaluation nodes with area
proportional to weight.
𝜇 𝑓 = 𝛼 0 = 2.1725
v
u
tÕ𝑛
𝜎𝑓 = 𝛼2𝑖 hΨ2𝑖 i = 0.06966 .
𝑖=1
12 Optimization Under Uncertainty 466
In this case, we are able to accurately estimate the mean and standard
deviation with only 25 function evaluations. In contrast, applying Monte Carlo
to this same problem, with LHS, requires about 10,000 function calls to estimate
the mean and over 100,000 function calls to estimate the standard deviation
(with less accuracy).
Although direct quadrature would work equally well if all we wanted was
the mean and standard deviation, polynomial chaos gives us a polynomial
approximation of our function near 𝜇𝑥 :
Õ
𝑓˜(𝑥) = 𝛼 𝑖 Ψ𝑖 (𝑥).
𝑖
2 2
1.5 1.5
𝜇𝑥 𝜇𝑥
𝑥2 1 𝑥2 1
0.5 0.5
The primary benefit of this new function is that it is very inexpensive to ·104
evaluate (and the original function is often expensive), so we can use sampling 3
procedures to compute other statistics, such as percentiles or reliability levels,
or simply to visualize the output PDF, as shown in Fig. 12.19. 2
12.4 Summary 0
2 2.2 2.4 2.6 2.8
𝑓 (𝑥)
Engineering problems are subject to variation under uncertainty. OUU
deals with optimization problems where the design variables or other Fig. 12.19 Output histogram pro-
parameters have uncertain variability. Robust design optimization seeks duced by sampling the polynomial
designs that are less sensitive to inherent variability in the objective expansion.
function. Common OUU objectives include minimizing the mean or
standard deviation or performing multiobjective trade-offs between the
mean performance and standard deviation. Reliable design optimiza-
tion seeks designs with a reduced probability of failure, considering
the variability in the constraint values. To quantify robustness and
reliability, we need a forward-propagation procedure that propagates
12 Optimization Under Uncertainty 467
Problems
a. The greater the reliability, the less likely the design is to have
a worse objective function value.
b. Reliability can be handled in a deterministic way using safety
factors, which ensure that the optimum has some margin
before the original constraint is violated.
c. Forward propagation computes the PDFs of the outputs and
inputs for a given numerical model.
d. The computational cost of direct quadrature scales exponen-
tially with the number of random variables, whereas the cost
of Monte Carlo is independent of the number of random
variables.
e. Monte Carlo methods approximate PDFs using random
sampling and converges slowly.
f. The first-order perturbation method computes the PDFs
using local Taylor series expansions.
g. Because the first-order perturbation method requires first-
order derivatives to compute the uncertainty metrics, OUU
using the first-order perturbation method requires second-
order derivatives.
h. Polynomial chaos is a forward-propagation technique that
uses polynomial approximations with random coefficients
to model the input uncertainties.
i. The number of basis functions required by polynomial chaos
grows exponentially with the number of uncertain input
variables.
12.3 Using Gaussian quadrature, find the mean and variance of the
function exp(cos(𝑥)) at 𝑥 = 1, assuming 𝑥 is normally distributed
with a standard deviation of 0.1. Determine how many evaluation
points are needed to converge to 5 decimal places. Compare your
results to trapezoidal integration.
12.5 Consider the function in Ex. 12.10. Solve the same problem,
but use Monte Carlo sampling instead. Compare the output
histogram and how many function calls are required to achieve
well-converged results for the mean and variance.
12.6 Repeat Ex. 12.10 using polynomial chaos, except with a uniform
distribution in both dimensions, where the standard deviations
from the example correspond to the half-width of a uniform
distribution.
𝑇3
𝑦3
𝑇2
𝑦2
Wind
𝑇1
Take the two optimal designs that you found, and then
compare each on the two objectives (deterministic and 95th
percentile). The first row corresponds to the performance of
the optimal deterministic layout. Evaluate the performance
of this layout using the deterministic value for COE and the
95th percentile that accounts for uncertainty. Repeat for the
optimal solution for the OUU case. Discuss your findings.
Multidisciplinary Design Optimization
13
As mentioned in Chapter 1, most engineering systems are multidiscipli-
nary, motivating the development of multidisciplinary design optimiza-
tion (MDO). The analysis of multidisciplinary systems requires coupled
models and coupled solvers. We prefer the term component instead of
discipline or model because it is more general. However, we use these
terms interchangeably depending on the context. When components in
a system represent different physics, the term multiphysics is commonly
used.
All the optimization methods covered so far apply to multidisci-
plinary problems if we view the coupled multidisciplinary analysis
as a single analysis that computes the objective and constraint func-
tions by solving the coupled model for a given set of design variables.
However, there are additional considerations in the solution, derivative
computation, and optimization of coupled systems.
In this chapter, we build on Chapter 3 by introducing models and
solvers for coupled systems. We also expand the derivative computation
methods of Chapter 6 to handle such systems. Finally, we introduce
various MDO architectures, which are different options for formulating
and solving MDO problems.
471
13 Multidisciplinary Design Optimization 472
the uncertainty at a given point in time (recall Fig. 1.3). Although these
benefits still apply when modeling and optimizing a single discipline
or component, broadening the modeling and optimization to the whole
system brings on additional benefits.
Even without performing any optimization, constructing a multi-
disciplinary (coupled) model that considers the whole engineering
system is beneficial. Such a model should ideally consider all the
interactions between the system components. In addition to modeling
physical phenomena, the model should also include other relevant
considerations, such as economics and human factors. The benefit of
such a model is that it better reflects the actual state and performance of
the system when deployed in the real world, as opposed to an isolated
component with assumed boundary conditions. Using such a model,
designers can quantify the actual impact of proposed changes on the
whole system.
When considering optimization, the main benefit of MDO is that op-
timizing the design variables for the various components simultaneously
leads to a better system than when optimizing the design variables
for each component separately. Currently, many engineering systems
are designed and optimized sequentially, which leads to suboptimal
designs. This approach is often used in industry, where engineers are
grouped by discipline, physical subsystem, or both. This might be per-
𝑥2 𝑥∗
ceived as the only choice when the engineering system is too complex
and the number of engineers too large to coordinate a simultaneous
design involving all groups. 𝑥0
different values for those variables each time, in which case they
will not converge. On the other hand, if we let one discipline handle a
shared variable, it will likely converge to a value that violates one or
more constraints from the other disciplines.
By considering the various components and optimizing a multidisci-
plinary performance metric with respect to as many design variables
as possible simultaneously, MDO automatically finds the best trade-off
between the components—this is the key principle of MDO. Suboptimal
designs also result from decisions at the system level that involve power
struggles between designers. In contrast, MDO provides the right
trade-offs because mathematics does not care about politics.
Shape
Structural
sizing
Surface
Aerodynamic pressures
solver
Structural Displacements
Displacements solver
Weight
Weight
Surface
pressure Drag, lift
integration
Stress Structural
computation stresses
Fuel Fuel
computation consumption
𝑢1
𝑟1 (𝑥, 𝑢) = 0
𝑢1
𝑢 = 𝑢2
𝑢2 𝑢2
𝑥 𝑟2 (𝑥, 𝑢) = 0 Fig. 13.4 Coupled model composed
𝑢3 of three numerical models. This cou-
𝑢3 pled model would replace the single
𝑟3 (𝑥, 𝑢) = 0 model in Fig. 3.21.
13 Multidisciplinary Design Optimization 475
13.2.1 Components
In Section 3.3, we explained how all models can ultimately be written as
a system of residuals, 𝑟(𝑥, 𝑢) = 0. When the system is large or includes
submodels, it might be natural to partition the system into components.
We prefer to use the more general term components instead of disciplines
to refer to the submodels resulting from the partitioning because the
partitioning of the overall model is not necessarily by discipline (e.g.,
aerodynamics, structures). A system model might also be partitioned
by physical system components (e.g., wing, fuselage, or an aircraft
in a fleet) or by different conditions applied to the same model (e.g.,
aerodynamic simulations at different flight conditions).
The partitioning can also be performed within a given discipline
for the same reasons cited previously. In theory, the system model
equations in 𝑟(𝑥, 𝑢) = 0 can be partitioned in any way, but only some
partitions are advantageous or make sense. We denote a partitioning
into 𝑛 components as
𝑟1 (𝑢1 ; 𝑢2 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0
..
.
𝑟(𝑢) = 0 ≡ 𝑟 (𝑢 ; 𝑢 , . . . , 𝑢 ,𝑢 ,...,𝑢 ) = 0 . (13.1)
𝑖 𝑖 1 𝑖−1 𝑖+1 𝑛
..
.
𝑟𝑛 (𝑢𝑛 ; 𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛−1 ) = 0
Each 𝑟 𝑖 and 𝑢𝑖 are vectors corresponding to the residuals and states of
component 𝑖. The semicolon denotes that we solve each component 𝑖
by driving its residuals (𝑟 𝑖 ) to zero by varying only its states (𝑢𝑖 ) while
keeping the states from all other components constant. We assume this
is possible, but this is not guaranteed in general. We have omitted the
dependency on 𝑥 in Eq. 13.1 because, for now, we just want to find the
state variables that solve the governing equations for a fixed design.
Components can be either implicit or explicit, a concept we introduced
in Section 3.3. To solve an implicit component 𝑖, we need an algorithm
for driving the equation residuals, 𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ), to zero by
varying the states 𝑢𝑖 while the other states (𝑢 𝑗 for all 𝑗 ≠ 𝑖) remain fixed.
This algorithm could involve a matrix factorization for a linear system
or a Newton solver for a nonlinear system.
An explicit component is much easier to solve because that com-
ponents’ states are explicit functions of other components’ states. The
states of an explicit component can be computed without factorization
or iteration. Suppose that the states of a component 𝑖 are given by the
explicit function 𝑢𝑖 = 𝑓 (𝑢 𝑗 ) for all 𝑗 ≠ 𝑖. As previously explained in
Section 3.3, we can convert an explicit equation to the residual form by
13 Multidisciplinary Design Optimization 477
moving the function on the right-hand side to the left-hand side. Then,
we obtain set of residuals,
where 𝐴 is the matrix of aerodynamic influence coefficients, and 𝑣 is a vector of 202. Jasa et al., Open-source coupled aero-
structural optimization using Python, 2018.
boundary conditions, both of which depend on the wing shape. The state Γ is a
vector that represents the circulation (vortex strength) at each spanwise position
on the wing, as shown on the left-hand side of Fig. 13.5. The lift and drag
scalars can be computed explicitly for a given Γ, so we write these dependencies
as 𝐿 = 𝐿(Γ) and 𝐷 = 𝐷(Γ), omitting the detailed explicit expressions for
conciseness.
𝑑𝑧
↑ ↑ ↑ ↑ ↑ Fig. 13.5 Aerostructural wing model
showing the aerodynamic state vari-
ables (circulations Γ) on the left and
y
y
Γ 𝑑𝜃
ments 𝑑 𝑧 and rotations 𝑑𝜃 ) on the
right.
position on the beam. The states 𝑑 are the displacements and rotations at each
node, as shown on the right-hand side of Fig. 13.5. The weight does not depend
on the states, and it is an explicit function of the beam sizing and shape, so it
does not involve the structural model (Eq. 13.3). The stresses are an explicit
function of the displacements, so we can write 𝜎 = 𝜎(𝑑), where 𝜎 is a vector
whose size is the number of elements.
When we couple these two models, 𝐴 and 𝑣 depend on the wing dis- Aerodynamics Γ
placements 𝑑, and 𝑞 depends on Γ. We can write all the implicit and explicit 𝐴(𝑑)Γ − 𝑣(𝑑) = 0
equations as residuals:
𝑟1 = 𝐴(𝑑)Γ − 𝑣(𝑑) Structures
𝑑 𝐾𝑑 − 𝑞(Γ) = 0
𝑟2 = 𝐾𝑑 − 𝑞(Γ) .
The states of this system are as follows:
Fig. 13.6 The aerostructural model
𝑢 Γ couples aerodynamics and structures
𝑢= 1 ≡ . through a displacement and force
𝑢2 𝑑
transfer.
This coupled system is illustrated in Fig. 13.6.
𝑢𝑖
Solver
𝑢𝑖
𝑟𝑖 𝑟 𝑖 (𝑢𝑖 ; 𝑝 𝑖 ) Fig. 13.7 In the general case, a model
may require conversions of inputs
and outputs distinct from the states
𝑄 𝑖 (𝑝 𝑖 , 𝑢𝑖 ) 𝑢ˆ 𝑖 = 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) that the solver computes.
Consider the structural model from Ex. 13.2. We wrote 𝑞(Γ) to represent
the dependency of the external forces on the aerodynamic model circulations
to keep the notation simple, but in reality, there should be a separate explicit
component that converts Γ into 𝑞. The circulation translates to a lift force at each
spanwise position, which in turn needs to be distributed consistently to the
nodes of each beam element. Also, the displacements given by the structural
model (translations and rotations of each node) must be converted into a twist
distribution on the wing, which affects the right-hand side of the aerodynamic
model, 𝜃(𝑑). Both of these conversions are explicit functions.
13 Multidisciplinary Design Optimization 480
𝑢1
𝑟1 𝑅1
𝑢2 𝑢ˆ 1 𝑢ˆ 1
𝑟2 ˆ 22, 𝑢ˆ 3 )
𝑈1 (𝑢𝑅
𝑢3 𝑢3
𝑟3 𝑅3
𝑢4
𝑟4 𝑅4
𝑢5 𝑢ˆ 2
𝑟5 ˆ 15, 𝑢ˆ 3 )
𝑈2 (𝑢𝑅 𝑢ˆ 2
𝑢6
𝑟6 𝑢6 𝑅6
𝑢7
𝑟7 𝑅7
𝑢8 𝑢ˆ 3 𝑢ˆ 3
𝑟8 ˆ 18, 𝑢ˆ 2 )
𝑈3 (𝑢𝑅
𝑢9 𝑢9
𝑟9 𝑅9
B B
C
a graph (Fig. 13.10, right), where the graph nodes are the components,
and the edges represent the information dependency. This graph is
a directed graph because, in general, there are three possibilities for
coupling two components: single coupling one way, single coupling
the other way, and two-way coupling. A directed graph is cyclic when
there are edges that form a closed loop (i.e., a cycle). The graph on
the right of Fig. 13.10 has a single cycle between components B and C.
When there are no closed loops, the graph is acyclic. In this case, the
whole system can be solved by solving each component in turn without
iterating.
The DSM can be viewed as a matrix where the blank entries are
zeros. For real-world systems, this is often a sparse matrix. This means
that in the corresponding DSM, each component depends only on a
subset of all the other components. We can take advantage of the
structure of this sparsity in the solution of coupled systems.
The components in the DSM can be reordered without changing
the solution of the system. This is analogous to reordering sparse
matrices to make linear systems easier to solve. In one extreme case,
reordering could achieve a DSM with no entries below the diagonal. In
that case, we would have only feedforward connections, which means
all dependencies could be resolved in one forward pass (as we will
see in Ex. 13.4). This is analogous to having a linear system where
the matrix is lower triangular, in which case the linear solution can be 203. Cuthill and McKee, Reducing the
obtained with forward substitution. bandwidth of sparse symmetric matrices,
1969.
The sparsity of the DSM can be exploited using ideas from sparse 204. Amestoy et al., An approximate
linear algebra. For example, reducing the bandwidth of the matrix (i.e., minimum degree ordering algorithm, 1996.
moving nonzero elements closer to the diagonal) can also be helpful. ‡ Although these methods were designed
for symmetric matrices, they are still useful
This can be achieved using algorithms such as Cuthill–McKee,203 for non-symmetric ones. Several numeri-
reverse Cuthill–McKee (RCM), and approximate minimum degree cal libraries include these methods.
(AMD) ordering.204‡ 205. Lambe and Martins, Extensions to the
design structure matrix for the description
We now introduce an extended version of the DSM, called XDSM,205 of multidisciplinary design, analysis, and
which we use later in this chapter to show the process in addition to optimization processes, 2012.
the data dependencies. Figure 13.11 shows the XDSM for the same
A ûA ûA
four-component system. When showing only the data dependencies,
the only difference relative to DSM is that the coupling variables are B ûB
labeled explicitly, and the data paths are drawn. In the next section, we
add the process to the XDSM. ûC C ûD
D
13.2.5 Solving Coupled Numerical Models
The solution of coupled systems, also known as multidisciplinary analysis Fig. 13.11 XDSM showing data de-
(MDA), requires concepts beyond the solvers reviewed in Section 3.6 pendencies for the four-component
coupled system of Fig. 13.10.
13 Multidisciplinary Design Optimization 483
In some cases, we have access to the model’s internal states, but we may
want to use a dedicated solver for that model anyway.
Because each model, in general, depends on the outputs of all other
models, we have a coupled dependency that requires a solver to resolve.
This means that the functional form requires two levels: one for the
model solvers and another for the system-level solver. At the system
level, we only deal with the coupling variables (𝑢),
ˆ and the internal
states (𝑢) are hidden.
The rest of this section presents several system-level solvers. We
will refer to each model as a component even though it is a group of
components in general.
Tip 13.2 Avoid coupling components with file input and output
The coupling variables are often passed between components through files.
This is undesirable because of a potential loss in precision (see Tip 13.1) and
because it can substantially slow down the coupled solution.
Instead of using files, pass the coupling variable data through memory
whenever possible. You can do this between codes written in different languages
by wrapping each code using a common language. When using files is
unavoidable, be aware of these issues and mitigate them as much as possible.
û(0)
0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Jacobi
1:
û1 2 : û1
Solver 1
1:
û2 2 : û2
Solver 2
Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states
𝑘=0
while
𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1)
> 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for all 𝑖 ∈ {1, . . . , 𝑚} do Can be done in parallel
(𝑘+1) (𝑘+1) (𝑘)
𝑢ˆ 𝑖 ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 𝑗 = 0, 𝑗 ≠ 𝑖 Solve for component 𝑖 ’s states
using the states from the previous iteration of other components
end for
𝑘 = 𝑘+1
end while
The block Jacobi solver (Alg. 13.1) can also be used when one or
more components are linear solvers. This is useful for computing the
derivatives of the coupled system using implicit analytics methods
because that involves solving a coupled linear system with the same
structure as the coupled model (see Section 13.3.3).
û(0)
0, 4 → 1 :
1 : û2 , û3 2 : û3
Gauss–Seidel
1:
û1 4 : û1 2 : û1 2 : û1
Solver 1
2:
û2 4 : û2 3 : û2
Solver 2
(𝑘)
where 𝜃 (𝑘) is the relaxation factor, and Δ𝑢ˆ 𝑖 is the previous update
for component 𝑖. The relaxation factor, 𝜃, could be a fixed value,
which would normally be less than 1 to dampen oscillations and avoid
divergence.
Aitken’s method206 improves on the fixed relaxation approach by 206. Irons and Tuck, A version of the
Aitken accelerator for computer iteration,
adapting the 𝜃. The relaxation factor at each iteration changes based 1969.
on the last two updates according to
|
© Δ𝑢ˆ (𝑘) − Δ𝑢ˆ (𝑘−1) Δ𝑢ˆ (𝑘) ª
= 𝜃 (𝑘−1) 1 −
®.
Δ𝑢ˆ (𝑘) − Δ𝑢ˆ (𝑘−1)
2 ®
𝜃 (𝑘) (13.8)
« ¬
Aitken’s method usually accelerates convergence and has been shown
to work well for nonlinear block Gauss–Seidel with multidisciplinary
systems.207 It is advisable to override the value of the relaxation factor 207. Kenway et al., Scalable parallel ap-
proach for high-fidelity steady-state aeroe-
given by Eq. 13.8 to keep it between 0.25 and 2.208 lastic analysis and derivative computations,
The steps for the full Gauss–Seidel algorithm with Aitken accelera- 2014.
208. Chauhan et al., An automated selec-
tion are listed in Alg. 13.2. Similar to the block Jacobi solver, the block tion algorithm for nonlinear solvers in MDO,
Gauss–Seidel solver can also be used when one or more components 2018.
are linear solvers. Aitken acceleration can be used in the linear case
without modification and it is still useful.
The order in which the components are solved makes a significant
difference in the efficiency of the Gauss–Seidel method. In the best
possible scenario, the components can be reordered such that there are
13 Multidisciplinary Design Optimization 487
no entries in the lower diagonal of the DSM, which means that each
component depends only on previously solved components, and there
are therefore no feedback dependencies (see Ex. 13.4). In this case,
the block Gauss–Seidel method would converge to the solution in one
forward sweep.
In the more general case, even though we might not eliminate
the lower diagonal entries completely, minimizing these entries by
reordering results in better convergence. This reordering can also mean
the difference between convergence and nonconvergence.
𝑘=0
while
𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1)
> 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for 𝑖 = 1, 𝑚 do
(𝑘+1) (𝑘+1) (𝑘+1) (𝑘) (𝑘)
𝑢ˆ temp ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑖−1 , 𝑢ˆ 𝑖+1 , . . . , 𝑢ˆ 𝑚 = 0
Solve for component 𝑖 ’s states using the latest states from other components
(𝑘) (𝑘)
Δ𝑢ˆ 𝑖 = 𝑢ˆ temp − 𝑢ˆ 𝑖 Compute step
if 𝑘 > 0 then | !
Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) Δ𝑢ˆ (𝑘)
𝜃(𝑘) = 𝜃(𝑘−1) 1− 2 Update the relaxation factor
k Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) k
end if
(𝑘+1) (𝑘) (𝑘)
𝑢ˆ 𝑖 = 𝑢ˆ 𝑖 + 𝜃(𝑘) Δ𝑢ˆ 𝑖 Update component 𝑖 ’s states
end for
𝑘 = 𝑘+1
end while
A E
B C
C A
Newton’s Method
𝑢 𝑟1 𝑢 𝑟𝑛 𝑢 𝑢1 𝑢 𝑢𝑛 𝑢ˆ 𝑈1 𝑢ˆ 𝑈𝑚
A variation on this monolithic Newton approach uses two-level Fig. 13.15 There are three options
solver hierarchy, as illustrated on the middle panel of Fig. 13.15. The for solving a coupled system with
Newton’s method. The monolithic
system-level solver is the same as in the monolithic approach, but each approach (left) solves for all state
component is solved first using the latest states. The Newton step for variables simultaneously. The block
approach (middle) solves the same
each component 𝑖 is given by system as the monolithic approach,
but solves each component for its
𝜕𝑟 𝑖 states at each iteration. The black box
Δ𝑢𝑖 = −𝑟 𝑖 𝑢𝑖 ; 𝑢 𝑗≠𝑖 , (13.10)
𝜕𝑢𝑖 approach (right) applies Newton’s
method to the coupling variables.
where 𝑢 𝑗 represents the states from other components (i.e., 𝑗 ≠ 𝑖),
which are fixed at this level. Each component is solved before taking
a step in the entire state vector (Eq. 13.9). The procedure is given
in Alg. 13.3 and illustrated in Fig. 13.16. We call this the full-space
hierarchical Newton approach because the system-level solver iterates the
entire state vector. Solving each component before taking each step in
the full space Newton iteration acts as a preconditioner. In general, the
monolithic approach is more efficient, and the hierarchical approach is
more robust, but these characteristics are case-dependent.
u(0)
0, 2 → 1 :
1 : u2 , u3 1 : u1 , u3 1 : u1 , u2
Newton
∂ r1 1:
u1 2 : u1 ,
∂u Solver 1
∂ r2 1:
u2 2 : u2 ,
∂u Solver 2
trated in Fig. 13.9 to solve only for the coupling variables. We call this
the reduced-space hierarchical Newton approach because the system-level
solver iterates only in the space of the coupling variables, which is
smaller than the full space of the state variables. Using this approach,
each component’s solver can be a black box, as in the nonlinear block
Jacobi and Gauss–Seidel solvers. This approach is illustrated on the
right panel of Fig. 13.15. The reduced-space approach is mathemati-
cally equivalent and follows the same iteration path as the full-space
approach if each component solver in the reduced-space approach is
converged well enough.132 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
design, analysis, and optimization, 2019.
Algorithm 13.3 Full-space hierarchical Newton
Inputs: h i
(0) (0)
𝑢 (0) = 𝑢1 , . . . , 𝑢𝑛 : Initial guess for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
𝜕𝑟
Compute 𝑖 Jacobian block for component 𝑖 for current state
𝜕𝑢𝑖
𝜕𝑟 𝑖
Solve Δ𝑢𝑖 = −𝑟 𝑖 Solve for Newton step for 𝑖 th component
𝜕𝑢𝑖
(𝑘) (𝑘)
𝑢𝑖 = 𝑢𝑖 + Δ𝑢𝑖 Update state variables for component 𝑖
end while
end for
Compute 𝑟 𝑢 (𝑘) Full residual vector for current states
𝜕𝑟
Compute Full Jacobian for current states
𝜕𝑢
𝜕𝑟
Solve Δ𝑢 = −𝑟 Coupled Newton system (Eq. 13.9)
𝜕𝑢
𝑢 (𝑘+1) = 𝑢 (𝑘) + Δ𝑢 Update full state variable vector
𝑘 = 𝑘+1
end while
where we need the partial derivatives of all the residuals with respect to
the coupling variables to form the Jacobian matrix 𝜕ˆ𝑟 /𝜕𝑢.
ˆ The Jacobian
can be found by differentiating Eq. 13.11 with respect to the coupling
variables. Then, expanding the concatenated residuals and coupling
variable vectors yields
𝜕𝑈1
𝐼 𝜕𝑈1
Δ𝑢ˆ 1 𝑢ˆ 1 − 𝑈1 (𝑢ˆ 2 , . . . , 𝑢ˆ 𝑚 )
− ··· −
𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚
𝜕𝑈 𝜕𝑈2
− 2 Δ𝑢ˆ 2 𝑢ˆ 2 − 𝑈2 (𝑢ˆ 1 , 𝑢ˆ 3 , . . . , 𝑢ˆ 𝑚 )
𝐼 ··· −
𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚 =− .
. .. ..
.. .. ..
.
..
. . .
.
𝜕𝑈𝑚
− 𝜕𝑈𝑚 Δ𝑢ˆ 𝑚 𝑢ˆ 𝑚 − 𝑈𝑚 (𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚−1 )
𝜕𝑢ˆ − ··· 𝐼
1 𝜕𝑢ˆ 2
(13.13)
The residuals in the right-hand side of this equation are evaluated at
the current iteration.
The derivatives in the block Jacobian matrix are also computed
at the current iteration. Each row 𝑖 represents the derivatives of the
(potentially implicit) function that computes the outputs of component
𝑖 with respect to all the inputs of that component. The Jacobian matrix
in Eq. 13.13 has the same structure as the DSM (but transposed) and is
often sparse. These derivatives can be computed using the methods
from Chapter 6. These are partial derivatives in the sense that they do
not take into account the coupled system. However, they must take
into account the respective model and can be computed using implicit
analytic methods when the model is implicit.
This Newton solver is shown in Fig. 13.17 and detailed in Alg. 13.4.
Each component corresponds to a set of rows in the block Newton
system (Eq. 13.13). To compute each set of rows, the corresponding
component must be solved, and the derivatives of its outputs with
respect to its inputs must be computed as well. Each set can be computed
in parallel, but once the system is assembled, a step in the coupling
variables is computed by solving the full system (Eq. 13.13).
These coupled Newton methods have similar advantages and dis-
advantages to the plain Newton method. The main advantage is that it
13 Multidisciplinary Design Optimization 492
û(0)
0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Newton
∂ U1 1:
û1 2 : û1 ,
∂ û Solver 1
∂ U2 1:
û2 2 : û2 ,
∂ û Solver 2
Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states
𝑘=0
while k 𝑟ˆ k 2 > 𝜀 do Check residual norm for all components
for all 𝑖 ∈ {1, . . . , 𝑚} do Can be done in parallel
(𝑘)
𝑈𝑖 ← compute 𝑈 𝑖 𝑢ˆ 𝑗≠𝑖 Solve component 𝑖 and compute its outputs
end for
𝑟ˆ = 𝑢ˆ (𝑘) − 𝑈 Compute all coupling variable residuals
𝜕𝑈
Compute Jacobian of coupling variables for current state
𝜕𝑢ˆ
𝜕ˆ𝑟
Solve Δ𝑢ˆ = −ˆ𝑟 Coupled Newton system (Eq. 13.13)
𝜕𝑢ˆ
𝑢ˆ (𝑘+1) = 𝑢ˆ (𝑘) + Δ𝑢ˆ Update all coupling variables
𝑘 = 𝑘+1
end while
where Δ𝑢 (𝑘) is the last step in the states and Δ𝑟 (𝑘) is the difference
between the two latest residual vectors. Because the inverse is provided
explicitly, we can find the update by performing the multiplication,
Lift
In this case, the linear systems defined by 𝑟1 and 𝑟2 are small enough to 4,000
be solved using a direct method, such as LU factorization. Thus, we can solve 2,000
𝑟1 for Γ, for a given 𝑑, and solve 𝑟2 for 𝑑, for a given Γ. Also, no conversions 0
are involved, so the set of coupling variables is equivalent to the set of state 𝑑𝛾
0.08
variables (𝑢ˆ = 𝑢). Rotation
0.06
Using the nonlinear block Jacobi method (Alg. 13.1), we start with an initial 0.04
guess (e.g., Γ = 0, 𝑑 = 0) and solve 𝑟1 = 0 and 𝑟2 = 0 separately for the new 0.02
values of Γ and 𝑑, respectively. Then we use these new values of Γ and 𝑑 to 0
Nonlinear block Gauss–Seidel (Alg. 13.2) is similar, but we need to solve Vertical displacement
0.2
the two components in sequence. We can start by solving 𝑟1 = 0 for Γ with
0.1
𝑑 = 0. Then we use the Γ obtained from this solution in 𝑟2 and solve for a new
0
𝑑. We now have a new 𝑑 to use in 𝑟1 to solve for a new Γ, and so on. 0 5 10 15
The Jacobian for the Newton system (Eq. 13.9) is Spanwise location [m]
𝜕𝑟1 𝜕𝑟1 𝜕𝑣
𝐴 𝜕𝐴 Fig. 13.18 Spanwise distribution of
𝜕𝑟 𝜕𝑢2 =
Γ−
𝜕𝑑 .
= 𝜕𝑢1
𝜕𝑟2 𝜕𝑞
𝜕𝑑 the lift, wing rotation (𝑑𝜃 ), and verti-
𝜕𝑟2
−
𝜕𝑢 cal displacement (𝑑 𝑧 ) for the coupled
𝐾
𝜕𝑢1 𝜕𝑢2 𝜕Γ aerostructural solution.
13 Multidisciplinary Design Optimization 494
We already have the block diagonal matrices in this Jacobian from the governing
equations, but we need to compute the off-diagonal partial derivative blocks,
which can be done analytically or with algorithmic differentiation (AD).
The solution is shown in Fig. 13.18, where we plot the variation of lift,
vertical displacement, and rotation along the span. The vertical displacements
are a subset of 𝑑, and the rotations are a conversion of a subset of 𝑑 representing
the rotations of the wing section at each spanwise location. The lift is the
vertical force at each spanwise location, which is proportional to Γ times the
wing chord at that location.
The monolithic Newton approach does not converge in this case. We
apply the full-space hierarchical approach (Alg. 13.3), which converges more
reliably. In this case, the reduced-space approach is not used because there is
no distinction between coupling variables and state variables.
In Fig. 13.19, we compare the convergence of the methods introduced in this
section.¶ The Jacobi method has the poorest convergence rate and oscillates. ¶ These results and subsequent results
The Gauss–Seidel method is much better, and it is even better with Aitken based on the same example were obtained
using OpenAeroStruct,202 which was de-
acceleration. Newton has the highest convergence rate, as expected. Broyden veloped using OpenMDAO. The descrip-
performs about as well as Gauss–Seidel in this case. tion in these examples is simplified for
didactic purposes; check the paper and
code for more details.
104 202. Jasa et al., Open-source coupled aero-
structural optimization using Python, 2018.
102
100
Block Jacobi
k𝑟 k
10−2
Newton
10−4 Block
Gauss-Seidel
10−6
Broyden Block GS
10−8 with Aitken Fig. 13.19 Convergence of each solver
3 6 9 12 15 18 for aerostructural system.
Iterations
Recursive
solver
1 6
Recursive Recursive
solver solver
2 3 7 8
𝑢1 𝑢1 𝑢1
𝑢2 𝑢2 𝑢2
Fig. 13.21 There are three main possi-
bilities involving two components.
Parallel Serial Coupled
𝑢1 𝑢1 𝑢1
𝑢2 𝑢2 𝑢2 Parallel
𝑢3 𝑢3 𝑢3 Serial
𝑢4 𝑢4 𝑢4 Coupled
To solve the system from Ex. 13.3 using hierarchical solvers, we can Fig. 13.22 Three examples of a system
of four components with a two-level
use the hierarchy shown in Fig. 13.23. We form three groups with three
solver hierarchy.
components each. Each group includes the input and output conversion
components (which are explicit) and one implicit component (which
requires its own solver). Serial solvers can be used to handle the input
and output conversion components. A coupled solver is required to
solve the entire coupled system, but the coupling between the groups
is restricted to the corresponding outputs (components 3, 6, and 9).
Alternatively, we could apply a coupled solver to the functional
representation (Fig. 13.9, right). This would also use two levels of
solvers: a solver within each group and a system-level solver for the
13 Multidisciplinary Design Optimization 497
𝑟1 Serial
𝑟2 Coupled
𝑟3
𝑟4
𝑟5
𝑟6
Fig. 13.23 For the case of Fig. 13.9,
𝑟7 we can use a serial evaluation within
each of the three groups and require a
𝑟8 coupled solver to handle the coupling
between the three groups.
𝑟9
The development of coupled solvers is often done for a specific set of models
from scratch, which requires substantial effort. OpenMDAO is an open-source
framework that facilitates such efforts by implementing MAUD.132 All the 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
solvers introduced in this chapter are available in OpenMDAO. This framework design, analysis, and optimization, 2019.
also makes it easier to compute the derivatives of the coupled system, as we
will see in the next section. Users can assemble systems of mixed explicit and
implicit components.
For implicit components, they must give OpenMDAO access to the residual
computations and the corresponding state variables. For explicit components,
OpenMDAO only needs access to the inputs and the outputs, so it supports
black-box models.
OpenMDAO is usually more efficient when the user provides access to
the residuals and state variables instead of treating models as black boxes. A
hierarchy of multiple solvers can be set up in OpenMDAO, as illustrated in
Fig. 13.20. OpenMDAO also provides the necessary interfaces for user-defined
solvers. Finally, OpenMDAO encourages coupling through memory, which is
beneficial for numerical precision (see Tip 13.1) and computational efficiency
(see Tip 13.2).
𝜙1
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 ..
= − ... . . (13.18)
d𝑥
𝜙𝑛
𝜕𝑥 𝜕𝑢1 𝜕𝑢𝑛
Similarly, the adjoint equations (Eq. 6.46) can be written for a coupled
system using the same concatenated state and residual vectors. The
13 Multidisciplinary Design Optimization 500
𝜕𝑢 𝜓 𝑛
𝜕𝑢𝑛 𝜕𝑢𝑛
···
𝑛
After solving this equations for the coupled-adjoint vector, we can
use the coupled version of the total derivative equation (Eq. 6.47) to
compute the desired derivatives as
𝜕𝑟1
𝜕𝑥
d𝑓 |
− 𝜓1 . . . 𝜓 |𝑛 ... .
𝜕𝑓
= (13.20)
d𝑥 𝜕𝑥 𝜕𝑟
𝑛
𝜕𝑥
Like the adjoint method from Section 6.7, the coupled adjoint is a
powerful approach for computing gradients with respect to many
design variables.∗ ∗ The coupled-adjoint approach has been
The required partial derivatives are the derivatives of the residuals implemented for aerostructural problems
or outputs of each component with respect to the state variables or governed by coupled PDEs207 and demon-
strated in wing design optimization.209
inputs of all other components. In practice, the block structure of
207. Kenway et al., Scalable parallel ap-
these partial derivative matrices is sparse, and the matrices themselves proach for high-fidelity steady-state aeroe-
are sparse. This sparsity can be exploited using graph coloring to lastic analysis and derivative computations,
2014.
drastically reduce the computation effort of computing Jacobians at the 209. Kenway and Martins, Multipoint
system or component level, as explained in Section 6.8. high-fidelity aerostructural optimization of a
transport aircraft configuration, 2014.
Figure 13.26 shows the structure of the Jacobians in Eq. 13.17 and
Eq. 13.19 for the three-group case from Fig. 13.23. The sparsity structure
of the Jacobian is the transpose of the DSM structure. Because the
Jacobian in Eq. 13.19 is transposed, the Jacobian in the adjoint equation
has the same structure as the DSM.
The structure of the linear system can be exploited in the same
way as for the nonlinear system solution using hierarchical solvers:
serial solvers within each group and a coupled solver for the three
groups. The block Jacobi and Gauss–Seidel methods from Section 13.2.5
are applicable to coupled linear components, so these methods can
be re-used to solve this coupled linear system for the total coupled
derivatives.
The partial derivatives in the coupled Jacobian, the right-hand side
of the linear systems (Eqs. 13.17 and 13.19), and the total derivatives
equations (Eqs. 13.18 and 13.20) can be computed with any of the
13 Multidisciplinary Design Optimization 501
methods from Chapter 6. The nature of these derivatives is the same Fig. 13.26 Jacobian structure for resid-
as we have seen previously for implicit analytic methods (Section 6.7). ual form of the coupled direct (left)
and adjoint (right) equations for the
They do not require the solution of the equation and are typically three-group system of Fig. 13.23. The
cheap to compute. Ideally, the components would already have analytic structure of the transpose of the Jaco-
bian is the same as that of the DSM.
derivatives of their outputs with respect to their inputs, which are all
the derivatives needed at the system level.
The partial derivatives can also be computed using the finite-
difference or complex-step methods. Even though these are not efficient
for cases with many inputs, it might still be more efficient to compute
the partial derivatives with these methods and then solve the coupled
derivative equations instead of performing a finite difference of the
coupled system, as described in Section 13.3.1. The reason is that com-
puting the partial derivatives avoids having to reconverge the coupled
system for every input perturbation. In addition, the coupled system
derivatives should be more accurate when finite differences are used
only to compute the partial derivatives.
Variants of the coupled direct and adjoint methods can also be derived
for the functional form of the system-level representation (Eq. 13.4),
by using the residuals defined for the system-level Newton solver
(Eq. 13.11),
ˆ = 𝑢ˆ 𝑖 − 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) = 0 ,
𝑟ˆ𝑖 (𝑢) 𝑖 = 1, . . . , 𝑚 . (13.21)
Recall that driving these residuals to zero relies on a solver for each
component to solve for each component’s states and another solver to
13 Multidisciplinary Design Optimization 502
𝜕𝑈1 𝜕𝑈ˆ 1
𝐼 𝜕𝑈1
𝜙ˆ 1
− ··· − 𝜕𝑥
𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚
𝜕𝑈 𝜕𝑈2 𝜕𝑈ˆ 2
− 2 𝜙ˆ 2
𝐼 ··· − 𝜕𝑥
𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚 = ,
. .. .. ..
(13.22)
.. .. ..
. .
. . .
ˆ
𝜕𝑈𝑚 ˆ 𝜕𝑈𝑚
− 𝜕𝑈𝑚 𝜙𝑚
𝜕𝑢ˆ − ··· 𝐼 𝜕𝑥
1 𝜕𝑢ˆ 2
where the Jacobian is identical to the one we derived for the coupled
Newton step (Eq. 13.13). Here, 𝜙ˆ 𝑖 represents the derivatives of the cou-
pling variables from component 𝑖 with respect to the design variables.
The solution can then be used in the following equation to compute the
total derivatives:
𝜙ˆ 1
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 ..
= − ... . . (13.23)
d𝑥
𝜙ˆ 𝑚
𝜕𝑥 𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚
Similarly, the functional version of the coupled adjoint equations
can be derived as
|
𝜕 𝑓 |
𝐼 𝜕𝑈2 | 𝜕𝑈𝑚 𝜓ˆ 1
− ··· − 𝜕𝑢ˆ 1
𝜕𝑢ˆ 1 𝜕𝑢ˆ 1
𝜕𝑈 | |
𝜕 𝑓 |
− 1 𝜕𝑈𝑚 𝜓ˆ 2
𝐼 ··· − 𝜕𝑢ˆ 2
𝜕𝑢ˆ 2 𝜕𝑢ˆ 2 =
.. ..
. (13.24)
.. .. .. .. . .
. . . .
𝜕𝑈1 | ˆ 𝜕 𝑓
−
| |
𝜓 𝑚
𝜕𝑈2
𝜕𝑢ˆ − ··· 𝐼 𝜕𝑢ˆ 𝑚
𝑚 𝜕𝑢ˆ 𝑚
After solving for the coupled-adjoint vector using the previous equa-
tion, we can use the total derivative equation to compute the desired
derivatives:
𝜕ˆ𝑟1
𝜕𝑥
d𝑓 |
− 𝜓ˆ 1 . . . 𝜓ˆ |𝑚 ... .
𝜕𝑓
= (13.25)
d𝑥 𝜕𝑥 𝜕ˆ𝑟
𝑚
𝜕𝑥
Because the coupling variables (𝑢) ˆ are usually a reduction of the
internal state variables (𝑢), the linear systems in Eqs. 13.22 and 13.24 are
13 Multidisciplinary Design Optimization 503
usually much smaller than that of the residual counterparts (Eqs. 13.17
and 13.19). However, unlike the partial derivatives in the residual form,
the partial derivatives in the functional form Jacobian need to account
for the solution of the corresponding component. When viewed at
the component level, these derivatives are actually total derivatives of
the component. When the component is an implicit set of equations,
computing these derivatives with finite-differencing would require
solving the component’s equations for each variable perturbation.
Alternatively, an implicit analytic method (from Section 6.7) could be
applied to the component to compute these derivatives.
Figure 13.27 shows the Jacobian structure in the functional form of 𝜕𝑈1 𝜕𝑈1
𝐼 − −
the coupled direct method (Eq. 13.22) for the case of Fig. 13.23. The 𝜕𝑢ˆ 2 𝜕𝑢ˆ 3
dimension of this Jacobian is smaller than that of the residual form.
𝜕𝑈2 𝜕𝑈2
Recall from Fig. 13.9 that 𝑈1 corresponds to 𝑟3 , 𝑈2 corresponds to 𝑟6 , and −
𝜕𝑢ˆ 1
𝐼 −
𝜕𝑢ˆ 3
𝑈3 corresponds to 𝑟9 . Thus, the total size of this Jacobian corresponds
to the sum of the sizes of components 3, 6, and 9, as opposed to the −
𝜕𝑈3
−
𝜕𝑈3
𝐼
sum of the sizes of all nine components for the residual form. However, 𝜕𝑢ˆ 1 𝜕𝑢ˆ 2
The empty weight 𝑊 only depends on 𝑡 and 𝑏, and the dependence is explicit
(it does not require solving the aerodynamic or structural models). The drag 𝐷
and lift 𝐿 depend on all variables once we account for the coupled system of
equations. The remaining variables are fixed: 𝑅 is the required range, 𝑉 is the
airplane’s cruise speed, and 𝑐 is the specific fuel consumption of the airplane’s
engines. We also need to constrain the stresses in the structure, 𝜎, which are an
explicit function of the displacements (see Ex. 6.12).
To solve this optimization problem using gradient-based optimization, we
need the coupled derivatives of 𝑓 and 𝜎 with respect to 𝛼, 𝑏, 𝜃, and 𝑡. Computing
the derivatives of the aerodynamic and structural models separately is not
13 Multidisciplinary Design Optimization 505
sufficient. For example, a perturbation on the twist changes the loads, which
then changes the wing displacements, which requires solving the aerodynamic
model again. Coupled derivatives take this effect into account.
𝑟𝛼 𝑟𝑏 𝑟𝜃 𝑟𝑡 𝑟Γ 𝑟𝑑 𝑟𝑊 𝑟𝐷 𝑟𝐿 𝑟𝜎 𝑟𝑓
𝑏
Design variables
Γ
Intermediate variables
𝐿
Functions
We show the DSM for the system in Fig. 13.28. Because the DSM has the
same sparsity structure as the transpose of the Jacobian, this diagram reflects
the structure of the reverse UDE. The blocks that pertain to the design variables
have unit diagonals because they are independent variables, but they directly
affect the solver blocks. The blocks responsible for solving for Γ and 𝑑 are the
only ones with feedback coupling. The part of the UDE pertaining to Γ and 𝑑
is the Jacobian of residuals for the aerodynamic and structural components,
which we already derived in Ex. 13.5 to apply Newton’s method on the coupled
system. The functions of interest are all explicit components and only depend
directly on the design variables or the state variables. For example, the weight
𝑊 depends only on 𝑡; drag and lift depend only on the converged Γ; 𝜎 depends
on the displacements; and finally, the fuel burn 𝑓 just depends on drag, lift,
and weight. This whole coupled chain of derivatives is computed by solving
the linear system shown in Fig. 13.28.
For brevity, we only discuss the derivatives required to compute the
derivative of fuel burn with respect to span, but the other partial derivatives
would follow the same rationale.
• 𝜕𝑟/𝜕𝑢 is identical to what we derived when solving the coupled aero-
13 Multidisciplinary Design Optimization 506
d𝑓 𝜕 𝑓 d𝐷 𝜕 𝑓 d𝐿 𝜕 𝑓 d𝑊
= + + ,
d𝑏 𝜕𝐷 d𝑏 𝜕𝐿 d𝑏 𝜕𝑊 d𝑏
where the partial derivatives can be obtained by differentiating Eq. 13.26
symbolically, and the total derivatives are part of the coupled linear
system solution.
After computing all the partial derivative terms, we solve either the forward d𝑓
d𝑡
or reverse UDE system. For the derivative with respect to span, neither method 1,250
Decoupled
has an advantage. However, for the derivatives of fuel burn with respect to 1,000
the twist and thickness variables, the reverse mode is much more efficient. In 750
Coupled
this example, d 𝑓 /d𝑏 = −11.0 kg/m, so each additional meter of span reduced 500
the fuel burn by 11 kg. If we compute this same derivative without coupling 250
(by converging the aerostructural model but not considering the off-diagonal 0
terms in the aerostructural Jacobian), we obtain d 𝑓 /d𝑏 = −17.7 kg/m, which is
d𝑓
significantly different. The derivatives of the fuel burn with respect to the twist d𝛾
distribution and the thickness distribution along the wingspan are plotted in 0
Fig. 13.29, where we can see the difference between coupled and uncoupled
−20
derivatives.
−40
−60
0 5 10 15
13.4 Monolithic MDO Architectures Spanwise location [m]
So far in this chapter, we have extended the models and solvers from Fig. 13.29 Derivatives of the fuel burn
with respect to the spanwise distribu-
Chapter 3 and derivative computation methods from Chapter 6 to
tion of twist and thickness variables.
coupled systems. We now discuss the options to optimize coupled The coupled derivatives differ from
systems, which are given by various MDO architectures. the uncoupled derivatives, especially
Monolithic MDO architectures cast the design problem as a single for the derivatives with respect to
structural thicknesses near the wing
optimization. The only difference between the different monolithic root.
architectures is the set of design variables that the optimizer is responsi-
ble for, which affects the constraint formulation and how the governing
equations are solved.
13 Multidisciplinary Design Optimization 507
x(0) û(0)
0, 7 → 1 :
x∗ 2:x 3:x 4:x 6:x
Optimization
0, 4 → 1 :
2 : û2 , û3 3 : û3
MDA
2:
û∗1 5 : û1 3 : û1 4 : û1 6 : û1
Solver 1
3:
û∗2 5 : û2 4 : û2 6 : û2
Solver 2
4:
û∗3 5 : û3 6 : û3
Solver 3
6:
7 : f, g
Functions
sense as we are with finding an improved design. However, it is not Fig. 13.30 The MDF architecture relies
guaranteed that the design constraints are satisfied if the optimization is on an MDA to solve for the coupling
and state variables at each optimiza-
terminated early; that depends on whether the optimization algorithm tion iteration. In this case, the MDA
maintains a feasible design point or not. uses the block Gauss–Seidel method.
The main disadvantage of MDF is that it solves an MDA for each
optimization iteration, which requires its own algorithm outside of the
optimization. Implementing an MDA algorithm can be time-consuming
if one is not already in place.
As mentioned in Tip 13.3, a MAUD-based framework such as Open-
MDAO can facilitate this. MAUD naturally implements the MDF archi-
tecture because it focuses on solving the MDA (Section 13.2.5) and on
computing the derivatives corresponding to the MDA (Section 13.3.3).‡ ‡ The first application of MAUD was the
design optimization of a satellite and its
When using a gradient-based optimizer, gradient computations are orbit dynamics. The problem consisted
also challenging for MDF because coupled derivatives are required. of over 25,000 design variables and over 2
million state variables210
Finite-difference derivative approximations are easy to implement, but
210. Hwang et al., Large-scale multidiscipli-
their poor scalability and accuracy are compounded by the MDA, as nary optimization of a small satellite’s design
explained in Section 13.3. Ideally, we would use one of the analytic and operation, 2014.
Continuing the wing aerostructural problem from Ex. 13.6, we are finally
ready to optimize the wing. The MDF formulation is as follows: Initial
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡 Optimized
subject to 𝐿−𝑊 = 0
2.5|𝜎| − 𝜎yield ≤ 0 0 5 10 15 20
𝐾𝑑 − 𝑞(Γ) = 0
Fig. 13.31 The optimization reduces
for Γ, 𝑑. the fuel burn by increasing the span.
The structural stresses are constrained to be less than the yield stress of the
material by a safety factor (2.5 in this case). In Ex. 13.5, we set up the MDA for 𝜃
2
the aerostructural problem, and in Ex. 13.6, we set up the coupled derivative Optimized
computations needed to solve this problem using gradient-based optimization. 0
Initial
Solving this optimization resulted in the larger span wing shown in Fig. 13.31. −2
This larger span increases the structural weight, but decreases drag. Although
the increase in weight would typically increase the fuel burn, the drag decrease 𝑡
more than compensates for this adverse effect, and the fuel burn ultimately 1 Optimized
decreases up to this value of span. Beyond this optimal span value, the weight 0.5 Initial
penalty would start to dominate, resulting in a fuel burn increase.
0
The twist and thickness distributions are shown in Fig. 13.32. The wing 0 0.2 0.4 0.6 0.8 1
twist directly controls the spanwise lift loading. The baseline wing had no Normalized spanwise location
twist, which resulted in the loading shown in Fig. 13.33. In this figure, the
gray line represents a hypothetical elliptical lift distribution, which results in Fig. 13.32 Twist and thickness dis-
the theoretical minimum for induced drag. The loading distributions for the tributions for the baseline and opti-
mized wings.
level flight (1 g) and maneuver conditions (2.5 g) are indistinguishable. The
optimization increases the twist in the midspan and drastically decreases it
toward the tip. This twist distribution differentiates the loading at the two
conditions: it makes the loading at level flight closer to the elliptical ideal while
Loading
shifting the loading at the maneuver condition toward the wing root.
1 1g Initial
The thickness distribution also changes significantly, as shown in Fig. 13.32.
The optimization tailors the thickness by adding more thickness in the spar 2.5g
0.5
near the root, where the moments are larger, and thins out the wing much
0
more toward the tip, where the loads decrease. This more radical thickness
Loading
distribution is enabled by the tailoring of the spanwise lift loading discussed
1 Optimized
1g
previously.
These trades make sense because, at the level flight condition, the optimizer 0.5 2.5g
5.5
Sequential
5 𝑥0
4.5
Thickness [cm]
4
MDO
3.5
To perform sequential optimization for the wing design problem of Ex. 13.1,
we could start by optimizing the aerodynamics by solving the following
problem:
minimize 𝑓
by varying 𝛼, 𝜃
subject to 𝐿−𝑊 = 0.
Here, 𝑊 is constant because the structural thicknesses 𝑡 are fixed, but 𝐿 is a
function of the aerodynamic design variables and states. We cannot include the
span 𝑏 because it is a shared variable, as explained in Section 13.1. Otherwise,
this optimization would tend to increase 𝑏 indefinitely to reduce the lift-induced
drag. Because 𝑓 is a function of 𝐷 and 𝐿, and 𝐿 is constant because 𝐿 = 𝑊, we
could replace the objective with 𝐷.
Once the aerodynamic optimization has converged, the twist distribution
and the forces are fixed, and we then optimize the structure by minimizing
weight subject to stress constraints by solving the following problem:
minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0 .
13 Multidisciplinary Design Optimization 511
Because the drag and lift are constant, the objective could be replaced by 𝑊.
Again, we cannot include the span in this problem because it would decrease
indefinitely to reduce the weight and internal loads due to bending.
These two optimizations are repeated until convergence. As shown in
Fig. 13.34, sequential optimization only changes one variable at a time, and it
converges to a point on the constraint with about 3.5 ◦ more twist than the true
optimum of the MDO. When including more variables, these differences are
likely to be even larger.
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥, 𝑢ˆ 𝑡
subject to ˆ ≤0
𝑔 (𝑥; 𝑢)
ℎ 𝑖𝑐 = 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 = 0 𝑖 = 1, . . . , 𝑚 (13.29)
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥, 𝑢ˆ 𝑗≠𝑖
𝑡
=0 𝑖 = 1, . . . , 𝑚
for 𝑢ˆ .
x(0) , ût,(0)
0, 3 → 1 :
x∗ 2 : x, ût 1 : x, ût2 , ût3 1 : x, ût1 , ût3 1 : x, ût1 , ût2
Optimization
2:
û∗1 3 : f, g, g c
Functions
1:
û∗2 2 : û1
Solver 1
1:
û∗3 2 : û2
Solver 2
1:
2 : û3
Solver 3
One advantage of IDF is that each component can be solved in Fig. 13.35 The IDF architecture breaks
up the MDA by letting the optimizer
parallel because they do not depend on each other directly. Another
solve for the coupling variables that
advantage is that if gradient-based optimization is used to solve the satisfy interdisciplinary feasibility.
problem, the optimizer is typically more robust and has a better conver-
gence rate than the fixed-point iteration algorithms of Section 13.2.5.
The main disadvantage of IDF is that the optimizer must handle
more variables and constraints compared with the MDF architecture. If
the number of coupling variables is large, the size of the resulting opti-
mization problem may be too large to solve efficiently. This problem can
be mitigated by careful selection of the components or by aggregating
the coupling variables to reduce their dimensionality.
Unlike MDF, IDF does not guarantee a multidisciplinary feasible
state at every design optimization iteration. Multidisciplinary feasibility
is only guaranteed at the end of the optimization through the satisfaction
of the consistency constraints. This is a disadvantage because if the
optimization stops prematurely or we run out of time, we do not have
a valid state for the coupled system.
13 Multidisciplinary Design Optimization 513
For the IDF architecture, we need to make copies of the coupling variables
(Γ𝑡 and 𝑑 𝑡 ) and add the corresponding consistency constraints, as highlighted
in the following problem statement:
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ𝑡 , 𝑑 𝑡
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
Γ𝑡 − Γ = 0
𝑑𝑡 − 𝑑 = 0
while solving 𝐴 𝑑𝑡 Γ − 𝜃 𝑑𝑡 , 𝛼 = 0
𝐾𝑑 − 𝑞 Γ𝑡 = 0
for Γ, 𝑑 .
The aerodynamic and structural models are solved independently. The aerody-
namic solver finds Γ for the 𝑑 𝑡 given by the optimizer, and the structural solver
finds 𝑑 for the given Γ𝑡 .
When using gradient-based optimization, we do not require coupled
derivatives, but we do need the derivatives of each model with respect to both
state variables. The derivatives of the consistency constraints are just a unit
matrix when taken with respect to the variable copies and are zero otherwise.
coupling variables, and the design optimization problem for the design
variables. All that is required from the model is the computation
of residuals. Because the optimizer is controlling all these variables,
SAND is also known as a full-space approach. SAND can be stated as
13 Multidisciplinary Design Optimization 514
follows:
minimize ˆ 𝑢)
𝑓 (𝑥, 𝑢,
by varying ˆ 𝑢
𝑥, 𝑢,
(13.31)
subject to ˆ ≤0
𝑔 (𝑥, 𝑢)
ˆ 𝑢) = 0 .
𝑟 (𝑥, 𝑢,
Here, we use the representation shown in Fig. 13.7, so there are two
sets of explicit functions that convert the input coupling variables of
the component. The SAND architecture is also applicable to single
components, in which case there are no coupling variables. The XDSM
for SAND is shown in Fig. 13.36.
0, 2 → 1 :
x∗ , û∗ 1 : x, û 1 : x, û, u1 1 : x, û, u2 1 : x, û, u3
Optimization
1:
2 : f, g
Functions
1:
2 : r1
Residual 1
1:
2 : r2
Residual 2
1:
2 : r3
Residual 3
Because it solves for all variables simultaneously, the SAND archi- Fig. 13.36 The SAND architecture
lets the optimizer solve for all vari-
tecture can be the most efficient way to get to the optimal solution. In
ables (design, coupling, and state
practice, however, it is unlikely that this is advantageous when efficient variables), and component solvers are
component solvers are available. no longer needed.
The resulting optimization problem is the largest of all MDO archi-
tectures and requires an optimizer that scales well with the number
of variables. Therefore, a gradient-based optimization algorithm is
likely required, in which case the derivative computation must also
be considered. Fortunately, SAND does not require derivatives of the
coupled system or even total derivatives that account for the component
solution; only partial derivatives of residuals are needed.
SAND is an intrusive approach because it requires access to residuals.
13 Multidisciplinary Design Optimization 515
For the SAND approach, we do away completely with the solvers and let
the optimizer find the states. The resulting problem is as follows:
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ, 𝑑
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
𝐴Γ − 𝜃 = 0
𝐾𝑑 − 𝑞 = 0.
Instead of being solved separately, the models are now solved by the optimizer.
When using gradient-based optimization, the required derivatives are just
partial derivatives of the residuals (the same partial derivatives we would use
for an implicit analytic method).
variables) and design variables that affect two or more components 41. Martins and Lambe, Multidisciplinary
design optimization: A survey of architec-
directly (called shared design variables). We denote the vector of design tures, 2013.
variables local to component 𝑖 by 𝑥 𝑖 and the shared variables by 𝑥0 . The
full vector of design variables is given by concatenating
| | the shared and
|
local design variables into a single vector 𝑥 = 𝑥0 , 𝑥1 , . . . , 𝑥 𝑚 , where
𝑚 is the number of components.
If a constraint can be computed using a single component and
satisfied by varying only the local design variables for that component,
it is a local constraint; otherwise, it is nonlocal. Similarly,
| | for the|design
variables, we concatenate the constraints as 𝑔 = 𝑔0 , 𝑔1 , . . . , 𝑔𝑚 . The
same distinction could be applied to the objective function, but we do
not usually do this.
The MDO problem representation we use here is shown in Fig. 13.37
for a general three-component system. We use the functional form
introduced in Section 13.2.3, where the states in each component are
hidden. In this form, the system level only has access to the outputs of
each solver, which are the coupling coupling variables and functions of
interest.
x0 , x1 x0 , x2 x0 , x3 x x
Global
g0
constraints
The set of constraints is also split into shared constraints and local
ones. Local constraints are computed by the corresponding component
and depend only on the variables available in that component. Shared
constraints depend on more than one set of coupling variables. These
dependencies are also shown in Fig. 13.37.
target values of the coupling and shared design variables. These target
values are then shared with all disciplines during every iteration of
the solution procedure. The complete independence of disciplinary
subproblems combined with the simplicity of the data-sharing protocol
makes this architecture attractive for problems with a small amount of
shared data.
The system-level subproblem modifies the original problem as
follows: (1) local constraints are removed, (2) target coupling variables,
𝑢ˆ 𝑡 , are added as design variables, and (3) a consistency constraint is
added. This optimization problem can be written as follows:
minimize 𝑓 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡
by varying 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡
subject to 𝑔0 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡 ≤ 0
2
2
𝐽𝑖∗ =
𝑥 0𝑖
𝑡
− 𝑥 0
2 +
𝑥 𝑖𝑡 − 𝑥 𝑖
2
+
𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖
=0 for 𝑖 = 1, . . . , 𝑚 ,
(13.32)
where 𝑥0𝑖𝑡
are copies of the shared design variables that are passed to
discipline 𝑖, and 𝑥 𝑖𝑡 are copies of the local design variables passed to
the system subproblem.
The constraint function 𝐽𝑖∗ is a measure of the inconsistency between
the values requested by the system-level subproblem and the results
from the discipline 𝑖 subproblem. The disciplinary subproblems do not
include the original objective function. Instead, the objective of each
subproblem is to minimize the inconsistency function.
13 Multidisciplinary Design Optimization 518
for 𝑢ˆ 𝑖 .
These subproblems are independent of each other and can be solved
in parallel. Thus, the system-level subproblem is responsible for
minimizing the design objective, whereas the discipline subproblems
minimize system inconsistency while satisfying local constraints.
The CO problem statement has been shown to be mathematically
equivalent to the original MDO problem.212 There are two versions of 212. Braun and Kroo, Development and
the CO architecture: CO1 and CO2 . Here, we only present the CO2 application of the collaborative optimization
architecture in a multidisciplinary design
version. The XDSM for CO is shown in Fig. 13.38 and the procedure is environment, 1997.
detailed in Alg. 13.5.
0, 2 → 1 :
x∗0 System 1 : x0 , xt1...m , ût 1.1 : ûtj6=i 1.2 : x0 , xti , ût
optimization
1:
2 : f0 , g0 System
functions
1.1 :
û∗i 1.2 : ûi
Solver i
1.2 :
2 : Ji∗ 1.3 : fi , gi , Ji Discipline i
functions
CO has the organizational advantage of having entirely separate Fig. 13.38 Diagram for the CO archi-
disciplinary subproblems. This is desirable when designers in each tecture.
discipline want to maintain some autonomy. However, the CO for-
13 Multidisciplinary Design Optimization 519
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑔 ∗ : Optimal constraint values
Here, the structural optimization minimizes the discrepancy between the span
wanted by the structures (a decrease) versus what the system level requests
(which takes into account the opposite trend from aerodynamics). The structural
subproblem is fully responsible for satisfying the stress constraints by changing
the structural sizing 𝑡, which are local variables.
section, except that ATC uses penalties instead of a constraint. The ATC 214. Tosserams et al., An augmented
Lagrangian relaxation for analytical target
system-level problem is as follows: cascading using the alternating direction
method of multipliers, 2006.
Õ
𝑚
215. Talgorn and Kokkolaras, Compact
minimize 𝑓0 𝑥, 𝑢ˆ 𝑡 + 𝑡
Φ𝑖 𝑥0𝑖 − 𝑥0 , 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 𝑥0 , 𝑥 𝑖 , 𝑢ˆ 𝑡 implementation of non-hierarchical analytical
target cascading for coordinating distributed
𝑖=1
(13.34) multidisciplinary design optimization
𝑡
+ Φ0 𝑔0 𝑥, 𝑢ˆ problems, 2017.
by varying 𝑥 0 , 𝑢ˆ 𝑡 ,
13 Multidisciplinary Design Optimization 521
0, 8 → 1 :
6:w 3 : wi
Update w
5, 7 → 6 :
x∗0 System 6 : x0 , ût 3 : x0 , ût 2 : ûtj6=i
optimization
6:
System and
7 : f0 , Φ0...m
penalty
functions
1, 4 → 2 :
x∗i 6 : xt0i , xi 3 : xt0i , xi 2 : xt0i , xi
Optimization i
3:
Discipline i
4 : fi , gi , Φ0 , Φi
and penalty
functions
2:
û∗i 6 : ûi 3 : ûi
Solver i
by varying 𝑡
𝑥0𝑖 , 𝑥𝑖 (13.35)
subject to 𝑡
𝑔𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ≤ 0
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖 =0
for 𝑢ˆ 𝑖 .
The most common penalty functions used in ATC are quadratic
penalty functions (see Section 5.4.1). Appropriate penalty weights are
important for multidisciplinary consistency and convergence.
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
(0) (0)
x(0) ût,(0) x0 xi
0, 11 → 1 :
Convergence
check
1, 3 → 2 :
6 : ûtj6=i 6, 9 : ûtj6=i 6 : ûtj6=i 2, 5 : ûtj6=i
MDA
8, 10 :
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
optimization
4, 7 :
x∗i 11 : x0 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6, 9 :
10 : f0 , g0 7 : f0 , g0 System
functions
6, 9 :
10 : fi , gi 7 : fi , gi Discipline i
functions
9:
Shared
10 : df /dx0 , dg/dx0
variable
derivatives
3:
Discipline i
7 : df0,i /dx0 , dg0,i /dx0
variable
derivatives
2:
û∗i 3 : ûi 6, 9 : ûi 6, 9 : ûi 9 : ûi 6 : ûi
Solver i
13.5.4 Asymmetric Subspace Optimization Fig. 13.40 Diagram for the BLISS ar-
chitecture.
Asymmetric subspace optimization (ASO) is a distributed MDF archi-
tecture motivated by cases where there is a large discrepancy between
the cost of the disciplinary solvers. The cheaper disciplinary analyses
are replaced by disciplinary design optimizations inside the overall
MDA to reduce the number of more expensive disciplinary analyses.
The system-level optimization subproblem is as follows:
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥0 , 𝑥 𝑘
subject to ˆ ≤0
𝑔0 (𝑥; 𝑢)
𝑔 𝑘 (𝑥; 𝑢ˆ 𝑘 ) ≤ 0 for all 𝑘, (13.38)
while solving 𝑟 𝑘 𝑢ˆ 𝑘 ; 𝑥 𝑘 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0
for 𝑢ˆ 𝑘 .
13 Multidisciplinary Design Optimization 526
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥𝑖
subject to 𝑔𝑖 (𝑥 0 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ) ≤ 0 (13.39)
while solving 𝑟𝑖 𝑢ˆ 𝑖 ; 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0
for 𝑢ˆ 𝑖 .
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
(0) (0)
x0,1,2 ût,(0) x3
0, 10 → 1 :
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
optimization
9:
Discipline 0, 1,
10 : f0,1,2 , g0,1,2
and 2
functions
0, 8 → 2 :
2 : ût2 , ût3 3 : ût3
MDA
2:
û∗1 9 : û1 8 : û1 3 : û1 6 : û1 5 : û1
Solver 1
3:
û∗2 9 : û2 8 : û2 6 : û2 5 : û2
Solver 2
4, 7 → 5 :
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , g0 , f3 , g3
and 3
functions
5:
û∗3 9 : û3 8 : û3 6 : û3
Solver 3
For a gradient-based system-level optimizer, the gradients of the Fig. 13.41 Diagram for the ASO archi-
objective and constraints must take into account the suboptimization. tecture.
This requires coupled post-optimality derivative computation, which
increases computational and implementation time costs compared
with a normal coupled derivative computation. The total optimization
cost is only competitive with MDF if the discrepancy between each
disciplinary solver is high enough.
follows:
minimize 𝑓
by varying 𝑏, 𝜃
subject to 𝐿 − 𝑊∗ = 0
while solving 𝐴(𝑑∗ )Γ − 𝜃(𝑑∗ ) = 0
for Γ ,
where 𝑊 ∗ 𝑑∗
and correspond to values obtained from the structural subopti-
mization. The suboptimization is formulated as follows:
minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0
while solving 𝐾𝑑 − 𝑞 = 0
for 𝑑.
Similar to the sequential optimization, we could replace 𝑓 with 𝑊 in the
suboptimization because the other parameters in 𝑓 are fixed. To solve the
system-level problem with a gradient-based optimizer, we would need post-
optimality derivatives of 𝑊 ∗ with respect to span and Γ.
13.6 Summary
MDF/MAUD
Monolithic IDF
SAND
BLISS
MDO
architecture CSSO
classification Distributed MDF
MDOIS
ASO
Distributed CO
Multilevel
QSD
Penalty
IPD/EPD
ECO
The distributed architectures can be divided according to whether Fig. 13.42 Classification of MDO ar-
or not they enforce multidisciplinary feasibility (through an MDA of the chitectures.
whole system), as shown in Fig. 13.42. Distributed MDF architectures
enforce multidisciplinary feasibility through an MDA. The distributed
IDF architectures are like IDF in that no MDA is required. However,
they must ensure multidisciplinary feasibility in some other way. Some
do this by formulating an appropriate multilevel optimization (such as
CO), and others use penalties to ensure this (such as ATC).∗ ∗ Martins and Lambe41 describe all of these
Several commercial MDO frameworks are available, including MDO architectures in detail.
41. Martins and Lambe, Multidisciplinary
Isight/SEE 218 by Dassault Systèmes, ModelCenter/CenterLink by design optimization: A survey of architec-
Phoenix Integration, modeFRONTIER by Esteco, AML Suite by Tech- tures, 2013.
noSoft, Optimus by Noesis Solutions, and VisualDOC by Vanderplaats 218. Golovidov et al., Flexible implementa-
tion of approximation concepts in an MDO
Research and Development.219 These frameworks focus on making it framework, 1998.
easy for users to couple multiple disciplines and use the optimization 219. Balabanov et al., VisualDOC: A soft-
ware system for general purpose integration
algorithms through graphical user interfaces. They also provide con- and design optimization, 2002.
venient wrappers to popular commercial engineering tools. Typically,
these frameworks use fixed-point iteration to converge the MDA. When
derivatives are needed for a gradient-based optimizer, finite-difference
approximations are used rather than more accurate analytic derivatives.
13 Multidisciplinary Design Optimization 531
Problems
13.3 Consider the DSMs that follow. For each case, what is the lowest
number of feedback loops you can achieve through reordering?
What hierarchy of solvers would you recommend to solve the
coupled problem for each case?
a. A
b. A
c. A
13.4 Consider the “spaghetti” diagram shown in Fig. 13.43. Draw the
equivalent DSM for these dependencies. How can you exploit
the structure in these dependencies? What hierarchy of solvers
would you recommend to solve a coupled system with these
dependencies?
A B
C F
𝐶 𝐿 = 𝐶 𝐿0 − 𝐶 𝐿,𝜃 𝜃 ,
Structures
𝐿𝑏 2
where 𝜃 is the angle of deflection at the wing tip. Use 𝐶 𝐿0 = 0.4 𝜃 𝜃 = 48𝐸𝐼
and 𝐶 𝐿,𝜃 = 0.1 rad−1 . The deflection also depends on the lift. We
compute 𝜃 assuming the uniform lift distribution and using the Fig. 13.44 The aerostructural model
couples aerodynamics and structures
simple beam bending theory as through lift and wing deflection.
(𝐿/𝑏)(𝑏/2)3 𝐿𝑏 2
𝜃= = .
6𝐸𝐼 48𝐸𝐼
The Young’s modulus is 𝐸 = 70 GPa. Use the H-shaped cross-
section described in Prob. 5.17 to compute the second moment of
inertia, 𝐼.
We add the flight speed 𝑣 to the set of design variables and
handle 𝐿 = 𝑊 as a constraint. The objective of the aerostructural
optimization problem is to minimize the power with respect to
𝑥 = [𝑏, 𝑐, 𝑣], subject to 𝐿 = 𝑊.
Solve this problem using MDF, IDF, and a distributed MDO
architecture. Compare the aerostructural optimal solution with
the original solution from Appendix D.1.6 and discuss your
results.
Mathematics Background
A
This appendix briefly reviews various mathematical concepts used
throughout the book.
𝑓 (𝑥 + Δ𝑥) = 𝑎0 + 𝑎1 Δ𝑥 + 𝑎2 Δ𝑥 2 + . . . + 𝑎 𝑘 Δ𝑥 𝑘 + . . . . (A.1)
𝑓 (𝑘) (𝑥)
𝑎𝑘 = . (A.3)
𝑘!
534
A Mathematics Background 535
Substituting this into the polynomial in Eq. A.1 yields the Taylor series
Õ
∞
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) . (A.4)
𝑘!
𝑘=0
Õ
𝑚
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) + 𝒪 Δ𝑥 𝑚+1 , (A.5)
𝑘!
𝑘=0
𝑛=2
6
use Taylor series expansions of this function about 𝑥 = 0, we get
=
𝑛
1 1
𝑓 (Δ𝑥) = −4 + Δ𝑥 + 2Δ𝑥 2 − Δ𝑥 4 + Δ𝑥 6 − . . . .
6 180
Four different truncations of this series are plotted and compared to the exact 𝑓
function in Fig. A.1. 1
𝑛=
𝑛=4
The Taylor series in multiple dimensions is similar to the single- 0
variable case but more complicated. The first derivative of the function 𝑥
becomes a gradient vector, and the second derivatives become a Hessian Fig. A.1 Taylor series expansions for
matrix. Also, we need to define a direction along which we want to one-dimensional example. The more
approximate the function because that information is not inherent like terms we consider from the Taylor
series, the better the approximation.
it is in a one-dimensional function. The Taylor series expansion in 𝑛
dimensions along a direction 𝑝 can be written as
Õ
𝑛
𝜕𝑓 1 ÕÕ
𝑛 𝑛
𝜕2 𝑓
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼 𝑝𝑘 + 𝛼2 𝑝𝑘 𝑝𝑙 + 𝒪 𝛼3 ,
𝜕𝑥 𝑘 2 𝜕𝑥 𝑘 𝜕𝑥 𝑙
𝑘=1 𝑘=1 𝑙=1
(A.6)
where 𝛼 is a scalar that determines how far to go in the direction 𝑝. In
matrix form, we can write
1
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼∇ 𝑓 (𝑥)| 𝑝 + 𝛼 2 𝑝 | 𝐻(𝑥)𝑝 + 𝒪 𝛼 3 , (A.7)
2
where 𝐻 is the Hessian matrix.
A Mathematics Background 536
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 .
2
Performing a Taylor series expansion about 𝑥 = [0, −2], we get
1 2 | 10 0
𝑓 (𝑥 + 𝛼𝑝) = 18 + 𝛼 −2 − 14 𝑝 + 𝛼 𝑝 𝑝.
2 0 6
The original function, the linear approximation, and the quadratic approxima-
tion are compared in Fig. A.2.
d d 𝑓 d𝑔
( 𝑓 (𝑔(𝑥))) = . (A.8)
d𝑥 d𝑔 d𝑥
d d d 2
( 𝑓 (𝑔(𝑥))) = (sin(𝑔)) 𝑥 = cos 𝑥 2 (2𝑥) .
d𝑥 d𝑔 d𝑥
A Mathematics Background 537
d 𝜕 𝑓 d𝑔 𝜕 𝑓 dℎ
( 𝑓 (𝑔(𝑥), ℎ(𝑥))) = + , (A.9)
d𝑥 𝜕𝑔 d𝑥 𝜕ℎ d𝑥
d𝑓 𝜕𝑓 𝜕 𝑓 d𝑦
= +
d𝑥 𝜕𝑥 𝜕𝑦 d𝑥
= 2𝑥 + 2𝑦 cos(𝑥)
= 2𝑥 + 2 sin(𝑥) cos(𝑥) .
Notice that the partial derivative and total derivative are quite different. For this
simple case, we could also find the total derivative by direct substitution and
then using an ordinary one-dimensional derivative. Substituting 𝑦(𝑥) = sin(𝑥)
directly into the original expression for 𝑓 gives
d𝑦 = 𝑓 0(𝑥) d𝑥 , (A.11)
d 𝑓 = 𝑓 0(𝑥) d𝑥 . (A.12)
We can solve Ex. A.5 using differentials as follows. Taking the definition of
each function, we write their differentials,
𝑘=1
Two matrices can be multiplied only if their inner dimensions are equal
Fig. A.3 Matrix product and resulting
(𝑛 in this case). The remaining products discussed in this section are size.
just special cases of matrix multiplication, but they are common enough ∗ In this notation, 𝑚 is the number of rows
that we discuss them separately. and 𝑛 is the number of columns.
A Mathematics Background 540
𝑣1
𝑣2 Õ
=
𝑛
|
𝛼 = 𝑢 𝑣 = 𝑢1 𝑢2 . . . 𝑢𝑛 . = 𝑢𝑖 𝑣 𝑖 . (A.14)
..
(1 × 1) (1 × 𝑛) (𝑛 × 1)
𝑣 𝑛
𝑖=1
Fig. A.4 Dot (or inner) product of two
The order of multiplication is irrelevant, and therefore, vectors.
—— 𝐴1∗ —— =
𝐴 𝑖∗
—— 𝐴2∗ ——
𝑣= 𝑢, (A.20)
..
(𝑚 × 1) (𝑚 × 𝑛) (𝑛 × 1)
.
—— 𝐴𝑚∗ ——
Fig. A.6 Matrix-vector product.
| | |
𝑣 = 𝐴∗1 𝑢1 + 𝐴∗2 𝑢2 + . . . +
𝐴∗𝑛 𝑢𝑛 ,
(A.21)
| | |
and 𝐴∗𝑗 are the columns of 𝐴.
We can also multiply by a vector on the left, instead of on the right:
𝑣 | = 𝑢 | 𝐴. (A.22)
𝐴
R𝑛 R𝑚
Row space
𝐴| 𝑦 ≠ 0 Column space
dim = 𝑟 𝐴𝑥 ≠ 0
dim = 𝑟
𝐴𝑥 𝑟 = 𝑏 Fig. A.7 The four fundamental sub-
𝑥𝑟
𝑏 spaces of linear algebra. An (𝑚 × 𝑛)
𝐴𝑥 = 𝑏 matrix 𝐴 maps vectors from 𝑛-space
to 𝑚-space. When the vector is in
𝑥 = 𝑥𝑟 + 𝑥𝑛
the row space of the matrix, it maps
0 0 to the column space of 𝐴 (𝑥 𝑟 → 𝑏).
0
𝐴𝑥 𝑛 =
When the vector is in the nullspace
𝑥𝑛 of 𝐴, it maps to zero (𝑥 𝑛 → 0). Com-
Left nullspace bining the row space and nullspace
Nullspace of 𝐴, we can obtain any vector in
𝐴| 𝑦 = 0
𝐴𝑥 = 0
dim = 𝑚 − 𝑟 𝑛-dimensional space (𝑥 = 𝑥 𝑟 + 𝑥 𝑛 ),
dim = 𝑛 − 𝑟
which maps to the column space of
𝐴 (𝑥 → 𝑏).
Because this norm is used so often, we often omit the subscript and just
write k𝑥 k. In this book, we sometimes use the square of the 2-norm,
which can be written as the dot product,
k𝑥 k 22 = 𝑥 | 𝑥 . (A.26)
Several norms for matrices exist. There are matrix norms similar to
the vector norms that we defined previously. Namely,
Õ
𝑛
k𝐴k 1 = max 𝐴 𝑖𝑗
1≤ 𝑗≤𝑛
𝑖=1
1
k𝐴k 2 = (𝜆max (𝐴| 𝐴)) 2 (A.32)
Õ
𝑛
k𝐴k ∞ = max 𝐴 𝑖𝑗 ,
1≤𝑖≤𝑛
𝑖=1
Another matrix norm that is useful but not related to any vector
norm is the Frobenius norm, which is defined as the square root of the
absolute squares of its elements, that is,
v
u
tÕ𝑚 Õ
𝑛
k𝐴k 𝐹 = 𝐴2𝑖𝑗 . (A.34)
𝑖=1 𝑗=1
Note that
(𝐴| )| = 𝐴
(𝐴 + 𝐵)| = 𝐴| + 𝐵| (A.38)
| | |
(𝐴𝐵) = 𝐵 𝐴 .
A symmetric matrix is one where the matrix is equal to its transpose:
𝐴| = 𝐴 ⇒ 𝐴 𝑖𝑗 = 𝐴 𝑗𝑖 . (A.39)
Not all matrices are invertible. Some common properties for inverses
are as follows: −1
𝐴−1 =𝐴
𝑥 | 𝐴𝑥 > 0 (A.42)
𝑥 | 𝐴𝑥 ≥ 0 (A.44)
for all nonzero vectors 𝑥. In this case, the eigenvalues are nonnegative,
and there is at least one that is zero. A negative-definite matrix satisfies
𝑥 | 𝐴𝑥 < 0 (A.45)
for all nonzero vectors 𝑥. In this case, all the eigenvalues are negative.
An indefinite matrix is one that is neither positive definite nor negative
definite. Then, there are at least two nonzero vectors 𝑥 and 𝑦 such that
Õ
𝑛
𝑓 (𝑥) = 𝑎 | 𝑥 + 𝑏 = 𝑎𝑖 𝑥𝑖 + 𝑏𝑖 , (A.47)
𝑖=1
where 𝑎, 𝑥, and 𝑏 are vectors of length 𝑛, and 𝑎 𝑖 , 𝑥 𝑖 , and 𝑏 𝑖 are the 𝑖th
elements of 𝑎, 𝑥, and 𝑏, respectively. If we take the partial derivative
of each element with respect to an arbitrary element of 𝑥, namely, 𝑥 𝑘 ,
we get " #
𝜕 Õ
𝑛
𝑎𝑖 𝑥𝑖 + 𝑏𝑖 = 𝑎𝑘 . (A.48)
𝜕𝑥 𝑘
𝑖=1
Thus,
∇𝑥 (𝑎 | 𝑥 + 𝑏) = 𝑎 . (A.49)
Recall the quadratic form presented in Appendix A.3.3; we can
combine that with a linear term to form a general quadratic function:
𝑓 (𝑥) = 𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐 , (A.50)
A Mathematics Background 548
We now move the diagonal terms back into the sums to get
𝜕𝑓 Õ 𝑛
= 𝑏𝑘 + (𝑥 𝑗 𝑎 𝑗 𝑘 + 𝑎 𝑘 𝑗 𝑥 𝑗 ) , (A.54)
𝜕𝑥 𝑘
𝑗=1
∇𝑥 𝑓 (𝑥) = 𝐴| 𝑥 + 𝐴𝑥 + 𝑏 . (A.55)
∇𝑥 (𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐) = 2𝐴𝑥 + 𝑏 . (A.56)
𝑣 𝑖 𝑣 𝑖 = 1).
|
This is actually a sample mean, which would differ from the pop-
ulation mean (the true mean if you could measure every bar). With
enough samples, the sample mean approaches the population mean. In
this brief review, we do not distinguish between sample and population
statistics.
Another important quantity is the variance or standard deviation. This ∗ Unbiasedmeans that the expected value
is a measure of spread, or how far away our samples are from the mean. of the sample variance is the same as the
true population variance. If 𝑛 were used in
The unbiased∗ estimate of the variance is the denominator instead of 𝑛 − 1, then the
two quantities would differ by a constant.
A Mathematics Background 550
1 Õ
𝑛
𝜎𝑥2 = (𝑥 𝑖 − 𝜇𝑥 )2 , (A.60)
𝑛−1
𝑖=1
and the standard deviation is just the square root of the variance. A
small variance implies that measurements are clustered tightly around
the mean, whereas a large variance means that measurements are spread
out far from the mean. The variance can also be written in the following
mathematically equivalent but more computationally-friendly format:
𝑛
!
1 Õ
𝜎𝑥2 = 𝑥 2𝑖 − 𝑛𝜇2𝑥 . (A.61)
𝑛−1
𝑖=1
The total integral of the PDF must be 1 because it contains all possible
outcomes (100 percent):
∫ ∞
𝑝(𝑥) d𝑥 = 1 . (A.63)
−∞
From the PDF, we can also measure various statistics, such as the mean
value:
∫ ∞
𝜇𝑥 = E(𝑥) = 𝑥𝑝(𝑥) d𝑥 . (A.64)
−∞
The mean and variance are the first and second moments of the
distribution. In general, a distribution may require an infinite number
of moments to describe it fully. Higher-order moments are generally
mean centered and are normalized by the standard deviation so that
the 𝑛th normalized moment is computed as follows:
𝑥 − 𝜇 𝑛
𝑥
E . (A.68)
𝜎
The third moment is called skewness, and the fourth is called kurtosis,
although these higher-order moments are less commonly used.
The cumulative distribution function (CDF) is related to the PDF, which
is the cumulative integral of the PDF and is defined as follows:
∫ 𝑥
𝑃(𝑥) = 𝑝(𝑡) d𝑡 . (A.69)
−∞
The capital 𝑃 denotes the CDF, and the lowercase 𝑝 denotes the PDF.
As an example, the PDF and corresponding CDF for the axial strength
are shown in Fig. A.10. The CDF always approaches 1 as 𝑥 → ∞.
1
0.3
0.8
0.2 0.6
𝑝(𝜎) 𝑃(𝜎)
0.4
0.1
0.2
0 0
990 995 1,000 1,005 1,010 990 995 1,000 1,005 1,010 Fig. A.10 Comparison between PDF
𝜎 𝜎
and CDF for a simple example.
PDF for the axial strength of a rod. CDF for the axial strength of a rod.
For a normal distribution, the mean and variance are visible in the func-
tion, but these quantities are defined for any distribution. Figure A.11
A Mathematics Background 552
𝜇 = 1, 𝜎 = 0.5
0.6
𝑝(𝑥) 0.4
𝜇 = 3, 𝜎 = 1.0
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6
𝑥 𝑥
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
Fig. A.12 Popular probability distri-
0 2 4 6 0 2 4 6 butions besides the normal distribu-
𝑥 𝑥
tion.
Lognormal distribution Exponential distribution
A Mathematics Background 553
cov(𝑥, 𝑦)
corr(𝑥, 𝑦) = . (A.73)
𝜎𝑥 𝜎 𝑦
Linear Solvers
B
In Section 3.6, we present an overview of solution methods for dis-
cretized systems of equations, followed by an introduction to Newton-
based methods for solving nonlinear equations. Here, we review the
solvers for linear systems required to solve for each step of Newton-
∗ Trefethen and Bau III220
based methods.∗ provides a much
more detailed explanation of linear solvers.
220. Trefethen and Bau III, Numerical
B.1 Systems of Linear Equations Linear Algebra, 1997.
𝐴𝑢 = 𝑏 , (B.1)
554
B Linear Solvers 555
𝑟(𝑢) = 𝐴𝑢 − 𝑏 = 0. (B.2)
B.2 Conditioning
time, starting with the first one and progressing from left to right. This 1
is done by subtracting multiples of each row from subsequent rows. Fig. B.1 𝐿𝑈 factorization.
B Linear Solvers 556
Inputs:
𝐴: Nonsingular square matrix
𝑏: A vector
Outputs:
𝑢: Solution to 𝐴𝑢 = 𝑏
𝑏1 1 © Õ 𝑖−1
ª
𝑦1 = , 𝑦𝑖 = 𝑏 − 𝐿 𝑖𝑗 𝑦 𝑗 ® for 𝑖 = 2, . . . , 𝑛
𝐿11 𝐿 𝑖𝑖 𝑖
« 𝑗=1 ¬
Perform backward substitution to solve the following 𝑈𝑢 = 𝑦 for 𝑢:
𝑦𝑛 1 © Õ
𝑛
ª
𝑢𝑛 = , 𝑢𝑖 = 𝑦𝑖 − 𝑈 𝑖𝑗 𝑢 𝑗 ® for 𝑖 = 𝑛 − 1, . . . , 1
𝑈𝑛𝑛 𝑈 𝑖𝑖
« 𝑗=𝑖+1 ¬
Although direct methods are usually more efficient and robust, iterative
methods have several advantages:
𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 (𝑏 − 𝐴𝑢 𝑘 ) . (B.10)
B Linear Solvers 558
𝑟 (𝑢 𝑘 ) = 𝑏 − 𝐴𝑢 𝑘 , (B.11)
we can write
𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 𝑟 (𝑢 𝑘 ) . (B.12)
The splitting matrix 𝑀 is fixed and constructed so that it is easy to
invert. The closer 𝑀 −1 is to the inverse of 𝐴, the better the iterations
work. We now introduce three stationary methods corresponding to
three different splitting matrices.
The Jacobi method consists of setting 𝑀 to be a diagonal matrix 𝐷,
where the diagonal entries are those of 𝐴. Then,
𝑢 𝑘+1 = 𝑢 𝑘 + 𝐷 −1 𝑟 (𝑢 𝑘 ) . (B.13)
Õ
𝑛𝑢
1 𝑏 𝑖 − 𝐴 𝑖𝑗 𝑢 𝑗 𝑘 ,
𝑢𝑖 𝑘+1 = 𝑖 = 1, . . . , 𝑛𝑢 . (B.14)
𝐴 𝑖𝑖
𝑗=1,𝑗≠𝑖
Using this method, each component in 𝑢 𝑘+1 is independent of each
other at a given iteration; they only depend on the previous iteration
values, 𝑢 𝑘 , and can therefore be done in parallel.
The Gauss–Seidel method is obtained by setting 𝑀 to be the lower
triangular portion of 𝐴 and can be written as
Õ Õ
1 𝑏 𝑖 −
𝑢𝑖 𝑘+1 = 𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗 𝑘 ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 . (B.16)
𝑖𝑗 𝑗 𝑘+1
𝐴 𝑖𝑖
𝑗<𝑖 𝑗>𝑖
Unlike the Jacobi iterations, a Gauss–Seidel iteration cannot be per-
formed in parallel because of the terms where 𝑗 < 𝑖, which require
the latest values. Instead, the states must be updated sequentially.
However, the advantage of Gauss–Seidel is that it generally converges
faster than Jacobi iterations.
B Linear Solvers 559
Õ Õ
𝜔 𝑏 𝑖 −
𝑢𝑖 𝑘+1 = (1−𝜔)𝑢𝑖 𝑘 + 𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗 𝑘 ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 .
𝑖𝑗 𝑗 𝑘+1
𝐴 𝑖𝑖 𝑢2
𝑗<𝑖 𝑗>𝑖
2
(B.18)
With the correct value of 𝜔, SOR converges faster than Gauss–Seidel. 1.5
𝑢0
1
Example B.1 Iterative methods applied to a simple linear system.
𝑢∗
0.5
Suppose we have the following linear system of two equations:
0
2 −1 𝑢1 0 0 1 2
= .
−2 3 𝑢2 1 𝑢1
Jacobi
This corresponds to the two lines shown in Fig. B.3, where the solution is at
𝑢2
their intersection. 2
Applying the Jacobian iteration (Eq. B.14),
1.5
1
𝑢1 𝑘+1 = 𝑢2 𝑘 𝑢0
2 1
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘 ) . 𝑢∗
3 0.5
Starting with the guess 𝑢 (0) = (2, 1), we get the iterations shown in Fig. B.3. The 0
Gauss–Seidel iteration (Eq. B.16) is similar, where the only change is that the 0 1 2
𝑢1
second equation uses the latest state from the first one:
Gauss–Seidel
1
𝑢1 𝑘+1 = 𝑢2 𝑘 𝑢2
2 2
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘+1 ) .
3 1.5
SOR converges even faster for the right values of 𝜔. The result shown here is SOR
for 𝜔 = 1.2.
Fig. B.3 Jacobi, Gauss–Seidel, and
SOR iterations.
B Linear Solvers 560
∇ 𝑓 (𝑢) = 𝐴𝑢 − 𝑏 . (B.21)
Thus, the gradient of the quadratic is the residual of the linear system,
𝑟 𝑘 = ∇ 𝑓 (𝑢 𝑘 ) . (B.22)
1 Õ Õ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼𝑘 𝑝𝑘 𝐴 𝛼𝑘 𝑝𝑘 − 𝑏 |
𝛼𝑘 𝑝𝑘 (B.24)
2
𝑘=0 𝑘=0 𝑘=0
1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 − 𝛼 𝑘 𝑏| 𝑝 𝑘 .
2
𝑘=0 𝑗=0 𝑘=0
1 ÕÕ 1Õ 2 |
𝑛−1 𝑛−1 𝑛−1
𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 = 𝛼 𝑘 𝑝 𝑘 𝐴𝑝 𝑘 . (B.26)
2 2
𝑘=0 𝑗=0 𝑘=0
B Linear Solvers 561
Because each term in this sum involves only one direction 𝑝 𝑘 , we have
reduced the original problem to a series of one-dimensional quadratic
functions that can be minimized one at a time. Each one-dimensional
problem corresponds to minimizing the quadratic with respect to the
step length 𝛼 𝑘 . Differentiating each term and setting it to zero yields
the following:
𝑏| 𝑝 𝑘
𝛼 𝑘 𝑝 𝑘 | 𝐴𝑝 𝑘 − 𝑏 | 𝑝 𝑘 = 0 ⇒ 𝛼 𝑘 = . (B.28)
𝑝 𝑘 | 𝐴𝑝 𝑘
Now, the question is: How do we find this set of directions? There
are many sets of directions that satisfy conjugacy. For example, the
eigenvectors of 𝐴 satisfy Eq. B.25.∗ However, it is costly to compute the ∗ Suppose we have two eigenvectors, 𝑣 𝑘
and 𝑣 𝑗 . Then 𝑣 𝑘 | 𝐴𝑣 𝑗 = 𝑣 𝑘 | (𝜆 𝑗 𝑣 𝑗 ) =
eigenvectors of a matrix. We want a more convenient way to compute a 𝜆 𝑗 𝑣 𝑘 | 𝑣 𝑗 . This dot product is zero because
sequence of conjugate vectors. the eigenvectors of a symmetric matrix are
mutually orthogonal.
The conjugate gradient method sets the first direction to the steepest-
descent direction of the quadratic at the first point. Because the gradient
of the function is the residual of the linear system (Eq. B.22), this first
direction is obtained from the residual at the starting point,
𝑝1 = −𝑟 (𝑢0 ) . (B.29)
𝑝 𝑘+1 | 𝐴𝑝 𝑘 = 0 . (B.31)
Substituting the new direction 𝑝 𝑘+1 with the update (Eq. B.30), we get
𝑟 𝑘+1 | 𝐴𝑝 𝑘
𝛽𝑘 = . (B.33)
𝑝 𝑘 | 𝐴𝑝 𝑘
B Linear Solvers 562
By setting this derivative to zero, we can get the step size that minimizes
the quadratic along the line to be
𝑟𝑘 | 𝑝 𝑘
𝛼𝑘 = − . (B.35)
𝑝 𝑘 | 𝐴𝑝 𝑘
𝑟 𝑘 | 𝑝 𝑘 = 𝑟 𝑘 | (−𝑟 𝑘 | + 𝛽 𝑘 𝑝 𝑘−1 )
= −𝑟 𝑘 | 𝑟 𝑘 | + 𝛽 𝑘 𝑟 𝑘 | 𝑝 𝑘−1 (B.36)
|
= −𝑟 𝑘 𝑟 𝑘 .
Here we have used the property of the conjugate directions stating that
the residual vector is orthogonal to all previous conjugate directions,
so that 𝑟 𝑖 | 𝑝 𝑖 for 𝑖 = 0, 1, . . . , 𝑘 − 1.† Thus, we can now write, † For a proof of this property, see Theorem
5.2 in Nocedal and Wright.79
𝑟𝑘 𝑘 |𝑟
𝛼𝑘 = − | . (B.37) 79. Nocedal and Wright, Numerical
𝑝 𝑘 𝐴𝑝 𝑘 Optimization, 2006.
The numerator of the expression for 𝛽 (Eq. B.33) can also be written
in terms of the residual alone. Using the expression for the residual
(Eq. B.19) and taking the difference between two subsequent residuals,
we get
1
(𝑟 𝑘+1 | 𝑟 𝑘+1 ) ,
𝑟 𝑘+1 | 𝐴𝑝 𝑘 = (B.39)
𝛼𝑘
where we have used the property that the residual at any conjugate
residual iteration is orthogonal to the residuals at all previous iterations,
so 𝑟 𝑘+1 | 𝑟 𝑘 = 0.‡ ‡ For a proof of this property, see Theorem
5.3 in Nocedal and Wright.79
Now, using this new numerator and using Eq. B.37 to write the
79. Nocedal and Wright, Numerical
denominator as a function of the previous residual, we obtain Optimization, 2006.
𝑟𝑘 | 𝑟𝑘
𝛽𝑘 = . (B.40)
𝑟 𝑘−1 | 𝑟 𝑘−1
We use this result in the nonlinear conjugate gradient method for
function minimization in Section 4.4.2.
The linear conjugate gradient steps are listed in Alg. B.2. The
advantage of this method relative to the direct method is that 𝐴 does
not need to be stored or given explicitly. Instead, we only need to
provide a function that computes matrix-vector products with 𝐴. These
products are required to compute residuals (𝑟 = 𝐴𝑢 − 𝑏) and the 𝐴𝑝
term in the computation of 𝛼. Assuming a well-conditioned problem
with good enough arithmetic precision, the algorithm should converge
to the solution in 𝑛 steps.§ § Because the linear conjugate gradient
method converges in 𝑛 steps, it was origi-
nally thought of as a direct method. It was
Algorithm B.2 Linear conjugate gradient initially dismissed in favor of more efficient
direct methods, such as LU factorization.
Inputs: However, the conjugate gradient method
𝑢 (0) : Starting point was later reframed as an effective iterative
method to obtain approximate solutions
𝜏: Convergence tolerance to large problems.
Outputs:
𝑢 ∗ : Solution of linear system
residual (GMRES) method, do not have such restrictions on the matrix. 220. Trefethen and Bau III, Numerical
Linear Algebra, 1997.
Compared with stationary methods of Appendix B.4.1, Krylov methods
have the advantage that they use information gathered throughout the
iterations. Instead of using a fixed splitting matrix, Krylov methods
effectively vary the splitting so that 𝑀 is changed at each iteration
according to some criteria that use the information gathered so far. For
this reason, Krylov methods are usually more efficient than stationary
methods.
Like stationary iteration methods, Krylov methods do not require
forming or storing 𝐴. Instead, the iterations require only matrix-vector
products of the form 𝐴𝑣, where 𝑣 is some vector given by the Krylov
algorithm. The matrix-vector product could be given by a black box, as
shown in Fig. B.2.
For the linear conjugate gradient method (Appendix B.4.2), we
found conjugate directions and minimized the residual of the linear
system in a sequence of these directions.
Krylov subspace methods minimize the residual in a space,
𝑥 0 + 𝒦𝑘 , (B.41)
we do not need an explicit form for 𝑀. The matrix resulting from the
product 𝑀 −1 𝐴 should have a smaller condition number so that the new
linear system is better conditioned.
In the extreme case where 𝑀 = 𝐴, that means we have computed
the inverse of 𝐴, and we can get 𝑥 explicitly. In another extreme, 𝑀
could be a diagonal matrix with the diagonal elements of 𝐴, which
would scale 𝐴 such that the diagonal elements are 1.‖ ‖
The splitting matrix 𝑀 we used in the
equation for the stationary methods (Ap-
Krylov subspace solvers require three main components: (1) an pendix B.4.1) is effectively a preconditioner.
orthogonal basis for the Krylov subspace, (2) an optimal property that An 𝑀 using the diagonal entries of 𝐴 cor-
responds to the Jacobi method (Eq. B.13).
determines the solution within the subspace, and (3) an effective pre-
conditioner. Various Krylov subspace methods are possible, depending
on the choice for each of these three components. One of the most
popular Krylov subspace methods is the GMRES.221∗∗ 221. Saad and Schultz, GMRES: A gener-
alized minimal residual algorithm for solving
nonsymmetric linear systems, 1986.
∗∗ GMRES and other Krylov subspace
methods are available in most program-
ming languages, including C/C++, For-
tran, Julia, MATLAB, and Python.
Quasi-Newton Methods
C
C.1 Broyden’s Method
𝑠 𝑘 = 𝑢 𝑘+1 − 𝑢 𝑘 , (C.2)
𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 , (C.3)
𝐽˜ = 𝐽˜𝑘 + 𝑣𝑣 | , (C.5)
566
C Quasi-Newton Methods 567
yields
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|
|
𝑣𝑣 = | . (C.7)
𝑠𝑘 𝑠𝑘
Substituting this result into the update (Eq. C.5), we get the Jacobian
approximation update,
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|
where
𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 (C.9)
is the difference in the function values (as opposed to the difference in
the gradients used in optimization).
This update can be inverted using the Sherman–Morrison–Woodbury
formula (Appendix C.3) to get the more useful update on the inverse of
the Jacobian,
𝑠 𝑘 − 𝐽˜𝑘−1 𝑦 𝑘 𝑦 𝑘
|
𝐽˜𝑘+1
−1
= 𝐽˜𝑘−1 + | . (C.10)
𝑦𝑘 𝑦𝑘
We can start with 𝐽˜0−1 = 𝐼. Similar to the Newton step (Eq. 3.30), the step
in Broyden’s method is given by solving the linear system. Because the
inverse is provided explicitly, we can just perform the multiplication,
Δ𝑢 𝑘 = −𝐽˜−1 𝑟 𝑘 . (C.11)
𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (C.12)
We need the inverse version of the secant equation (Eq. 4.80), which is
𝑉˜ 𝑘+1 𝑦 𝑘 = 𝑠 𝑘 . (C.15)
𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑠 𝑘 𝑠 𝑘 𝑦 𝑘 + 𝛽𝑉˜ 𝑘 𝑦 𝑘 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘 = 𝑠 𝑘 . (C.16)
| |
1 1
𝑉˜ 𝑘+1 = 𝑉˜ 𝑘 + 𝑦𝑘 | 𝑠 𝑘 − 𝑦 𝑘 | 𝑉˜ 𝑘 𝑦 𝑘 . (C.17)
𝑠𝑘 𝑠𝑘 | 𝑉𝑘 𝑦 𝑘 𝑦 𝑘 | 𝑉˜ 𝑘
˜
C Quasi-Newton Methods 569
C.2.2 BFGS
The BFGS update was informally derived in Section 4.4.4. As discussed
previously, obtaining an approximation of the Hessian inverse is a more
efficient way to get the quasi-Newton step.
Similar to DFP, BFGS was originally formally derived by analytically
solving an optimization problem. However, instead of solving the
optimization problem of Eq. C.13, we solve a similar problem using the
Hessian inverse approximation instead. This problem can be stated as
minimize
𝑉˜ − 𝑉˜ 𝑘
subject to 𝑉˜ 𝑦 𝑘 = 𝑠 𝑘 (C.20)
𝑉˜ = 𝑉˜ | ,
𝑉˜ = 𝑉˜ 𝑘 + 𝛼𝑣𝑣 | , (C.23)
where we only need one self outer product to produce a rank 1 update
(as opposed to two).
Substituting the rank 1 update (Eq. C.23) into the secant equation,
we obtain
𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑣𝑣 | 𝑦 𝑘 = 𝑠 𝑘 . (C.24)
Rearranging yields
(𝛼𝑣 | 𝑦 𝑘 ) 𝑣 = 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.25)
Thus, we have to make sure that 𝑣 is in the direction of 𝑦 𝑘 − 𝐻 𝑘 𝑠 𝑘 . The
scalar 𝛼 must be such that the scaling of the vectors on both sides of the
equation match each other. We define a normalized 𝑣 in the desired
direction,
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
𝑣=
. (C.26)
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
2
To find the correct value for 𝛼, we substitute Eq. C.26 into Eq. C.25 to
get
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
𝑠 𝑘 − 𝑉𝑘 𝑦 𝑘 = 𝛼
˜
2 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.27)
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
2
Solving for 𝛼 yields
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
2
2
𝛼= . (C.28)
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
Substituting Eqs. C.26 and C.28 into Eq. C.23, we get the SR1 update
1 |
𝑉˜ = 𝑉˜ 𝑘 + 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.29)
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
|
1 1
𝛼 BFGS = 0, 𝛽BFGS = − | , 𝛾BFGS = | + 2 . (C.33)
𝑦𝑘 𝑠 𝑘 𝑦𝑘 𝑠 𝑘 |
𝑦𝑘 𝑠 𝑘
The formal derivations of the DFP and BFGS methods use the Sherman–
Morrison–Woodbury formula (also known as the Woodbury matrix
identity). Suppose that the inverse of a matrix is known, and then the
matrix is perturbed. The Sherman–Morrison–Woodbury formula gives
the inverse of the perturbed matrix without having to re-invert the
perturbed matrix. We used this formula in Section 4.4.4 to derive the
quasi-Newton update.
One possible perturbation is a rank 1 update of the form
𝐴ˆ = 𝐴 + 𝑢𝑣 | , (C.34)
C Quasi-Newton Methods 572
𝐴−1 𝑢𝑣 | 𝐴−1
𝐴ˆ −1 = 𝐴−1 − . (C.35)
1 + 𝑣 | 𝐴−1 𝑢
This formula can be verified by multiplying Eq. C.34 and Eq. C.35,
which yields the identity matrix.
This formula can be generalized for higher-rank updates as follows:
𝐴ˆ = 𝐴 + 𝑈𝑉 | , (C.36)
gradient-based optimizer:
4
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥 22 − 𝛽𝑥 1 𝑥2 , (D.1) 𝑥∗
0
ing valley. The large difference between the maximum and minimum
Fig. D.2 Rosenbrock function.
curvatures, and the fact that the principal curvature directions change
223. Rosenbrock, An automatic method
along the valley, makes it a good test for quasi-Newton methods. for finding the greatest or least value of a
The Rosenbrock function can be extended to 𝑛 dimensions by function, 1960.
defining the sum,
𝑛−1
Õ 2
𝑓 (𝑥) = 100 𝑥 𝑖+1 − 𝑥 𝑖 2 + (1 − 𝑥 𝑖 )2 . (D.3)
𝑖=1
573
D Test Problems 574
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 +
2𝑥2 − 𝑥12 . (D.4)
2 0
from different points. There are saddle points, maxima, and minima,
with one global minimum. This function, shown in Fig. D.4 along with 2
𝑥∗
Global minimum: = −13.5320 at = (2.6732, −0.6759).
𝑓 (𝑥 ∗ ) 𝑥∗ −1
Local minima: 𝑓 (𝑥) = −9.7770 at 𝑥 = (−0.4495, 2.2928). −1 0 1 2 3
𝑓 (𝑥) = −9.0312 at 𝑥 = (2.4239, 1.9219). 𝑥1
Õ
4
© Õ
3
ª
1
𝛼 𝑖 exp − 𝐴 𝑖𝑗 (𝑥 𝑗 − 𝑃𝑖𝑗 )2 ® ,
𝑥∗
𝑓 (𝑥) = − (D.6)
0.8
𝑖=1 « 𝑗=1 ¬
0.6
where
0.4
𝛼 = [1.0, 1.2, 3.0, 3.2] ,
3 10 30 3689 1170 2673
0.2
0.1 10 35 4387 7470 (D.7) 0 0.25 0.5 0.75 1 1.25
𝐴 = −4 4699
10 30
𝑃 = 10
8732 5547
𝑥2
, .
3 1091
0.1 10 35 381 5743 8828
Fig. D.5 An 𝑥2 − 𝑥3 slice of Hartmann
function at 𝑥 1 = 0.1148.
A slice of the function, at the optimal value of 𝑥1 = 0.1148, is shown
in Fig. D.5.
Global minimum: 𝑓 (𝑥 ∗ ) = −3.86278 at 𝑥 ∗ = (0.11480, 0.55566, 0.85254).
D Test Problems 575
𝐿 = 𝑞𝐶 𝐿 𝑆 , (D.10)
𝑆 = 𝑏𝑐 . (D.12)
𝐷 𝑓 = 𝑘𝐶 𝑓 𝑞𝑆wet . (D.13)
In this equation, the Reynolds number is based on the wing chord and
is defined as follows:
𝜌𝑣𝑐
𝑅𝑒 = , (D.15)
𝜇
where 𝜌 is the air density, and 𝜇 is the air dynamic viscosity. The form
factor, 𝑘, accounts for the effects of pressure drag. The wetted area,
𝑆wet , is the area over which the skin friction drag acts, which is a little
more than twice the planform area. We will use
𝐿2
𝐷𝑖 = , (D.17)
𝑞𝜋𝑏 2 𝑒
where 𝑒 is the Oswald efficiency factor. The total drag is the sum of
induced and viscous drag, 𝐷 = 𝐷𝑖 + 𝐷 𝑓 .
Our objective function, the power required by the motor for level
flight, is
𝐷𝑣
𝑃(𝑏, 𝑐) = , (D.18)
𝜂
where 𝜂 is the propulsive efficiency. We assume that our electric
propellers have a Gaussian efficiency curve (real efficiency curves are
not Gaussian, but this is simple and will be sufficient for our purposes):
−(𝑣 − 𝑣)2
𝜂 = 𝜂max exp . (D.19)
2𝜎2
This is the same problem that was presented in Ex. 1.2 of Chapter 1.
0.9
The optimal wingspan and chord are 𝑏 = 25.48 m and 𝑐 = 0.50 m,
respectively, given the parameters. The contour and the optimal wing 0.6
shape are shown in Fig. D.6.
Because there are no structural considerations in this problem, the 0.3
5 15 25 35
resulting wing has a higher wing aspect ratio than is realistic. This 𝑏
∫ p
𝑥 𝑖 +d𝑥 d𝑥 2 + d𝑦 2
Δ𝑡 = p
2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
r
𝑥𝑖
2 (D.22)
∫ 𝑥 𝑖 +d𝑥 1+
d𝑦
d𝑥 d𝑥
= p .
𝑥𝑖 2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
To discretize this problem, we can divide the path into linear
segments. As an example, Fig. D.7 shows the wire divided into four
linear segments (five nodes) as an approximation of a continuous wire.
The slope 𝑠 𝑖 = (Δ𝑦/Δ𝑥)𝑖 is then a constant along a given segment, and 𝑦 (𝑥 𝑖 , 𝑦 𝑖 )
𝑦(𝑥) = 𝑦 𝑖 + 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ). Making these substitutions results in Δ𝑦
q
(𝑥 𝑖+1 , 𝑦 𝑖+1 )
1 + 𝑠 𝑖2 ∫ 𝑥 𝑖+1
d𝑥 Δ𝑥
Δ𝑡 𝑖 = p p . (D.23) 𝑥
2𝑔 𝑥𝑖 ℎ − 𝑦 𝑖 − 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ) − 𝜇 𝑘 𝑥
Fig. D.7 A discretized representation
Performing the integration and simplifying (many steps omitted here) of the brachistochrone problem.
results in
s q
2 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
Δ𝑡 𝑖 = p p , (D.24)
𝑔 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦 𝑖 − 𝜇 𝑘 𝑥 𝑖
Õ
𝑛−1
𝑇= Δ𝑡 𝑖 . (D.25)
𝑖=1
The design variables are the 𝑛−2 positions of the path parameterized
by 𝑦 𝑖 . The endpoints must be fixed; otherwise, the problem is ill-defined,
which is why there are 𝑛 − 2 design variables instead of 𝑛. Note that
𝑥 is a parameter, meaning that it is fixed. We could space the 𝑥 𝑖 any
reasonable way and still find the same underlying optimal curve, but
D Test Problems 579
The analytic solution for the case with friction is more difficult to
derive, but the analytic solution for the frictionless case (𝜇 𝑘 = 0) with
our starting and ending points is as follows:
𝑥 = 𝑎(𝜃 − sin(𝜃))
(D.27)
𝑦 = −𝑎(1 − cos(𝜃)) + 1 ,
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Fig. D.8 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔
1 1
𝐸 𝑝 (𝑥1 , 𝑥2 ) = 𝑘 1 (Δ𝑙1 )2 + 𝑘2 (Δ𝑙2 )2 − 𝑚 𝑔𝑥 2 , (D.28)
2 2
where Δ𝑙1 and Δ𝑙2 are the changes in length for the two springs. With
respect to the original lengths, and displacements 𝑥 1 and 𝑥 2 as shown,
D Test Problems 580
where the partial derivatives of Δ𝑙1 and Δ𝑙2 are Fig. D.9 Total potential energy con-
tours for two-spring system.
𝜕(Δ𝑙1 ) 𝑙1 + 𝑥 1
=q
𝜕𝑥 1
(𝑙1 + 𝑥 1 )2 + 𝑥22
(D.31)
𝜕(Δ𝑙2 ) 𝑙2 − 𝑥 1
=q .
𝜕𝑥2
(𝑙2 − 𝑥 1 )2 + 𝑥 22
q q
By letting ℒ1 = (𝑙1 + 𝑥1 )2 + 𝑥22 and ℒ2 = (𝑙2 − 𝑥1 )2 + 𝑥22 , the partial
derivative of 𝐸 𝑝 with respect to 𝑥1 can be written as
𝑎 1 = 75.196 𝑎 2 = −3.8112
𝑎 3 = 0.12694 𝑎 4 = −2.0567 × 10−3
𝑎 5 = 1.0345 × 10−5 𝑎 6 = −6.8306
𝑎 7 = 0.030234 𝑎 8 = −1.28134 × 10−3
𝑎 9 = 3.5256 × 10−5 𝑎 10 = −2.266 × 10−7
𝑎11 = 0.25645 𝑎 12 = −3.4604 × 10−3
𝑎13 = 1.3514 × 10−5 𝑎 14 = −28.106
𝑎 15 = −5.2375 × 10−6 𝑎 16 = −6.3 × 10−8
𝑎 17 = 7.0 × 10−10 𝑎 18 = 3.4054 × 10−4
𝑎 19 = −1.6638 × 10−6 𝑎20 = −2.8673
𝑎 21 = 0.0005
𝑓 (𝑥1 , 𝑥2 ) = 𝑎1 + 𝑎2 𝑥1 + 𝑎 3 𝑦4 + 𝑎4 𝑦4 𝑥1 + 𝑎5 𝑦42 + 𝑎6 𝑥2 + 𝑎7 𝑦1 +
𝑎8 𝑥1 𝑦1 + 𝑎 9 𝑦1 𝑦4 + 𝑎 10 𝑦2 𝑦4 + 𝑎11 𝑦3 + 𝑎12 𝑥2 𝑦3 + 𝑎 13 𝑦32 +
𝑎14 (D.35)
+ 𝑎15 𝑦3 𝑦4 + 𝑎16 𝑦1 𝑦4 𝑥2 + 𝑎17 𝑦1 𝑦3 𝑦4 + 𝑎18 𝑥1 𝑦3 +
𝑥2 + 1
𝑎19 𝑦1 𝑦3 + 𝑎20 exp(𝑎21 𝑦1 ).
is bounded from [0, 80] in both dimensions, in which case the global 60
not to be in the corner and so set the bounds to [0, 65] in both dimen-
𝑥∗
sions. The contour of this function is plotted in Fig. D.10. 20
ℓ ℓ
1 2
7 8 9 10
ℓ
5 6
The stress in the truss element can be computed from the equation
𝜎 = 𝑆 𝑒 𝑑, where 𝜎 is a scalar, 𝑑 is the same vector as before, and the
element 𝑆 𝑒 matrix (really a row vector because stress is one-dimensional
for truss elements) is
𝐸
𝑆𝑒 = −𝑐 −𝑠 𝑐 𝑠 . (D.39)
𝐿
The global structure (an assembly of multiple finite elements) has the
same equations, 𝐾𝑑 = 𝑞 and 𝜎 = 𝑆𝑑, but now 𝑑 contains displacements
for all of the nodes in the structure, 𝑑 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. If we have 𝑛
nodes and 𝑚 elements, then 𝑞 and 𝑑 are 2𝑛-vectors, 𝐾 is a (2𝑛 × 2𝑛)
matrix, 𝑆 is an (𝑚 × 2𝑛) matrix, and 𝜎 is an 𝑚-vector. The elemental
stiffness and stress matrices are first computed and then assembled into
the global matrices. This is straightforward because the displacements
and forces of the individual elements add linearly.
After we assemble the global matrices, we must remove any degrees
of freedom that are structurally rigid (already known to have zero
displacement). Otherwise, the problem is ill-defined, and the stiffness
matrix will be ill-conditioned.
D Test Problems 584
𝐾𝑑 = 𝑞 . (D.40)
1 Wu, N., Kenway, G., Mader, C. A., Jasa, J., and Martins, J. R. R. A., cited on pp. 15, 199
“PyOptSparse: A Python framework for large-scale constrained
nonlinear optimization of sparse systems,” Journal of Open Source
Software, Vol. 5, No. 54, October 2020, p. 2564.
doi: 10.21105/joss.02564
2 Lyu, Z., Kenway, G. K. W., and Martins, J. R. R. A., “Aerodynamic cited on p. 20
Shape Optimization Investigations of the Common Research Model
Wing Benchmark,” AIAA Journal, Vol. 53, No. 4, April 2015, pp. 968–
985.
doi: 10.2514/1.J053318
3 He, X., Li, J., Mader, C. A., Yildirim, A., and Martins, J. R. R. A., cited on p. 20
“Robust aerodynamic shape optimization—From a circle to an
airfoil,” Aerospace Science and Technology, Vol. 87, April 2019, pp. 48–
61.
doi: 10.1016/j.ast.2019.01.051
4 Betts, J. T., “Survey of numerical methods for trajectory optimiza- cited on p. 26
tion,” Journal of Guidance, Control, and Dynamics, Vol. 21, No. 2, 1998,
pp. 193–207.
doi: 10.2514/2.4231
5 Bryson, A. E. and Ho, Y. C., Applied Optimal Control; Optimization, cited on p. 26
Estimation, and Control. Waltham, MA: Blaisdell Publishing, 1969.
6 Bertsekas, D. P., Dynamic Programming and Optimal Control. Belmont, cited on p. 26
MA: Athena Scientific, 1995.
7 Kepler, J., Nova stereometria doliorum vinariorum (New Solid Geometry cited on p. 34
of Wine Barrels). Linz, Austria: Johannes Planck, 1615.
8 Ferguson, T. S., “Who solved the secretary problem?” Statistical cited on p. 34
Science, Vol. 4, No. 3, August 1989, pp. 282–289.
doi: 10.1214/ss/1177012493
9 Fermat, P. de, Methodus ad disquirendam maximam et minimam cited on p. 35
(Method for the Study of Maxima and Minima). 1636, translated by
Jason Ross.
585
Bibliography 586
44 Hwang, J. T. and Martins, J. R. R. A., “A computational architecture cited on pp. 41, 494
for coupling heterogeneous numerical models and computing
coupled derivatives,” ACM Transactions on Mathematical Software,
Vol. 44, No. 4, June 2018, Article 37.
doi: 10.1145/3182393
45 Wright, M. H., “The interior-point revolution in optimization: cited on p. 41
History, recent developments, and lasting consequences,” Bulletin
of the American Mathematical Society, Vol. 42, 2005, pp. 39–56.
doi: 10.1007/978-1-4613-3279-4_23
46 Grant, M., Boyd, S., and Ye, Y., “Disciplined convex programming,” cited on pp. 41, 428
Global Optimization—From Theory to Implementation, Liberti, L. and
Maculan, N., Eds. Boston, MA: Springer, 2006, pp. 155–210.
doi: 10.1007/0-387-30528-9_7
47 Wengert, R. E., “A simple automatic derivative evaluation program,” cited on p. 41
Communications of the ACM, Vol. 7, No. 8, August 1964, pp. 463–464,
issn: 0001-0782.
doi: 10.1145/355586.364791
48 Speelpenning, B., “Compiling fast partial derivatives of functions cited on p. 41
given by algorithms,” PhD dissertation, University of Illinois at
Urbana–Champaign, Champaign, IL, January 1980.
doi: 10.2172/5254402
49 Squire, W. and Trapp, G., “Using complex variables to estimate cited on pp. 42, 231
derivatives of real functions,” SIAM Review, Vol. 40, No. 1, 1998,
pp. 110–112, issn: 0036-1445 (print), 1095-7200 (electronic).
doi: 10.1137/S003614459631241X
50 Martins, J. R. R. A., Sturdza, P., and Alonso, J. J., “The complex- cited on pp. 42, 232, 234, 236
step derivative approximation,” ACM Transactions on Mathematical
Software, Vol. 29, No. 3, September 2003, pp. 245–262.
doi: 10.1145/838250.838251
51 Torczon, V., “On the convergence of pattern search algorithms,” cited on p. 42
SIAM Journal on Optimization, Vol. 7, No. 1, February 1997, pp. 1–25.
doi: 10.1137/S1052623493250780
52 Jones, D., Perttunen, C., and Stuckman, B., “Lipschitzian optimiza- cited on pp. 42, 296
tion without the Lipschitz constant,” Journal of Optimization Theory
and Application, Vol. 79, No. 1, October 1993, pp. 157–181.
doi: 10.1007/BF00941892
53 Jones, D. R. and Martins, J. R. R. A., “The DIRECT algorithm—25 cited on pp. 42, 296
years later,” Journal of Global Optimization, Vol. 79, March 2021,
pp. 521–566.
doi: 10.1007/s10898-020-00952-6
Bibliography 590
66 Hodges, A., Alan Turing: The Enigma. Princeton, NJ: Princeton cited on p. 44
University Press, 2014.
isbn: 9780691164724
67 Lipsitz, G., How Racism Takes Place. Philadelphia, PA: Temple cited on p. 44
University Press, 2011.
68 Rothstein, R., The Color of Law: A Forgotten History of How Our cited on p. 44
Government Segregated America. New York, NY: Liveright Publishing,
2017.
69 King, L. J., “More than slaves: Black founders, Benjamin Banneker, cited on p. 44
and critical intellectual agency,” Social Studies Research & Practice
(Board of Trustees of the University of Alabama), Vol. 9, No. 3, 2014.
70 Shetterly, M. L., Hidden Figures: The American Dream and the Untold cited on p. 45
Story of the Black Women Who Helped Win the Space Race. New York,
NY: William Morrow and Company, 2016.
71 Box, G. E. P., “Science and statistics,” Journal of the American Statistical cited on p. 48
Association, Vol. 71, No. 356, 1976, pp. 791–799, issn: 01621459.
doi: 10.2307/2286841
72 Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., cited on p. 58
Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley,
M. D., Waugh, B., White, E. P., and Wilson, P., “Best practices for
scientific computing,” PLoS Biology, Vol. 12, No. 1, 2014, e1001745.
doi: 10.1371/journal.pbio.1001745
73 Grotker, T., Holtmann, U., Keding, H., and Wloka, M., The De- cited on p. 59
veloper’s Guide to Debugging, 2nd ed. New York, NY: Springer,
2012.
74 Ascher, U. M. and Greif, C., A First Course in Numerical Methods. cited on p. 60
Philadelphia, PA: SIAM, 2011.
75 Saad, Y., Iterative Methods for Sparse Linear Systems, 2nd ed. Philadel- cited on pp. 61, 564
phia, PA: SIAM, 2003.
76 Higgins, T. J., “A note on the history of mixed partial derivatives,” cited on p. 83
Scripta Mathematica, Vol. 7, 1940, pp. 59–62.
77 Hager, W. W. and Zhang, H., “A new conjugate gradient method cited on p. 97
with guaranteed descent and an efficient line search,” SIAM Journal
on Optimization, Vol. 16, No. 1, January 2005, pp. 170–192, issn:
1095-7189.
doi: 10.1137/030601880
Bibliography 592
78 Moré, J. J. and Thuente, D. J., “Line search algorithms with guaran- cited on p. 101
teed sufficient decrease,” ACM Transactions on Mathematical Software
(TOMS), Vol. 20, No. 3, 1994, pp. 286–307.
doi: 10.1145/192115.192132
79 Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd ed. Berlin: cited on pp. 101, 116, 140, 141, 189,
Springer, 2006. 208, 562, 563
doi: 10.1007/978-0-387-40065-5
80 Broyden, C. G., “The convergence of a class of double-rank min- cited on p. 125
imization algorithms 1. General considerations,” IMA Journal of
Applied Mathematics, Vol. 6, No. 1, 1970, pp. 76–90, issn: 1464-3634.
doi: 10.1093/imamat/6.1.76
81 Fletcher, R., “A new approach to variable metric algorithms,” The cited on p. 125
Computer Journal, Vol. 13, No. 3, March 1970, pp. 317–322, issn:
1460-2067.
doi: 10.1093/comjnl/13.3.317
82 Goldfarb, D., “A family of variable-metric methods derived by cited on p. 125
variational means,” Mathematics of Computation, Vol. 24, No. 109,
January 1970, pp. 23–23, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0258249-6
83 Shanno, D. F., “Conditioning of quasi-Newton methods for func- cited on p. 125
tion minimization,” Mathematics of Computation, Vol. 24, No. 111,
September 1970, pp. 647–647, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0274029-x
84 Conn, A. R., Gould, N. I. M., and Toint, P. L., Trust Region Methods. cited on pp. 139, 140, 141, 142
Philadelphia, PA: SIAM, 2000.
isbn: 0898714605
85 Steihaug, T., “The conjugate gradient method and trust regions in cited on p. 140
large scale optimization,” SIAM Journal on Numerical Analysis, Vol.
20, No. 3, June 1983, pp. 626–637, issn: 1095-7170.
doi: 10.1137/0720042
86 Boyd, S. P. and Vandenberghe, L., Convex Optimization. Cambridge, cited on pp. 155, 423
UK: Cambridge University Press, March 2004.
isbn: 0521833787
87 Strang, G., Linear Algebra and its Applications, 4th ed. Boston, MA: cited on pp. 155, 542
Cengage Learning, 2006.
isbn: 0030105676
88 Dax, A., “Classroom note: An elementary proof of Farkas’ lemma,” cited on p. 165
SIAM Review, Vol. 39, No. 3, 1997, pp. 503–507.
doi: 10.1137/S0036144594295502
Bibliography 593
89 Gill, P. E., Murray, W., Saunders, M. A., and Wright, M. H., “Some cited on p. 182
theoretical properties of an augmented Lagrangian merit function,”
SOL 86-6R, Systems Optimization Laboratory, September 1986.
90 Di Pillo, G. and Grippo, L., “A new augmented Lagrangian function cited on p. 182
for inequality constraints in nonlinear programming problems,”
Journal of Optimization Theory and Applications, Vol. 36, No. 4, 1982,
pp. 495–519
doi: 10.1007/BF00940544
91 Birgin, E. G., Castillo, R. A., and MartÍnez, J. M., “Numerical cited on p. 182
comparison of augmented Lagrangian algorithms for nonconvex
problems,” Computational Optimization and Applications, Vol. 31, No.
1, 2005, pp. 31–55
doi: 10.1007/s10589-005-1066-7
92 Rockafellar, R. T., “The multiplier method of Hestenes and Powell cited on p. 182
applied to convex programming,” Journal of Optimization Theory
and Applications, Vol. 12, No. 6, 1973, pp. 555–562
doi: 10.1007/BF00934777
93 Murray, W., “Analytical expressions for the eigenvalues and eigen- cited on p. 186
vectors of the Hessian matrices of barrier and penalty functions,”
Journal of Optimization Theory and Applications, Vol. 7, No. 3, March
1971, pp. 189–196.
doi: 10.1007/bf00932477
94 Forsgren, A., Gill, P. E., and Wright, M. H., “Interior methods for cited on p. 186
nonlinear optimization,” SIAM Review, Vol. 44, No. 4, January 2002,
pp. 525–597.
doi: 10.1137/s0036144502414942
95 Gill, P. E. and Wong, E., “Sequential quadratic programming cited on p. 189
methods,” Mixed Integer Nonlinear Programming, Lee, J. and Leyffer,
S., Eds., ser. The IMA Volumes in Mathematics and Its Applications.
New York, NY: Springer, 2012, Vol. 154.
doi: 10.1007/978-1-4614-1927-3_6
96 Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP cited on pp. 189, 196, 199
algorithm for large-scale constrained optimization,” SIAM Review,
Vol. 47, No. 1, 2005, pp. 99–131.
doi: 10.1137/S0036144504446096
97 Fletcher, R. and Leyffer, S., “Nonlinear programming without a cited on p. 197
penalty function,” Mathematical Programming, Vol. 91, No. 2, January
2002, pp. 239–269.
doi: 10.1007/s101070100244
Bibliography 594
98 Benson, H. Y., Vanderbei, R. J., and Shanno, D. F., “Interior-point cited on p. 197
methods for nonconvex nonlinear programming: Filter methods
and merit functions,” Computational Optimization and Applications,
Vol. 23, No. 2, 2002, pp. 257–272.
doi: 10.1023/a:1020533003783
99 Fletcher, R., Leyffer, S., and Toint, P., “A brief history of filter cited on p. 197
methods,” ANL/MCS-P1372-0906, Argonne National Laboratory,
September 2006.
100 Fletcher, R., Practical Methods of Optimization, 2nd ed. Hoboken, NJ: cited on p. 199
Wiley, 1987.
101 Liu, D. C. and Nocedal, J., “On the limited memory BFGS method cited on p. 199
for large scale optimization,” Mathematical Programming, Vol. 45,
No. 1–3, August 1989, pp. 503–528.
doi: 10.1007/bf01589116
102 Byrd, R. H., Nocedal, J., and Waltz, R. A., “Knitro: An integrated cited on pp. 199, 207
package for nonlinear optimization,” Large-Scale Nonlinear Opti-
mization, Di Pillo, G. and Roma, M., Eds. Boston, MA: Springer US,
2006, pp. 35–59.
doi: 10.1007/0-387-30065-1_4
103 Kraft, D., “A software package for sequential quadratic program- cited on p. 199
ming,” DFVLR-FB 88-28, DLR German Aerospace Center–Institute
for Flight Mechanics, Koln, Germany, 1988.
104 Wächter, A. and Biegler, L. T., “On the implementation of an cited on p. 207
interior-point filter line-search algorithm for large-scale nonlinear
programming,” Mathematical Programming, Vol. 106, No. 1, April
2005, pp. 25–57.
doi: 10.1007/s10107-004-0559-y
105 Byrd, R. H., Hribar, M. E., and Nocedal, J., “An interior point cited on p. 207
algorithm for large-scale nonlinear programming,” SIAM Journal
on Optimization, Vol. 9, No. 4, January 1999, pp. 877–900.
doi: 10.1137/s1052623497325107
106 Wächter, A. and Biegler, L. T., “On the implementation of a primal- cited on p. 207
dual interior point filter line search algorithm for large-scale non-
linear programming,” Mathematical Programming, Vol. 106, No. 1,
2006, pp. 25–57.
107 Gill, P. E., Saunders, M. A., and Wong, E., “On the performance cited on p. 208
of SQP methods for nonlinear optimization,” Modeling and Opti-
mization: Theory and Applications, Defourny, B. and Terlaky, T., Eds.
New York, NY: Springer, 2015, Vol. 147, pp. 95–123.
doi: 10.1007/978-3-319-23699-5_5
Bibliography 595
108 Kreisselmeier, G. and Steinhauser, R., “Systematic control design by cited on p. 211
optimizing a vector performance index,” IFAC Proceedings Volumes,
Vol. 12, No. 7, September 1979, pp. 113–117, issn: 1474-6670.
doi: 10.1016/s1474-6670(17)65584-8
109 Duysinx, P. and Bendsøe, M. P., “Topology optimization of contin- cited on p. 212
uum structures with local stress constraints,” International Journal
for Numerical Methods in Engineering, Vol. 43, 1998, pp. 1453–1478.
doi: 10 . 1002 / (SICI ) 1097 - 0207(19981230 ) 43 : 8 % 3C1453 :: AID -
NME480%3E3.0.CO;2-2
110 Kennedy, G. J. and Hicken, J. E., “Improved constraint-aggregation cited on p. 212
methods,” Computer Methods in Applied Mechanics and Engineering,
Vol. 289, 2015, pp. 332–354, issn: 0045-7825.
doi: 10.1016/j.cma.2015.02.017
111 Hoerner, S. F., Fluid-Dynamic Drag. Bakersfield, CA: Hoerner Fluid cited on pp. 217, 218
Dynamics, 1965.
112 Lyness, J. N. and Moler, C. B., “Numerical differentiation of analytic cited on p. 231
functions,” SIAM Journal on Numerical Analysis, Vol. 4, No. 2, 1967,
pp. 202–210, issn: 0036-1429 (print), 1095-7170 (electronic).
doi: 10.1137/0704019
113 Lantoine, G., Russell, R. P., and Dargent, T., “Using multicomplex cited on p. 232
variables for automatic computation of high-order derivatives,”
ACM Transactions on Mathematical Software, Vol. 38, No. 3, April
2012, pp. 1–21, issn: 0098-3500.
doi: 10.1145/2168773.2168774
114 Fike, J. A. and Alonso, J. J., “Automatic differentiation through the cited on p. 232
use of hyper-dual numbers for second derivatives,” Recent Advances
in Algorithmic Differentiation, Forth, S., Hovland, P., Phipps, E., Utke,
J., and Walther, A., Eds. Berlin: Springer, 2012, pp. 163–173, isbn:
978-3-642-30023-3.
doi: 10.1007/978-3-642-30023-3_15
115 Griewank, A., Evaluating Derivatives. Philadelphia, PA: SIAM, 2000. cited on pp. 236, 246, 248
doi: 10.1137/1.9780898717761
116 Naumann, U., The Art of Differentiating Computer Programs—An cited on p. 236
Introduction to Algorithmic Differentiation. Philadelphia, PA: SIAM,
2011.
117 Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, cited on p. 249
P., Hill, C., and Wunsch, C., “OpenAD/F: A modular open-source
tool for automatic differentiation of Fortran codes,” ACM Trans-
actions on Mathematical Software, Vol. 34, No. 4, July 2008, issn:
Bibliography 596
0098-3500.
doi: 10.1145/1377596.1377598
118 Hascoet, L. and Pascual, V., “The Tapenade automatic differentia- cited on p. 249
tion tool: Principles, model, and specification,” ACM Transactions
on Mathematical Software, Vol. 39, No. 3, May 2013, 20:1–20:43, issn:
0098-3500.
doi: 10.1145/2450153.2450158
119 Griewank, A., Juedes, D., and Utke, J., “Algorithm 755: ADOL-C: cited on p. 249
A package for the automatic differentiation of algorithms written
in C/C++,” ACM Transactions on Mathematical Software, Vol. 22, No.
2, June 1996, pp. 131–167, issn: 0098-3500.
doi: 10.1145/229473.229474
120 Wiltschko, A. B., Merriënboer, B. van, and Moldovan, D., “Tangent: cited on p. 249
Automatic differentiation using source code transformation in
Python,” arXiv:1711.02712, 2017.
Url: https://arxiv.org/abs/1711.02712.
121 Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., cited on p. 249
Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-
Milne, S., and Zhang, Q., “JAX: Composable Transformations of
Python+NumPy Programs,” 2018.
Url: http://github.com/google/jax.
122 Revels, J., Lubin, M., and Papamarkou, T., “Forward-mode auto- cited on p. 249
matic differentiation in Julia,” arXiv:1607.07892, July 2016.
Url: https://arxiv.org/abs/1607.07892.
123 Neidinger, R. D., “Introduction to automatic differentiation and cited on p. 249
MATLAB object-oriented programming,” SIAM Review, Vol. 52,
No. 3, January 2010, pp. 545–563.
doi: 10.1137/080743627
124 Betancourt, M., “A geometric theory of higher-order automatic cited on p. 249
differentiation,” arXiv:1812.11592 [stat.CO], December 2018.
Url: https://arxiv.org/abs/1812.11592.
125 Giles, M., “An extended collection of matrix derivative results for cited on pp. 250, 251
forward and reverse mode algorithmic differentiation,” Oxford,
UK, January 2008.
Url: https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf.
126 Peter, J. E. V. and Dwight, R. P., “Numerical sensitivity analysis for cited on p. 251
aerodynamic optimization: A survey of approaches,” Computers
and Fluids, Vol. 39, No. 3, March 2010, pp. 373–391.
doi: 10.1016/j.compfluid.2009.09.013
Bibliography 597
127 Martins, J. R. R. A., “Perspectives on aerodynamic design optimiza- cited on pp. 256, 441
tion,” Proceedings of the AIAA SciTech Forum. American Institute of
Aeronautics and Astronautics, January 2020.
doi: 10.2514/6.2020-0043
128 Lambe, A. B., Martins, J. R. R. A., and Kennedy, G. J., “An evaluation cited on p. 259
of constraint aggregation strategies for wing box mass minimiza-
tion,” Structural and Multidisciplinary Optimization, Vol. 55, No. 1,
January 2017, pp. 257–277.
doi: 10.1007/s00158-016-1495-1
129 Kenway, G. K. W., Mader, C. A., He, P., and Martins, J. R. R. A., cited on p. 260
“Effective Adjoint Approaches for Computational Fluid Dynamics,”
Progress in Aerospace Sciences, Vol. 110, October 2019, p. 100 542.
doi: 10.1016/j.paerosci.2019.05.002
130 Curtis, A. R., Powell, M. J. D., and Reid, J. K., “On the estimation cited on p. 262
of sparse Jacobian matrices,” IMA Journal of Applied Mathematics,
Vol. 13, No. 1, February 1974, pp. 117–119, issn: 1464-3634.
doi: 10.1093/imamat/13.1.117
131 Gebremedhin, A. H., Manne, F., and Pothen, A., “What color is cited on p. 263
your Jacobian? Graph coloring for computing derivatives,” SIAM
Review, Vol. 47, No. 4, January 2005, pp. 629–705, issn: 1095-7200.
doi: 10.1137/s0036144504444711
132 Gray, J. S., Hwang, J. T., Martins, J. R. R. A., Moore, K. T., and cited on pp. 263, 490, 497, 504, 529
Naylor, B. A., “OpenMDAO: An open-source framework for multi-
disciplinary design, analysis, and optimization,” Structural and
Multidisciplinary Optimization, Vol. 59, No. 4, April 2019, pp. 1075–
1104.
doi: 10.1007/s00158-019-02211-z
133 Ning, A., “Using blade element momentum methods with gradient- cited on p. 264
based design optimization,” Structural and Multidisciplinary Opti-
mization, May 2021
doi: 10.1007/s00158-021-02883-6
134 Martins, J. R. R. A. and Hwang, J. T., “Review and unification of cited on p. 265
methods for computing derivatives of multidisciplinary compu-
tational models,” AIAA Journal, Vol. 51, No. 11, November 2013,
pp. 2582–2599.
doi: 10.2514/1.J052184
135 Yu, Y., Lyu, Z., Xu, Z., and Martins, J. R. R. A., “On the influence of cited on p. 281
optimization algorithm and starting design on wing aerodynamic
shape optimization,” Aerospace Science and Technology, Vol. 75, April
Bibliography 598
146 Barricelli, N., “Esempi numerici di processi di evoluzione,” Metho- cited on p. 304
dos, 1954, pp. 45–68.
147 Jong, K. A. D., “An analysis of the behavior of a class of genetic cited on p. 304
adaptive systems,” PhD dissertation, University of Michigan, Ann
Arbor, MI, 1975.
148 Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., “A fast and cited on pp. 306, 362
elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions
on Evolutionary Computation, Vol. 6, No. 2, April 2002, pp. 182–197.
doi: 10.1109/4235.996017
149 Deb, K., Multi-Objective Optimization Using Evolutionary Algorithms. cited on p. 311
Hoboken, NJ: John Wiley & Sons, 2001.
isbn: 047187339X
150 Eberhart, R. and Kennedy, J. A., “New optimizer using particle cited on p. 314
swarm theory,” Proceedings of the Sixth International Symposium
on Micro Machine and Human Science. Institute of Electrical and
Electronics Engineers, 1995, pp. 39–43.
doi: 10.1109/MHS.1995.494215
151 Zhan, Z.-H., Zhang, J., Li, Y., and Chung, H. S.-H., “Adaptive cited on p. 315
particle swarm optimization,” IEEE Transactions on Systems, Man,
and Cybernetics, Part B (Cybernetics), Vol. 39, No. 6, April 2009,
pp. 1362–1381.
doi: 10.1109/TSMCB.2009.2015956
152 Gutin, G., Yeo, A., and Zverovich, A., “Traveling salesman should cited on p. 336
not be greedy: Domination analysis of greedy-type heuristics for
the TSP,” Discrete Applied Mathematics, Vol. 117, No. 1–3, March
2002, pp. 81–86, issn: 0166-218X.
doi: 10.1016/s0166-218x(01)00195-0
153 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization cited on p. 345
by simulated annealing,” Science, Vol. 220, No. 4598, May 1983,
pp. 671–680, issn: 1095-9203.
doi: 10.1126/science.220.4598.671
154 Černý, V., “Thermodynamical approach to the traveling salesman cited on p. 345
problem: An efficient simulation algorithm,” Journal of Optimization
Theory and Applications, Vol. 45, No. 1, January 1985, pp. 41–51, issn:
1573-2878.
doi: 10.1007/bf00940812
155 Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., cited on p. 345
and Teller, E., “Equation of state calculations by fast computing
machines,” Journal of Chemical Physics, March 1953.
doi: 10.2172/4390578
Bibliography 600
156 Andresen, B. and Gordon, J. M., “Constant thermodynamic speed cited on p. 346
for minimizing entropy production in thermodynamic processes
and simulated annealing,” Physical Review E, Vol. 50, No. 6, Decem-
ber 1994, pp. 4346–4351, issn: 1095-3787.
doi: 10.1103/physreve.50.4346
157 Lin, S., “Computer solutions of the traveling salesman problem,” cited on p. 347
Bell System Technical Journal, Vol. 44, No. 10, December 1965,
pp. 2245–2269, issn: 0005-8580.
doi: 10.1002/j.1538-7305.1965.tb04146.x
158 Press, W. H., Wevers, J., Flannery, B. P., Teukolsky, S. A., Vetterling, cited on p. 349
W. T., Flannery, B. P., and Vetterling, W. T., Numerical Recipes in
C: The Art of Scientific Computing. Cambridge, UK: Cambridge
University Press, 1992.
isbn: 0521431085
159 Haimes, Y. Y., Lasdon, L. S., and Wismer, D. A., “On a bicriterion cited on p. 358
formulation of the problems of integrated system identification
and system optimization,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. SMC-1, No. 3, July 1971, pp. 296–297.
doi: 10.1109/tsmc.1971.4308298
160 Das, I. and Dennis, J. E., “Normal-boundary intersection: A new cited on p. 358
method for generating the Pareto surface in nonlinear multicriteria
optimization problems,” SIAM Journal on Optimization, Vol. 8, No.
3, August 1998, pp. 631–657.
doi: 10.1137/s1052623496307510
161 Ismail-Yahaya, A. and Messac, A., “Effective generation of the cited on p. 360
Pareto frontier using the normal constraint method,” Proceedings
of the 40th AIAA Aerospace Sciences Meeting & Exhibit. American
Institute of Aeronautics and Astronautics, January 2002.
doi: 10.2514/6.2002-178
162 Messac, A. and Mattson, C. A., “Normal constraint method with cited on p. 360
guarantee of even representation of complete Pareto frontier,”
AIAA Journal, Vol. 42, No. 10, October 2004, pp. 2101–2111.
doi: 10.2514/1.8977
163 Hancock, B. J. and Mattson, C. A., “The smart normal constraint cited on p. 360
method for directly generating a smart Pareto set,” Structural and
Multidisciplinary Optimization, Vol. 48, No. 4, June 2013, pp. 763–775.
doi: 10.1007/s00158-013-0925-6
164 Schaffer, J. D., “Some experiments in machine learning using cited on p. 361
vector evaluated genetic algorithms.” PhD dissertation, Vanderbilt
University, Nashville, TN, 1984.
Bibliography 601
175 Han, Z.-H., Zhang, Y., Song, C.-X., and Zhang, K.-S., “Weighted cited on p. 406
gradient-enhanced kriging for high-dimensional surrogate model-
ing and design optimization,” AIAA Journal, Vol. 55, No. 12, August
2017, pp. 4330–4346.
doi: 10.2514/1.J055842
176 Forrester, A., Sobester, A., and Keane, A., Engineering Design via cited on p. 406
Surrogate Modelling: A Practical Guide. Hoboken, NJ: John Wiley &
Sons, 2008.
isbn: 0470770791
177 Ruder, S., “An overview of gradient descent optimization algo- cited on p. 411
rithms,” arXiv:1609.04747, 2016.
Url: http://arxiv.org/abs/1609.04747.
178 Goh, G., “Why momentum really works,” Distill, 2017. cited on p. 412
doi: 10.23915/distill.00006
179 Diamond, S. and Boyd, S., “Convex optimization with abstract lin- cited on p. 421
ear operators,” Proceedings of the 2015 IEEE International Conference
on Computer Vision (ICCV). Institute of Electrical and Electronics
Engineers, December 2015.
doi: 10.1109/iccv.2015.84
180 Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., “Applica- cited on p. 423
tions of second-order cone programming,” Linear Algebra and Its
Applications, Vol. 284, No. 1–3, November 1998, pp. 193–228.
doi: 10.1016/s0024-3795(98)10032-0
181 Parikh, N. and Boyd, S., “Block splitting for distributed optimiza- cited on p. 423
tion,” Mathematical Programming Computation, Vol. 6, No. 1, October
2013, pp. 77–102.
doi: 10.1007/s12532-013-0061-8
182 Vandenberghe, L. and Boyd, S., “Semidefinite programming,” cited on p. 423
SIAM Review, Vol. 38, No. 1, March 1996, pp. 49–95.
doi: 10.1137/1038003
183 Vandenberghe, L. and Boyd, S., “Applications of semidefinite cited on p. 423
programming,” Applied Numerical Mathematics, Vol. 29, No. 3, March
1999, pp. 283–299.
doi: 10.1016/s0168-9274(98)00098-1
184 Boyd, S., Kim, S.-J., Vandenberghe, L., and Hassibi, A., “A tutorial cited on p. 434
on geometric programming,” Optimization and Engineering, Vol. 8,
No. 1, April 2007, pp. 67–127.
doi: 10.1007/s11081-007-9001-7
Bibliography 603
185 Hoburg, W., Kirschen, P., and Abbeel, P., “Data fitting with geometric- cited on p. 434
programming-compatible softmax functions,” Optimization and
Engineering, Vol. 17, No. 4, August 2016, pp. 897–918.
doi: 10.1007/s11081-016-9332-3
186 Kirschen, P. G., York, M. A., Ozturk, B., and Hoburg, W. W., “Ap- cited on p. 435
plication of signomial programming to aircraft design,” Journal of
Aircraft, Vol. 55, No. 3, May 2018, pp. 965–987.
doi: 10.2514/1.c034378
187 York, M. A., Hoburg, W. W., and Drela, M., “Turbofan engine cited on p. 435
sizing and tradeoff analysis via signomial programming,” Journal
of Aircraft, Vol. 55, No. 3, May 2018, pp. 988–1003.
doi: 10.2514/1.c034463
188 Stanley, A. P. and Ning, A., “Coupled wind turbine design and cited on p. 441
layout optimization with non-homogeneous wind turbines,” Wind
Energy Science, Vol. 4, No. 1, January 2019, pp. 99–114.
doi: 10.5194/wes-4-99-2019
189 Gagakuma, B., Stanley, A. P. J., and Ning, A., “Reducing wind cited on p. 441
farm power variance from wind direction using wind farm layout
optimization,” Wind Engineering, January 2021.
doi: 10.1177/0309524X20988288
190 Padrón, A. S., Thomas, J., Stanley, A. P. J., Alonso, J. J., and Ning, cited on p. 441
A., “Polynomial chaos to efficiently compute the annual energy
production in wind farm layout optimization,” Wind Energy Science,
Vol. 4, May 2019, pp. 211–231.
doi: 10.5194/wes-4-211-2019
191 Cacuci, D., Sensitivity & Uncertainty Analysis. Boca Raton, FL: cited on p. 447
Chapman and Hall/CRC, May 2003, Vol. 1.
doi: 10.1201/9780203498798
192 Parkinson, A., Sorensen, C., and Pourhassan, N., “A general ap- cited on p. 447
proach for robust optimal design,” Journal of Mechanical Design, Vol.
115, No. 1, 1993, p. 74.
doi: 10.1115/1.2919328
193 Golub, G. H. and Welsch, J. H., “Calculation of Gauss quadrature cited on p. 452
rules,” Mathematics of Computation, Vol. 23, No. 106, 1969, pp. 221–
230, issn: 00255718, 10886842.
doi: 10.1090/S0025-5718-69-99647-1
194 Wilhelmsen, D. R., “Optimal quadrature for periodic analytic cited on p. 454
functions,” SIAM Journal on Numerical Analysis, Vol. 15, No. 2, 1978,
pp. 291–296, issn: 00361429.
doi: 10.1137/0715020
Bibliography 604
195 Trefethen, L. N. and Weideman, J. A. C., “The exponentially conver- cited on p. 454
gent trapezoidal rule,” SIAM Review, Vol. 56, No. 3, 2014, pp. 385–
458, issn: 00361445, 10957200.
doi: 10.1137/130932132
196 Johnson, S. G., “Notes on the convergence of trapezoidal-rule cited on p. 454
quadrature,” March 2010.
Url: http://math.mit.edu/~stevenj/trapezoidal.pdf.
197 Smolyak, S. A., “Quadrature and interpolation formulas for tensor cited on p. 455
products of certain classes of functions,” Proceedings of the USSR
Academy of Sciences, 5. 1963, Vol. 148, pp. 1042–1045.
doi: 10.3103/S1066369X10030084
198 Wiener, N., “The homogeneous chaos,” American Journal of Mathe- cited on p. 459
matics, Vol. 60, No. 4, October 1938, p. 897.
doi: 10.2307/2371268
199 Eldred, M., Webster, C., and Constantine, P., “Evaluation of non- cited on p. 462
intrusive approaches for wiener–askey generalized polynomial
chaos,” Proceedings of the 49th AIAA Structures, Structural Dynamics,
and Materials Conference. American Institute of Aeronautics and
Astronautics, April 2008.
doi: 10.2514/6.2008-1892
200 Adams, B. M., Bohnhoff, W. J., Dalbey, K. R., Ebeida, M. S., Eddy, J. P., cited on p. 463
Eldred, M. S., Hooper, R. W., Hough, P. D., Hu, K. T., Jakeman, J. D.,
Khalil, M., Maupin, K. A., Monschke, J. A., Ridgway, E. M., Rushdi,
A. A., Seidl, D. T., Stephens, J. A., Swiler, L. P., and Winokur, J. G.,
“Dakota, a multilevel parallel object-oriented framework for design
optimization, parameter estimation, uncertainty quantification,
and sensitivity analysis: Version 6.14 user’s manual,” May 2021.
Url: https://dakota.sandia.gov/content/manuals.
201 Feinberg, J. and Langtangen, H. P., “Chaospy: An open source tool cited on p. 463
for designing methods of uncertainty quantification,” Journal of
Computational Science, Vol. 11, November 2015, pp. 46–57.
doi: 10.1016/j.jocs.2015.08.008
202 Jasa, J. P., Hwang, J. T., and Martins, J. R. R. A., “Open-source cited on pp. 477, 494
coupled aerostructural optimization using Python,” Structural and
Multidisciplinary Optimization, Vol. 57, No. 4, April 2018, pp. 1815–
1827.
doi: 10.1007/s00158-018-1912-8
Bibliography 605
203 Cuthill, E. and McKee, J., “Reducing the bandwidth of sparse cited on p. 482
symmetric matrices,” Proceedings of the 1969 24th National Confer-
ence. New York, NY: Association for Computing Machinery, 1969,
pp. 157–172.
doi: 10.1145/800195.805928
204 Amestoy, P. R., Davis, T. A., and Duff, I. S., “An approximate cited on p. 482
minimum degree ordering algorithm,” SIAM Journal on Matrix
Analysis and Applications, Vol. 17, No. 4, 1996, pp. 886–905.
doi: 10.1137/S0895479894278952
205 Lambe, A. B. and Martins, J. R. R. A., “Extensions to the design cited on p. 482
structure matrix for the description of multidisciplinary design,
analysis, and optimization processes,” Structural and Multidiscipli-
nary Optimization, Vol. 46, August 2012, pp. 273–284.
doi: 10.1007/s00158-012-0763-y
206 Irons, B. M. and Tuck, R. C., “A version of the Aitken accelerator cited on p. 486
for computer iteration,” International Journal for Numerical Methods
in Engineering, Vol. 1, No. 3, 1969, pp. 275–277.
doi: 10.1002/nme.1620010306
207 Kenway, G. K. W., Kennedy, G. J., and Martins, J. R. R. A., “Scalable cited on pp. 486, 500
parallel approach for high-fidelity steady-state aeroelastic analysis
and derivative computations,” AIAA Journal, Vol. 52, No. 5, May
2014, pp. 935–951.
doi: 10.2514/1.J052255
208 Chauhan, S. S., Hwang, J. T., and Martins, J. R. R. A., “An automated cited on p. 486
selection algorithm for nonlinear solvers in MDO,” Structural and
Multidisciplinary Optimization, Vol. 58, No. 2, June 2018, pp. 349–377.
doi: 10.1007/s00158-018-2004-5
209 Kenway, G. K. W. and Martins, J. R. R. A., “Multipoint high-fidelity cited on p. 500
aerostructural optimization of a transport aircraft configuration,”
Journal of Aircraft, Vol. 51, No. 1, January 2014, pp. 144–160.
doi: 10.2514/1.C032150
210 Hwang, J. T., Lee, D. Y., Cutler, J. W., and Martins, J. R. R. A., cited on p. 508
“Large-scale multidisciplinary optimization of a small satellite’s
design and operation,” Journal of Spacecraft and Rockets, Vol. 51, No.
5, September 2014, pp. 1648–1663.
doi: 10.2514/1.A32751
211 Biegler, L. T., Ghattas, O., Heinkenschloss, M., and Bloemen Waan- cited on p. 513
ders, B. van, Eds., Large-Scale PDE-Constrained Optimization. Berlin:
Springer, 2003.
Bibliography 606
212 Braun, R. D. and Kroo, I. M., “Development and application of cited on pp. 517, 518
the collaborative optimization architecture in a multidisciplinary
design environment,” Multidisciplinary Design Optimization: State of
the Art, Alexandrov, N. and Hussaini, M. Y., Eds. Philadelphia, PA:
SIAM, 1997, pp. 98–116.
doi: 10.5555/888020
213 Kim, H. M., Rideout, D. G., Papalambros, P. Y., and Stein, J. L., cited on p. 520
“Analytical target cascading in automotive vehicle design,” Journal
of Mechanical Design, Vol. 125, No. 3, September 2003, pp. 481–490.
doi: 10.1115/1.1586308
214 Tosserams, S., Etman, L. F. P., Papalambros, P. Y., and Rooda, cited on p. 520
J. E., “An augmented Lagrangian relaxation for analytical target
cascading using the alternating direction method of multipliers,”
Structural and Multidisciplinary Optimization, Vol. 31, No. 3, March
2006, pp. 176–189.
doi: 10.1007/s00158-005-0579-0
215 Talgorn, B. and Kokkolaras, M., “Compact implementation of non- cited on p. 520
hierarchical analytical target cascading for coordinating distributed
multidisciplinary design optimization problems,” Structural and
Multidisciplinary Optimization, Vol. 56, No. 6, 2017, pp. 1597–1602
doi: 10.1007/s00158-017-1726-0
216 Sobieszczanski–Sobieski, J., Altus, T. D., Phillips, M., and Sandusky, cited on p. 523
R., “Bilevel integrated system synthesis for concurrent and dis-
tributed processing,” AIAA Journal, Vol. 41, No. 10, 2003, pp. 1996–
2003.
doi: 10.2514/2.1889
217 Tedford, N. P. and Martins, J. R. R. A., “Benchmarking multidiscipli- cited on p. 529
nary design optimization algorithms,” Optimization and Engineering,
Vol. 11, No. 1, February 2010, pp. 159–183.
doi: 10.1007/s11081-009-9082-6
218 Golovidov, O., Kodiyalam, S., Marineau, P., Wang, L., and Rohl, cited on p. 530
P., “Flexible implementation of approximation concepts in an
MDO framework,” Proceedings of the 7th AIAA/USAF/NASA/ISSMO
Symposium on Multidisciplinary Analysis and Optimization. American
Institute of Aeronautics and Astronautics, 1998.
doi: 10.2514/6.1998-4959
219 Balabanov, V., Charpentier, C., Ghosh, D. K., Quinn, G., Vander- cited on p. 530
plaats, G., and Venter, G., “Visualdoc: A software system for general
purpose integration and design optimization,” Proceedings of the 9th
AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimiza-
Bibliography 607
608
Index 609
zigzagging, 111
Based on course-tested material, this rigorous yet accessible graduate textbook covers both fun-
damental and advanced optimization theory and algorithms. It covers a wide range of numerical
methods and topics, including both gradient-based and gradient-free algorithms, multidisciplinary
design optimization, and uncertainty, with instruction on how to determine which algorithm should
be used for a given application. It also provides an overview of models and how to prepare them
for use with numerical optimization, including derivative computation. Over 400 high-quality visu-
alizations and numerous examples facilitate understanding of the theory, and practical tips address
common issues encountered in practical engineering design optimization and how to address them.
Numerous end-of-chapter homework problems, progressing in difficulty, help put knowledge into
practice. Accompanied online by a solutions manual for instructors and source code for problems,
this is ideal for a one- or two-semester graduate course on optimization in aerospace, civil, mechan-
ical, electrical, and chemical engineering departments.