0% found this document useful (0 votes)
251 views

Mdobook

Uploaded by

Ruben Paredes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views

Mdobook

Uploaded by

Ruben Paredes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 642

Engineering

Design Optimization
Joaquim R. R. A. Martins
Andrew Ning
Engineering
Design
Optimization

joaquim r. r. a. martins
University of Michigan

andrew ning
Brigham Young University

Compiled on Tuesday 5th October, 2021 at 22:02


Copyright
© 2021 Joaquim R. R. A. Martins and Andrew Ning. All rights
reserved.

Publication
First electronic edition: January 2020.
Contents

Contents i

Preface vi

Acknowledgements viii

1 Introduction 1
1.1 Design Optimization Process 2
1.2 Optimization Problem Formulation 6
1.3 Optimization Problem Classification 17
1.4 Optimization Algorithms 21
1.5 Selecting an Optimization Approach 26
1.6 Notation 28
1.7 Summary 29
Problems 30

2 A Short History of Optimization 33


2.1 The First Problems: Optimizing Length and Area 33
2.2 Optimization Revolution: Derivatives and Calculus 34
2.3 The Birth of Optimization Algorithms 36
2.4 The Last Decades 39
2.5 Toward a Diverse Future 43
2.6 Summary 45

3 Numerical Models and Solvers 46


3.1 Model Development for Analysis versus Optimization 46
3.2 Modeling Process and Types of Errors 47
3.3 Numerical Models as Residual Equations 49
3.4 Discretization of Differential Equations 51
3.5 Numerical Errors 52
3.6 Overview of Solvers 60
3.7 Rate of Convergence 62
3.8 Newton-Based Solvers 65
3.9 Models and the Optimization Problem 69

i
Contents ii

3.10 Summary 72
Problems 74

4 Unconstrained Gradient-Based Optimization 77


4.1 Fundamentals 78
4.2 Two Overall Approaches to Finding an Optimum 93
4.3 Line Search 95
4.4 Search Direction 109
4.5 Trust-Region Methods 138
4.6 Summary 146
Problems 148

5 Constrained Gradient-Based Optimization 152


5.1 Constrained Problem Formulation 153
5.2 Understanding n-Dimensional Space 155
5.3 Optimality Conditions 157
5.4 Penalty Methods 174
5.5 Sequential Quadratic Programming 186
5.6 Interior-Point Methods 203
5.7 Constraint Aggregation 210
5.8 Summary 213
Problems 214

6 Computing Derivatives 222


6.1 Derivatives, Gradients, and Jacobians 222
6.2 Overview of Methods for Computing Derivatives 224
6.3 Symbolic Differentiation 225
6.4 Finite Differences 226
6.5 Complex Step 231
6.6 Algorithmic Differentiation 236
6.7 Implicit Analytic Methods—Direct and Adjoint 251
6.8 Sparse Jacobians and Graph Coloring 261
6.9 Unified Derivatives Equation 265
6.10 Summary 274
Problems 276

7 Gradient-Free Optimization 279


7.1 When to Use Gradient-Free Algorithms 279
7.2 Classification of Gradient-Free Algorithms 282
7.3 Nelder–Mead Algorithm 285
7.4 Generalized Pattern Search 290
7.5 DIRECT Algorithm 296
7.6 Genetic Algorithms 304
Contents iii

7.7 Particle Swarm Optimization 314


7.8 Summary 319
Problems 321

8 Discrete Optimization 325


8.1 Binary, Integer, and Discrete Variables 325
8.2 Avoiding Discrete Variables 326
8.3 Branch and Bound 328
8.4 Greedy Algorithms 335
8.5 Dynamic Programming 337
8.6 Simulated Annealing 345
8.7 Binary Genetic Algorithms 349
8.8 Summary 349
Problems 350

9 Multiobjective Optimization 353


9.1 Multiple Objectives 353
9.2 Pareto Optimality 355
9.3 Solution Methods 356
9.4 Summary 367
Problems 368

10 Surrogate-Based Optimization 370


10.1 When to Use a Surrogate Model 371
10.2 Sampling 372
10.3 Constructing a Surrogate 381
10.4 Kriging 397
10.5 Deep Neural Networks 406
10.6 Optimization and Infill 412
10.7 Summary 416
Problems 418

11 Convex Optimization 421


11.1 Introduction 421
11.2 Linear Programming 423
11.3 Quadratic Programming 425
11.4 Second-Order Cone Programming 427
11.5 Disciplined Convex Optimization 428
11.6 Geometric Programming 432
11.7 Summary 435
Problems 436
Contents iv

12 Optimization Under Uncertainty 438


12.1 Robust Design 439
12.2 Reliable Design 444
12.3 Forward Propagation 445
12.4 Summary 466
Problems 468

13 Multidisciplinary Design Optimization 471


13.1 The Need for MDO 471
13.2 Coupled Models 474
13.3 Coupled Derivatives Computation 497
13.4 Monolithic MDO Architectures 506
13.5 Distributed MDO Architectures 515
13.6 Summary 529
Problems 531

A Mathematics Background 534


A.1 Taylor Series Expansion 534
A.2 Chain Rule, Total Derivatives, and Differentials 536
A.3 Matrix Multiplication 539
A.4 Four Fundamental Subspaces in Linear Algebra 542
A.5 Vector and Matrix Norms 543
A.6 Matrix Types 545
A.7 Matrix Derivatives 547
A.8 Eigenvalues and Eigenvectors 548
A.9 Random Variables 549

B Linear Solvers 554


B.1 Systems of Linear Equations 554
B.2 Conditioning 555
B.3 Direct Methods 555
B.4 Iterative Methods 557

C Quasi-Newton Methods 566


C.1 Broyden’s Method 566
C.2 Additional Quasi-Newton Approximations 567
C.3 Sherman–Morrison–Woodbury Formula 571

D Test Problems 573


D.1 Unconstrained Problems 573
D.2 Constrained Problems 580
Contents v

Bibliography 585

Index 608
Preface

Despite its usefulness, design optimization remains underused in in-


dustry. One of the reasons for this is the shortage of design optimization
courses in undergraduate and graduate curricula. This is changing;
today, most top aerospace and mechanical engineering departments in-
clude at least one graduate-level course on numerical optimization. We
have also seen design optimization increasingly used in an expanding
number of industries.
The word engineering in the title reflects the types of problems
and algorithms we focus on, even though the methods are applicable
beyond engineering. In contrast to explicit analytic mathematical
functions, most engineering problems are implemented in complex
multidisciplinary codes that involve implicit functions. Such problems
might require hierarchical solvers and coupled derivative computation.
Furthermore, engineering problems often involve many design variables
and constraints, requiring scalable methods.
The target audience for this book is advanced undergraduate and
beginning graduate students in science and engineering. No previous
exposure to optimization is assumed. Knowledge of linear algebra,
multivariable calculus, and numerical methods is helpful. However,
these subjects’ core concepts are reviewed in an appendix and as needed
in the text. The content of the book spans approximately two semester-
length university courses. Our approach is to start from the most
general case problem and then explain special cases. The first half
of the book covers the fundamentals (along with an optional history
chapter). In contrast, the second half, from Chapter 8 onward, covers
more specialized or advanced topics.
Our philosophy in the exposition is to provide a detailed enough
explanation and analysis of optimization algorithms so that readers
can implement a basic working version. Although we do not encourage ∗ In the words of Donald Knuth: “The ulti-
readers to use their implementations instead of existing software for mate test of whether I understand something
solving optimization problems, implementing a method is crucial in is if I can explain it to a computer. I can say
something to you and you’ll nod your head, but
understanding the method and its behavior.∗ A deeper knowledge of I’m not sure that I explained it well. But the
these methods is useful for developers, researchers, and those who computer doesn’t nod its head. It repeats back
exactly what I tell it. In most of life, you can
want to use numerical optimization more effectively. The problems at bluff, but not with computers.”

vi
Preface vii

the end of each chapter are designed to provide a gradual progression


in difficulty and eventually require implementing the methods. Some
of the problems are open-ended to encourage students to explore a
given topic on their own. When discussing the various optimization
techniques, we also explain how to avoid the potential pitfalls of using a
particular method and how to employ it more effectively. Practical tips
are included throughout the book to alert the reader to common issues
encountered in engineering design optimization and how to address
them.
We have created a repository with code, data, templates, and
examples as a supplementary resource for this book: https://github.
com/mdobook/resources. Some of the end-of-chapter exercises refer
to code or data from this repository.
Go forth and optimize!
Acknowledgments

Our workflow was tremendously enhanced by the support of Edmund


Lee and Aaron Lu, who took our sketches and plots and translated
them to high-quality, consistently formatted figures. The layout of this
book was greatly improved based in part on a template provided by
Max Opgenoord. We are indebted to many students and colleagues
who provided feedback and insightful questions on our concepts,
examples, lectures, and manuscript drafts. At the risk of leaving
out some contributors, we wish to express particular gratitude to the
following individuals who helped create examples, problems, solutions,
or content that was incorporated in the book: Tal Dohn, Xiaosong
Du, Sicheng He, Jason Hicken, Donald Jones, Shugo Kaneko, Taylor
McDonnell, Judd Mehr, Santiago Padrón, Sabet Seraj, P. J. Stanley, and
Anil Yildirim. Additionally, the following individuals provided helpful
suggestions and corrections to the manuscript: Eytan Adler, Josh Anibal,
Eliot Aretskin-Hariton, Alexander Coppeans, Alec Gallimore, Philip
Gill, Justin Gray, Christina Harvey, John Hwang, Kevin Jacobsen, Kai
James, Eirikur Jonsson, Matthew Kramer, Alexander Kleb, Michael
Kokkolaras, Yingqian Liao, Sandy Mader, Marco Mangano, Giuliana
Mannarino, Yara Martins, Johannes Norheim, Bernardo Pacini, Malhar
Prajapati, Michael Saunders, Nikhil Shetty, Tamás Terlaky, and Elizabeth
Wong. We are grateful to peer reviewers who provided enthusiastic
encouragement and helpful suggestions and wish to thank our editors
at Cambridge University Press, who quickly and competently offered
corrections. Finally, we express our deepest gratitude to our families
for their loving support.

Joaquim Martins and Andrew Ning

viii
Introduction
1
Optimization is a human instinct. People constantly seek to improve
their lives and the systems that surround them. Optimization is intrinsic
in biology, as exemplified by the evolution of species. Birds optimize
their wings’ shape in real time, and dogs have been shown to find
optimal trajectories. Even more broadly, many laws of physics relate to
optimization, such as the principle of minimum energy. As Leonhard
Euler once wrote, “nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear.”
The term optimization is often used to mean “improvement”, but
mathematically, it is a much more precise concept: finding the best
possible solution by changing variables that can be controlled, often
subject to constraints. Optimization has a broad appeal because it is
applicable in all domains and because of the human desire to make
things better. Any problem where a decision needs to be made can be
cast as an optimization problem.
Although some simple optimization problems can be solved an-
alytically, most practical problems of interest are too complex to be
solved this way. The advent of numerical computing, together with
the development of optimization algorithms, has enabled us to solve
problems of increasing complexity.

By the end of this chapter you should be able to:

1. Understand the design optimization process.


2. Formulate an optimization problem.
3. Identify key characteristics to classify optimization prob-
lems and optimization algorithms.
4. Select an appropriate algorithm for a given optimization
problem.

Optimization problems occur in various areas, such as economics,


political science, management, manufacturing, biology, physics, and

1
1 Introduction 2

engineering. This book focuses on the application of numerical opti-


mization to the design of engineering systems. Numerical optimization
first emerged in operations research, which deals with problems such as
deciding on the price of a product, setting up a distribution network,
scheduling, or suggesting routes. Other optimization areas include
optimal control and machine learning. Although we do not cover these
other areas specifically in this book, many of the methods we cover are
useful in those areas.
Design optimization problems abound in the various engineering
disciplines, such as wing design in aerospace engineering, process
control in chemical engineering, structural design in civil engineering,
circuit design in electrical engineering, and mechanism design in
mechanical engineering. Most engineering systems rarely work in
isolation and are linked to other systems. This gave rise to the field of
multidisciplinary design optimization (MDO), which applies numerical
optimization techniques to the design of engineering systems that
involve multiple disciplines.
In the remainder of this chapter, we start by explaining the design
optimization process and contrasting it with the conventional design
process (Section 1.1). Then we explain how to formulate optimization
problems and the different types of problems that can arise (Section 1.2).
Because design optimization problems involve functions of different
types, these are also briefly discussed (Section 1.3). (A more detailed
discussion of the numerical models used to compute these functions is
deferred to Chapter 3.) We then provide an overview of the different
optimization algorithms, highlighting the algorithms covered in this
Requirements
book and linking to the relevant sections (Section 1.4). We connect and
algorithm types and problem types by providing guidelines for selecting specifications
the right algorithm for a given problem (Section 1.5). Finally, we
introduce the notation used throughout the book (Section 1.6).
Conceptual
design

1.1 Design Optimization Process


Preliminary
Engineering design is an iterative process that engineers follow to design
develop a product that accomplishes a given task. For any product
beyond a certain complexity, this process involves teams of engineers
and multiple stages with many iterative loops that may be nested. The Detailed
design
engineering teams are formed to tackle different aspects of the product
at different stages.
The design process can be divided into the sequence of phases shown Final
design
in Fig. 1.1. Before the design process begins, we must determine the
requirements and specifications. This might involve market research, Fig. 1.1 Design phases.
1 Introduction 3

an analysis of current similar designs, and interviews with potential


customers. In the conceptual design phase, various concepts for the
system are generated and considered. Because this phase should be
short, it usually relies on simplified models and human intuition. For
more complicated systems, the various subsystems are identified. In
the preliminary design phase, a chosen concept and subsystems are
refined by using better models to guide changes in the design, and
the performance expectations are set. The detailed design phase seeks
to complete the design down to every detail so that it can finally be
manufactured. All of these phases require iteration within themselves.
When severe issues are identified, it may be necessary to “go back to the
drawing board” and regress to an earlier phase. This is just a high-level
view; in practical design, each phase may require multiple iterative
processes.
Design optimization is a tool that can replace an iterative design
process to accelerate the design cycle and obtain better results. To
understand the role of design optimization, consider a simplified
version of the conventional engineering design process with only one
iterative loop, as shown in Fig. 1.2 (top). In this process, engineers make
decisions at every stage based on intuition and background knowledge.
Each of the steps in the conventional design process includes human
decisions that are either challenging or impossible to program into com-
puter code. Determining the product specifications requires engineers
to define the problem and do background research. The design cycle
must start with an initial design, which can be based on past designs or
a new idea. In the conventional design process, this initial design is
analyzed in some way to evaluate its performance. This could involve
numerical modeling or actual building and testing. Engineers then
evaluate the design and decide whether it is good enough or not based
on the results.∗ If the answer is no—which is likely to be the case for at ∗ The evaluation of a given design in engi-
neering is often called the analysis. Engi-
least the first few iterations—the engineer changes the design based neers and computer scientists also refer to
on intuition, experience, or trade studies. When the design is finalized it as simulation.

when it is deemed satisfactory.


The design optimization process can be represented using a flow
diagram similar to that for the conventional design process, as shown in
Fig. 1.2 (bottom). The determination of the specifications and the initial
design are no different from the conventional design process. However,
design optimization requires a formal formulation of the optimization
problem that includes the design variables that are to be changed, the
objective to be minimized, and the constraints that need to be satisfied.
The evaluation of the design is strictly based on numerical values for the
objective and constraints. When a rigorous optimization algorithm is
1 Introduction 4

Manual iteration
Change
design
manually No

Initial Evaluate Is the design Yes


Specifications Final design
design performance good?

Optimization
Update
design
No
variables
Initial
design
Evaluate
Specifications Optimality
objective and
achieved?
Formulate constraints
optimization
problem
Yes

Change initial design or No Is the design Yes


Final design
reformulate problem good?

used, the decision to finalize the design is made only when the current Fig. 1.2 Conventional (top) versus de-
design satisfies the optimality conditions that ensure that no other sign optimization process (bottom).
design “close by” is better. The design changes are made automatically
by the optimization algorithm and do not require intervention from
the designer.
This automated process does not usually provide a “push-button”
solution; it requires human intervention and expertise (often more
expertise than in the traditional process). Human decisions are still
needed in the design optimization process. Before running an op-
timization, in addition to determining the specifications and initial
design, engineers need to formulate the design problem. This requires
expertise in both the subject area and numerical optimization. The
designer must decide what the objective is, which parameters can be
changed, and which constraints must be enforced. These decisions
have profound effects on the outcome, so it is crucial that the designer
1 Introduction 5

formulates the optimization problem well.


After running the optimization, engineers must assess the design
because it is unlikely that the first formulation yields a valid and practical
design. After evaluating the optimal design, engineers might decide
to reformulate the optimization problem by changing the objective
function, adding or removing constraints, or changing the set of design
variables. Engineers might also decide to increase the models’ fidelity if
they fail to consider critical physical phenomena, or they might decide
to decrease the fidelity if the models are too expensive to evaluate in an
optimization iteration.
Post-optimality studies are often performed to interpret the optimal
design and the design trends. This might be done by performing pa-
rameter studies, where design variables or other parameters are varied
to quantify their effect on the objective and constraints. Validation of
the result can be done by evaluating the design with higher-fidelity
simulation tools, by performing experiments, or both. It is also possi-
ble to compute post-optimality sensitivities to evaluate which design
variables are the most influential or which constraints drive the design.
These sensitivities can inform where engineers might best allocate
resources to alleviate the driving constraints in future designs.
Design optimization can be used in any of the design phases shown Design
optimization Increased
in Fig. 1.1, where each phase could involve running one or more design performance

performance
optimizations. We illustrate several advantages of design optimization System
in Fig. 1.3, which shows the notional variations of system performance, Conventional
design process
cost, and uncertainty as a function of time in design. When using
optimization, the system performance increases more rapidly compared
with the conventional process, achieving a better end result in a shorter
Cumulative

Reduced cost
total time. As a result, the cost of the design process is lower. Finally,
cost

the uncertainty in the performance reduces more rapidly as well.


Considering multiple disciplines or components using MDO ampli-
fies the advantages illustrated in Fig. 1.3. The central idea of MDO is to
consider the interactions between components using coupled models Reduced time
Uncertainty

while simultaneously optimizing the design variables from the various


components. In contrast, sequential optimization optimizes one com- Reduced
uncertainty
ponent at a time. Even when interactions are considered, sequential
Time in design
optimization might converge to a suboptimal result (see Section 13.1
for more details and examples). Fig. 1.3 Compared with the conven-
In this book, we tend to frame problems and discussions in the tional design process, MDO increases
context of engineering design. However, the optimization methods the system performance, decreases
the design time, reduces the total cost,
are general and are used in other applications that may not be design
and reduces the uncertainty at a given
problems, such as optimal control, machine learning, and regression. point in time.
In other words, we mean “design” in a general sense, where variables
1 Introduction 6

are changed to optimize an objective.

1.2 Optimization Problem Formulation

The design optimization process requires the designer to translate


their intent to a mathematical statement that can then be solved by
an optimization algorithm. Developing this statement has the added
benefit that it helps the designer better understand the problem. Being
methodical in the formulation of the optimization problem is vital
because the optimizer tends to exploit any weaknesses you might have in your
formulation or model. An inadequate problem formulation can either
cause the optimization to fail or cause it to converge to a mathematical
optimum that is undesirable or unrealistic from an engineering point
of view—the proverbial “right answer to the wrong question”.
To formulate design optimization problems, we follow the procedure 1. Describe the
problem
outlined in Fig. 1.4. The first step requires writing a description of the
design problem, including a description of the system, and a statement
of all the goals and requirements. At this point, the description does 2. Gather
not necessarily involve optimization concepts and is often vague. information
The next step is to gather as much data and information as possible
about the problem. Some information is already specified in the
3. Define the
problem statement, but more research is usually required to find all the design variables
relevant data on the performance requirements and expectations. Raw
data might need to be processed and organized to gather the information
4. Define the
required for the design problem. The more familiar practitioners are objective
with the problem, the better prepared they will be to develop a sound
formulation to identify eventual issues in the solutions.
At this stage, it is also essential to identify the analysis procedure 5. Define the
constraints
and gather information on that as well. The analysis might consist of a
simple model or a set of elaborate tools. All the possible inputs and
Fig. 1.4 Steps in optimization problem
outputs of the analysis should be identified, and its limitations should formulation.
be understood. The computational time for the analysis needs to be
considered because optimization requires repeated analysis.
It is usually impossible to learn everything about the problem before
proceeding to the next steps, where we define the design variables, objec-
tive, and constraints. Therefore, information gathering and refinement
are ongoing processes in problem formulation.

1.2.1 Design Variables


The next step is to identify the variables that describe the system, the
design variables,∗ which we represent by the column vector: ∗ Some texts call these decision variables or
simply variables.
1 Introduction 7

𝑥 = [𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 𝑥 ] . (1.1)

This vector defines a given design, so different vectors 𝑥 correspond


to different designs. The number of variables, 𝑛 𝑥 , determines the
problem’s dimensionality.
The design variables must not depend on each other or any other
parameter, and the optimizer must be free to choose the elements of
𝑥 independently. This means that in the analysis of a given design,
the variables must be input parameters that remain fixed throughout
the analysis process. Otherwise, the optimizer does not have absolute
control of the design variables. Another possible pitfall is to define
a design variable that happens to be a linear combination of other
variables, which results in an ill-defined optimization problem with
an infinite number of combinations of design variable values that
correspond to the same design.
The choice of variables is usually not unique. For example, a square
shape can be parametrized by the length of its side or by its area, and
different unit systems can be used. The choice of units affects the
problem’s scaling but not the functional form of the problem.
The choice of design variables can affect the functional form of the
objective and constraints. For example, some nonlinear relationships
can be converted to linear ones through a change of variables. It is also
possible to introduce or eliminate discontinuities through the choice of
design variables.
A given set of design variable values defines the system’s design, but
whether this system satisfies all the requirements is a separate question
that will be addressed with the constraints in a later step. However, it
is possible and advisable to define the space of allowable values for
the design variables based on the design problem’s specifications and
physical limitations.
The first consideration in the definition of the allowable design
variable values is whether the design variables are continuous or discrete.
Continuous design variables are real numbers that are allowed to vary
continuously within a specified range with no gaps, which we write as

𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 , 𝑖 = 1, . . . , 𝑛 𝑥 , (1.2)

where 𝑥 and 𝑥 are lower and upper bounds on the design variables,
respectively. These are also known as bound constraints or side constraints.
Some design variables may be unbounded or bounded on only one
side.
When all the design variables are continuous, the optimization prob-
† This is not to be confused with the conti-
nuity of the objective and constraint func-
lem is said to be continuous.† Most of this book focuses on algorithms tions, which we discuss in Section 1.3.
1 Introduction 8

that assume continuous design variables.


When one or more variables are allowed to have discrete values,
whether real or integer, we have a discrete optimization problem. An
example of a discrete design variable is structural sizing, where only
components of specific thicknesses or cross-sectional areas are available.
Integer design variables are a special case of discrete variables where
the values are integers, such as the number of wheels on a vehicle.
Optimization algorithms that handle discrete variables are discussed
in Chapter 8.
We distinguish the design variable bounds from constraints because
the optimizer has direct control over their values, and they benefit from
a different numerical treatment when solving an optimization problem.
When defining these bounds, we must take care not to unnecessarily
constrain the design space, which would prevent the optimizer from
achieving a better design that is realizable. A smaller allowable range
in the design variable values should make the optimization easier.
However, design variable bounds should be based on actual physical
constraints instead of being artificially limited. An example of a
physical constraint is a lower bound on structural thickness in a weight
minimization problem, where otherwise, the optimizer will discover
that negative sizes yield negative weight. Whenever a design variable
converges to the bound at the optimum, the designer should reconsider
the reasoning for that bound and make sure it is valid. This is because
designers sometimes set bounds that limit the optimization from
obtaining a better objective.
At the formulation stage, we should strive to list as many indepen-
dent design variables as possible. However, it is advisable to start with
a small set of variables when solving a problem for the first time and
then gradually expand the set of design variables.
Some optimization algorithms require the user to provide initial
design variable values. This initial point is usually based on the best
guess the user can produce. This might be an already good design that
the optimization refines further by making small changes. Another
possibility is that the initial guess is a bad design or a “blank slate” that
the optimization changes significantly.

Example 1.1 Design variables for wing design

Consider a wing design problem where the wing planform shape is rect-
angular. The planform could be parametrized by the span (𝑏) and the chord 𝑐
(𝑐), as shown in Fig. 1.5, so that 𝑥 = [𝑏, 𝑐]. However, this choice is not
𝑏
unique. Two other variables are often used in aircraft design: wing area (𝑆)
and wing aspect ratio (𝐴𝑅), as shown in Fig. 1.6. Because these variables are Fig. 1.5 Wingspan (𝑏) and chord (𝑐).
1 Introduction 9

5 10

8
4 𝑆
𝑆= =

𝑆
𝑆=5

35

1.0
=
𝑏=

25
15
6 12

𝑐=
𝑐 3 𝐴𝑅
4 1.5
𝑐=
2 .0 𝑏=8
2 𝑐=
𝑐=2
.5
2
=1

3 Fig. 1.6 Wing design space for two


𝐴𝑅

=
𝐴𝑅 7
1
𝐴𝑅 =
0
𝑏=4
different sets of design variables, 𝑥 =
2 4 6 8 10 5 10 15 20 25 [𝑏, 𝑐] and 𝑥 = [𝑆, 𝐴𝑅].
𝑏 𝑆

not independent (𝑆 = 𝑏𝑐 and 𝐴𝑅 = 𝑏 2 /𝑆), we cannot just add them to the set
of design variables. Instead, we must pick any two variables out of the four
to parametrize the design because we have four possible variables and two
dependency relationships.
For this wing, the variables must be positive to be physically meaningful,
so we must remember to explicitly bound these variables to be greater than
zero in an optimization. The variables should be bound from below by small
positive values because numerical models are probably not prepared to take
zero values. No upper bound is needed unless the optimization algorithm
requires it.

Tip 1.1 Use splines to parameterize curves

Many problems that involve shapes, functional distributions, and paths


𝑐2
are sometimes implemented with a large number of discrete points. However, 4
Chord [m]

these can be represented more compactly with splines. This is a commonly used 𝑐1 𝑐3
technique in optimization because reducing the number of design variables 2
often speeds up an optimization with little if any loss in the model parameteri-
zation fidelity. Figure 1.7 shows an example spline describing the shape of a 𝑐4
turbine blade. In this example, only four design variables are used to represent 0
0 0.2 0.4 0.6 0.8 1
the curved shape. Blade fraction

Fig. 1.7 Parameterizing the chord dis-


tribution of a wing or turbine blade
using a spline reduces the number of
1.2.2 Objective Function design variables while still allowing
for a wide range of shape changes.
To find the best design, we need an objective function, which is a quantity
that determines if one design is better than another. This function must
be a scalar that is computable for a given design variable vector 𝑥. The
objective function can be minimized or maximized, depending on the
problem. For example, a designer might want to minimize the weight
or cost of a given structure. An example of a function to be maximized
could be the range of a vehicle.
1 Introduction 10

The convention adopted in this book is that the objective function, 𝑓 ,


is to be minimized. This convention does not prevent us from maximizing
a function because we can reformulate it as a minimization problem by
finding the minimum of the negative of 𝑓 and then changing the sign,
as follows:
max[ 𝑓 (𝑥)] = − min[− 𝑓 (𝑥)] . (1.3)
This transformation is illustrated in Fig. 1.8.‡ ‡ Inverting the function (1/ 𝑓 ) is another

way to turn a maximization problem into


The objective function is computed through a numerical model a minimization problem, but it is generally
whose complexity can range from a simple explicit equation to a system less desirable because it alters the scale of
the problem and could introduce a divide-
of coupled implicit models (more on this in Chapter 3). by-zero problem.
The choice of objective function is crucial for successful design  
optimization. If the function does not represent the true intent of the max 𝑓 (𝑥)

designer, it does not matter how precisely the function and its optimum
point are computed—the mathematical optimum will be non-optimal
from the engineering point of view. A bad choice for the objective
function is a common mistake in design optimization. 0
𝑥∗
The choice of objective function is not always obvious. For example,
minimizing the weight of a vehicle might sound like a good idea, but
this might result in a vehicle that is too expensive to manufacture. In
 
this case, manufacturing cost would probably be a better objective. min − 𝑓 (𝑥)
However, there is a trade-off between manufacturing cost and the
performance of the vehicle. It might not be obvious which of these Fig. 1.8 A maximization problem can
be transformed into an equivalent
objectives is the most appropriate one because this trade-off depends on minimization problem.
customer preferences. This issue motivates multiobjective optimization,
which is the subject of Chapter 9. Multiobjective optimization does
not yield a single design but rather a range of designs that settle for
different trade-offs between the objectives.
Experimenting with different objectives should be part of the design
exploration process (this is represented by the outer loop in the design
optimization process in Fig. 1.2). Results from optimizing the “wrong”
objective can still yield insights into the design trade-offs and trends
for the system at hand.
In Ex. 1.1, we have the luxury of being able to visualize the design
space because we have only two variables. For more than three variables,
it becomes impossible to visualize the design space. We can also
visualize the objective function for two variables, as shown in Fig. 1.9.
In this figure, we plot the function values using the vertical axis, which
results in a three-dimensional surface. Although plotting the surface
might provide intuition about the function, it is not possible to locate
Fig. 1.9 A function of two variables
the points accurately when drawing on a two-dimensional surface. ( 𝑓 = 𝑥12 + 𝑥22 in this case) can be visual-
Another possibility is to plot the contours of the function, which ized by plotting a three-dimensional
are lines of constant value, as shown in Fig. 1.10. We prefer this type surface or contour plot.
1 Introduction 11

of plot and use it extensively throughout this book because we can


locate points accurately and get the correct proportions in the axes
(in Fig. 1.10, the contours are perfect circles, and the location of the
minimum is clear). Our convention is to represent lower function values
with darker lines and higher values with lighter ones. Unless otherwise 𝑥2

stated, the function variation between two adjacent lines is constant,


and therefore, the closer together the contour lines are, the faster the 1

function is changing. The equivalent of a contour line in 𝑛-dimensional 𝑥∗


space is a hypersurface of constant value with dimensions of 𝑛 − 1, 0

called an isosurface.
−1

Example 1.2 Objective function for wing design


−2 −1 0 1 2
𝑥1
Let us discuss the appropriate objective function for Ex. 1.1 for a small
airplane. A common objective for a wing is to minimize drag. However, this Fig. 1.10 Contour plot of 𝑓 = 𝑥 12 + 𝑥22 .
does not take into account the propulsive efficiency, which is strongly affected
by speed. A better objective might be to minimize the required power, which
balances drag and propulsive efficiency.§ § Thesimple models used in this example
The contours for the required power are shown in Fig. 1.11 for the two are described in Appendix D.1.6.

choices of design variable sets discussed in Ex. 1.1. We can locate the minimum
graphically (denoted by the dot). Although the two optimum solutions are
the same, the shapes of the objective function contours are different. In this
case, using the aspect ratio and wing area simplifies the relationship between
the design variables and the objective by aligning the two main curvature
trends with each design variable. Thus, the parameterization can change the
effectiveness of the optimization.

1.5 70

1.2
50

𝑐 0.9 𝐴𝑅
Fig. 1.11 Required power contours
30
for two different choices of design
0.6 variable sets. The optimal wing is
the same for both cases, but the func-
10
0.3
tional form of the objective is simpli-
5 15 25 35 5 10 15 20 25 fied in the one on the right.
𝑏 𝑆

The optimal wing for this problem has an aspect ratio that is much higher
than that typically seen in airplanes or birds. Although the high aspect ratio
increases aerodynamic efficiency, it adversely affects the structural strength,
which we did not consider here. Thus, as in most engineering problems, we
need to add constraints and consider multiple disciplines.
1 Introduction 12

We use mostly two-dimensional examples throughout the book


because we can visualize them conveniently. Such visualizations should
give you an intuition about the methods and problems. However, keep
in mind that general problems have many more dimensions, and only
mathematics can help you in such cases.
Although we can sometimes visualize the variation of the objective
function in a contour plot as in Ex. 1.2, this is not possible for problems
with more design variables or more computationally demanding func-
tion evaluations. This motivates numerical optimization algorithms,
which aim to find the minimum in a multidimensional design space
using as few function evaluations as possible.

1.2.3 Constraints
The vast majority of practical design optimization problems require the
enforcement of constraints. These are functions of the design variables
that we want to restrict in some way. Like the objective function,
constraints are computed through a model whose complexity can vary
widely. The feasible region is the set of points that satisfy all constraints.
We seek to minimize the objective function within this feasible design
space.
When we restrict a function to being equal to a fixed value, we call
this an equality constraint, denoted by ℎ(𝑥) = 0. When the function is
required to be less than or equal to a certain value, we have an inequality
constraint, denoted by 𝑔(𝑥) ≤ 0.¶ Although we use the “less or equal” ¶A strict inequality, 𝑔(𝑥) < 0, is never
used because then 𝑥 could be arbitrarily
convention, some texts and software programs use “greater or equal” close to the equality. Because the optimum
instead. There is no loss of generality with either convention because is at 𝑔 = 0 for an active constraint, the
exact solution would then be ill-defined
we can always multiply the constraint by −1 to convert between the from a mathematical perspective. Also, the
two. difference is not meaningful when using
finite-precision arithmetic (which is always
the case when using a computer).
Tip 1.2 Check the inequality convention

When using optimization software, do not forget to check the convention


for the inequality constraints (i.e., determine whether it is “less than”, “greater
than”, or “allow two-sided constraints”) and convert your constraints as
needed.

Some texts and papers omit the equality constraints without loss
of generality because an equality constraint can be replaced by two
inequality constraints. More specifically, an equality constraint, ℎ(𝑥) =
0, is equivalent to enforcing two inequality constraints, ℎ(𝑥) ≥ 0 and
ℎ(𝑥) ≤ 0.
1 Introduction 13

Inequality constraints can be active or inactive at the optimum point.


An active inequality constraint means that 𝑔(𝑥 ∗ ) = 0, whereas for an
inactive one, 𝑔(𝑥 ∗ ) < 0. If a constraint is inactive at the optimum, this
constraint could have been removed from the problem with no change
in its solution, as illustrated in Fig. 1.12. In this case, constraints 𝑔2
and 𝑔3 can be removed without affecting the solution of the problem.
Furthermore, active constraints (𝑔1 in this case) can equivalently be
replaced by equality constraints. However, it is difficult to know in
advance which constraints are active or not at the optimum for a general
problem. Constrained optimization is the subject of Chapter 5.

𝑔1 (𝑥) ≤ 0 ℎ1 (𝑥) = 0
(active) 𝑓 (𝑥) (active) 𝑓 (𝑥)

Fig. 1.12 Two-dimensional problem


𝑥∗ 𝑥∗
with one active and two inactive
𝑥2 𝑥2 inequality constraints (left). The
shaded area indicates regions that
are infeasible (i.e., the constraints are
𝑔2 (𝑥) ≤ 0 violated). If we only had the active
(inactive) single equality constraint in the for-
mulation, we would obtain the same
𝑥1 𝑔3 (𝑥) ≤ 0 𝑥1
result (right).
(inactive)

It is possible to overconstrain the problem such that there is no


solution. This can happen as a result of a programming error but can
also occur at the problem formulation stage. For more complicated
design problems, it might not be possible to satisfy all the specified
constraints, even if they seem to make sense. When this happens,
constraints have to be relaxed or removed.
The problem must not be overconstrained, or else there is no feasible
region in the design space over which the function can be minimized.
Thus, the number of independent equality constraints must be less than
or equal to the number of design variables (𝑛 ℎ ≤ 𝑛 𝑥 ). There is no limit
on the number of inequality constraints. However, they must be such
that there is a feasible region, and the number of active constraints plus
the equality constraints must still be less than or equal to the number
of design variables.
The feasible region grows when constraints are removed and shrinks
when constraints are added (unless these constraints are redundant).
As the feasible region grows, the optimum objective function usually
improves or at least stays the same. Conversely, the optimum worsens
or stays the same when the feasible region shrinks.
1 Introduction 14

One common issue in optimization problem formulation is dis-


tinguishing objectives from constraints. For example, we might be
tempted to minimize the stress in a structure, but this would inevitably
result in an overdesigned, heavy structure. Instead, we might want
minimum weight (or cost) with sufficient safety factors on stress, which
can be enforced by an inequality constraint.
Most engineering problems require constraints—often a large num-
ber of them. Although constraints may at first appear limiting, they
enable the optimizer to find useful solutions.
As previously mentioned, some algorithms require the user to
provide an initial guess for the design variable values. Although it
is easy to assign values within the bounds, it might not be as easy to
ensure that the initial design satisfies the constraints. This is not an
issue for most optimization algorithms, but some require starting with
a feasible design.

Example 1.3 Constraints for wing design

We now add a design constraint for the power minimization problem 𝑐


1.5
of Ex. 1.2. The unconstrained optimal wing had unrealistically high aspect
ratios because we did not include structural considerations. If we add an 1.2
inequality constraint on the bending stress at the root of the wing for a fixed
amount of material, we get the curve and feasible region shown in Fig. 1.13.The 0.9

unconstrained optimum violates this constraint. The constrained optimum


0.6
results in a lower span and higher chord, and the constraint is active.
0.3
5 15 25 35

As previously mentioned, it is generally not possible to visualize 𝑏

the design space as shown in Ex. 1.2 and obtain the solution graphically. Fig. 1.13 Minimum-power wing with
In addition to the possibility of a large number of design variables a constraint on bending stress com-
and computationally expensive objective function evaluations, we now pared with the unconstrained case.
add the possibility of a large number of constraints, which might also
be expensive to evaluate. Again, this is further motivation for the
optimization techniques covered in this book.

1.2.4 Optimization Problem Statement


Now that we have discussed the design variables, the objective function,
and constraints, we can put them all together in an optimization problem
statement. In words, this statement is as follows: minimize the objective
function by varying the design variables within their bounds subject to the
constraints.
1 Introduction 15

Mathematically, we write this statement as follows:

minimize 𝑓 (𝑥)
by varying 𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
(1.4)
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ .

This is the standard formulation used in this book; however, other


books and software manuals might differ from this.‖ For example, they ‖
Instead of “by varying”, some textbooks
might use different symbols, use “greater than or equal to” for the use “with respect to” or “w.r.t.” as short-
inequality constraint, or maximize instead of minimizing. In any case, hand.

it is possible to convert between standard formulations to get equivalent


problems.
All continuous optimization problems with a single-objective can
be written in the standard form shown in Eq. 1.4. Although our target
applications are engineering design problems, many other problems
can be stated in this form, and thus, the methods covered in this book
can be used to solve those problems.
The values of the objective and constraint functions for a given set
𝑥0 Optimizer 𝑥∗
of design variables are computed through the analysis, which consists
of one or more numerical models. The analysis must be fully automatic 𝑥 𝑓 , 𝑔, ℎ
so that multiple optimization cycles can be completed without human
intervention, as shown in Fig. 1.14. The optimizer usually requires an Analysis

initial design 𝑥 0 and then queries the analysis for a sequence of designs
until it finds the optimum design, 𝑥 ∗ . Fig. 1.14 The analysis computes the
objective ( 𝑓 ) and constraint values (𝑔,
ℎ) for a given set of design variables
Tip 1.3 Using an optimization software package (𝑥).
The setup of an optimization problem varies depending on the particular
software package, so read the documentation carefully. Most optimization
software requires you to define the objective and constraints as callback functions.
These are passed to the optimizer, which calls them back as needed during the
optimization process. The functions take the design variable values as inputs
∗∗ Optimization software resources in-
and output the function values, as shown in Fig. 1.14. Study the software clude the optimization toolboxes in
documentation for the details on how to use it.∗∗ To make sure you understand MATLAB, scipy.optimize.minimize in
how to use a given optimization package, test it on simple problems for which Python, Optim.jl or Ipopt.jl in Julia,
NLopt for multiple languages, and the
you know the solution first (see Prob. 1.4). Solver add-in in Microsoft Excel. The py-
OptSparse framework provides a common
Python wrapper for many existing opti-
mization codes and facilitates the testing
When the optimizer queries the analysis for a given 𝑥, for most of different methods.1 SNOW.jl wraps a
methods, the constraints do not have to be feasible. The optimizer is few optimizers and multiple derivative
computation methods in Julia.
responsible for changing 𝑥 so that the constraints are satisfied.
1. Wu et al., pyOptSparse: A Python frame-
The objective and constraint functions must depend on the design work for large-scale constrained nonlinear
variables; if a function does not depend on any variable in the whole optimization of sparse systems, 2020.
1 Introduction 16

domain, it can be ignored and should not appear in the problem


statement.
Ideally, 𝑓 , 𝑔, and ℎ should be computable for all values of 𝑥 that
make physical sense. Lower and upper design variable bounds should
be set to avoid nonphysical designs as much as possible. Even after
taking this precaution, models in the analysis sometimes fail to provide
a solution. A good optimizer can handle such eventualities gracefully.
There are some mathematical transformations that do not change
the solution of the optimization problem (Eq. 1.4). Multiplying either
the objective or the constraints by a constant does not change the optimal
design; it only changes the optimum objective value. Adding a constant
to the objective does not change the solution, but adding a constant to
any constraint changes the feasible space and can change the optimal
design.
Determining an appropriate set of design variables, objective, and
constraints is a crucial aspect of the outer loop shown in Fig. 1.2,
which requires human expertise in engineering design and numerical
optimization.

Tip 1.4 Ease into the problem

It is tempting to set up the full problem and attempt to solve it right


away. This rarely works, especially for a new problem. Before attempting any
optimization, you should run the analysis models and explore the solution
space manually. Particularly if using gradient-based methods, it helps to plot
the output functions across multiple input sweeps to assess if the numerical
outputs display the expected behavior and smoothness.
Instead of solving the full problem, ease into it by setting up the simplest
subproblem possible. If the function evaluations are costly, consider using
computational models that are less costly (but still representative). It is
advisable to start by solving a subproblem with a small set of variables and
then gradually expand it. The removal of some constraints has to be done more
carefully because it might result in an ill-defined problem. For multidisciplinary
problems, you should run optimizations with each component separately before
attempting to solve the coupled problem.
Solving simple problems for which you know the answer (or at least
problems for which you know the trends) helps identify any issues with
the models and problem formulation. Solving a sequence of increasingly
complicated problems gradually builds an understanding of how to solve the
optimization problem and interpret its results.
1 Introduction 17

1.3 Optimization Problem Classification

To choose the most appropriate optimization algorithm for solving a


given optimization problem, we must classify the optimization prob-
lem and know how its attributes affect the efficacy and suitability of
the available optimization algorithms. This is important because no
optimization algorithm is efficient or even appropriate for all types of
problems.
We classify optimization problems based on two main aspects:
the problem formulation and the characteristics of the objective and
constraint functions, as shown in Fig. 1.15.

Continuous
Design variables Discrete

Mixed
Single
Problem Objective
formulation Multiobjective

Constrained
Constraints
Unconstrained

Optimization
Continuous
problem Smoothness
classification Discontinuous

Linear
Linearity
Nonlinear

Objective and Unimodal


constraint function Modality
characteristics Multimodal
Fig. 1.15 Optimization problems can
be classified by attributes associated
Convex
Convexity with the different aspects of the prob-
Nonconvex lem. The two main aspects are the
problem formulation and the objec-
Deterministic tive and constraint function charac-
Stochasticity teristics.
Stochastic

In the problem formulation, the design variables can be either dis-


crete or continuous. Most of this book assumes continuous design
variables, but Chapter 8 provides an introduction to discrete optimiza-
tion. When the design variables include both discrete and continuous
variables, the problem is said to be mixed. Most of the book assumes a
single objective function, but we explain how to solve multiobjective
1 Introduction 18

problems in Chapter 9. Finally, as previously mentioned, unconstrained


problems are rare in engineering design optimization. However, we
explain unconstrained optimization algorithms (Chapter 4) because
they provide the foundation for constrained optimization algorithms
(Chapter 5).
The characteristics of the objective and constraint functions also
determine the type of optimization problem at hand and ultimately
limit the type of optimization algorithm that is appropriate for solving
the optimization problem.
In this section, we will view the function as a “black box”, that is, a
computation for which we only see inputs (including the design vari-
ables) and outputs (including objective and constraints), as illustrated
in Fig. 1.16. When dealing with black-box models, there is limited or no 𝑓 (𝑥)
understanding of the modeling and numerical solution process used 𝑥 𝑔(𝑥)
ℎ(𝑥)
to obtain the function values. We discuss these types of models and
how to solve them in Chapter 3, but here, we can still characterize the Fig. 1.16 A model is considered a
functions based purely on their outputs. The black-box view is common black box when we only see its inputs
in real-world applications. This might be because the source code is not and outputs.
provided, the modeling methods are not described, or simply because
the user does not bother to understand them.
In the remainder of this section, we discuss the attributes of objec-
tives and constraints shown in Fig. 1.15. Strictly speaking, many of these
attributes cannot typically be identified from a black-box model. For
example, although the model may appear smooth, we cannot know that
it is smooth everywhere without a more detailed inspection. However,
for this discussion, we assume that the black box’s outputs can be
exhaustively explored so that these characteristics can be identified.

𝑓 (𝑥)
1.3.1 Smoothness
The degree of function smoothness with respect to variations in the 𝑥
design variables depends on the continuity of the function values and
their derivatives. When the value of the function varies continuously, 𝑓 (𝑥)
0
the function is said to be 𝐶 continuous. If the first derivatives also vary
continuously, then the function is 𝐶 1 continuous, and so on. A function 𝑥

is smooth when the derivatives of all orders vary continuously every-


where in its domain. Function smoothness with respect to continuous 𝑓 (𝑥)

design variables affects what type of optimization algorithm can be


used. Figure 1.17 shows one-dimensional examples for a discontinuous 𝑥

0 1
function, a 𝐶 function, and a 𝐶 function. Fig. 1.17 Discontinuous function
As we will see later, discontinuities in the function value or deriva- (top), 𝐶 0 continuous function (mid-
tives limit the type of optimization algorithm that can be used because dle), and 𝐶 1 continuous function (bot-
tom).
1 Introduction 19

some algorithms assume 𝐶 0 , 𝐶 1 , and even 𝐶 2 continuity. In practice,


these algorithms usually still work with functions that have only a few
discontinuities that are located away from the optimum.

1.3.2 Linearity
The functions of interest could be linear or nonlinear. When both the
objective and constraint functions are linear, the optimization problem
is known as a linear optimization problem. These problems are easier
to solve than general nonlinear ones, and there are entire books and
courses dedicated to the subject. The first numerical optimization
algorithms were developed to solve linear optimization problems, and 𝑥2

there are many applications in operations research (see Chapter 2). An 𝑥∗


example of a linear optimization problem is shown in Fig. 1.18.
When the objective function is quadratic and the constraints are
𝑥1
linear, we have a quadratic optimization problem, which is another
type of problem for which specialized solution methods exist.∗ Linear Fig. 1.18 Example of a linear optimiza-
optimization and quadratic optimization are covered in Sections 5.5, tion problem in two dimensions.

11.2, and 11.3.


∗ Historically,optimization problems were
referred to as programming problems, so
Although many problems can be formulated as linear or quadratic much of the existing literature refers to lin-
problems, most engineering design problems are nonlinear. However, ear optimization as linear programming and
similarly for other types of optimization.
it is common to have at least a subset of constraints that are linear, and
some general nonlinear optimization algorithms take advantage of the
techniques used to solve linear and quadratic problems.

1.3.3 Multimodality and Convexity


Functions can be either unimodal or multimodal. Unimodal functions
have a single minimum, whereas multimodal functions have multiple
minima. When we find a minimum without knowledge of whether the
function is unimodal or not, we can only say that it is a local minimum;
that is, this point is better than any point within a small neighborhood.
When we know that a local minimum is the best in the whole domain
(because we somehow know that the function is unimodal), then this
Weak local
is also the global minimum, as illustrated in Fig. 1.19. Sometimes, the minimum
function might be flat around the minimum, in which case we have a Local
minimum
weak minimum.
For functions involving more complicated numerical models, it is Global
usually impossible to prove that the function is unimodal. Proving minimum

that such a function is unimodal would require evaluating the function Fig. 1.19 Types of minima.
at every point in the domain, which is computationally prohibitive.
However, it much easier to prove multimodality—all we need to do is
find two distinct local minima.
1 Introduction 20

Just because a function is complicated or the design space has many


dimensions, it does not mean that the function is multimodal. By
default, we should not assume that a given function is either unimodal
or multimodal. As we explore the problem and solve it starting from
different points or using different optimizers, there are two main
possibilities.
One possibility is that we find more than one minimum, thus
proving that the function is multimodal. To prove this conclusively, we
must make sure that the minima do indeed satisfy the mathematical
optimality conditions with good enough precision.
The other possibility is that the optimization consistently converges
to the same optimum. In this case, we can become increasingly confi-
dent that the function is unimodal with every new optimization that
converges to the same optimum.† † Forexample, Lyu et al.2 and He et al.3
show consistent convergence to the same
Often, we need not be too concerned about the possibility of multiple optimum in an aerodynamic shape opti-
local minima. From an engineering design point of view, achieving a mization problem.
local optimum that is better than the initial design is already a useful 2. Lyu et al., Aerodynamic Shape Optimiza-
tion Investigations of the Common Research
result. Model Wing Benchmark, 2015.
Convexity is a concept related to multimodality. A function is 3. He et al., Robust aerodynamic shape
convex if all line segments connecting any two points in the function lie optimization—From a circle to an airfoil,
2019.
above the function and never intersect it. Convex functions are always
Unimodal
unimodal. Also, all multimodal functions are nonconvex, but not all Convex
unimodal functions are convex (see Fig. 1.20).
Convex optimization seeks to minimize convex functions over con-
vex sets. Like linear optimization, convex optimization is another
subfield of numerical optimization with many applications. When the Multimodal
objective and constraints are convex functions, we can use specialized
formulations and algorithms that are much more efficient than gen-
eral nonlinear algorithms to find the global optimum, as explained in
Chapter 11.
Fig. 1.20 Multimodal functions have
multiple minima, whereas unimodal
1.3.4 Deterministic versus Stochastic functions have only one minimum.
All multimodal functions are noncon-
Some functions are inherently stochastic. A stochastic model yields vex, but not all unimodal functions
different function values for repeated evaluations with the same input are convex.
(Fig. 1.21). For example, the numerical value from a roll of dice is a
stochastic function.
Stochasticity can also arise from deterministic models when the in-
puts are subject to uncertainty. The input variables are then described as
probability distributions, and their uncertainties need to be propagated
through the model. For example, the bending stress in a beam may
follow a deterministic model, but the beam’s geometric properties may
1 Introduction 21

be subject to uncertainty because of manufacturing deviations. For Deterministic Stochastic


most of this text, we assume that functions are deterministic, except in
Chapter 12, where we explain how to perform optimization when the
model inputs are uncertain.
𝑓

1.4 Optimization Algorithms

No single optimization algorithm is effective or even appropriate for


all possible optimization problems. This is why it is important to 𝑥 𝑥
understand the problem before deciding which optimization algorithm
to use. By “effective” algorithm, we mean that the algorithm can Fig. 1.21 Deterministic functions
yield the same output when evalu-
solve the problem, and secondly, it does so reliably and efficiently.
ated repeatedly for the same input,
Figure 1.22 lists the attributes for the classification of optimization whereas stochastic functions do not.
algorithms, which we cover in more detail in the following discussion.
These attributes are often amalgamated, but they are independent, and
any combination is possible. In this text, we cover a wide variety of
optimization algorithms corresponding to several of these combinations.
However, this overview still does not cover a wide variety of specialized
algorithms designed to solve specific problems where a particular
structure can be exploited.
When multiple models are involved, we also need to consider how
the models are coupled, solved, and integrated with the optimizer.
These considerations lead to different MDO architectures, which may

Zeroth
Order First
Second

Local
Search
Global

Mathematical
Algorithm
Heuristic
Optimization
algorithm
classification
Function Direct
evaluation Surrogate model

Fig. 1.22 Optimization algorithms can


Deterministic
Stochasticity be classified by using the attributes
Stochastic in the rightmost column. As in the
problem classification step, these at-
tributes are independent, and any
Static combination is possible.
Time dependence
Dynamic
1 Introduction 22

involve multiple levels of optimization problems. Coupled models and


MDO architectures are covered in Chapter 13.

1.4.1 Order of Information


At the minimum, an optimization algorithm requires users to provide
the models that compute the objective and constraint values—zeroth-
order information—for any given set of allowed design variables. We
call algorithms that use only these function values gradient-free algo-
rithms (also known as derivative-free or zeroth-order algorithms). We
cover a selection of these algorithms in Chapters 7 and 8. The advantage
of gradient-free algorithms is that the optimization is easier to set up
because they do not need additional computations other than what the
models for the objective and constraints already provide.
Gradient-based algorithms use gradients of both the objective and
constraint functions with respect to the design variables—first-order
information. The gradients provide much richer information about
the function behavior, which the optimizer can use to converge to the
optimum more efficiently. Figure 1.23 shows how the cost of gradient-

Number of function evaluations


·104
based versus gradient-free optimization algorithms typically scales 3
when the number of design variables increases. The number of function Gradient free

evaluations required by gradient-free methods increases dramatically, 2


whereas the number of evaluations required by gradient-based methods
does not increase as much and is many orders of magnitude lower for 1

the larger numbers of design variables. Gradient based


In addition, gradient-based methods use more rigorous criteria for 10 20 30
optimality. The gradients are used to establish whether the optimizer Number of design variables
converged to a point that satisfies mathematical optimality conditions,
Fig. 1.23 Gradient-based algorithms
something that is difficult to verify in a rigorous way without gradients.
scale much better with the number
We first cover gradient-based algorithms for unconstrained prob- of design variables. In this example,
lems in Chapter 4 and then extend them to constrained problems in the gradient-based curve (with ex-
Chapter 5. Gradient-based algorithms also include algorithms that act derivatives) grows from 67 to 206
function calls, but it is overwhelmed
use curvature—second-order information. Curvature is even richer
by the gradient-free curve, which
information that tells us the rate of the change in the gradient, which grows from 103 function calls to over
provides an idea of where the function might flatten out. 32,000.
There is a distinction between the order of information provided by
the user and the order of information actually used in the algorithm. For
example, a user might only provide function values to a gradient-based
algorithm and rely on the algorithm to internally estimate gradients.
Optimization algorithms estimate the gradients by requesting addi-
tional function evaluations for finite difference approximations (see
Section 6.4). Gradient-based algorithms can also internally estimate
1 Introduction 23

curvature based on gradient values (see Section 4.4.4).


In theory, gradient-based algorithms require the functions to be
sufficiently smooth (at least 𝐶 1 continuous). However, in practice, they
can tolerate the occasional discontinuity, as long as this discontinuity is
not near the optimum.
We devote a considerable portion of this book to gradient-based
algorithms because they scale favorably with the number of design
variables, and they have rigorous mathematical criteria for optimality.
We also cover the various approaches for computing gradients in detail
because the accurate and efficient computation of these gradients is
crucial for the efficacy and efficiency of these methods (see Chapter 6).
Current state-of-the-art optimization algorithms also use second-
order information to implement Newton-type methods for second-
order convergence. However, these algorithms tend to build second-
order information based on the provided gradients, as opposed to
requiring users to provide the second-order information directly (see
Section 4.4.4).
Because gradient-based methods require accurate gradients and
smooth enough functions, they require more knowledge about the mod-
els and optimization algorithm than gradient-free methods. Chapters 3
through 6 are devoted to making the power of gradient-based methods
more accessible by providing the necessary theoretical and practical
knowledge.

1.4.2 Local versus Global Search


The many ways to search the design space can be classified as being
local or global. A local search takes a series of steps starting from
a single point to form a sequence of points that hopefully converges
to a local optimum. In spite of the name, local methods can traverse
large portions of the design space and can even step between convex
regions (although this happens by chance). A global search tries to
span the whole design space in the hope of finding the global optimum.
As previously mentioned when discussing multimodality, even when
using a global method, we cannot prove that any optimum found is a
global one except for particular cases.
The classification of local versus global searches often gets con-
flated with the gradient-based versus gradient-free attributes because
gradient-based methods usually perform a local search. However, these
should be viewed as independent attributes because it is possible to use
a global search strategy to provide starting points for a gradient-based
1 Introduction 24

algorithm. Similarly, some gradient-free algorithms are based on local


search strategies.
The choice of search type is intrinsically linked to the modality of
the design space. If the design space is unimodal, then a local search
is sufficient because it converges to the global optimum. If the design
space is multimodal, a local search converges to an optimum that might
be local (or global if we are lucky enough). A global search increases
the likelihood that we converge to a global optimum, but this is by no
means guaranteed.

1.4.3 Mathematical versus Heuristic


There is a big divide regarding the extent to which an algorithm is based
on provable mathematical principles versus heuristics. Optimization
algorithms require an iterative process, which determines the sequence
of points evaluated when searching for an optimum, and optimality
criteria, which determine when the iterative process ends. Heuristics
are rules of thumb or commonsense arguments that are not based on a
strict mathematical rationale.
Gradient-based algorithms are usually based on mathematical prin-
ciples, both for the iterative process and for the optimality criteria.
Gradient-free algorithms are more evenly split between the mathe-
matical and heuristic for both the optimality criteria and the itera-
tive procedure. The mathematical gradient-free algorithms are often
called derivative-free optimization algorithms. Heuristic gradient-free
algorithms include a wide variety of nature-inspired algorithms (see
Section 7.2).
Heuristic optimality criteria are an issue because, strictly speaking,
they do not prove that a given point is a local (let alone global) optimum;
they are only expected to find a point that is “close enough”. This
contrasts with mathematical optimality criteria, which are unambiguous
about (local) optimality and converge to the optimum within the limits
of the working precision. This is not to suggest that heuristic methods
are not useful. Finding a better solution is often desirable regardless of
whether or not it is strictly optimal. Not converging tightly to optimality
criteria does, however, make it harder to compare results from different
methods.
Iterative processes based on mathematical principles tend to be
more efficient than those based on heuristics. However, some heuristic
methods are more robust because they tend to make fewer assumptions
about the modality and smoothness of the functions and handle noisy
functions more effectively.
1 Introduction 25

Most algorithms mix mathematical arguments and heuristics to


some degree. Mathematical algorithms often include constants whose
values end up being tuned based on experience. Conversely, algo-
rithms primarily based on heuristics sometimes include steps with
mathematical justification.

1.4.4 Function Evaluation


The optimization problem setup that we described previously assumes
that the function evaluations are obtained by solving numerical models
of the system. We call these direct function evaluations. However, it is
possible to create surrogate models (also known as metamodels) of these
models and use them in the optimization process. These surrogates can
be interpolation-based or projection-based models. Surrogate-based
optimization is discussed in Chapter 10.

1.4.5 Stochasticity
This attribute is independent of the stochasticity of the model that
we mentioned previously, and it is strictly related to whether the
optimization algorithm itself contains steps that are determined at
random or not.
A deterministic optimization algorithm always evaluates the same
points and converges to the same result, given the same initial conditions.
In contrast, a stochastic optimization algorithm evaluates a different set
of points if run multiple times from the same initial conditions, even
if the models for the objective and constraints are deterministic. For
example, most evolutionary algorithms include steps determined by
generating random numbers. Gradient-based algorithms are usually
deterministic, but some exceptions exist, such as stochastic gradient
descent (see Section 10.5).

1.4.6 Time Dependence


In this book, we assume that the optimization problem is static. This
means that we formulate the problem as a single optimization and solve
the complete numerical model at each optimization iteration. In contrast,
dynamic optimization problems solve a sequence of optimization problems
to make decisions at different time instances based on information that
becomes available as time progresses.
For some problems that involve time dependence, we can perform
time integration to solve for the entire time history of the states and then
compute the objective and constraint function values for an optimization
1 Introduction 26

iteration. This means that every optimization iteration requires solving


for the entire time history. An example of this type of problem is a
trajectory optimization problem where the design variables are the
coordinates representing the path, and the objective is to minimize the
total energy expended to get to a given destination.4 Although such a 4. Betts, Survey of numerical methods for
trajectory optimization, 1998.
problem involves a time dependence, we still classify it as static because
we solve a single optimization problem. As a more specific example,
consider a car going around a racetrack. We could optimize the time
history of the throttle, braking, and steering of a car to get a trajectory
that minimizes the total time in a known racetrack for fixed conditions.
This is an open-loop optimal control problem because the car control is
predetermined and does not react to any disturbances.
For dynamic optimization problems (also known as dynamic program-
ming), the design variables are decisions made in a sequence of time
steps.5,6 The decision at a given time step is influenced by the decisions 5. Bryson and Ho, Applied Optimal Con-
trol; Optimization, Estimation, and Control,
and system states from previous steps. Sometimes, the decision at a 1969.
given time step also depends on a prediction of the states a few steps 6. Bertsekas, Dynamic Programming and
Optimal Control, 1995.
into the future.
The car example that we previously mentioned could also be a
dynamic optimization problem if we optimized the throttle, braking,
or steering of a car at each time instance in response to some measured
output. We could, for example, maximize the instantaneous acceleration
based on real-time acceleration sensor information and thus react to
varying conditions, such as surface traction. This is an example of a
closed-loop (or feedback) optimal control problem, a type of dynamic
optimization problem where a control law is optimized for a dynamical
system over a period of time.
Dynamic optimization is not covered in this book, except in the con-
text of discrete optimization (see Section 8.5). Different approaches are
used in general, but many of the concepts covered here are instrumental
in the numerical solution of dynamic optimization and optimal control
problems.

1.5 Selecting an Optimization Approach

This section provides guidance on how to select an appropriate approach


for solving a given optimization problem. This process cannot always
be distilled to a simple decision tree; however, it is still helpful to have a
framework as a first guide. Many of these decisions will become more
apparent as you progress through the book and gain experience, so
you may want to revisit this section periodically. Eventually, selecting
an appropriate methodology will become second nature.
1 Introduction 27

Figure 1.24 outlines one approach to algorithm selection and also


serves as an overview of the chapters in this book. The first two char-
acteristics in the decision tree (convex problem and discrete variables)
are not the most common within the broad spectrum of engineering
optimization problems, but we list them first because they are the more
restrictive in terms of usable optimization algorithms.

Convex? Yes
Linear optimization, quadratic optimization, etc.
Ch. 11
Yes
No Branch and bound
Yes
Dynamic programming
Discrete? Yes
Linear? Markov chain?
Ch. 8 No SA or GA (bit-encoded)
No
No
Yes BFGS
Yes Ch. 4 Yes
Differentiable?
Unconstrained? Multimodal? Multistart
Ch. 6 SQP or IP
No Ch. 5
No
Yes
DIRECT, GPS, GA, PS, etc.
Gradient free
Multimodal?
Ch. 7 Nelder–Mead
No

Multiple objectives? Noisy or expensive? Uncertainty? Multiple disciplines?


Ch. 9 Ch. 10 Ch. 12 Ch. 13

The first node asks about convexity. Although it is often not Fig. 1.24 Decision tree for selecting
immediately apparent if the problem is convex, with some experience, an optimization algorithm.
we can usually discern whether we should attempt to reformulate it as a
convex problem. In most instances, convexity occurs for problems with
simple objectives and constraints (e.g., linear or quadratic), such as in
control applications where the optimization is performed repeatedly. A
convex problem can be solved with general gradient-based or gradient-
free algorithms, but it would be inefficient not to take advantage of the
convex formulation structure if we can do so.
The next node asks about discrete variables. Problems with discrete
design variables are generally much harder to solve, so we might
consider alternatives that avoid using discrete variables when possible.
For example, a wind turbine’s position in a field could be posed as
a discrete variable within a discrete set of options. Alternatively, we
could represent the wind turbine’s position as a continuous variable
with two continuous coordinate variables. That level of flexibility may
or may not be desirable but generally leads to better solutions. Many
1 Introduction 28

problems are fundamentally discrete, and there is a wide variety of


available methods.
Next, we consider whether the model is continuous and differen-
tiable or can be made smooth through model improvements. If the
problem is high dimensional (more than a few tens of variables as a
rule of thumb), gradient-free algorithms are generally intractable and
gradient-based algorithms are preferable. We would either need to
make the model smooth enough to use a gradient-based algorithm or
reduce the problem dimensionality to use a gradient-free algorithm.
Another alternative if the problem is not readily differentiable is to use
surrogate-based optimization (the box labeled “Noisy or expensive” in
Fig. 1.24). If we go the surrogate-based optimization route, we could
still use a gradient-based approach to optimize the surrogate model
because most such models are differentiable. Finally, for problems with
a relatively small number of design variables, gradient-free methods
can be a good fit. Gradient-free methods have the largest variety of
algorithms, and a combination of experience and testing is needed to
determine an appropriate algorithm for the problem at hand.
The bottom row in Fig. 1.24 lists additional considerations: multiple
objectives, surrogate-based optimization for noisy (nondifferentiable) or
computationally expensive functions, optimization under uncertainty
in the design variables and other model parameters, and MDO. 𝑥
𝑥1
1.6 Notation

We do not use bold font to represent vectors or matrices. Instead, 𝑥𝑖


we follow the convention of many optimization and numerical linear
algebra books, which try to use Greek letters (e.g., 𝛼 and 𝛽) for scalars,
lowercase roman letters (e.g., 𝑥 and 𝑢) for vectors, and capitalized
𝑥𝑛
roman letters (e.g., 𝐴 and 𝐻) for matrices. There are exceptions to this
notation because of the wide variety of topics covered in this book and (𝑛 × 1)

a desire not to deviate from the standard conventions used in each Fig. 1.25 An 𝑛-vector, 𝑥.
field. We explicitly note these exceptions as needed. For example, the
𝐴
objective function 𝑓 is a scalar function and the Lagrange multipliers
(𝜆 and 𝜎) are vectors. 𝐴11 𝐴1𝑚

By default, a vector 𝑥 is a column vector, and thus 𝑥 | is a row


𝐴 𝑖𝑗
vector. We denote the 𝑖th element of the vector as 𝑥 𝑖 , as shown in 𝑖

Fig. 1.25. For more compact notation, we may write a column vector
horizontally, with its components separated by commas, for example, 𝐴𝑛1 𝐴𝑛𝑚
𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. We refer to a vector with 𝑛 components as an 𝑗
𝑛-vector, which is equivalent to writing 𝑥 ∈ R𝑛 . (𝑛 × 𝑚)

An (𝑛 × 𝑚) matrix has 𝑛 rows and 𝑚 columns, which is equivalent Fig. 1.26 An (𝑛 × 𝑚) matrix, 𝐴.
1 Introduction 29

to defining 𝐴 ∈ R𝑛×𝑚 . The matrix element 𝐴 𝑖𝑗 is the element in the 𝑖th


row of the 𝑗the column, as shown in Fig. 1.26. Occasionally, additional
letters beyond 𝑖 and 𝑗 are needed for indices, but those are explicitly
noted when used.
The subscript 𝑘 usually refers to iteration number. Thus, 𝑥 𝑘 is the
complete vector 𝑥 at iteration 𝑘. The subscript zero is used for the same
purpose, so 𝑥0 would be the complete vector 𝑥 at the initial iteration.
Other subscripts besides those listed are used for naming. A superscript
star (𝑥 ∗ ) refers to a quantity at the optimum.

Tip 1.5 Work out the dimensions of the vectors and matrices

As you read this book, we encourage you to work out the dimensions of
the vectors and matrices in the operations within each equation and verify the
dimensions of the result for consistency. This will enhance your understanding
of the equations.

1.7 Summary

Optimization is compelling, and there are opportunities to apply it


everywhere. Numerical optimization fully automates the design pro-
cess but requires expertise in the problem formulation, optimization
algorithm selection, and the use of that algorithm. Finally, design
expertise is also required to interpret and critically evaluate the results
given by the optimization.
There is no single optimization algorithm that is effective in the
solution of all types of problems. It is crucial to classify the optimization
problem and understand the optimization algorithms’ characteristics
to select the appropriate algorithm to solve the problem.
In seeking a more automated design process, we must not dismiss the
value of engineering intuition, which is often difficult (if not impossible)
to convert into a rigid problem formulation and algorithm.
1 Introduction 30

Problems

1.1 Answer true or false and justify your answer.

a. MDO arose from the need to consider multiple design objec-


tives.
b. The preliminary design phase takes place after the concep-
tual design phase.
c. Design optimization is a completely automated process from
which designers can expect to get their final design.
d. The design variables for a problem consist of all the inputs
needed to compute the objective and constraint functions.
e. The design variables must always be independent of each
other.
f. An optimization algorithm designed for minimization can be
used to maximize an objective function without modifying
the algorithm.
g. Compared with the global optimum of a given problem,
adding more design variables to that problem results in a
global optimum that is no worse than that of the original
problem.
h. Compared with the global optimum objective value of a
given problem, adding more constraints sometimes results
in a better global optimum.
i. A function is 𝐶 1 continuous if its derivative varies continu-
ously.
j. All unimodal functions are convex.
k. Global search algorithms always converge to the global
optimum.
l. Gradient-based methods are largely based on mathematical
principles as opposed to heuristics.
m. Solving a problem that involves a stochastic model requires
a stochastic optimization algorithm.
n. If a problem is multimodal, it requires a gradient-free opti-
mization algorithm.

1.2 Plotting a two-dimensional function. Consider the two-dimensional


function
𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥1 𝑥22 − 𝑥23 − 20𝑥1 .
1 Introduction 31

Plot the function contours and find the approximate location of


the minimum point(s). Is there a global minimum? Exploration:
Plot other functions to get an intuition about their trends and
minima. You can start with simple low-order polynomials and
then add higher-order terms, trying different coefficients. Then
you can also try nonalgebraic functions. This will give you an
intuition about the function trends and minima.

1.3 Standard form. Convert the following problem to the standard


formulation (Eq. 1.4):

maximize 2𝑥12 − 𝑥14 𝑥22 − 𝑒 𝑥3 + 𝑒 −𝑥3 + 12


by varying 𝑥1 , 𝑥2 , 𝑥3
subject to 𝑥1 ≥ 1 (1.5)
𝑥 2 + 𝑥3 ≥ 10
𝑥12 + 3𝑥 22 ≤ 4 .

1.4 Using an unconstrained optimizer. Consider the two-dimensional


function
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 .
2
Plot the contours of this function and find the minimum graphi-
cally. Then, use optimization software to find the minimum (see
Tip 1.3). Verify that the optimizer converges to the minimum you
found graphically. Exploration: (1) Try minimizing the function
in Prob. 1.2 starting from different points. (2) Minimize other
functions of your choosing. (3) Study the options provided by the
optimization software and explore different settings.

1.5 Using a constrained optimizer. Now we add constraints to Prob. 1.4.


The objective is the same, but we now have two inequality con-
straints:

𝑥12 + 𝑥22 ≤ 1
1
𝑥1 − 3𝑥 2 + ≥ 0 ,
2
and bound constraints:

𝑥1 ≥ 0, 𝑥2 ≥ 0 .

Plot the constraints and identify the feasible region. Find the
constrained minimum graphically. Use optimization software
to solve the constrained minimization problem. Which of the
inequality constraints and bounds are active at the solution?
1 Introduction 32

1.6 Paper review. Select a paper on design optimization that seems


interesting to you, preferably from a peer-reviewed journal.
Write the full optimization problem statement in the standard
form (Eq. 1.4) for the problem solved in the paper. Classify the
problem according to Fig. 1.15 and the optimization algorithm ac-
cording to Fig. 1.22. Use the decision tree in Fig. 1.24 to determine
if the optimization algorithm was chosen appropriately. Write a
critique of the paper, highlighting its strengths and weaknesses.

1.7 Problem formulation. Choose an engineering system that you


are familiar with, and use the process outlined in Fig. 1.4 to
formulate a problem for the design optimization of that system.
Write the statement in the standard form (Eq. 1.4). Critique your
statement by asking the following: Does the objective function
truly capture the design intent? Are there other objectives that
could be considered? Do the design variables provide enough
freedom? Are the design variables bounded such that nonphysical
designs are prevented? Are you sure you have included all the
constraints needed to get a practical design? Can you think of
any loophole that the optimizer can exploit in your statement?
A Short History of Optimization
2
This chapter provides helpful historical context for the methods dis-
cussed in this book. Nothing else in the book depends on familiarity
with the material in this chapter, so it can be skipped. However, this
history makes connections between the various topics that will enrich
the big picture of optimization as you become familiar with the material
in the rest of the book, so you might want to revisit this chapter.
Optimization has a long history that started with geometry problems
solved by ancient Greek mathematicians. The invention of algebra and
calculus opened the door to many more types of problems, and the
advent of numerical computing increased the range of problems that
could be solved in terms of type and scale.

By the end of this chapter you should be able to:

1. Appreciate a range of historical advances in optimization.


2. Describe current frontiers in optimization.

2.1 The First Problems: Optimizing Length and Area

Ancient Greek and Egyptian mathematicians made numerous contri-


butions to geometry, including solving optimization problems that
involved length and area. They adopted a geometric approach to
solving problems that are now more easily solved using calculus.
Archimedes of Syracuse (287–212 BCE) showed that of all possible
spherical caps of a given surface area, hemispherical caps have the
largest volume. Euclid of Alexandria (325–265 BCE) showed that the
shortest distance between a point and a line is the segment perpendicular
to that line. He also proved that among rectangles of a given perimeter,
the square has the largest area.
Geometric problems involving perimeter and area had practical
value. The classic example of such practicality is Dido’s problem.
According to the legend, Queen Dido, who had fled to Tunis, purchased

33
2 A Short History of Optimization 34

from a local leader as much land as could be enclosed by an ox’s hide.


The leader agreed because this seemed like a modest amount of land.
To maximize her land area, Queen Dido had the hide cut into narrow
strips to make the longest possible string. Then, she intuitively found
Carthage
the curve that maximized the area enclosed by the string: a semicircle
with the diameter segment set along the sea coast (see Fig. 2.1). As Gulf of
Tunis
a result of the maximization, she acquired enough land to found the
ancient city of Carthage. Later, Zenodorus (200–140 BCE) proved this
optimal solution using geometrical arguments. A rigorous solution to
this problem requires using calculus of variations, which was invented Fig. 2.1 Queen Dido intuitively maxi-
much later. mized the area for a given perimeter,
thus acquiring enough land to found
Geometric optimization problems are also applicable to the laws of
the city of Carthage.
physics. Hero of Alexandria (10–70 CE) derived the law of reflection
by finding the shortest path for light reflecting from a mirror, which 𝐴
results in an angle of reflection equal to the angle of incidence (Fig. 2.2).
𝐵

𝜃 𝜃
2.2 Optimization Revolution: Derivatives and Calculus
Mirror
𝐵0
The scientific revolution generated significant optimization develop-
ments in the seventeenth and eighteenth centuries that intertwined Fig. 2.2 The law of reflection can be
with other developments in mathematics and physics. derived by minimizing the length of
In the early seventeenth century, Johannes Kepler published a book the light beam.

in which he derived the optimal dimensions of a wine barrel.7 He 7. Kepler, Nova stereometria doliorum
vinariorum (New Solid Geometry of Wine
became interested in this problem when he bought a barrel of wine, Barrels), 1615.
and the merchant charged him based on a diagonal length (see Fig. 2.3).
This outraged Kepler because he realized that the amount of wine could
vary for the same diagonal length, depending on the barrel proportions.
Incidentally, Kepler also formulated an optimization problem when
looking for his second wife, seeking to maximize the likelihood of satis-
faction. This “marriage problem” later became known as the “secretary
problem”, which is an optimal-stopping problem that has since been
solved using dynamic optimization (mentioned in Section 1.4.6 and
discussed in Section 8.5).8
Willebrord Snell discovered the law of refraction in 1621, a formula
that describes the relationship between the angles of incidence and
Fig. 2.3 Wine barrels were measured
refraction when light passes through a boundary between two different by inserting a ruler in the tap hole
media, such as air, glass, or water. Whereas Hero minimized the length until it hit the corner.
to derive the law of reflection, Snell minimized time. These laws were 8. Ferguson, Who solved the secretary
generalized by Fermat in the principle of least time (or Fermat’s principle), problem? 1989.

which states that a ray of light going from one point to another follows
the path that takes the least time.
2 A Short History of Optimization 35

Pierre de Fermat derived Snell’s law by applying the principle of


least time, and in the process, he devised a mathematical technique for
finding maxima and minima using what amounted to derivatives (he
missed the opportunity to generalize the notion of derivative, which
came later in the development of calculus).9 Today, we learn about 9. Fermat, Methodus ad disquirendam
maximam et minimam (Method for the Study
derivatives before learning about optimization, but Fermat did the of Maxima and Minima), 1636.
reverse.
During this period, optimization was not yet considered an im-
portant area of mathematics, and contributions to optimization were
scattered among other areas. Therefore, most mathematicians did not
appreciate seminal contributions in optimization at the time.
In 1669, Isaac Newton wrote about a numerical technique to find
the roots of polynomials by successively linearizing them, achieving
quadratic convergence. In 1687, he used this technique to find the
roots of a nonpolynomial equation (Kepler’s equation),∗ but only after ∗ Kepler’s equation describes orbits by
𝐸 − 𝑒 sin(𝐸) = 𝑀 , where 𝑀 is the mean
using polynomial expansions. In 1690, Joseph Raphson improved on anomaly, 𝑒 is the eccentricity, and 𝐸 is the
Newton’s method by keeping all the decimals in each linearization and eccentric anomaly. This equation does not
have a closed-form solution for 𝐸 .
making it a fully iterative scheme. The multivariable “Newton’s method”
that is widely used today was actually introduced in 1740 by Thomas
Simpson. He generalized the method by using the derivatives (which
allowed for solving nonpolynomial equations without expansions) and
by extending it to a system of two equations and two unknowns.† † For this reason, Kollerstrom10
argues that
the method should be called neither New-
In 1685, Newton studied a shape optimization problem where he ton nor Newton–Raphson.
sought the shape of a body of revolution that minimizes fluid drag 10. Kollerstrom, Thomas Simpson and
and even mentioned a possible application to ship design. Although ‘Newton’s method of approximation’: an
enduring myth, 1992.
he used the wrong model for computing the drag, he correctly solved
what amounted to a calculus of variations problem.
In 1696, Johann Bernoulli challenged other mathematicians to find
the path of a body subject to gravity that minimizes the travel time 𝐴
between two points of different heights. This is now a classic calculus of
variations problem called the brachistochrone problem (Fig. 2.4). Bernoulli
already had a solution that he kept secret. Five mathematicians respond
with solutions: Newton, Jakob Bernoulli (Johann’s brother), Gottfried
Wilhelm von Leibniz, Ehrenfried Walther von Tschirnhaus, and Guil-
laume de l’Hôpital. Newton reportedly started working on the problem 𝐵
as soon as he received it and stayed up all night before sending the
Fig. 2.4 Suppose that you have a bead
solution anonymously to Bernoulli the next day.
on a wire that goes from 𝐴 to 𝐵. The
Starting in 1736, Leonhard Euler derived the general optimality brachistochrone curve is the shape
conditions for solving calculus of variations problems, but the derivation of the wire that minimizes the time
included geometric arguments. In 1755, Joseph-Louis Lagrange used a for the bead to slide between the two
points under gravity alone. It is faster
purely analytic approach to derive the same optimality conditions (he
than a straight-line trajectory or a cir-
was 19 years old at the time!). Euler recognized Lagrange’s derivation, cular arc.
2 A Short History of Optimization 36

which uses variations of a function, as a superior approach and adopted


it, calling it “calculus of variations”. This is a second-order partial
differential equation that has become known as the Euler–Lagrange
equation. Lagrange used this equation to develop a reformulation
of classical mechanics in 1788, which became known as Lagrangian
mechanics. When deriving the general equations of equilibrium for
problems with constraints, Lagrange introduced the “method of the
multipliers”.11 Lagrange multipliers eventually became a fundamental 11. Lagrange, Mécanique analytique, 1788.

concept in constrained optimization (see Section 5.3).


In 1784, Gaspard Monge developed a geometric method to solve
a transportation problem. Although the method was not entirely
correct, it established combinatorial optimization, a branch of discrete
optimization (Chapter 8).

2.3 The Birth of Optimization Algorithms

Several more theoretical contributions related to optimization occurred


in the nineteenth century and the early twentieth century. However, it
was not until the 1940s that optimization started to gain traction with
the development of algorithms and their use in practical applications,
thanks to the advent of computer hardware.
In 1805, Adrien-Marie Legendre described the method of least
squares, which was used to predict asteroid orbits and for curve fitting.
Friedrich Gauss published a rigorous mathematical foundation for the
method of least squares and claimed he used it to predict the orbit of
the asteroid Ceres in 1801. Legendre and Gauss engaged in a bitter
dispute on who first developed the method.
In one of his 789 papers, Augustin-Louis Cauchy proposed the
steepest-descent method for solving systems of nonlinear equations.12 12. Cauchy, Méthode générale pour la réso-
lution des systèmes d’équations simultanées,
He did not seem to put much thought into it and promised a “paper 1847.
to follow” on the subject, which never happened. He proposed this
method for solving systems of nonlinear equations, but it is directly
applicable to unconstrained optimization (see Section 4.4.1).
In 1902, Gyula Farkas proved a theorem on the solvability of a system
of linear inequalities. This became known as Farkas’ lemma, which is
crucial in the derivation of the optimality conditions for constrained
problems (see Section 5.3.2). In 1917, Harris Hancock published the first
textbook on optimization, which included the optimality conditions for
multivariable unconstrained and constrained problems.13 13. Hancock, Theory of Minima and Max-
ima, 1917.
In 1932, Karl Menger presented “the messenger problem”,14 an
14. Menger, Das botenproblem, 1932.
optimization problem that seeks to minimize the shortest travel path
that connects a set of destinations, observing that going to the closest
2 A Short History of Optimization 37

point each time does not, in general, result in the shortest overall path.
This is a combinatorial optimization problem that later became known
as the traveling salesperson problem, one of the most intensively studied
problems in optimization (Chapter 8).
In 1939, William Karush derived the necessary conditions for in-
equality constrained problems in his master’s thesis. His approach
generalized the method of Lagrange multipliers, which only allowed
for equality constraints. Harold Kuhn and Albert Tucker independently
rediscovered these conditions and published their seminal paper in
1951.15 These became known as the Karush–Kuhn–Tucker (KKT) condi- 15. Karush, Minima of functions of several
variables with inequalities as side constraints,
tions, which constitute the foundation of gradient-based constrained 1939.
optimization algorithms (see Section 5.3).
Leonid Kantorovich developed a technique to solve linear program-
ming problems in 1939 after having been given the task of optimizing
production in the Soviet government’s plywood industry. However,
his contribution was neglected for ideological reasons. In the United
States, Tjalling Koopmans rediscovered linear programming in the
early 1940s when working on ship-transportation problems. In 1947,
George Dantzig published the first complete algorithm for solving linear
programming problems—the simplex algorithm.16 In the same year, 16. Dantzig, Linear programming and
extensions, 1998.
von Neumann developed the theory of duality for linear programming
problems. Kantorovich and Koopmans later shared the 1975 Nobel
Memorial Prize in Economic Sciences “for their contributions to the
theory of optimum allocation of resources”. Dantzig was not included,
presumably because his work was more theoretical. The development
of the simplex algorithm and the widespread practical applications of
linear programming sparked a revolution in optimization. The first
international conference on optimization, the International Symposium
on Mathematical Programming, was held in Chicago in 1949.
In 1951, George Box and Kenneth Wilson developed the response-
surface methodology (surrogate modeling), which enables optimization
of systems based on experimental data (as opposed to a physics-based
model). They developed a method to build a quadratic model where
the number of data points scales linearly with the number of inputs
instead of exponentially, striking a balance between accuracy and ease of
application. In the same year, Danie Krige developed a surrogate model
based on a stochastic process, which is now known as the kriging model.
He developed this model in his master’s thesis to estimate the most likely
distribution of gold based on a limited number of borehole samples.17 17. Krige, A statistical approach to some
mine valuation and allied problems on the
These approaches are foundational in surrogate-based optimization Witwatersrand, 1951.
(Chapter 10).
In 1952, Harry Markowitz published a paper on portfolio theory
2 A Short History of Optimization 38

that formalized the idea of investment diversification, marking the birth


of modern financial economics.18 The theory is based on a quadratic 18. Markowitz, Portfolio selection, 1952.

optimization problem. He received the 1990 Nobel Memorial Prize in


Economic Sciences for developing this theory.
In 1955, Lester Ford and Delbert Fulkerson created the first known
algorithm to solve the maximum-flow problem, which has applications
in transportation, electrical circuits, and data transmission. Although
the problem could already be solved with the simplex algorithm, they
proposed a more efficient algorithm for this specialized problem.
In 1957, Richard Bellman derived the necessary optimality condi-
tions for dynamic programming problems.19 These are expressed in 19. Bellman, Dynamic Programming, 1957.

what became known as the Bellman equation (Section 8.5), which was
first applied to engineering control theory and subsequently became a
core principle in economic theory.
In 1959, William Davidon developed the first quasi-Newton method
for solving nonlinear optimization problems that rely on approxi-
mations of the curvature based on gradient information. He was
motivated by his work at Argonne National Laboratory, where he
used a coordinate-descent method to perform an optimization that
kept crashing the computer before converging. Although Davidon’s
approach was a breakthrough in nonlinear optimization, his original
paper was rejected. It was eventually published more than 30 years
later in the first issue of the SIAM Journal on Optimization.20 Fortunately, 20. Davidon, Variable metric method for
minimization, 1991.
his valuable insight had been recognized well before that by Roger
Fletcher and Michael Powell, who further developed the method.21 The 21. Fletcher and Powell, A rapidly con-
vergent descent method for minimization,
method became known as the Davidon–Fletcher–Powell (DFP) method 1963.
(Section 4.4.4).
Another quasi-Newton approximation method was independently
proposed in 1970 by Charles Broyden, Roger Fletcher, Donald Goldfarb,
and David Shanno, now called the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) method. Larry Armijo, A. Goldstein, and Philip Wolfe developed
the conditions for the line search that ensure convergence in gradient-
based methods (see Section 4.3.2).22 22. Wolfe, Convergence conditions for
ascent methods, 1969.
Leveraging the developments in unconstrained optimization, re-
searchers sought methods for solving constrained problems. Penalty
and barrier methods were developed but fell out of favor because
of numerical issues (see Section 5.4). In another effort to solve non-
linear constrained problems, Robert Wilson proposed the sequential 23. Wilson, A simplicial algorithm for
concave programming, 1963.
quadratic programming (SQP) method in his PhD thesis.23 SQP consists
24. Han, Superlinearly convergent variable
of applying the Newton method to solve the KKT conditions (see Sec- metric algorithms for general nonlinear
programming problems, 1976.
tion 5.5). Shih-Ping Han reinvented SQP in 1976,24 and Michael Powell
25. Powell, Algorithms for nonlinear con-
popularized this method in a series of papers starting from 1977.25 straints that use Lagrangian functions, 1978.
2 A Short History of Optimization 39

There were attempts to model the natural process of evolution


starting in the 1950s. In 1975, John Holland proposed genetic algorithms
(GAs) to solve optimization problems (Section 7.6).26 Research in GAs 26. Holland, Adaptation in Natural and
Artificial Systems, 1975.
increased dramatically after that, thanks in part to the exponential
increase in computing power.
Hooke and Jeeves27 proposed a gradient-free method called coor- 27. Hooke and Jeeves, ‘Direct search’ so-
lution of numerical and statistical problems,
dinate search. In 1965, Nelder and Mead28 developed the nonlinear 1961.
simplex method, another gradient-free nonlinear optimization based 28. Nelder and Mead, A simplex method
for function minimization, 1965.
on heuristics (Section 7.3).∗
∗ The Nelder–Mead algorithm has no con-
The Mathematical Programming Society was founded in 1973, an nection to the simplex algorithm for linear
international association for researchers active in optimization. It was programming problems mentioned earlier.
renamed the Mathematical Optimization Society in 2010 to reflect the
more modern name for the field.
Narendra Karmarkar presented a revolutionary new method in
1984 to solve large-scale linear optimization problems as much as a
hundred times faster than the simplex method.29 The New York Times 29. Karmarkar, A new polynomial-time
algorithm for linear programming, 1984.
published a related news item on the front page with the headline
“Breakthrough in Problem Solving”. This heralded the age of interior-
point methods, which are related to the barrier methods dismissed in
the 1960s. Interior-point methods were eventually adapted to solve
nonlinear problems (see Section 5.6) and contributed to the unification
of linear and nonlinear optimization.

2.4 The Last Decades

The relentless exponential increase in computer power throughout


the 1980s and beyond has made it possible to perform engineering
design optimization with increasingly sophisticated models, including
multidisciplinary models. The increased computer power has also
been contributing to the gain in popularity of heuristic optimization
algorithms. Computer power has also enabled large-scale deep neural
networks (see Section 10.5), which have been instrumental in the
explosive rise of artificial intelligence (AI).
The field of optimal control flourished after Bellman’s contribution
to dynamic programming. Another important optimality principle for
control, the maximum principle, was derived by Pontryagin et al.30 This 30. Pontryagin et al., The Mathematical
Theory of Optimal Processes, 1961.
principle makes it possible to transform a calculus of variations problem
into a nonlinear optimization problem. Gradient-based nonlinear
optimization algorithms were then used to numerically solve for the
optimal trajectories of rockets and aircraft, with an adjoint method
to compute the gradients of the objective with respect to the control
histories.31 The adjoint method efficiently computes gradients with 31. Bryson Jr, Optimal control—1950 to
1985, 1996.
2 A Short History of Optimization 40

respect to large numbers of variables and has proven to be useful in other


disciplines. Optimal control then expanded to include the optimization
of feedback control laws that guarantee closed-loop stability. Optimal
control approaches include model predictive control, which is widely
used today.
In 1960, Schmit32 proposed coupling numerical optimization with 32. Schmit, Structural design by systematic
synthesis, 1960.
structural computational models to perform structural design, establish-
ing the field of structural optimization. Five years later, he presented
applications, including aerodynamics and structures, representing
the first known multidisciplinary design optimization (MDO) appli-
cation.33 The direct method for computing gradients for structural 33. Schmit and Thornton, Synthesis of an
airfoil at supersonic Mach number, 1965.
computational models was developed shortly after that,34 eventually
34. Fox, Constraint surface normals for
followed by the adjoint method (Section 6.7).35 In this early work, the structural synthesis techniques, 1965.
design variables were the cross-sectional areas of the members of a truss 35. Arora and Haug, Methods of design
structure. Researchers then added joint positions to the set of design sensitivity analysis in structural optimiza-
tion, 1979.
variables. Structural optimization was generalized further with shape
optimization, which optimizes the shape of arbitrary three-dimensional
structural parts.36 Another significant development was topology op- 36. Haftka and Grandhi, Structural shape
optimization—A survey, 1986.
timization, where a structural layout emerges from a solid block of
material.37 It took many years of further development in algorithms and 37. Eschenauer and Olhoff, Topology
optimization of continuum structures: A
computer hardware for structural optimization to be widely adopted review, 2001.
by industry, but this capability has now made its way to commercial
software.
Aerodynamic shape optimization began when Pironneau38 used 38. Pironneau, On optimum design in fluid
mechanics, 1974.
optimal control techniques to minimize the drag of a body by varying
its shape (the “control” variables). Jameson39 extended the adjoint 39. Jameson, Aerodynamic design via
control theory, 1988.
method with more sophisticated computational fluid dynamics (CFD)
models and applied it to aircraft wing design. CFD-based optimization
applications spread beyond aircraft wing design to the shape optimiza-
tion of wind turbines, hydrofoils, ship hulls, and automobiles. The
adjoint method was then generalized for any discretized system of
equations (see Section 6.7).
MDO developed rapidly in the 1980s following the application
of numerical optimization techniques to structural design. The first
conference in MDO, the Multidisciplinary Analysis and Optimization
Conference, took place in 1985. The earliest MDO applications focused
on coupling the aerodynamics and structures in wing design, and 40. Sobieszczanski–Sobieski and Haftka,
Multidisciplinary aerospace design opti-
other early applications integrated structures and controls.40 The de- mization: Survey of recent developments,
1997.
velopment of MDO methods included efforts toward decomposing the
41. Martins and Lambe, Multidisciplinary
problem into optimization subproblems, leading to distributed MDO design optimization: A survey of architec-
architectures.41 Sobieszczanski–Sobieski42 proposed a formulation for tures, 2013.
42. Sobieszczanski–Sobieski, Sensitivity of
computing the derivatives for coupled systems, which is necessary complex, internally coupled systems, 1990.
2 A Short History of Optimization 41

when performing MDO with gradient-based optimizers. This concept


was later combined with the adjoint method for the efficient computa-
tion of coupled derivatives.43 More recently, efficient computation of 43. Martins et al., A coupled-adjoint sen-
sitivity analysis method for high-fidelity
coupled derivatives and hierarchical solvers have made it possible to aero-structural design, 2005.
solve large-scale MDO problems44 (Chapter 13). Engineering design 44. Hwang and Martins, A computational
has been focusing on achieving improvements made possible by con- architecture for coupling heterogeneous
numerical models and computing coupled
sidering the interaction of all relevant disciplines. MDO applications derivatives, 2018.
have extended beyond aircraft to the design of bridges, buildings,
automobiles, ships, wind turbines, and spacecraft.
In continuous nonlinear optimization, SQP has remained the state-
of-the-art approach since its popularization in the late 1970s. However,
the interior-point approach, which, as mentioned previously, revolu-
tionized linear optimization, was successfully adapted for the solution
of nonlinear problems and has made great strides since the 1990s.45 45. Wright, The interior-point revolution in
optimization: history, recent developments,
Today, both SQP and interior-point methods are considered to be state and lasting consequences, 2005.
of the art.
Interior-point methods have contributed to the connection between
linear and nonlinear optimization, which were treated as entirely
separate fields before 1984. Today, state-of-the-art linear optimization
software packages have options for both the simplex and interior-point
approaches because the best approach depends on the problem.
Convex optimization emerged as a generalization of linear optimiza-
tion (Chapter 11). Like linear optimization, it was initially mostly used
in operations research applications,∗ such as transportation, manufac- ∗ The field of operations research was es-
tablished in World War II to aid in making
turing, supply-chain management, and revenue management, and there better strategical decisions.
were only a few applications in engineering. Since the 1990s, convex
optimization has increasingly been used in engineering applications,
including optimal control, signal processing, communications, and
circuit design. A disciplined convex programming methodology facili-
tated this expansion to construct convex problems and convert them to
a solvable form.46 New classes of convex optimization problems have 46. Grant et al., Disciplined convex pro-
gramming, 2006.
also been developed, such as geometric programming (see Section 11.6),
semidefinite programming, and second-order cone programming.
As mathematical models became increasingly complex computer
programs, and given the need to differentiate those models when per-
forming gradient-based optimization, new methods were developed
to compute derivatives. Wengert47 was among the first to propose the 47. Wengert, A simple automatic derivative
evaluation program, 1964.
automatic differentiation of computer programs (or algorithmic differ-
entiation). The reverse mode of algorithmic differentiation, which is
equivalent to the adjoint method, was proposed later (see Section 6.6).48 48. Speelpenning, Compiling fast partial
derivatives of functions given by algorithms,
This field has evolved immensely since then, with techniques to handle 1980.
more functions and increase efficiency. Algorithmic differentiation tools
2 A Short History of Optimization 42

have been developed for an increasing number of programming lan-


guages. One of the more recently developed programming languages,
Julia, features prominent support for algorithmic differentiation. At
the same time, algorithmic differentiation has spread to a wide range
of applications.
Another technique to compute derivatives numerically, the complex-
step derivative approximation, was proposed by Squire and Trapp.49 49. Squire and Trapp, Using complex
variables to estimate derivatives of real
Soon after, this technique was generalized to computer programs, functions, 1998.
applied to CFD, and found to be related to algorithmic differentiation
(see Section 6.5).50 50. Martins et al., The complex-step deriva-
tive approximation, 2003.
The pattern-search algorithms that Hooke and Jeeves and Nelder
and Meade developed were disparaged by applied mathematicians,
who preferred the rigor and efficiency of the gradient-based methods
developed soon after that. Nevertheless, they were further developed
and remain popular with engineering practitioners because of their sim-
plicity. Pattern-search methods experienced a renaissance in the 1990s
with the development of convergence proofs that added mathematical
rigor and the availability of more powerful parallel computers.51 Today, 51. Torczon, On the convergence of pattern
search algorithms, 1997.
pattern-search methods (Section 7.4) remain a useful option, sometimes
one of the only options, for certain types of optimization problems.
Global optimization algorithms also experienced further develop-
ments. Jones et al.52 developed the DIRECT algorithm, which uses 52. Jones et al., Lipschitzian optimization
without the Lipschitz constant, 1993.
a rigorous approach to find the global optimum (Section 7.5). This
seminal development was followed by various extensions and improve-
53. Jones and Martins, The DIRECT
ments.53 algorithm—25 years later, 2021.
The first genetic algorithms started the development of the broader
class of evolutionary optimization algorithms inspired by natural and
societal processes. Optimization by simulated annealing (Section 8.6)
represents one of the early examples of this broader perspective.54 54. Kirkpatrick et al., Optimization by
simulated annealing, 1983.
Another example is particle swarm optimization (Section 7.7).55 Since
55. Kennedy and Eberhart, Particle swarm
then, there has been an explosion in the number of evolutionary optimization, 1995.
algorithms, inspired by any process imaginable (see the sidenote at
the end of Section 7.2 for a partial list). Evolutionary algorithms
have remained heuristic and have not experienced the mathematical
treatment applied to pattern-search methods.
There has been a sustained interest in surrogate models (also known
as metamodels) since the seminal contributions in the 1950s. Kriging
surrogate models are still used and have been the focus of many
improvements, but new techniques, such as radial-basis functions, have
also emerged.56 Surrogate-based optimization is now an area of active 56. Forrester and Keane, Recent advances
in surrogate-based optimization, 2009.
research (Chapter 10).
AI has experienced a revolution in the last decade and is connected
2 A Short History of Optimization 43

to optimization in several ways. The early AI efforts focused on solving


problems that could be described formally using logic and decision
trees. A design optimization problem statement can be viewed as an
example of a formal logic description. Since the 1980s, AI has focused
on machine learning, which uses algorithms and statistics to learn from
data. In the 2010s, machine learning rose explosively owing to the
development of deep learning neural networks, the availability of large
data sets for training the neural networks, and increased computer
power. Today, machine learning solves problems that are difficult to
describe formally, such as face and speech recognition. Deep learning
neural networks learn to map a set of inputs to a set of outputs based
on training data and can be viewed as a type of surrogate model
(see Section 10.5). These networks are trained using optimization
algorithms that minimize the loss function (analogous to model error),
but they require specialized optimization algorithms such as stochastic
gradient descent. 57 The gradients for such problems are efficiently 57. Bottou et al., Optimization methods for
large-scale machine learning, 2018.
computed with backpropagation, a specialization of the reverse mode
of algorithmic differentiation (AD) (see Section 6.6).58 58. Baydin et al., Automatic differentiation
in machine learning: A survey, 2018.

2.5 Toward a Diverse Future

In the history of optimization, there is a glaring lack of diversity in ge-


ography, culture, gender, and race. Many contributions to mathematics
have more diverse origins. This section is just a brief comment on this
diversity and is not meant to be comprehensive. For a deeper analysis
of the topics mentioned here, please see the cited references and other
specialized bibliographies.
One of the oldest known mathematical objects is the Ishango bone,
which originates from Africa and shows the construction of a numeral
system.59 Ancient Egyptians and Babylonians had a profound influence 59. Gerdes, On mathematics in the history
of sub-Saharan Africa, 1994.
on ancient Greek mathematics. The Mayan civilization developed a
sophisticated counting system that included zero and made precise as-
tronomical observations to measure the solar year’s length accurately.60 60. Closs, Native American Mathematics,
1986.
In China, a textbook called Nine Chapters on the Mathematical Art, the
compilation of which started in 200 BCE, includes a guide on solving
equations using a matrix-based method. 61 The word algebra derives 61. Shen et al., The Nine Chapters on
the Mathematical Art: Companion and
from a book entitled Al-jabr wa’l muqabalah by the Persian mathemati- Commentary, 1999.
cian al-Khwarizmi in the ninth century, the title of which originated
the term algorithm.62 Finally, some of the basic components of calculus 62. Hodgkin, A History of Mathematics:
From Mesopotamia to Modernity, 2005.
were discovered in India 250 years before Newton’s breakthroughs.63
63. Joseph, The Crest of the Peacock: Non-
We also must recognize that there has been, and still is, a gender European Roots of Mathematics, 2010.
gap in science, engineering, and mathematics that has prevented
2 A Short History of Optimization 44

women from having the same opportunities as men. The first known
female mathematician, Hypatia, lived in Alexandria (Egypt) in the
fourth century and was brutally murdered for political motives. In
the eighteenth century, Sophie Germain corresponded with famous
mathematicians under a male pseudonym to avoid gender bias. She
could not get a university degree because she was female but was
nevertheless a pioneer in elasticity theory. Ada Lovelace famously
wrote the first computer program in the nineteenth century.64 In the late 64. Hollings et al., Ada Lovelace: The
Making of a Computer Scientist, 2014.
nineteenth century, Sofia Kovalevskaya became the first woman to obtain
a doctorate in mathematics but had to be tutored privately because
she was not allowed to attend lectures. Similarly, Emmy Noether, who
made many fundamental contributions to abstract algebra in the early
twentieth century, had to overcome rules that prevented women from
enrolling in universities and being employed as faculty.65 65. Osen, Women in Mathematics, 1974.

In more recent history, many women made crucial contributions in


computer science. Grace Hopper invented the first compiler and influ-
enced the development of the first high-level programming language
(COBOL). Lois Haibt was part of a small team at IBM that developed
Fortran, an extremely successful programming language that is still
used today. Frances Allen was a pioneer in optimizing compilers (an
altogether different type of optimization from the topic in this book)
and was the first woman to win the Turing Award. Finally, Margaret
Hamilton was the director of a laboratory that developed the flight
software for NASA’s Apollo program and coined the term software
engineering.
Many other researchers have made key contributions despite facing
discrimination. One of the most famous examples is that of mathe-
matician and computer scientist Alan Turing, who was prosecuted in
1952 for having a relationship with another man. His punishment was
chemical castration, which he endured for a time but ultimately led
him to commit suicide at the age of 41.66 66. Hodges, Alan Turing: The Enigma,
2014.
Some races and ethnicities have been historically underrepresented
in science, engineering, and mathematics. One of the most apparent
disparities has been the lack of representation of African Americans in
the United States in these fields. This underrepresentation is a direct
result of slavery and, among other factors, segregation, redlining, and
anti-black racism.67,68 In the eighteenth-century United States, Benjamin 67. Lipsitz, How Racism Takes Place, 2011.

Banneker, a free African American who was a self-taught mathematician 68. Rothstein, The Color of Law: A For-
gotten History of How Our Government
and astronomer, corresponded directly with Thomas Jefferson and Segregated America, 2017.
successfully challenged the morality of the U.S. government’s views on
race and humanity.69 Historically black colleges and universities were 69. King, More than slaves: Black founders,
Benjamin Banneker, and critical intellectual
established in the United States after the American Civil War because agency, 2014.
2 A Short History of Optimization 45

African Americans were denied admission to traditional institutions.


In 1925, Elbert Frank Cox was the first black man to get a PhD in
mathematics, and he then became a professor at Howard University.
Katherine Johnson and fellow female African American mathematicians
Dorothy Vaughan and Mary Jackson played a crucial role in the U.S.
space program despite the open prejudice they had to overcome.70 70. Shetterly, Hidden Figures: The Ameri-
can Dream and the Untold Story of the Black
“Talent is equally distributed, opportunity is not.”∗ The arc of Women Who Helped Win the Space Race,
recent history has bent toward more diversity and equity,† but it takes 2016.
∗ Variationsof this quote abound; this one
concerted action to bend it. We have much more work to do before is attributed to social entrepreneur Leila
everyone has the same opportunity to contribute to our scientific Janah.
progress. Only when that is achieved can we unleash the true potential † A rephrasing of Martin Luther King Jr.’s

quote: “The arc of the moral universe is


of humankind. long, but it bends toward justice.”

2.6 Summary

The history of optimization is as old as human civilization and has had


many twists and turns. Ancient geometric optimization problems that
were correctly solved by intuition required mathematical developments
that were only realized much later. The discovery of calculus laid the
foundations for optimization. Computer hardware and algorithms then
enabled the development and deployment of numerical optimization.
Numerical optimization was first motivated by operations research
problems but eventually made its way into engineering design. Soon
after numerical models were developed to simulate engineering systems,
the idea arose to couple those models to optimization algorithms in
an automated cycle to optimize the design of such systems. The
first application was in structural design, but many other engineering
design applications followed, including applications coupling multiple
disciplines, establishing MDO. Whenever a new numerical model
becomes fast enough and sufficiently robust, there is an opportunity to
integrate it with numerical optimization to go beyond simulation and
perform design optimization.
Many insightful connections have been made in the history of
optimization, and the trend has been to unify the theory and methods.
There are no doubt more connections and contributions to be made—
hopefully from a more diverse research community.
Numerical Models and Solvers
3
In the introductory chapter, we discussed function characteristics from
the point of view of the function’s output—the black-box view shown in
Fig. 1.16. Here, we discuss how the function is modeled and computed.
The better your understanding of the model and the more access you
have to its details, the more effectively you can solve the optimization
problem. We explain the errors involved in the modeling process so
that we can interpret optimization results correctly.

By the end of this chapter you should be able to:

1. Identify different types of numerical errors and understand


the limitations of finite-precision arithmetic.
2. Estimate an algorithm’s rate of convergence.
3. Use Newton’s method to solve systems of equations.

3.1 Model Development for Analysis versus Optimization

A good understanding of numerical models and solvers is essential


because numerical optimization demands more from the models and
solvers than does pure analysis. In an analysis or a parametric study, we
may cycle through a range of plausible designs. However, optimization
algorithms seek to explore the design space, and therefore, intermediate
evaluations may use atypical design variables combinations. The
mathematical model, numerical model, and solver must be robust
enough to handle these design variable combinations.
A related issue is that an optimizer exploits errors in ways an engi-
neer would not do in analysis. For example, consider the aerodynamic
analysis of a car. In a parametric study, we might try a dozen designs,
compare the drag, and choose the best one. If we passed this procedure
to an optimizer, it might flatten the car to zero height (the minimum
drag solution) if there are no explicit constraints on interior volume
or structural integrity. Thus, we often need to develop additional

46
3 Numerical Models and Solvers 47

models for optimization. A designer often considers some of these


requirements implicitly and approximately, but we need to model these
requirements explicitly and pose them as constraints in optimization.
Another consideration that affects both the mathematical and the
numerical model is the overall computational cost of optimization. An
analysis might only be run dozens of times, whereas an optimization
often runs the analysis thousands of times. This computational cost
can affect the level of fidelity or discretization we can afford to use.
The level of precision desirable for analysis is often insufficient
for optimization. In an analysis, a few digits of precision may be
sufficient. However, using fewer significant digits limits the types
of optimization algorithms we can employ effectively. Convergence
failures can cause premature termination of algorithms. Noisy outputs
can mislead or terminate an optimization prematurely. A common
source of these errors involves programs that work through input
and output files (see Tip 6.1). Even though the underlying code may
use double-precision arithmetic, output files rarely include all the
significant digits (another separate issue is that reading and writing
files at every iteration considerably slows down the analysis).
Another common source of errors involves converging systems of
equations, as discussed later in this chapter. Optimization generally
requires tighter tolerances than are used for analysis. Sometimes this
is as easy as changing a default tolerance, but other times we need to
rethink the solvers.

3.2 Modeling Process and Types of Errors

Design optimization problems usually involve modeling a physical


system to compute the objective and constraint function values for a
given design. Figure 3.1 shows the steps in the modeling process. Each
of these steps in the modeling process introduces errors.
The physical system represents the reality that we want to model. The
mathematical model can range from simple mathematical expressions
to continuous differential or integral equations for which closed-form
solutions over an arbitrary domain are not possible. Modeling errors
are introduced in the idealizations and approximations performed in
the derivation of the mathematical model. The errors involved in the
rest of the process are numerical errors, which we detail in Section 3.5.
In Section 3.3, we discuss mathematical models in more detail and
establish the notation for representing them.
When a mathematical model is given by differential or integral
equations, we must discretize the continuous equations to obtain the
3 Numerical Models and Solvers 48

numerical model. Section 3.4 provides a brief overview of the dis-


cretization process, and Section 3.5.2 defines the associated errors.
The numerical model must then be programmed using a computer Experiment
language to develop a numerical solver. Because this process is suscep-
tible to human error, we discuss strategies for addressing such errors in Observe
Section 3.5.4.
Physical
Finally, the solver computes the system state variables using finite- system
precision arithmetic, which introduces roundoff errors (see Section 3.5.1).
Model
Section 3.6 includes a brief overview of solvers, and we dedicate a sep-
arate section to Newton-based solvers in Section 3.8 because they are Mathematical
model
used later in this book.
The total error in the modeling process is the sum of the modeling Discretize
errors and numerical errors. Validation and verification processes Numerical
quantify and reduce these errors. Verification ensures that the model model
and solver are correctly implemented so that there are no errors in Program
the code. It also ensures that the errors introduced by discretization
and numerical computations are acceptable. Validation compares Solver
the numerical results with experimental observations of the physical
Compute
system, which are themselves subject to experimental errors. By making
these comparisons, we can validate the modeling step of the process Finite-precision
states
and ensure that the mathematical model idealizations and assumptions
are acceptable.
Fig. 3.1 Physical problems are mod-
Modeling and numerical errors relate directly to the concepts of eled and then solved numerically to
precision and accuracy. An accurate solution compares well with the produce function values.
actual physical system (validation), whereas a precise solution means
that the model is programmed and solved correctly (verification).
It is often said that “all models are wrong, but some are useful”.71 71. Box, Science and statistics, 1976.
Because there are always errors involved, we must prioritize which
aspects of a given model should be improved to reduce the overall
error. When developing a new model, we should start with the simplest
model that includes the system’s dominant behavior. Then, we might
selectively add more detail as needed. One common pitfall in numerical
modeling is to confuse precision with accuracy. Increasing precision by
reducing the numerical errors is usually desirable. However, when we
look at the bigger picture, the model might have limited utility if the
modeling errors are more significant than the numerical errors.

Example 3.1 Modeling a structure

As an example of a physical system, consider the timber roof truss structure


shown in Fig. 3.2. A typical mathematical model of such a structure idealizes
the wood as a homogeneous material and the joints as pinned. It is also Fig. 3.2 Timber roof truss and ideal-
common to assume that the loads are applied only at the joints and that the ized model.
3 Numerical Models and Solvers 49

structure’s weight does not contribute to the loading. Finally, the displacements
are assumed to be small relative to the dimensions of the truss members.
The structure is discretized by pinned bar elements. The discrete governing
equations for any truss structure can be derived using the finite-element method.
This leads to the linear system

𝐾𝑢 = 𝑞 ,

where 𝐾 is the stiffness matrix, 𝑞 is the vector of applied loads, and 𝑢 represents
the displacements that we want to compute. At each joint, there are two degrees
of freedom (horizontal and vertical) that describe the displacement and applied
force. Because there are 9 joints, each with 2 degrees of freedom, the size of
this linear system is 18.

3.3 Numerical Models as Residual Equations

Mathematical models vary significantly in complexity and scale. In


the simplest case, a model can be represented by one or more explicit
functions, which are easily coded and computed. Many examples in
this book use explicit functions for simplicity. In practice, however,
many numerical models are defined by implicit equations.
Implicit equations can be written in the residual form as

𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (3.1)

where 𝑟 is a vector of residuals that has the same size as the vector of
state variables 𝑢. The equations defining the residuals could be any
expression that can be coded in a computer program. No matter how
complex the mathematical model, it can always be written as a set of
equations in this form, which we write more compactly as 𝑟(𝑢) = 0.
Finding the state variables that satisfy this set of equations requires
a solver, as illustrated in Fig. 3.3. We review the various types of solvers Solver 𝑢
𝑢
in Section 3.6. Solving a set of implicit equations is more costly than
𝑟(𝑢)
computing explicit functions, and it is typically the most expensive step 𝑟

in the optimization cycle.


Fig. 3.3 Numerical models use a
Mathematical models are often referred to as governing equations, solver to find the state variables 𝑢
which determine the state (𝑢) of a given physical system at specific that satisfy the governing equations,
conditions. Many governing equations consist of differential equations, such that 𝑟(𝑢) = 0.
which require discretization. The discretization process yields implicit
equations that can be solved numerically (see Section 3.4). After
discretization, the governing equations can always be written as a set
of residuals, 𝑟(𝑢) = 0.
3 Numerical Models and Solvers 50

Example 3.2 Implicit and explicit equations in structural analysis

The linear system from Ex. 3.1 is an example of a system of implicit equations,
which we can write as a set of residuals by moving the right-hand-side vector
to the left to obtain
𝑟(𝑢) = 𝐾𝑢 − 𝑞 = 0 ,
where 𝑢 represents the state variables. Although the solution for 𝑢 could be
written as an explicit function, 𝑢 = 𝐾 −1 𝑓 , this is usually not done because it
is computationally inefficient and intractable for large-scale systems. Instead,
we use a linear solver that does not explicitly form the inverse of the stiffness
matrix (see Appendix B).
In addition to computing the displacements, we might also want to compute
the axial stress (𝜎) in each of the 15 truss members.This is an explicit function
of the displacements, which is given by the linear relationship

𝜎 = 𝑆𝑢 ,

where 𝑆 is a (15 × 18) matrix.

We can still use the residual notation to represent explicit functions Solver 𝑢𝑟
𝑢𝑟
to write all the functions in a model (implicit and explicit) as 𝑟(𝑢) = 0 𝑓 (𝑢𝑟 ) 𝑢𝑓
𝑟𝑟 (𝑢𝑟 )
without loss of generality. Suppose we have an implicit system of 𝑟𝑟
equations, 𝑟𝑟 (𝑢𝑟 ) = 0, followed by a set of explicit functions whose
Fig. 3.4 A model with implicit and
output is a vector 𝑢 𝑓 = 𝑓 (𝑢𝑟 ), as illustrated in Fig. 3.4. We can rewrite
explicit functions.
the explicit function as a residual by moving all the terms to one side to
get 𝑟 𝑓 (𝑢𝑟 , 𝑢 𝑓 ) = 𝑓 (𝑢𝑟 ) − 𝑢 𝑓 = 0. Then, we can concatenate the residuals
and variables for the implicit and explicit equations as
Solver
   
𝑢
𝑢𝑟
𝑟𝑟 (𝑢𝑟 ) 𝑢𝑟 𝑢𝑓
𝑟(𝑢) ≡ = 0, where 𝑢≡ . (3.2) 𝑟𝑟 (𝑢𝑟 )
𝑓 (𝑢𝑟 ) − 𝑢 𝑓 𝑢𝑓 𝑟 𝑓 (𝑢𝑟 ) − 𝑢 𝑓

The solver arrangement would then be as shown in Fig. 3.5. Fig. 3.5 Explicit functions can be writ-
Even though it is more natural to just evaluate explicit functions ten in residual form and added to the
instead of adding them to a solver, in some cases, it is helpful to use solver.
the residual to represent the entire model with the compact notation,
𝑟(𝑢) = 0. This will be helpful in later chapters when we compute
derivatives (Chapter 6) and solve systems that mix multiple implicit
and explicit sets of equations (Chapter 13).

Example 3.3 Expressing an explicit function as an implicit equation

Suppose we have the following mathematical model:

𝑢12 + 2𝑢2 − 1 = 0
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
3 Numerical Models and Solvers 51

The first two equations are written in implicit form, and the third equation is
given as an explicit function. The first equation could be manipulated to obtain
an explicit function of either 𝑢1 or 𝑢2 . The second equation does not have a
closed-form solution and cannot be written as an explicit function for 𝑢1 . The
third equation is an explicit function of 𝑢1 and 𝑢2 . In this case, we could solve
the first two equations for 𝑢1 and 𝑢2 using a nonlinear solver and then evaluate
𝑓 (𝑢1 , 𝑢2 ). Alternatively, we can write the whole system as implicit residual
equations by defining the value of 𝑓 (𝑢1 , 𝑢2 ) as 𝑢3 ,

𝑟1 (𝑢1 , 𝑢2 ) = 𝑢12 + 2𝑢2 − 1 =0


𝑟2 (𝑢1 , 𝑢2 ) = 𝑢1 + cos(𝑢1 ) − 𝑢2 =0
𝑟3 (𝑢1 , 𝑢2 , 𝑢3 ) = 𝑢1 + 𝑢2 − 𝑢3 = 0.

Then we can use the same nonlinear solver to solve for all three equations
simultaneously.

3.4 Discretization of Differential Equations

Many physical systems are modeled by differential equations defined


over a domain. The domain can be spatial (one or more dimensions),
temporal, or both. When time is considered, then we have a dynamic
model. When a differential equation is defined in a domain with one
degree of freedom (one-dimensional in space or time), then we have an
ordinary differential equation (ODE), whereas any domain defined by
more than one variable results in a partial differential equation (PDE).
Differential equations need to be discretized over the domain to be
solved numerically. There are three main methods for the discretization
of differential equations: the finite-difference method, the finite-volume
method, and the finite-element method. The finite-difference method
approximates the derivatives in the differential equations by the value
of the relevant quantities at a discrete number of points in a mesh (see
Fig. 3.6). The finite-volume method is based on the integral form of the
PDEs. It divides the domain into control volumes called cells (which
also form a mesh), and the integral is evaluated for each cell. The values
of the relevant quantities can be defined either at the centroids of the
cells or at the cell vertices. The finite-element model divides the domain
into elements (which are similar to cells) over which the quantities are
interpolated using predefined shape functions. The states are computed
at specific points in the element that are not necessarily at the element
boundaries. Governing equations can also include integrals, which can
be discretized with quadrature rules.
3 Numerical Models and Solvers 52

Finite difference Finite volume Finite element


𝑢5 𝑢4 𝑢8 𝑢9
𝑢4 𝑢7
𝑢 𝑢 𝑢3 𝑢 𝑢6
𝑢1 𝑢3 𝑢1 𝑢 𝑢5
𝑢2 𝑢1 𝑢2 2 𝑢3 𝑢4

Mesh point 𝑧
Cell 𝑧
Element 𝑧

With any of these discretization methods, the final result is a Fig. 3.6 Discretization methods in one
set of algebraic equations that we can write as 𝑟(𝑢) = 0 and solve spatial dimension.
for the state variables 𝑢. This is a potentially large set of equations
depending on the domain and discretization (e.g., it is common to
have millions of equations in three-dimensional computational fluid
dynamics problems). The number of state variables of the discretized PDE
model is equal to the number of equations for a complete and well-
defined model. In the most general case, the set of equations could be
implicit and nonlinear. 𝑡
𝑢(𝑧, 𝑡)
When a problem involves both space and time, the prevailing ap-
proach is to decouple the discretization in space from the discretization
in time—called the method of lines (see Fig. 3.7). The discretization in 𝑧

space is performed first, yielding an ODE in time. The time derivative


can then be approximated as a finite difference, leading to a time-
integration scheme. ODE
𝑢4 (𝑡)
The discretization process usually yields implicit algebraic equations
that require a solver to obtain the solution. However, discretization
in some cases yields explicit equations, in which case a solver is not 𝑡

required.

𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
3.5 Numerical Errors 𝑧

Numerical errors (or computation errors) can be categorized into three


main types: roundoff errors, truncation errors, and errors due to coding.
Fully discretized
Numerical errors are involved with each of the modeling steps between
the mathematical model and the states (see Fig. 3.1). The error involved
in the discretization step is a type of truncation error. The errors
𝑡 𝑢4 (𝑡3 )
introduced in the coding step are not usually discussed as numerical
errors, but we include them here because they are a likely source of error
in practice. The errors in the computation step involve both roundoff 𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
and truncation errors. The following subsections describe each of these 𝑧

error sources.
Fig. 3.7 PDEs in space and time are
An absolute error is the magnitude of the difference between the exact often discretized in space first to yield
value (𝑥 ∗ ) and the computed value (𝑥), which we can write as |𝑥 − 𝑥 ∗ |. an ODE in time.
3 Numerical Models and Solvers 53

The relative error is a more intrinsic error measure and is defined as


|𝑥 − 𝑥 ∗ |
𝜀= . (3.3)
|𝑥 ∗ |

This is the more useful error measure in most cases. When the exact
value 𝑥 ∗ is close to zero, however, this definition breaks down. To
address this, we avoid the division by zero by using

|𝑥 − 𝑥 ∗ |
𝜀= . (3.4)
1 + |𝑥 ∗ |

This error metric combines the properties of absolute and relative errors.
When |𝑥 ∗ |  1, this metric is similar to the relative error, but when
|𝑥 ∗ |  1, it becomes similar to the absolute error.

3.5.1 Roundoff Errors


Roundoff errors stem from the fact that a computer cannot represent
all real numbers with exact precision. Errors in the representation of
each number lead to errors in each arithmetic operation, which in turn
might accumulate throughout a program.
There is an infinite number of real numbers, but not all numbers can
be represented in a computer. When a number cannot be represented
exactly, it is rounded. In addition, a number might be too small or too
large to be represented.
Computers use bits to represent numbers, where each bit is either
0 or 1. Most computers use the Institute of Electrical and Electronics
Engineers (IEEE) standard for representing numbers and performing
finite-precision arithmetic. A typical representation uses 32 bits for
integers and 64 bits for real numbers.
Basic operations that only involve integers and whose result is an
integer do not incur numerical errors. However, there is a limit on the
range of integers that can be represented. When using 32-bit integers,
1 bit is used for the sign, and the remaining 31 bits can be used for
the digits, which results in a range from −231 = −2, 147, 483, 648 to
231 − 1 = 2, 147, 483, 647. Any operation outside this range would result
in integer overflow.∗ ∗ Some programming languages, such as
Python, have arbitrary precision integers
Real numbers are represented using scientific notation in base 2: and are not subject to this issue, albeit with
some performance trade-offs.
𝑥 = significand × 2exponent . (3.5)

The 64-bit representation is known as the double-precision floating-point


format, where some digits store the significand and others store the
exponent. The greatest positive and negative real numbers that can
3 Numerical Models and Solvers 54

be represented using the IEEE double-precision representation are


approximately 10308 and −10308 . Operations that result in numbers
outside this range result in overflow, which is a fatal error in most
computers and interrupts the program execution.
There is also a limit on how close a number can come to zero,
approximately 10−324 when using double precision. Numbers smaller
than this result in underflow. The computer sets such numbers to
zero by default, and the program usually proceeds with no harmful
consequences.
One important number to consider in roundoff error analysis is the
machine precision, 𝜀ℳ , which represents the precision of the computa-
tions. This is the smallest positive number 𝜀 such that

1+𝜀 > 1 (3.6)

when calculated using a computer. This number is also known as


machine zero. Typically, the double precision 64-bit representation uses
1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand.
Thus, when using double precision, 𝜀ℳ = 2−52 ≈ 2.2 × 10−16 . A
double-precision number has about 16 digits of precision, and a relative
representation error of up to 𝜀ℳ may occur.

Example 3.4 Machine precision

Suppose that three decimal digits are available to represent a number (and
that we use base 10 for simplicity). Then, 𝜀ℳ = 0.005 because any number
smaller than this results in 1 + 𝜀 = 1 when rounded to three digits. For
example, 1.00 + 0.00499 = 1.00499, which rounds to 1.00. On the other hand,
1.00 + 0.005 = 1.005, which rounds to 1.01 and satisfies Eq. 3.6.

Example 3.5 Relative representation error

If we try to store 24.11 using three digits, we get 24.1. The relative error is
24.11 − 24.1
≈ 0.0004 ,
24.11
which is lower than the maximum possible representation error of 𝜀ℳ = 0.005
established in Ex. 3.4.

When operating with numbers that contain errors, the result is


subject to a propagated error. For multiplication and division, the relative
propagated error is approximately the sum of the relative errors of the
respective two operands.
3 Numerical Models and Solvers 55

For addition and subtraction, an error can occur even when the
two operands are represented exactly. Before addition and subtraction,
the computer must convert the two numbers to the same exponent.
When adding numbers with different exponents, several digits from
the small number vanish (see Fig. 3.8). If the difference in the two
exponents is greater than the magnitude of the exponent of 𝜀ℳ , the
small number vanishes completely—a consequence of Eq. 3.6. The
relative error incurred in addition is still 𝜀ℳ .

Difference in exponent Lost digits from 𝑏

𝑎 0.

± 𝑏 0. 0 0 0 0 0 0 0 0 0 0

𝑐 0.

Digits from 𝑎 Affected digits

On the other hand, subtraction can incur much greater relative Fig. 3.8 Adding or subtracting num-
bers of differing exponents results in
errors when subtracting two numbers that have the same exponent and a loss in the number of digits cor-
are close to each other. In this case, the digits that match between the responding to the difference in the
two numbers cancel each other and reduce the number of significant exponents. The gray boxes indicate
digits that are identical between the
digits. When the relative difference between two numbers is less than two numbers.
machine precision, all digits match, and the subtraction result is zero
(see Fig. 3.9). This is called subtractive cancellation and is a serious issue
when approximating derivatives via finite differences (see Section 6.4).

Common digits

𝑎 0.

− 𝑏 0.

Fig. 3.9 Subtracting two numbers that


𝑐 0. 0 0 0 0 0 0 0 0 0 0 0 are close to each other results in a loss
of the digits that match.
Common digits are lost Remaining digits

Sometimes, minor roundoff errors can propagate and result in


much more significant errors. This can happen when a problem is ill-
conditioned or when the algorithm used to solve the problem is unstable.
In both cases, small changes in the inputs cause large changes in the
output. Ill-conditioning is not a consequence of the finite-precision
computations but is a characteristic of the model itself. Stability is a
property of the algorithm used to solve the problem. When a problem
3 Numerical Models and Solvers 56

is ill-conditioned, it is challenging to solve it in a stable way. When a


problem is well conditioned, there is a stable algorithm to solve it.

Example 3.6 Effect of roundoff error on function representation

Let us examine the function 𝑓 (𝑥) = 𝑥 2 − 4𝑥 + 4 near its minimum, at 𝑥 = 2. 𝑓

If we use double precision and plot many points in a small interval, we can see
that the function exhibits the step pattern shown in Fig. 3.10. The numerical 2·
10−15
minimum of this function is anywhere in the interval around 𝑥 = 2 where the
numerical value is zero. This interval is much larger than the machine precision 1·
(𝜀ℳ = 2.2 × 10−16 ). An additional error is incurred in the function computation 10−15

around 𝑥 = 2 as a result of subtractive cancellation. This illustrates the fact that


all functions are discontinuous when using finite-precision arithmetic. 0
2.0
2 − 5 · 10−8 2 + 5 · 10−8
𝑥

Fig. 3.10 With double precision, the


minimum of this quadratic function
3.5.2 Truncation Errors is in an interval much larger than
machine zero.
In the most general sense, truncation errors arise from performing a
finite number of operations where an infinite number of operations
would be required to get an exact result.† Truncation errors would † Roundoff error, discussed in the previous

arise even if we could do the arithmetic with infinite precision. section, is sometimes also referred to as
truncation error because digits are truncated.
When discretizing a mathematical model with partial derivatives as However, we avoid this confusing naming
described in Section 3.4, these are approximated by truncated Taylor and only use truncation error to refer to a
truncation in the number of operations.
series expansions that ignore higher-order terms. When the model
includes integrals, they are approximated as finite sums. In either case,
a mesh of points where the relevant states and functions are evaluated
is required. Discretization errors generally decrease as the spacing
between the points decreases.

Tip 3.1 Perform a mesh refinement study

When using a model that depends on a mesh, perform a mesh refinement


study. This involves solving the model for increasingly finer meshes to check if
the metrics of interest converge in a stable way and verify that the convergence
rate is as expected for the chosen numerical discretization scheme. A mesh
refinement study is also useful for finding the mesh that provides the best
compromise between computational time and accuracy. This is especially
important in optimization because the model is solved many times.

3.5.3 Iterative Solver Tolerance Error


Many methods for solving numerical models involve an iterative proce-
dure that starts with a guess for the states 𝑢 and then improves that
3 Numerical Models and Solvers 57

guess at each iteration until reaching a specified convergence tolerance.


The convergence is usually measured by a norm of the residuals, k𝑟(𝑢)k,
which we want to drive to zero. Iterative linear solvers and Newton-type
solvers are examples of iterative methods (see Section 3.6).
A typical convergence history for an iterative solver is shown in ||𝑟||

Fig. 3.11. The norm of the residuals decreases gradually until a limit 105

is reached (near 10−10 in this case). This limit represents the lowest 102

error achieved with the iterative solver and is determined by other 10−1
sources of error, such as roundoff and truncation errors. If we terminate
10−4
before reaching the limit (either by setting a convergence tolerance to a
10−7
value higher than 10−10 or setting an iteration limit to lower than 400
iterations), we incur an additional error. However, it might be desirable 10−10
0 200 400 600
to trade off a less precise solution for a lower computational effort. 𝑘

Fig. 3.11 Norm of residuals versus the


Tip 3.2 Find the level of the numerical noise in your model number of iterations for an iterative
solver.
It is crucial to know the error level in your model because this limits the
type of optimizer you can use and how well you can optimize. In Ex. 3.6, we
saw that if we plot a function at a small enough scale, we can see discrete steps
in the function due to roundoff errors. When accumulating all sources of error
in a more elaborate model (roundoff, truncation, and iterative), we no longer
have a neat step pattern. Instead, we get numerical noise, as shown in Fig. 3.12.
The noise level can be estimated by the amplitude of the oscillations and gives
us the order of magnitude of the total numerical error.

0.5369
0.58 +2 · 10−8

𝑓 𝑓 ∼ 10−8
0.56
Fig. 3.12 To find the level of numerical
0.5369
0.54 noise of a function of interest with re-
spect to an input parameter (left), we
0.52 magnify both axes by several orders
0.5369 of magnitude and evaluate the func-
−2 · 10−8 tion at points that are closely spaced
0.5
0 1 2 3 4 2 − 1 · 10−6 2.0 2 + 1 · 10−6 (right).
𝑥 𝑥

3.5.4 Programming Errors


Most of the literature on numerical methods is too optimistic and does
not explicitly discuss programming errors, commonly known as bugs.
Most programmers, especially beginners, underestimate the likelihood
that their code has bugs.
3 Numerical Models and Solvers 58

It is helpful to adopt sound programming practices, such as writing


clear, modular code. Clear code has consistent formatting, meaningful
naming of variable functions, and helpful comments. Modular code
reuses and generalizes functions as much as possible and avoids copying
and pasting sections of code.72 Modular code allows for more flexible 72. Wilson et al., Best practices for scientific
computing, 2014.
usage. Breaking up programs into smaller functions with well-defined
inputs and outputs makes debugging much more manageable.‡ ‡ Theterm debugging was used in engineer-
ing prior to computers, but Grace Hopper
There are different types of bugs relevant to numerical models: popularized this term in the programming
generic programming errors, incorrect memory handling, and algorith- context after a glitch in the Harvard Mark
II computer was found to be caused by a
mic or logical errors. Programming errors are the most frequent and moth.
include typos, type errors, copy-and-paste errors, faulty initializations,
missing logic, and default values. In theory, careful programming and
code inspection can avoid these errors, but you must always test your
code in practice. This testing involves comparing your result with a
case where you know the solution—the reference result. You should
start with the simplest representative problem and then build up from
that. Interactive debuggers are helpful because let you step through
the code and check intermediate variable values.

Tip 3.3 Debugging is a skill that takes practice

The overall attitude toward programming should be that all code has bugs
until it is verified through testing. Programmers who are skilled at debugging
are not necessarily any better at spotting errors by reading code or by stepping
through a debugger than average programmers. Instead, effective programmers
use a systematic approach to narrow down where the problem is occurring.
Beginners often try to debug by running the entire program. Even experi-
enced programmers have a hard time debugging at that level. One primary
strategy discussed in this section is to write modular code. It is much easier
to test and debug small functions. Reliable complex programs are built up
through a series of well-tested modular functions. Sometimes we need to
simplify or break up functions even further to narrow down the problem. We
might need to streamline and remove pieces, make sure a simple case works,
then slowly rebuild the complexity.
You should also become comfortable reading and understanding the error
messages and stack traces produced by the program. These messages seem
obscure at first, but through practice and researching what the error messages
mean, they become valuable information sources.
Of course, you should carefully reread the code, looking for errors, but
reading through it again and again is unlikely to yield new insights. Instead,
it can be helpful to step away from the code and hypothesize the most likely
ways the function could fail. You can then test and eliminate hypotheses to
narrow down the problem.
3 Numerical Models and Solvers 59

Memory handling issues are much less frequent than programming


errors, but they are usually more challenging to track. These issues
include memory leaks (a failure to free unused memory), incorrect use
of memory management, buffer overruns (e.g., array bound violations),
and reading uninitialized memory. Memory issues are challenging to
track because they can result in strange behavior in parts of the code that
are far from the source of the error. In addition, they might manifest
themselves in specific conditions that are hard to reproduce consistently.
Memory debuggers are essential tools for addressing memory issues.
They perform detailed bookkeeping of all allocation, deallocation, and
memory access to detect and locate any irregularities.§ § See
Grotker et al.73 for more details on
how to debug and profile code.
Whereas programming errors are due to a mismatch between the
73. Grotker et al., The Developer’s Guide to
programmer’s intent and what is coded, the root cause of algorithmic Debugging, 2012.
or logical errors is in the programmer’s intent itself. Again, testing is
the key to finding these errors, but you must be sure that the reference
result is correct.

Tip 3.4 Use sound code testing practices

Automated testing takes effort to implement but ultimately saves time,


especially for larger, long-term projects. Unit tests check for the internal
consistency of a small piece (a “unit”) of code and should be implemented as
each piece of code is developed. Integration tests are designed to demonstrate
that different code components work together as expected. Regression testing
consists of running all the tests (usually automatically) anytime the code has
changed to ensure that the changes have not introduced bugs. It is usually
impossible to test for all potential issues, but the more you can test, the more
coverage you have. Whenever a bug has been found, a test should be developed
to catch that same type of bug in the future.

Running the analysis within an optimization loop can reveal bugs


that do not manifest themselves in a single analysis. Therefore, you
should only run an optimization test case after you have tested the
analysis code in isolation.
As previously mentioned, there is a higher incentive to reduce the
computational cost of an analysis when it runs in an optimization loop
because it will run many times. When you first write your code, you
should prioritize clarity and correctness as opposed to speed. Once the
code is verified through testing, you should identify any bottlenecks
using a performance profiling tool. Memory performance issues can
also arise from running the analysis in an optimization loop instead
of running a single case. In addition to running a memory debugger,
3 Numerical Models and Solvers 60

you can also run a memory profiling tool to identify opportunities to


reduce memory usage.

3.6 Overview of Solvers

There are several methods available for solving the discretized gov-
erning equations (Eq. 3.1). We want to solve the governing equations
for a fixed set of design variables, so 𝑥 will not appear in the solution
algorithms. Our objective is to find the state variables 𝑢 such that
𝑟(𝑢) = 0.
This is not a book about solvers, but it is essential to understand the
characteristics of these solvers because they affect the cost and precision
of the function evaluations in the overall optimization process. Thus,
we provide an overview and some of the most relevant details in this
section.∗ In addition, the solution of coupled systems builds on these ∗ Ascher and Greif74 provide a more de-
tailed introduction to the numerical meth-
solvers, as we will see in Section 13.2. Finally, some of the optimization ods mentioned in this chapter.
algorithms detailed in later chapters use these solvers. 74. Ascher and Greif, A First Course in
There are two main types of solvers, depending on whether the Numerical Methods, 2011.

equations to be solved are linear or nonlinear (Fig. 3.13). Linear solution


methods solve systems of the form 𝑟(𝑢) = 𝐴𝑢 − 𝑏 = 0, where the matrix
𝐴 and vector 𝑏 are not dependent on 𝑢. Nonlinear methods can handle
any algebraic system of equations that can be written as 𝑟(𝑢) = 0.

LU factorization Cholesky factorization


Direct
QR factorization
Linear Jacobi

Iterative Fixed point Gauss–Seidel

Solver SOR
Newton
+ linear solver CG
Krylov subspace
Nonlinear
Nonlinear GMRES
variants of
fixed point

Linear systems can be solved directly or iteratively. Direct meth- Fig. 3.13 Overview of solution meth-
ods are based on the concept of Gaussian elimination, which can be ods for linear and nonlinear systems.
expressed in matrix form as a factorization into lower and upper tri-
angular matrices that are easier to solve (LU factorization). Cholesky
factorization is a more efficient variant of LU factorization that applies
only to symmetric positive-definite matrices.
Whereas direct solvers obtain the solution 𝑢 at the end of a process,
iterative solvers start with a guess for 𝑢 and successively improve it
3 Numerical Models and Solvers 61

with each iteration, as illustrated in Fig. 3.14. Iterative methods can


be fixed-point iterations, such as Jacobi, Gauss–Seidel, and successive
over-relaxation (SOR), or Krylov subspace methods. Krylov subspace
methods include the conjugate gradient (CG) and generalized minimum
residual (GMRES) methods.† Direct solvers are well established and † SeeSaad75 for more details on iterative
methods in the context of large-scale nu-
are included in the standard libraries for most programming languages. merical models.
Iterative solvers are less widespread in standard libraries, but they are 75. Saad, Iterative Methods for Sparse
becoming more commonplace. Appendix B describes linear solvers in Linear Systems, 2003.

more detail.
Direct methods are the right choice for many problems because
they are generally robust. Also, the solution is guaranteed for a fixed

Residual
number of operations, 𝒪(𝑛 3 ) in this case. However, for large systems
Iterative Direct
where 𝐴 is sparse, the cost of direct methods can become prohibitive,
whereas iterative methods remain viable. Iterative methods have other 𝜀ℳ
advantages, such as being able to trade between computational cost 
𝒪 𝑛3
and precision. They can also be restarted from a good guess (see
Effort
Appendix B.4).
Fig. 3.14 Whereas direct methods
only yield the solution at the end
Tip 3.5 Do not compute the inverse of 𝐴
of the process, iterative methods pro-
Because some numerical libraries have functions to compute 𝐴−1 , you duce approximate intermediate re-
sults.
might be tempted to do this and then multiply by a vector to compute 𝑢 = 𝐴−1 𝑏.
This is a bad idea because finding the inverse is computationally expensive.
Instead, use LU factorization or another method from Fig. 3.13.

When it comes to nonlinear solvers, the most efficient methods are


based on Newton’s method, which we explain later in this chapter
(Section 3.8). Newton’s method solves a sequence of problems that
are linearizations of the nonlinear problem about the current iterate.
The linear problem at each Newton iteration can be solved using any
linear solver, as indicated by the incoming arrow in Fig. 3.13. Although
efficient, Newton’s method is not robust in that it does not always
converge. Therefore, it requires modifications so that it can converge
reliably.
Finally, it is possible to adapt linear fixed-point iteration methods to
solve nonlinear equations as well. However, unlike the linear case, it
might not be possible to derive explicit expressions for the iterations in
the nonlinear case. For this reason, fixed-point iteration methods are
often not the best choice for solving a system of nonlinear equations.
However, as we will see in Section 13.2.5, these methods are useful for
solving systems of coupled nonlinear equations.
3 Numerical Models and Solvers 62

For time-dependent problems, we require a way to solve for the


time history of the states, 𝑢(𝑡). As mentioned in Section 3.3, the most
popular approach is to decouple the temporal discretization from the
spatial one. By discretizing a PDE in space first, this method formulates
an ODE in time of the following form:

d𝑢
= −𝑟(𝑢, 𝑡) , (3.7)
d𝑡
which is called the semi-discrete form. A time-integration scheme is
then used to solve for the time history. The integration scheme can be
either explicit or implicit, depending on whether it involves evaluating
explicit expressions or solving implicit equations. If a system under a
certain condition has a steady state, these techniques can be used to
solve the steady state (d𝑢/d𝑡 = 0).

3.7 Rate of Convergence

Iterative solvers compute a sequence of approximate solutions that hope-


fully converge to the exact solution. When characterizing convergence,
we need to first establish if the algorithm converges and, if so, how
fast it converges. The first characteristic relates to the stability of the
algorithm. Here, we focus on the second characteristic quantified
through the rate of convergence.
The cost of iterative algorithms is often measured by counting the
number of iterations required to achieve the solution. Iterative algo-
rithms often require an infinite number of iterations to converge to the
exact solution. In practice, we want to converge to an approximate solu-
tion close enough to the exact one. Determining the rate of convergence
arises from the need to quantify how fast the approximate solution is
approaching the exact one.
In the following, we assume that we have a sequence of points,
𝑥0 , 𝑥1 , . . . , 𝑥 𝑘 , . . ., that represent approximate solutions in the form of
vectors in any dimension and converge to a solution 𝑥 ∗ . Then,

lim k𝑥 𝑘 − 𝑥 ∗ k = 0 , (3.8)
𝑘→∞

which means that the norm of the error tends to zero as the number of
iterations tends to infinity.
The rate of convergence of a sequence is of order 𝑝 with asymptotic
error constant 𝛾 when 𝑝 is the largest number that satisfies∗ ∗ Some authors refer to 𝑝 as the rate of
convergence. Here, we characterize the
rate of convergence by two metrics: order
k𝑥 𝑘+1 − 𝑥 ∗ k and error constant.
0 ≤ lim = 𝛾 < ∞. (3.9)
𝑘→∞ k𝑥 𝑘 − 𝑥 ∗ k 𝑝
3 Numerical Models and Solvers 63

Asymptotic here refers to the fact that this is the behavior in the limit,
when we are close to the solution. There is no guarantee that the initial
and intermediate iterations satisfy this condition.
To avoid dealing with limits, let us consider the condition expressed
in Eq. 3.9 at all iterations. We can relate the error from one iteration to
the next by
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 𝑝 . (3.10)
When 𝑝 = 1, we have linear order of convergence; when 𝑝 = 2, we have
quadratic order of convergence. Quadratic convergence is a highly
valued characteristic for an iterative algorithm, and in practice, orders of
convergence greater than 𝑝 = 2 are usually not worthwhile to consider.
When we have linear convergence, then

k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.11)

where 𝛾𝑘 converges to a constant but varies from iteration to iteration.


In this case, the convergence is highly dependent on the value of the
asymptotic error constant 𝛾. If 𝛾𝑘 > 1, then the sequence diverges—a
situation to be avoided. If 0 < 𝛾𝑘 < 1 for every 𝑘, then the norm of the
error decreases by a constant factor for every iteration. Suppose that
𝛾 = 0.1 for all iterations. Starting with an initial error norm of 0.1, we
get the sequence

10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 , 10−7 , . . . . (3.12)

Thus, after six iterations, we get six-digit precision. Now suppose that
𝛾 = 0.9. Then we would have

10−1 , 9.0 × 10−2 , 8.1 × 10−2 , 7.3 × 10−2 , 6.6 × 10−2 ,


5.9 × 10−2 , 5.3 × 10−2 , . . . . (3.13)

This corresponds to only one-digit precision after six iterations. It


would take 131 iterations to achieve six-digit precision.
When we have quadratic convergence, then

k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 2 . (3.14)

If 𝛾 = 1, then the error norm sequence with a starting error norm of 0.1
would be
10−1 , 10−2 , 10−4 , 10−8 , . . . . (3.15)
This yields more than six digits of precision in just three iterations!
In this case, the number of correct digits doubles at every iteration.
When 𝛾 > 1, the convergence will not be as fast, but the series will still
converge.
3 Numerical Models and Solvers 64

If 𝑝 ≥ 1 and lim 𝑘→∞ 𝛾𝑘 = 0, we have superlinear convergence,


which includes quadratic and higher rates of convergence. There is a
special case of superlinear convergence that is relevant for optimization
algorithms, which is when 𝑝 = 1 and 𝛾 → 0. This case is desirable
because in practice, it behaves similarly to quadratic convergence and
can be achieved by gradient-based algorithms that use first derivatives
(as opposed to second derivatives). In this case, we can write

k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.16)

where lim 𝑘→∞ 𝛾𝑘 = 0. Now we need to consider a sequence of values


for 𝛾 that tends to zero. For example, if 𝛾𝑘 = 1/(𝑘 + 1), starting with an
error norm of 0.1, we get

10−1 , 5 × 10−1 , 1.7 × 10−1 , 4.2 × 10−2 , 8.3 × 10−4 ,


1.4 × 10−4 , 2.0 × 10−5 , . . . . (3.17)

Thus, we achieve four-digit precision after six iterations. This special


case of superlinear convergence is not quite as good as quadratic
convergence, but it is much better than either of the previous linear
convergence examples.
We plot these sequences in Fig. 3.15. Because the points are just
scalars and the exact solution is zero, the error norm is just 𝑥 𝑘 . The
first plot uses a linear scale, so we cannot see any differences beyond
two digits. To examine the differences more carefully, we need to use a
logarithmic axis for the sequence values, as shown in the plot on the
right. In this scale, each decrease in order of magnitude represents one
more digit of precision.

0.1 10−1
Linear
10−2 𝑝 = 1, 𝛾 = 0.9
0.08 Superlinear
10−3 𝑝=1
0.06 𝛾→0
10−4
𝑥 𝑥
0.04 10−5

10−6
0.02 Linear Fig. 3.15 Sample sequences for lin-
10−7 Quadratic 𝑝=1 ear, superlinear, and quadratic cases
𝑝=2 𝛾 = 0.1
0
plotted on a linear scale (left) and a
10−8
0 2 4 6 0 2 4 6 logarithmic scale (right).
𝑘 𝑘

The linear convergence sequences show up as straight lines in


Fig. 3.15 (right), but the slope of the lines varies widely, depending
on the value of the asymptotic error constant. Quadratic convergence
exhibits an increasing slope, reflecting the doubling of digits for each
3 Numerical Models and Solvers 65

iteration. The superlinear sequence exhibits poorer convergence than


the best linear one, but we can see that the slope of the superlinear curve
is increasing, which means that for a large enough 𝑘, it will converge at
a higher rate than the linear one.

Tip 3.6 Use a logarithmic scale when plotting convergence

When using a linear scale plot, you can only see differences in two significant
digits. To reveal changes beyond three digits, you should use a logarithmic
scale. This need frequently occurs in plotting the convergence behavior of
optimization algorithms.

When solving numerical models iteratively, we can monitor the


norm of the residual. Because we know that the residuals should be
zero for an exact solution, we have

k𝑟 𝑘+1 k = 𝛾𝑘 k𝑟 𝑘 k 𝑝 . (3.18)

If we monitor another quantity, we do not usually know the exact


solution. In these cases, we can use the ratio of the step lengths of each
iteration:
k𝑥 𝑘+1 − 𝑥 ∗ k k𝑥 𝑘+1 − 𝑥 𝑘 k
≈ . (3.19)
k𝑥 𝑘 − 𝑥 ∗ k k𝑥 𝑘 − 𝑥 𝑘−1 k
The order of convergence can be estimated numerically with the values
of the last available four iterates using
k𝑥 𝑘+1 −𝑥 𝑘 k
log10 k𝑥 𝑘 −𝑥 𝑘−1 k
𝑝≈ . (3.20)
k𝑥 𝑘 − 𝑥 𝑘−1 k
log10
k𝑥 𝑘−1 − 𝑥 𝑘−2 k

Finally, we can also monitor any quantity (function values, state


variables, or design variables) by normalizing the step length in the
same way as Eq. 3.4,
k𝑥 𝑘+1 − 𝑥 𝑘 k
. (3.21)
1 + k𝑥 𝑘 k

3.8 Newton-Based Solvers

As mentioned in Section 3.6, Newton’s method is the basis for many


nonlinear equation solvers. Newton’s method is also at the core of the
most efficient gradient-based optimization algorithms, so we explain it
here in more detail. We start with the single-variable case for simplicity
and then generalize it to the 𝑛-dimensional case.
3 Numerical Models and Solvers 66

We want to find 𝑢 ∗ such that 𝑟(𝑢 ∗ ) = 0, where, for now, 𝑟 and 𝑢


are scalars. Newton’s method for root finding estimates a solution
at each iteration 𝑘 by approximating 𝑟 (𝑢 𝑘 ) to be a linear function.
The linearization is done by taking a Taylor series of 𝑟 about 𝑢 𝑘 and
truncating it to obtain the approximation

𝑟 (𝑢 𝑘 + Δ𝑢) ≈ 𝑟 (𝑢 𝑘 ) + Δ𝑢𝑟 0 (𝑢 𝑘 ) , (3.22)

where 𝑟 0 = d𝑟/d𝑢. For conciseness, we define 𝑟 𝑘 = 𝑟 (𝑢 𝑘 ). Now we can


find the step Δ𝑢 that makes this approximate residual zero,
𝑟𝑘
𝑟 𝑘 + Δ𝑢𝑟 0𝑘 = 0 ⇒ Δ𝑢 = − , (3.23)
𝑟 0𝑘

where we need to assume that 𝑟 0𝑘 ≠ 0.


Thus, the update for each step in Newton’s algorithm is
𝑟𝑘
𝑢 𝑘+1 = 𝑢 𝑘 − . (3.24)
𝑟 0𝑘

If 𝑟 0𝑘 = 0, the algorithm will not converge because it yields a step to


infinity. Small enough values of 𝑟 0𝑘 also cause an issue with large steps,
but the algorithm might still converge.
One useful modification of Newton’s method is to replace the deriva-
tive with a forward finite-difference approximation (see Section 6.4)
based on the residual values of the current and last iterations,
𝑟 𝑘+1 − 𝑟 𝑘
𝑟 0𝑘 ≈ . (3.25)
𝑢 𝑘+1 − 𝑢 𝑘
Then, the update is given by
 
𝑢 𝑘+1 − 𝑢 𝑘
𝑢 𝑘+1 = 𝑢 𝑘 − 𝑟 𝑘 . (3.26)
𝑟 𝑘+1 − 𝑟 𝑘

This is the secant method, which is useful when the derivative is not
available. The convergence rate is not quadratic like Newton’s method,
but it is superlinear.

Example 3.7 Newton’s method and the secant method for a single variable

Suppose we want to solve the equation 𝑟(𝑢) = 2𝑢 3 + 4𝑢 2 + 𝑢 − 2 = 0. Because


= 6𝑢 2 + 8𝑢 + 1, the Newton iteration is
𝑟 0 (𝑢)

2𝑢 𝑘3 + 4𝑢 𝑘2 + 𝑢 𝑘 − 2
𝑢 𝑘+1 = 𝑢 𝑘 − .
6𝑢 𝑘2 + 8𝑢 𝑘 + 1

When we start with the guess 𝑢0 = 1.5 (left plot in Fig. 3.16), the iterations
are well behaved, and the method converges quadratically. We can see the
3 Numerical Models and Solvers 67

𝑟 𝑟

−0.5 0
0 0 Fig. 3.16 Newton iterations starting
0.54 1.5 0.54 from different starting points.
𝑢∗ 𝑢 𝑢0 𝑢0 𝑢∗

geometric interpretation of Newton’s method: For each iteration, it takes the


tangent to the curve and finds the intersection with 𝑟 = 0.
When we start with 𝑢0 = −0.5 (right plot in Fig. 3.16), the first step goes in
the wrong direction but recovers in the second iteration. The third iteration is
close to the point with the zero derivative and takes a large step. In this case,
the iterations recover and then converge normally. However, we can easily
envision a case where an iteration is much closer to the point with the zero
derivative, causing an arbitrarily long step.
We can also use the secant method (Eq. 3.26) for this problem, which gives 𝑟

the following update:


 
2𝑢 𝑘3 + 4𝑢 𝑘2 + 𝑢 𝑘 − 2 (𝑢 𝑘+1 − 𝑢 𝑘 )
𝑢 𝑘+1 = 𝑢 𝑘 − .
3 + 4𝑢 2 + 𝑢
2𝑢 𝑘+1 3 2
𝑘+1 𝑘+1 − 2𝑢 𝑘 − 4𝑢 𝑘 − 𝑢 𝑘

0
The iterations for the secant method are shown in Fig. 3.17, where we can see 0.54 1.3 1.5
the successive secant lines replacing the exact tangent lines used in Newton’s 𝑢∗ 𝑢 𝑢1 𝑢0
method.
Fig. 3.17 Secant method applied to a
one-dimensional function.

Newton’s method converges quadratically as it approaches the


solution with a convergence constant of
00 ∗
𝑟 (𝑢 )
𝛾 = 0 ∗ . (3.27)
2𝑟 (𝑢 )

This means that if the derivative is close to zero or the curvature tends
to a large number at the solution, Newton’s method will not converge
as well or not at all.
Now we consider the general case where we have 𝑛 nonlinear
equations of 𝑛 unknowns, expressed as 𝑟(𝑢) = 0. Similar to the single-
variable case, we derive the Newton step from a truncated Taylor
series. However, the Taylor series needs to be multidimensional in
both the independent variable and the function. Consider first the
multidimensionality of the independent variable, 𝑢, for a component
of the residuals, 𝑟 𝑖 (𝑢). The first two terms of the Taylor series about
𝑢 𝑘 for a step Δ𝑢 (which is now a vector with arbitrary direction and
3 Numerical Models and Solvers 68

magnitude) are
Õ
𝜕𝑟 𝑖
𝑛
𝑟 𝑖 (𝑢 𝑘 + Δ𝑢) ≈ 𝑟 𝑖 (𝑢 𝑘 ) + Δ𝑢 𝑗 . (3.28)
𝜕𝑢 𝑗 𝑢=𝑢𝑘
𝑗=1

Because we have 𝑛 residuals, 𝑖 = 1, . . . , 𝑛, we can write the second


term in matrix form as 𝐽Δ𝑢, where 𝐽 is an (𝑛 × 𝑛) square matrix whose
elements are
𝜕𝑟 𝑖
𝐽𝑖𝑗 = . (3.29)
𝜕𝑢 𝑗

This matrix is called the Jacobian.


Similar to the single-variable case, we want to find the step that
makes the two terms zero, which yields the linear system

𝐽𝑘 Δ𝑢 𝑘 = −𝑟 𝑘 . (3.30)

After solving this linear system, we can update the solution to

𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (3.31)

Thus, Newton’s method involves solving a sequence of linear systems


given by Eq. 3.30. The linear system can be solved using any of the linear
solvers mentioned in Section 3.6. One popular option for solving the
Newton step is the Krylov method, which results in the Newton–Krylov
method for solving nonlinear systems. Because the Krylov method only
requires matrix-vector products of the form 𝐽𝑣, we can avoid computing
and storing the Jacobian by computing this product directly (using
finite differences or other methods from Chapter 6). In Section 4.4.3 we
adapt Newton’s method to perform function minimization instead of
solving nonlinear equations.
The multivariable version of Newton’s method is subject to the same
issues we uncovered for the single-variable case: it only converges
if the starting point is within a specific region, and it can be subject
to ill-conditioning. To increase the likelihood of convergence from
any starting point, Newton’s method requires a globalization strategy
(see Section 4.2). The ill-conditioning issue has to do with the linear
system (Eq. 3.30) and can be quantified by the condition number of
the Jacobian matrix. Ill-conditioning can be addressed by scaling and
preconditioning.
There is an analog of the secant method for 𝑛 dimensions, which is
called Broyden’s method. This method is much more involved than its
one-dimensional counterpart because it needs to create an approximate
Jacobian based on directional finite-difference derivatives. Broyden’s
3 Numerical Models and Solvers 69

method is described in Appendix C.1 and is related to the quasi-Newton


optimization methods of Section 4.4.4.

Example 3.8 Newton’s method applied to two nonlinear equations

Suppose we have the following nonlinear system of two equations:

1 √
𝑢2 = , 𝑢2 = 𝑢1 .
𝑢1
This corresponds to the two lines shown in Fig. 3.18, where the solution is at 𝑢2
5
their intersection, 𝑢 = (1, 1). (In this example, the two equations are explicit,
and we could solve them by substitution, but they could have been implicit.) 4

To solve this using Newton’s method, we need to write these as residuals: 𝑢0


3

1
𝑟1 = 𝑢2 − =0 2
𝑢1
√ 1
𝑟2 = 𝑢2 − 𝑢1 = 0 . 𝑢∗
0
The Jacobian can be derived analytically, and the Newton step is given by the 0 1 2 3
𝑢1
linear system
" 1 #
1 Δ𝑢  
𝑢2 − 𝑢11

Fig. 3.18 Newton iterations.
𝑢12 1
= − √ .
− √1
2 𝑢1
1 Δ𝑢2 𝑢2 − 𝑢1

Starting from 𝑢 = (2, 3) yields the iterations shown in the following table, with
the quadratic convergence shown in Fig. 3.19.
||𝑟||
100
𝑢1 𝑢2 k𝑢 − 𝑢∗ k k𝑟 k
10−3
2.000 000 3.000 000 2.24 2.96
0.485 281 0.878 680 5.29 × 10−1 1.20 10−6

0.760 064 0.893 846 2.62 × 10−1 4.22 × 10−1 10−9


0.952 668 0.982 278 5.05 × 10−2 6.77 × 10−2
10−12
0.998 289 0.999 417 1.81 × 10−3 2.31 × 10−3
0.999 998 0.999 999 2.32 × 10−6 2.95 × 10−6 10−15
0 2 4 6
1.000 000 1.000 000 3.82 × 10−12 4.87 × 10−12 𝑘
1.000 000 1.000 000 0.0 0.0
Fig. 3.19 The norm of the residual
exhibits quadratic convergence.

3.9 Models and the Optimization Problem

When performing design optimization, we must compute the values


of the objective and constraint functions in the optimization problem
(Eq. 1.4). Computing these functions usually requires solving a model
for the given design 𝑥 at one or more specific conditions.∗ The model ∗ As previously mentioned, the process
of solving a model is also known as the
often includes governing equations that define the state variables 𝑢 as analysis or simulation.
3 Numerical Models and Solvers 70

an implicit function of 𝑥. In other words, for a given 𝑥, there is a 𝑢


that solves 𝑟(𝑢; 𝑥) = 0, as illustrated in Fig. 3.20. Here, the semicolon
in 𝑟(𝑢; 𝑥) indicates that 𝑥 is fixed when the governing equations are
solved for 𝑢.
The objective and constraints are typically explicit functions of the Solver 𝑢
state and design variables, as illustrated in Fig. 3.21 (this is a more 𝑥 𝑢
detailed version of Fig. 1.14). There is also an implicit dependence of 𝑟
𝑟(𝑢; 𝑥)

the objective and constraint functions on 𝑥 through 𝑢. Therefore, the


objective and constraint functions are ultimately fully determined by Fig. 3.20 For a general model, the state
variables 𝑢 are implicit functions of
the design variables. In design optimization applications, solving the
the design variables 𝑥 through the
governing equations is usually the most computationally intensive part solution of the governing equations.
of the overall optimization process.
When we first introduced the general optimization problem (Eq. 1.4), Optimizer
𝑥

the governing equations were not included because they were assumed
to be part of the computation of the objective and constraints for a Solve 𝑢
𝑟(𝑢; 𝑥) = 0
given 𝑥. However, we can include them in the problem statement for
completeness as follows: 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ
minimize 𝑓 (𝑥; 𝑢) ℎ(𝑥, 𝑢)

by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
Fig. 3.21 Computing the objective ( 𝑓 )
subject to 𝑔 𝑗 (𝑥; 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 and constraint functions (𝑔,ℎ) for a
ℎ 𝑘 (𝑥; 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ (3.32) given set of design variables (𝑥) usu-
ally involves the solution of a numeri-
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 cal model (𝑟 = 0) by varying the state
variables (𝑢).
while solving 𝑟 𝑙 (𝑢; 𝑥) = 0 𝑙 = 1, . . . , 𝑛𝑢
by varying 𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 .

Here, “while solving” means that the governing equations are solved
at each optimization iteration to find a valid 𝑢 for each value of 𝑥.
The semicolon in 𝑓 (𝑥; 𝑢) indicates that 𝑢 is fixed while the optimizer
determines the next value of 𝑥.

Example 3.9 Structural sizing optimization

Recalling the truss problem of Ex. 3.2, suppose we want to minimize the
mass of the structure (𝑚) by varying the cross-sectional areas of the truss
members (𝑎), subject to stress constraints.
The structural mass is an explicit function that can be written as
Õ
15
𝑚= 𝜌𝑎 𝑖 𝑙 𝑖 ,
𝑖=1
where 𝜌 is the material density, 𝑎 𝑖 is the cross-sectional area of each member 𝑖,
and 𝑙 𝑖 is the member length. This function depends on the design variables
directly and does not depend on the displacements.
3 Numerical Models and Solvers 71

We can write the optimization problem statement as follows:

minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
while solving 𝐾𝑢 − 𝑞 = 0 (system of 18 equations)
by varying 𝑢𝑙 𝑙 = 1, . . . , 18 .
The governing equations are a linear set of equations whose solution determines
the displacements (𝑢) of a given design (𝑎) for a load condition (𝑞). We
mentioned previously that the objective and constraint functions are usually
explicit functions of the state variables, design variables, or both. As we saw in
Ex. 3.2, the mass is an explicit function of the cross-sectional areas. In this case,
it does not even depend on the state variables. The constraint function is also
explicit, but in this case, it is just a function of the state variables. This example
illustrates a common situation where the solution of the state variables requires
the solution of implicit equations (structural solver), whereas the constraints
(stresses) and objective (weight) are explicit functions of the states and design
variables.

From a mathematical perspective, the model governing equations


𝑟(𝑥, 𝑢) = 0 can be considered equality constraints in an optimization
problem. Some specialized optimization approaches add these equa-
tions to the optimization problem and let the optimization algorithm
solve both the governing equations and optimization simultaneously.
This is called a full-space approach and is also known as simultaneous
analysis and design (SAND) or one-shot optimization. The approach is
𝑥, 𝑢
illustrated in Fig. 3.22 and stated as follows: Optimizer

minimize 𝑓 (𝑥, 𝑢)
𝑟(𝑥, 𝑢)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥 𝑟

𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (3.33) 𝑓 , 𝑔, ℎ ℎ(𝑥, 𝑢)
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ
Fig. 3.22 In the full-space approach,
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 the governing equations are solved
by the optimizer by varying the state
𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢 .
variables.
This approach is described in more detail in Section 13.4.3.
More generally, the optimization constraints and equations in a
model are interchangeable. Suppose a set of equations in a model can
be satisfied by varying a corresponding set of state variables. In that case,
these equations and variables can be moved to the optimization problem
statement as equality constraints and design variables, respectively.
3 Numerical Models and Solvers 72

Unless otherwise stated, we assume that the optimization model gov-


erning equations are solved by a dedicated solver for each optimization
iteration, as stated in Eq. 3.32.

Example 3.10 Structural sizing optimization using a full-space approach

To solve the structural sizing problem (Ex. 3.9) using a full-space approach,
we forgo the linear solver by adding 𝑢 to the set of design variables and letting
the optimizer enforce the governing equations. This results in the following
problem:

minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
𝑢𝑙 𝑙 = 1, . . . , 18
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
𝐾𝑢 − 𝑞 = 0 (system of 18 equations) .

Tip 3.7 Test your analysis before you attempt optimization

Before you optimize, you should be familiar with the analysis (model and
solver) that computes the objective and constraints. If possible, make several
parameter sweeps to see what the functions look like—whether they are smooth,
whether they seem unimodal or not, what the trends are, and the range of
values. You should also get an idea of the computational effort required and if
that varies significantly. Finally, you should test the robustness of the analysis
to different inputs because the optimization is likely to ask for extreme values.

3.10 Summary

It is essential to understand the models that compute the objective and


constraint functions because they directly affect the performance and
effectiveness of the optimization process.
The modeling process introduces several types of numerical errors
associated with each step of the process (discretization, programming,
computation), limiting the achievable precision of the optimization.
Knowing the level of numerical error is necessary to establish what
precision can be achieved in the optimization. Understanding the
types of errors involved helps us find ways to reduce those errors.
Programming errors—“bugs”—are often underestimated; thorough
testing is required to verify that the numerical model is coded correctly.
3 Numerical Models and Solvers 73

A lack of understanding of a given model’s numerical errors is often the


cause of a failure in optimization, especially when using gradient-based
algorithms.
Modeling errors arise from discrepancies between the mathematical
model and the actual physical system. Although they do not affect
the optimization process’s performance and precision, modeling errors
affect the accuracy and determine how valid the result is in the real
world. Therefore, model validation and an understanding of modeling
error are also critical.
In engineering design optimization problems, the models usually
involve solving large sets of nonlinear implicit equations. The compu-
tational time required to solve these equations dominates the overall
optimization time, and therefore, solver efficiency is crucial. Solver
robustness is also vital because optimization often asks for designs that
are very different from what a human designer would ask for, which
tests the limits of the model and the solver.
We presented an overview of the various types of solvers available
for linear and nonlinear equations. Newton-type methods are highly
desirable for solving nonlinear equations because they exhibit second-
order convergence. Because Newton-type methods involve solving a
linear system at each iteration, a linear solver is always required. These
solvers are also at the core of several of the optimization algorithms in
later chapters.
3 Numerical Models and Solvers 74

Problems

3.1 Answer true or false and justify your answer.

a. A model developed to perform well for analysis will always


do well in a numerical optimization process.
b. Modeling errors have nothing to do with computations.
c. Explicit and implicit equations can always be written in
residual form.
d. Subtractive cancellation is a type of roundoff error.
e. Programming errors can always be eliminated by carefully
reading the code.
f. Quadratic convergence is only better than linear convergence
if the asymptotic convergence error constant is less than or
equal to one.
g. Logarithmic scales are desirable when plotting convergence
because they show errors of all magnitudes.
h. Newton solvers always require a linear solver.
i. Some linear iterative solvers can be used to solve nonlinear
problems.
j. Direct methods allow us to trade between computational
cost and precision, whereas iterative methods do not.
k. Newton’s method requires the derivatives of all the state
variables with respect to the residuals.
l. In the full-space optimization approach, the state variables
become design variables, and the governing equations be-
come constraints.

3.2 Choose an engineering system that you are familiar with and
describe each of the components illustrated in Fig. 3.1 for that
system. List all the options for the mathematical and numerical
models that you can think of, and describe the assumptions for
each model. What type of solver is usually used for each model
(see Section 3.6)? What are the state variables for each model?

3.3 Consider the following mathematical model:

𝑢12
+ 𝑢22 = 1
4
4𝑢1 𝑢2 = 𝜋
𝑓 = 4(𝑢1 + 𝑢2 ) .
3 Numerical Models and Solvers 75

Solve this model by hand. Write these equations in residual form


and use a numerical solver to obtain the same solution.

3.4 Reproduce a plot similar to the one shown in Fig. 3.10 for

𝑓 (𝑥) = cos(𝑥) + 1

in the neighborhood of 𝑥 = 𝜋 .

3.5 Consider the residual equation

𝑟(𝑢) = 𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .

a. Find the solution using your own implementation of New-


ton’s method.
b. Tabulate the residual for each iteration number.
c. What is the lowest error you can achieve?
d. Plot the residual versus the iteration number using a linear
axis. How many digits can you discern in this plot?
e. Make the same plot using a logarithmic axis for the residual
and estimate the rate of convergence. Discuss whether the
rate is as expected or not.
f. Exploration: Try different starting points. Can you find a
predictable trend and explain it?

3.6 Kepler’s equation, which we mentioned in Section 2.2, defines the


relationship between a planet’s polar coordinates and the time
elapsed from a given initial point and is stated as follows:

𝐸 − 𝑒 sin(𝐸) = 𝑀,

where 𝑀 is the mean anomaly (a parameterization of time), 𝐸 is


the eccentric anomaly (a parameterization of polar angle), and 𝑒
is the eccentricity of the elliptical orbit.

a. Use Newton’s method to find 𝐸 when 𝑒 = 0.7 and 𝑀 = 𝜋/2.


b. Devise a fixed-point iteration to solve the same problem.
c. Compare the number of iterations and rate of convergence.
d. Exploration: Plot 𝐸 versus 𝑀 in the interval [0, 2𝜋] for 𝑒 =
[0, 0.1, 0.5, 0.9] and interpret your results physically.

3.7 Consider the equation from Prob. 3.5 where we replace one of the
coefficients with a parameter 𝑎 as follows:

𝑟(𝑢) = 𝑎𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .
3 Numerical Models and Solvers 76

a. Produce a plot similar to Fig. 3.12 by perturbing 𝑎 in the


neighborhood of 𝑎 = 1.2 using a solver convergence tolerance
of |𝑟 | ≤ 10−6 .
b. Exploration: Try smaller tolerances and see how much you
can decrease the numerical noise.

3.8 Reproduce the solution of Ex. 3.8 and then try different initial
guesses. Can you define a distinct region from where Newton’s
method converges?

3.9 Choose a problem that you are familiar with and find the magni-
tude of numerical noise in one or more outputs of interest with
respect to one or more inputs of interest. What means do you
have to decrease the numerical noise? What is the lowest possible
level of noise you can achieve?
Unconstrained Gradient-Based Optimization
4
In this chapter we focus on unconstrained optimization problems with
continuous design variables, which we can write as

minimize 𝑓 (𝑥) , (4.1)


𝑥

where 𝑥 = [𝑥1 , . . . , 𝑥 𝑛 ] is composed of the design variables that the


optimization algorithm can change.
We solve these problems using gradient information to determine a
series of steps from a starting guess (or initial design) to the optimum, as
shown in Fig. 4.1. We assume the objective function to be nonlinear, 𝐶 2
continuous, and deterministic. We do not assume unimodality or multi-
modality, and there is no guarantee that the algorithm finds the global 𝑥∗

optimum. Referring to the attributes that classify an optimization prob- 𝑥2

lem (Fig. 1.22), the optimization algorithms discussed in this chapter


𝑥0
range from first to second order, perform a local search, and evaluate the
function directly. The algorithms are based on mathematical principles
𝑥1
rather than heuristics.
Although most engineering design problems are constrained, the Fig. 4.1 Gradient-based optimization
constrained optimization algorithms in Chapter 5 build on the methods starts with a guess, 𝑥 0 , and takes a
sequence of steps in 𝑛-dimensional
explained in the current chapter. space that converge to an optimum,
𝑥∗.

77
4 Unconstrained Gradient-Based Optimization 78

By the end of this chapter you should be able to:

1. Understand the significance of gradients, Hessians, and


directional derivatives.
2. Mathematically define the optimality conditions for an
unconstrained problem.
3. Describe, implement, and use line-search-based methods.
4. Explain the pros and cons of the various search direction
methods.
5. Understand the trust-region approach and how it contrasts
with the line search approach.

4.1 Fundamentals

To determine the step directions shown in Fig. 4.1, gradient-based


methods need the gradient (first-order information). Some methods
also use curvature (second-order information). Gradients and curvature
are required to build a second-order Taylor series, a fundamental
building block in establishing optimality and developing gradient-
based optimization algorithms.

4.1.1 Derivatives and Gradients


Recall that we are considering a scalar objective function 𝑓 (𝑥), where
𝑥 is the vector of design variables, 𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. The gradient of
this function, ∇ 𝑓 (𝑥), is a column vector of first-order partial derivatives
of the function with respect to each design variable:
 
𝜕𝑓 𝜕𝑓 𝜕𝑓
∇ 𝑓 (𝑥) = , ,..., , (4.2)
𝜕𝑥1 𝜕𝑥2 𝜕𝑥 𝑛

where each partial derivative is defined as the following limit:

𝜕𝑓 𝑓 (𝑥 1 , . . . , 𝑥 𝑖 + 𝜀, . . . , 𝑥 𝑛 ) − 𝑓 (𝑥1 , . . . , 𝑥 𝑖 , . . . , 𝑥 𝑛 )
= lim . (4.3)
𝜕𝑥 𝑖 𝜀→0 𝜀
Each component in the gradient vector quantifies the function’s local
rate of change with respect to the corresponding design variable, as
shown in Fig. 4.2 for the two-dimensional case. In other words, these
components represent the slope of the function along each coordinate
direction. The gradient is a vector pointing in the direction of the
greatest function increase from the current point.
4 Unconstrained Gradient-Based Optimization 79

𝑓 𝜕𝑓
𝜕𝑓 𝜕𝑥 1
𝜕𝑥 2

𝜕𝑓
𝜕𝑥 1

𝜕𝑓
𝑥2
𝜕𝑥2

∇𝑓
𝑥2
∇𝑓
𝑥1 𝜕𝑓
𝜕𝑓 𝜕𝑥1
𝜕𝑥2
𝑥1

The gradient vectors are normal to the surfaces of constant 𝑓 in Fig. 4.2 Components of the gradient
𝑛-dimensional space (isosurfaces). In the two-dimensional case, gradient vector in the two-dimensional case.

vectors are perpendicular to the function contour lines, as shown in


Fig. 4.2.∗ ∗ Inthis book, most of the illustrations and
examples are based on two-dimensional
problems because they are easy to visual-
Example 4.1 Gradient of a polynomial function ize. However, the principles and methods
apply to 𝑛 dimensions.
Consider the following function of two variables:

𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥 1 𝑥22 − 𝑥23 − 20𝑥1 .

The gradient can be obtained using symbolic differentiation, yielding


 
3𝑥12 + 2𝑥22 − 20
∇ 𝑓 (𝑥1 , 𝑥2 ) = .
4𝑥1 𝑥2 − 3𝑥22

This defines the vector field plotted in Fig. 4.3, where each vector points in the
direction of the steepest local increase.

50

2 Saddle point 0

−50
Maximum
𝑥2 0
Minimum −100

−150
−2 Saddle point
2 Maximum Minimum

0 4
−4 2
−4 −2 0 2 4 𝑥2 0
−2 −2
𝑥1 −4 𝑥1

Fig. 4.3 Gradient vector field shows


how gradients point toward maxima
and away from minima.
4 Unconstrained Gradient-Based Optimization 80

If a function is simple, we can use symbolic differentiation as we


did in Ex. 4.1. However, symbolic differentiation has limited utility
for general engineering models because most models are far more
complicated; they may include loops, conditionals, nested functions,
and implicit equations. Fortunately, there are several methods for com-
puting derivatives numerically; we cover these methods in Chapter 6.
Each gradient component has units that correspond to the units
of the function divided by the units of the corresponding variable.
Because the variables might represent different physical quantities,
each gradient component might have different units.
From an engineering design perspective, it might be helpful to think
about relative changes, where the derivative is given as the percentage
change in the function for a 1 percent increase in the variable. This
relative derivative can be computed by nondimensionalizing both the
function and the variable, that is,
𝜕𝑓 𝑥
, (4.4)
𝜕𝑥 𝑓
where 𝑓 and 𝑥 are the values of the function and variable, respectively,
at the point where the derivative is computed.

Example 4.2 Interpretation of derivatives for wing design problem

Consider the wing design problem from Ex. 1.1, where the objective function
is the required power (𝑃). For the derivative of power with respect to span
(𝜕𝑃/𝜕𝑏), the units are watts per meter (W/m). For a wing with 𝑐 = 1 m and
𝑏 = 12 m, we have 𝑃 = 1087.85 W and 𝜕𝑃/𝜕𝑏 = −41.65 W/m. This means that
for an increase in span of 1 m, the linear approximation predicts a decrease in
power of 41.65 W (to 𝑃 = 1046.20). However, the actual power at 𝑏 = 13 𝑚 is
1059.77 W because the function is nonlinear (see Fig. 4.4). The relative derivative
for this same design can be computed as (𝜕𝑃/𝜕𝑏)(𝑏/𝑃) = −0.459, which means
that for a 1 percent increase in span, the linear approximation predicts a 0.459
percent decrease in power. The actual decrease is 0.310 percent.

1.5 1,150

1.2 1,125
(12, 1)
𝜕𝑃 1,100
𝑐 0.9 𝑃
𝜕𝑏 𝜕𝑃
1,075
0.6 𝜕𝑏
1059.77
0.3
1046.20 Fig. 4.4 Power versus span and the
5 15 25 35 11 12 13 14 15 16 corresponding derivative.
𝑏 𝑏
4 Unconstrained Gradient-Based Optimization 81

The gradient components quantify the function’s rate of change in


each coordinate direction (𝑥 𝑖 ), but sometimes we are interested in the
rate of change in a direction that is not a coordinate direction. The rate
of change in a direction 𝑝 is quantified by a directional derivative, defined
as
𝑓 (𝑥 + 𝜀𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 (𝑥) = lim . (4.5)
𝜀→0 𝜀
We can find this derivative by projecting the gradient onto the desired
direction 𝑝 using the dot product

∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 . (4.6)

When 𝑝 is a unit vector aligned with one of the Cartesian coordinates 𝑖,


this dot product yields the corresponding partial derivative 𝜕 𝑓 /𝜕𝑥 𝑖 . A
two-dimensional example of this projection is shown in Fig. 4.5.

∇ 𝑓 |𝑝

𝑥2

∇𝑓 ∇ 𝑓 |𝑝

𝑥2
∇𝑓
𝑥1 𝑝
𝑝 ∇ 𝑓 |𝑝

𝑥1

From the gradient projection, we can see why the gradient is the Fig. 4.5 Projection of the gradient in
an arbitrary unit direction 𝑝.
direction of the steepest increase. If we use this definition of the dot
product,
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 = k∇ 𝑓 k k𝑝 k cos 𝜃 , (4.7)
where 𝜃 is the angle between the two vectors, we can see that this is
maximized when 𝜃 = 0◦ . That is, the directional derivative is largest
when 𝑝 points in the same direction as ∇ 𝑓 .
If 𝜃 is in the interval (−90, 90)◦ , the directional derivative is positive
and is thus in a direction of increase, as shown in Fig. 4.6. If 𝜃 is in the
interval (90, 180]◦ , the directional derivative is negative, and 𝑝 points
in a descent direction. Finally, if 𝜃 = ±90◦ , the directional derivative
is 0, and thus the function value does not change for small steps; it
is locally flat in that direction. This condition occurs when ∇ 𝑓 and 𝑝
are orthogonal; therefore, the gradient is orthogonal to the function
isosurfaces.
4 Unconstrained Gradient-Based Optimization 82

Positive directional
derivative (∇ 𝑓 | 𝑝 > 0)

∇𝑓

𝜃
Negative directional 𝑝
derivative (∇ 𝑓 | 𝑝 < 0)
Fig. 4.6 The gradient ∇ 𝑓 is always
orthogonal to contour lines (surfaces),
and the directional derivative in the
Contour line tangent
direction 𝑝 is given by ∇ 𝑓 | 𝑝.
(∇ 𝑓 | 𝑝 = 0)

To get the correct slope in the original units of 𝑥, the direction should
be normalized as 𝑝ˆ = 𝑝/k𝑝k. However, in some of the gradient-based
algorithms of this chapter, 𝑝 is not normalized because the length
contains useful information. If 𝑝 is not normalized, the slopes and
variable axis are scaled by a constant.

Example 4.3 Directional derivative of a quadratic function

Consider the following function of two variables:

𝑓 (𝑥 1 , 𝑥2 ) = 𝑥12 + 2𝑥22 − 𝑥 1 𝑥2 .

The gradient can be obtained using symbolic differentiation, yielding


 
2𝑥1 − 𝑥2
∇ 𝑓 (𝑥 1 , 𝑥2 ) = .
4𝑥2 − 𝑥1

At point 𝑥 = [−1, 1], the gradient is


 
−3
∇ 𝑓 (−1, 1) = .
5
√ √
Taking the derivative in the normalized direction 𝑝 = [2/ 5, −1/ 5], we obtain
 √ 
2/ 5 11
∇ 𝑓 𝑝 = [−3, 5]
| √ = −√ ,
−1/ 5 5
which we show in Fig. 4.7 (left). We use a 𝑝 with unit length to get the slope of
the function in the original units.
A projection of the function in the 𝑝 direction can be obtained by plotting
𝑓 along the line defined by 𝑥 + 𝛼𝑝, where 𝛼 is the independent variable, as
shown in Fig. 4.7 (middle). The projected slope of the function in that direction
corresponds to the slope of this single-variable function. The polar plot in
Fig. 4.7 (right) shows how the directional derivative changes with the direction
of 𝑝. The directional derivative has a maximum in the direction of the gradient,
has the largest negative magnitude in the opposite direction, and has zero
values in the directions orthogonal to the gradient.
4 Unconstrained Gradient-Based Optimization 83

(0,1)
1.5 10
(-1,1) (1,1)
∇𝑓 k∇ 𝑓 k
𝑥 8
1
𝑝
𝑥 + 𝛼𝑝 6
−6 −3 0 3 6
𝑥2 0.5 𝑓 (-1,0) (1,0)
4 𝑥 ∇ 𝑓 |𝑝
0
2
∇ 𝑓 |𝑝
(-1,-1) (1,-1)
−0.5 0
−1.5 −1 −0.5 0 0.5 𝛼 (0,-1)
𝑥1

4.1.2 Curvature and Hessians Fig. 4.7 Function contours and direc-
tion 𝑝 (left), one-dimensional slice
The rate of change of the gradient—the curvature—is also useful infor- along 𝑝 (middle), directional deriva-
tive for all directions on polar plot
mation because it tells us if a function’s slope is increasing (positive (right).
curvature), decreasing (negative curvature), or stationary (zero curva-
ture).
In one dimension, the gradient reduces to a scalar (the slope), and
the curvature is also a scalar that can be calculated by taking the second
derivative of the function. To quantify curvature in 𝑛 dimensions, we
need to take the partial derivative of each gradient component 𝑗 with
respect to each coordinate direction 𝑖, yielding

𝜕2 𝑓
. (4.8)
𝜕𝑥 𝑖 𝜕𝑥 𝑗

If the function 𝑓 has continuous second partial derivatives, the order of


differentiation does not matter, and the mixed partial derivatives are
equal; thus
𝜕2 𝑓 𝜕2 𝑓
= . (4.9)
𝜕𝑥 𝑖 𝜕𝑥 𝑗 𝜕𝑥 𝑗 𝜕𝑥 𝑖
This property is known as the symmetry of second derivatives or equality
of mixed partials.† † The discovery and proof of the symmetry

of second derivatives property has a long


Considering all gradient components and their derivatives with history.76
respect to all coordinate directions results in a second-order tensor. This 76. Higgins, A note on the history of mixed
tensor can be represented as a square (𝑛 × 𝑛) matrix of second-order partial derivatives, 1940.
4 Unconstrained Gradient-Based Optimization 84

partial derivatives called the Hessian:

 𝜕2 𝑓 𝜕2 𝑓 
 𝜕2 𝑓
 ··· 
 𝜕𝑥12 𝜕𝑥 1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥 𝑛 
 2 
 𝜕 𝑓 𝜕2 𝑓 𝜕2 𝑓 

 ···
𝜕𝑥2 𝜕𝑥 𝑛  .
𝐻 𝑓 (𝑥) =  𝜕𝑥2 𝜕𝑥1 𝜕𝑥22
 (4.10)
 
 
.. .. .. ..
 . . . . 
 
 𝜕2 𝑓 𝜕2 𝑓 𝜕2 𝑓 
 
 𝜕𝑥 𝜕𝑥 ··· 
 𝑛 1 𝜕𝑥 𝑛 𝜕𝑥2 𝜕𝑥 𝑛2 

The Hessian is expressed in index notation as:

𝜕2 𝑓
𝐻 𝑓 𝑖𝑗 = . (4.11)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
Because of the symmetry of second derivatives, the Hessian is a sym-
metric matrix with 𝑛(𝑛 + 1)/2 independent elements.
Each row 𝑖 of the Hessian is a vector that quantifies the rate of
change of all components 𝑗 of the gradient vector with respect to the
direction 𝑖. On the other hand, each column 𝑗 of the matrix quantifies
the rate of change of component 𝑗 of the gradient vector with respect to
all coordinate directions 𝑖. Because the Hessian is symmetric, the rows
and columns are transposes of each other, and these two interpretations
are equivalent.
We can find the rate of change of the gradient in an arbitrary
normalized direction 𝑝 by taking the product 𝐻𝑝. This yields an 𝑛-
vector that quantifies the rate of change of the gradient in the direction
𝑝, where each component of the vector is the rate of the change of the
corresponding partial derivative with respect to a movement along 𝑝.
Therefore, this product is defined as follows:
∇ 𝑓 (𝑥 + 𝜀𝑝) − ∇ 𝑓 (𝑥)
𝐻𝑝 = ∇𝑝 (∇ 𝑓 (𝑥)) = lim . (4.12)
𝜀→0 𝜀
Because of the symmetry of second derivatives, we can also interpret
this as the rate of change in the directional derivative of the function
along 𝑝 with respect to each of the components of 𝑝.
To find the curvature of the one-dimensional function along a
direction 𝑝, we need to project 𝐻𝑝 onto direction 𝑝 as

∇𝑝 ∇𝑝 𝑓 (𝑥) = 𝑝 | 𝐻𝑝 , (4.13)

which yields a scalar quantity. Again, if we want to get the curvature in


the original units of 𝑥, 𝑝 should be normalized.
4 Unconstrained Gradient-Based Optimization 85

For an 𝑛-dimensional Hessian, it is possible to find directions 𝑣 𝑖


(where 𝑖 = 1, . . . , 𝑛) along which the projected curvature aligns with
that direction, that is,
𝐻𝑣 = 𝜅𝑣 . (4.14)
This is an eigenvalue problem whose eigenvectors represent the principal
curvature directions, and the eigenvalues 𝜅 quantify the corresponding
curvatures. If each eigenvector is normalized as 𝑣ˆ = 𝑣/k𝑣k, then the
corresponding 𝜅 is the curvature in the original units.

Example 4.4 Hessian and principal curvature directions of a quadratic

Consider the following quadratic function of two variables:

𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 2𝑥22 − 𝑥1 𝑥2 ,

whose contours are shown in Fig. 4.8 (left). These contours are ellipses that
have the same center. The Hessian of this quadratic is
 
2 −1
𝐻= ,
−1 4

which is constant. To find the curvature in the direction 𝑝 = [−1/2, − 3/2], we
compute
h √ i 2
  " −1 # √
− 3 −1 2
√ 7− 3
|
𝑝 𝐻𝑝 = −1 − 3 = .
2 2 −1 4 2
2
The principal curvature directions can be computed by solving the eigenvalue
problem (Eq. 4.14). This yields two eigenvalues and two corresponding
eigenvectors,
 √   √ 
√ 1− 2 √ 1+ 2
𝜅 𝐴 = 3 + 2, 𝑣 𝐴 = , and 𝜅 𝐵 = 3 − 2, 𝑣 𝐵 = .
1 1

By plotting the principal curvature directions superimposed on the function


contours (Fig. 4.8, left), we can see that they are aligned with the ellipses’ major
and minor axes. To see how the curvature varies as a function of the direction,
we make a polar plot of the curvature 𝑝 | 𝐻𝑝, where 𝑝 is normalized (Fig. 4.8,
right). The maximum curvature aligns with the first principal curvature
direction, as expected, and the minimum curvature corresponds to the second
principal curvature direction.
4 Unconstrained Gradient-Based Optimization 86

(0,1)

𝜅1 𝑣ˆ 1 (-1,1) 𝜅1 (1,1)
1

𝜅2 𝑣ˆ 2 𝜅2
Fig. 4.8 Contours of 𝑓 for Ex. 4.4
𝑥2 0 (-1,0) (1,0)
0 2 4 6 and the two principal curvature di-
rections in red. The polar plot shows
𝑝 | 𝐻𝑝 the curvature, with the eigenvectors
−1 𝑝 pointing at the directions of principal
(-1,-1) (1,-1)
curvature; all other directions have
−1 0 1 (0,-1) curvature values in between.
𝑥1

Example 4.5 Hessian of two-variable polynomial

Consider the same polynomial from Ex. 4.1. Differentiating the gradient
we obtained previously yields the Hessian:
 
6𝑥1 4𝑥 2
𝐻(𝑥1 , 𝑥2 ) = .
4𝑥2 4𝑥1 − 6𝑥2

We can visualize the variation of the Hessian by plotting the principal curvatures
at different points (Fig. 4.9).

4 100

50

2 0
Saddle point
−50

−100
𝑥2 0
−150
Maximum Minimum
−200
4 Minimum
−2 Saddle point
2

0
Maximum
−4 −4
𝑥1 −2 −2
−4 −2 0 2 4 0
𝑥1 2
−4 4 𝑥2

Fig. 4.9 Principal curvature direc-


tion and magnitude variation. Solid
lines correspond to positive curva-
ture, whereas dashed lines are for
4.1.3 Taylor Series negative curvature.

The Taylor series provides a local approximation to a function and is


the foundation for gradient-based optimization algorithms.
For an 𝑛-dimensional function, the Taylor series can predict the
function along any direction 𝑝. This is done by projecting the gradient
4 Unconstrained Gradient-Based Optimization 87

and Hessian onto the desired direction 𝑝 to get an approximation of


the function at any nearby point 𝑥 + 𝑝:‡ ‡ For a more extensive introduction to the

1   Taylor series, see Appendix A.1.

𝑓 (𝑥 + 𝑝) = 𝑓 (𝑥) + ∇ 𝑓 (𝑥)| 𝑝 + 𝑝 | 𝐻(𝑥)𝑝 + 𝒪 k𝑝k 3 . (4.15)


2
We use a second-order Taylor series (ignoring the cubic term)
because it results in a quadratic, the lowest-order Taylor series that
can have a minimum. For a function that is 𝐶 2 continuous, this
approximation can be made arbitrarily accurate by making k𝑝 k small
enough.

Example 4.6 Second-order Taylor series expansion of two-variable function

Using the gradient and Hessian of the two-variable polynomial from Ex. 4.1
and Ex. 4.5, we can use Eq. 4.15 to construct a second-order Taylor expansion
about 𝑥0 ,
 |  
3𝑥12 + 2𝑥 22 − 20 6𝑥1 4𝑥 2
𝑓˜(𝑝) = 𝑓 (𝑥0 ) + 𝑝 + 𝑝| 𝑝.
4𝑥 1 𝑥2 − 3𝑥 22 4𝑥2 4𝑥1 − 6𝑥2
Figure 4.10 shows the resulting Taylor series expansions about different points.
We perform three expansions, each about three critical points: the minimum
(left), the maximum (middle), and the saddle point (right). The expansion
about the minimum yields a convex quadratic that is a good approximation of
the original function near the minimum but becomes worse as we step farther
away. The expansion about the maximum shows a similar trend except that the
approximation is a concave quadratic. Finally, the expansion about the saddle
point yields a saddle function.

4 4 4

2 2 2
Saddle point

𝑥2 0 𝑥2 0 𝑥2 0
Minimum Maximum

−2 −2 −2

−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1

30 30 30

𝑓 0 𝑓 0 𝑓 0

−30 −30 −30

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1

Fig. 4.10 The second-order Taylor


series expansion uses the function
value, gradient, and Hessian at a
point to construct a quadratic model
about that point. The model can vary
drastically, depending on the func-
tion and the point location. The one-
dimensional slices are in the 𝑥1 direc-
4 Unconstrained Gradient-Based Optimization 88

4.1.4 Optimality Conditions


To find the minimum of a function, we must determine the mathematical
conditions that identify a given point 𝑥 as a minimum. There is only a
limited set of problems for which we can prove global optimality, so
in general, we are only interested in local optimality. A point 𝑥 ∗ is a
local minimum if 𝑓 (𝑥 ∗ ) ≤ 𝑓 (𝑥) for all 𝑥 in the neighborhood of 𝑥 ∗ . In
other words, there must be no descent direction starting from the local
minimum.
A second-order Taylor series expansion about 𝑥 ∗ for small steps of
size 𝑝 yields

1
𝑓 (𝑥 ∗ + 𝑝) = 𝑓 (𝑥 ∗ ) + ∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 + . . . . (4.16)
2
For 𝑥 ∗ to be an optimal point, we must have 𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) for all 𝑝.
This implies that the first- and second-order terms in the Taylor series
have to be nonnegative, that is,

1
∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 ≥ 0 for all 𝑝. (4.17)
2
Because the magnitude of 𝑝 is small, we can always find a 𝑝 such
that the first term dominates. Therefore, we require that

∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 for all 𝑝. (4.18)

Because 𝑝 can be in any arbitrary direction, the only way this inequality
can be satisfied is if all the elements of the gradient are zero (refer to
Fig. 4.6),
∇ 𝑓 (𝑥 ∗ ) = 0 . (4.19)
This is the first-order necessary optimality condition for an unconstrained
problem. This is necessary because if any element of 𝑝 is nonzero, there
are descent directions (e.g., 𝑝 = −∇ 𝑓 ) for which the inequality would
not be satisfied.
Because the gradient term has to be zero, we must now satisfy the
remaining term in the inequality (Eq. 4.17), that is,

𝑝 | 𝐻(𝑥 ∗ )𝑝 ≥ 0 for all 𝑝. (4.20)

From Eq. 4.13, we know that this term represents the curvature in
direction 𝑝, so this means that the function curvature must be positive
or zero when projected in any direction. You may recognize this
inequality as the definition of a positive-semidefinite matrix. In other
words, the Hessian 𝐻(𝑥 ∗ ) must be positive semidefinite.
4 Unconstrained Gradient-Based Optimization 89

For a matrix to be positive semidefinite, its eigenvalues must all


be greater than or equal to zero. Recall that the eigenvalues of the
Hessian quantify the principal curvatures, so as long as all the principal
curvatures are greater than or equal to zero, the curvature along an
arbitrary direction is also greater than or equal to zero.
These conditions on the gradient and curvature are necessary condi-
tions for a local minimum but are not sufficient. They are not sufficient
because if the curvature is zero in some direction 𝑝 (i.e., 𝑝 | 𝐻(𝑥 ∗ )𝑝 = 0),
we have no way of knowing if it is a minimum unless we check the
third-order term. In that case, even if it is a minimum, it is a weak
minimum.
The sufficient conditions for optimality require the curvature to be
positive in any direction, in which case we have a strong minimum.
Mathematically, this means that 𝑝 | 𝐻(𝑥 ∗ )𝑝 > 0 for all nonzero 𝑝, which
is the definition of a positive-definite matrix. If 𝐻 is a positive-definite
matrix, every eigenvalue of 𝐻 is positive.§ § For other approaches to determine if a ma-

trix is positive definite, see Appendix A.6.


Figure 4.11 shows some examples of quadratic functions that are
positive definite (all positive eigenvalues), positive semidefinite (non-
negative eigenvalues), indefinite (mixed eigenvalues), and negative
definite (all negative eigenvalues).

Minimum Weak minima Saddle point Maximum


line
Positive definite Positive semidefinite Indefinite Negative definite

In summary, the necessary optimality conditions for an unconstrained Fig. 4.11 Quadratic functions with
different types of Hessians.
optimization problem are

∇ 𝑓 (𝑥 ∗ ) = 0
(4.21)
𝐻(𝑥 ∗ ) is positive semidefinite .

The sufficient optimality conditions are

∇ 𝑓 (𝑥 ∗ ) = 0
(4.22)
𝐻(𝑥 ∗ ) is positive definite .
4 Unconstrained Gradient-Based Optimization 90

Example 4.7 Finding minima analytically

Consider the following function of two variables:

𝑓 = 0.5𝑥14 + 2𝑥13 + 1.5𝑥12 + 𝑥22 − 2𝑥1 𝑥2 .

We can find the minima of this function by solving for the optimality conditions
analytically.
To find the critical points of this function, we solve for the points at which
the gradient is equal to zero,

 𝜕𝑓     
 
 𝜕𝑥  2𝑥13 + 6𝑥12 + 3𝑥1 − 2𝑥2  0
∇ 𝑓 =  1  =  =  .
 𝜕 𝑓     
 0
 𝜕𝑥   2𝑥 2 − 2𝑥1
  
 2
From the second equation, we have that 𝑥2 = 𝑥1 . Substituting this into the first
equation yields  
𝑥1 2𝑥12 + 6𝑥 1 + 1 = 0 .
The solution of this equation yields three points:
√ √
   3   7 3
0 − − 7   − 
    
𝑥 𝐴 =   , 𝑥 𝐵 =  2 √2  , 𝑥 𝐶 =  √2 2 .
   3 7  7 3
0 − −   
   2 2   2 − 2
To classify these points, we need to compute the Hessian matrix. Differentiating
the gradient, we get

 𝜕2 𝑓 𝜕2 𝑓 
  2 
 𝜕𝑥 2 𝜕𝑥1 𝜕𝑥2  6𝑥 + 12𝑥1 + 3 −2
  1
𝐻(𝑥1 , 𝑥2 ) =  2 1 =  .
 𝜕 𝑓 𝜕2 𝑓   
   2
 𝜕𝑥2 𝜕𝑥1   −2

 𝜕𝑥22 
The Hessian at the first point is
 
3 −2
𝐻 (𝑥 𝐴 ) = ,
−2 2

whose eigenvalues are 𝜅1 ≈ 0.438 and 𝜅2 ≈ 4.561. Because both eigenvalues


are positive, this point is a local minimum. For the second point,
"  √  #
3 3+ 7 −2
𝐻 (𝑥 𝐵 ) = .
−2 2

The eigenvalues are 𝜅 1 ≈ 1.737 and 𝜅 2 ≈ 17.200, so this point is another local
minimum. For the third point,
 √ 
9 − 3 7 −2
𝐻 (𝑥 𝐶 ) = .
−2 2
4 Unconstrained Gradient-Based Optimization 91

The eigenvalues for this Hessian are 𝜅 1 ≈ −0.523 and 𝜅 2 ≈ 3.586, so this point
is a saddle point.
Figure 4.12 shows these three critical points. To find out which of the two
local minima is the global one, we evaluate the function at each of these points.
Because 𝑓 (𝑥 𝐵 ) < 𝑓 (𝑥 𝐴 ), 𝑥 𝐵 is the global minimum.

𝑥 𝐴 : local minimum
0
𝑥 𝐶 : saddle point
𝑥2 −1

−2
𝑥 𝐵 : global minimum
−3

Fig. 4.12 Minima and saddle point


−4
−4 −3 −2 −1 0 1 2 locations.
𝑥1

We may be able to solve the optimality conditions analytically for


simple problems, as we did in Ex. 4.7. However, this is not possible
in general because the resulting equations might not be solvable in
closed form. Therefore, we need numerical methods that solve for these
conditions.
When using a numerical approach, we seek points where ∇ 𝑓 (𝑥 ∗ ) = 0,
but the entries in ∇ 𝑓 do not converge to exactly zero because of finite-
precision arithmetic. Instead, we define convergence for the first
criterion based on the maximum component of the gradient, such that

k∇ 𝑓 k ∞ ≤ 𝜏 , (4.23)

where 𝜏 is some tolerance. A typical absolute tolerance is 𝜏 = 10−6


or a six-order magnitude reduction in gradient when using a relative
tolerance. Absolute and relative criteria are often combined in a metric
such as the following:

k∇ 𝑓 k ∞ ≤ 𝜏 1 + k∇ 𝑓0 k ∞ , (4.24)

where ∇ 𝑓0 is the gradient at the starting point.


The second optimality condition (that 𝐻 must be positive semidefi-
nite) is not usually checked explicitly. If we satisfy the first condition,
then all we know is that we have reached a stationary point, which
4 Unconstrained Gradient-Based Optimization 92

could be a maximum, a minimum, or a saddle point. However, as


shown in Section 4.4, the search directions for the algorithms of this
chapter are always descent directions, and therefore in practice, they
should converge to a local minimum.
For a practical algorithm, other exit conditions are often used besides
the reduction in the norm of the gradient. A function might be poorly
scaled, be noisy, or have other numerical issues that prevent it from
ever satisfying this optimality condition (Eq. 4.24). To prevent the
algorithm from running indefinitely, it is common to set a limit on
the computational budget, such as the number of function calls, the
number of major iterations, or the clock time. Additionally, to detect a
case where the optimizer is not making significant progress and not
likely to improve much further, we might set criteria on the minimum
step size and the minimum change in the objective. Similar to the
conditions on the gradient, the minimum change in step size could be
limited as follows:

k𝑥 𝑘 − 𝑥 𝑘−1 k ∞ < 𝜏𝑥 (1 + k𝑥 𝑘−1 k ∞ ) . (4.25)

The absolute and relative conditions on the objective are of the same
form, although they only use an absolute value rather than a norm
because the objective is scalar.

Tip 4.1 Check the exit message when using an optimizer

Optimizers usually include an exit message when returning a result. Inex-


perienced users often take whatever solution the optimizer returns without
checking this message. However, as discussed previously, the optimizer may
terminate without satisfying first-order optimality (Eq. 4.24). Check the exit
message and study the optimizer’s documentation to make sure you understand
the result. If the message indicates that this is not a definite optimum, you
should investigate further.
You might have to increase the limit on the number of iterations if the
optimization reached this limit. When terminating due to small step sizes
or function changes, you might need to improve your numerical model by
reducing the noise (see Tip 3.2) or by smoothing it (Tip 4.7). Another likely
culprit is scaling (Tip 4.4). Finally, you might want to explore the design space
around the point where the optimizer is stuck (Tip 4.2) and more specifically,
see what is happening with the line search (Tip 4.3).
4 Unconstrained Gradient-Based Optimization 93

4.2 Two Overall Approaches to Finding an Optimum

Although the optimality conditions derived in the previous section


can be solved analytically to find the function minima, this analytic
approach is not possible for functions that result from numerical models.
Instead, we need iterative numerical methods to find minima based
only on the function values and gradients.
In Chapter 3, we reviewed methods for solving simultaneous sys-
tems of nonlinear equations, which we wrote as 𝑟(𝑢) = 0. Because
the first-order optimality condition (∇ 𝑓 = 0) can be written in this
residual form (where 𝑟 = ∇ 𝑓 and 𝑢 = 𝑥), we could try to use the solvers
from Chapter 3 directly to solve unconstrained optimization problems.
Although several components of general solvers for 𝑟(𝑢) = 0 are used
in optimization algorithms, these solvers are not the most effective
approaches in their original form. Furthermore, solving ∇ 𝑓 = 0 is not
necessarily sufficient—it finds a stationary point but not necessarily a
minimum. Optimization algorithms require additional considerations
to ensure convergence to a minimum.
Like the iterative solvers from Chapter 3, gradient-based algorithms
start with a guess, 𝑥0 , and generate a series of points, 𝑥1 , 𝑥2 , . . . , 𝑥 𝑘 , . . .
that converge to a local optimum, 𝑥 ∗ , as previously illustrated in Fig. 4.1.
At each iteration, some form of the Taylor series about the current point
is used to find the next point.

𝑥𝑘 𝑥∗ 𝑥∗ 𝑥∗
𝑥𝑘
𝑥2 𝑥2 𝑥2

𝑥𝑘

𝑥1 𝑥1 𝑥1

A truncated Taylor series is, in general, only a good model within a Fig. 4.13 Taylor series quadratic mod-
els are only guaranteed to be accurate
small neighborhood, as shown in Fig. 4.13, which shows three quadratic
near the point about which the series
models of the same function based on three different points. All is expanded (𝑥 𝑘 ).
quadratic approximations match the local gradient and curvature at
the respective points. However, the Taylor series quadratic about the
first point (left plot) yields a quadratic without a minimum (the only
critical point is a saddle point). The second point (middle plot) yields
a quadratic whose minimum is closer to the true minimum. Finally,
the Taylor series about the actual minimum point (right plot) yields a
4 Unconstrained Gradient-Based Optimization 94

quadratic with the same minimum, as expected, but we can see how
the quadratic model worsens the farther we are from the point.
Because the Taylor series is only guaranteed to be a good model
locally, we need a globalization strategy to ensure convergence to an
optimum. Globalization here means making the algorithm robust
enough that it can converge to a local minimum when starting from
any point in the domain. This should not be confused with finding the
global minimum, which is a separate issue (see Tip 4.8). There are two
main globalization strategies: line search and trust region.
The line search approach consists of three main steps for every
iteration (Fig. 4.14): 𝑥0

1. Choose a suitable search direction from the current point. The


Search
choice of search direction is based on a Taylor series approxima- direction
tion.
2. Determine how far to move in that direction by performing a line
Update 𝑥 Line search
search.
3. Move to the new point and update all values.
No Is 𝑥 a
The two first steps can be seen as two separate subproblems. We minimum?
address the line search subproblem in Section 4.3 and the search Yes
direction subproblem in Section 4.4.
𝑥∗
Trust-region methods also consist of three steps (Fig. 4.15):
Fig. 4.14 Line search approach.
1. Create a model about the current point. This model can be based
𝑥0
on a Taylor series approximation or another type of surrogate
model.
2. Minimize the model within a trust region around the current point Create model

to find the step. Update trust-


region size, Δ
3. Move to the new point, update values, and adapt the size of the Minimize
trust region. model
Update 𝑥
We introduce the trust-region approach in Section 4.5. However, we Is 𝑥 a
devote more attention to algorithms that use the line search approach No minimum?
because they are more common in general nonlinear optimization. Yes
Both line search and trust-region approaches use iterative processes
𝑥∗
that must be repeated until some convergence criterion is satisfied. The
first step in both approaches is usually referred to as a major iteration, Fig. 4.15 Trust-region approach.
whereas the second step might require more function evaluations
corresponding to minor iterations.
4 Unconstrained Gradient-Based Optimization 95

Tip 4.2 Before optimizing, explore the design space

Before coupling your model solver with an optimizer, it is a good idea to


explore the design space. Ensure that the solver is robust and can handle a wide
variety of inputs within your provided bounds without errors. Plotting the
multidimensional design space is generally impossible, but you can perform a
series of one-dimensional sweeps. From the starting point, plot the objective
with all design variables fixed except one. Vary that design variable across
a range, and repeat that process for several design variables. These one-
dimensional plots can identify issues such as analysis failures, noisy outputs,
and discontinuous outputs, which you can then fix. These issues should
be addressed before attempting to optimize. This same technique can be
helpful when an optimizer becomes stuck; you can plot the behavior in a small
neighborhood around the point of failure (see Tip 4.3).

4.3 Line Search

Gradient-based unconstrained optimization algorithms that use a line


search follow Alg. 4.1. We start with a guess 𝑥0 and provide a con-
vergence tolerance 𝜏 for the optimality condition.∗ The final output is ∗ This algorithm,and others in this section,
an optimal point 𝑥 ∗ and the corresponding function value 𝑓 (𝑥 ∗ ). As use a basic convergence check for simplic-
ity. See the end of Section 4.1.4 for better
mentioned in the previous section, there are two main subproblems alternatives and additional exit criteria.
in line search gradient-based optimization algorithms: choosing the
search direction and determining how far to step in that direction. In
the next section, we introduce several methods for choosing the search
direction. The line search method determines how far to step in the
chosen direction and is usually independent of the method for choosing
the search direction. Therefore, line search methods can be combined
with any method for finding the search direction. However, the search
direction method determines the name of the overall optimization
algorithm, as we will see in the next section.

Algorithm 4.1 Gradient-based unconstrained optimization using a line


search
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value

𝑘=0 Initialize iteration counter


while k∇ 𝑓 k ∞ > 𝜏 do Optimality condition
4 Unconstrained Gradient-Based Optimization 96

Determine search direction, 𝑝 𝑘 Use any of the methods from Section 4.4
Determine step length, 𝛼 𝑘 Use a line search algorithm
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while

𝑥 𝑘 + 𝛼𝑝 𝑘
For the line search subproblem, we assume that we are given a
starting location at 𝑥 𝑘 and a suitable search direction 𝑝 𝑘 along which to 𝛼=2
search (Fig. 4.16). The line search then operates solely on points along
direction 𝑝 𝑘 starting from 𝑥 𝑘 , which can be written as 𝑝𝑘 𝛼=1

𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , (4.26)
𝑥𝑘 𝛼=0

where the scalar 𝛼 is always positive and represents how far we go in Fig. 4.16 The line search starts from
the direction 𝑝 𝑘 . This equation produces a one-dimensional slice of a given point 𝑥 𝑘 and searches solely
𝑛-dimensional space, as illustrated in Fig. 4.17. along direction 𝑝 𝑘 .

𝑥𝑘
𝑥2
𝑝𝑘
𝑓

𝑥 𝑘+1
𝑥 𝑘 + 𝛼𝑝 𝑘 Fig. 4.17 The line search projects
the 𝑛-dimensional problem onto one
dimension, where the independent
𝑥1 𝛼 variable is 𝛼.
𝑥𝑘 𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘

The line search determines the magnitude of the scalar 𝛼 𝑘 , which in


turn determines the next point in the iteration sequence. Even though
𝑥 𝑘 and 𝑝 𝑘 are 𝑛-dimensional, the line search is a one-dimensional
problem with the goal of selecting 𝛼 𝑘 .
Line search methods require that the search direction 𝑝 𝑘 be a descent
direction so that ∇ 𝑓 𝑘 | 𝑝 𝑘 < 0 (see Fig. 4.18). This guarantees that 𝑓 can be
reduced by stepping some distance along this direction with a positive
𝛼.
∇𝑓
The goal of the line search is not to find the value of 𝛼 that min-
imizes 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 ) but to find a point that is “good enough” using
as few function evaluations as possible. This is because finding the
exact minimum along the line would require too many evaluations of
the objective function and possibly its gradient. Because the overall 𝑝𝑘

optimization needs to find a point in 𝑛-dimensional space, the search


direction might change drastically between line searches, so spending Fig. 4.18 The line search direction
too many iterations on each line search is generally not worthwhile. must be a descent direction.
4 Unconstrained Gradient-Based Optimization 97

Consider the bean function whose contours are shown in Fig. 4.19.
At point 𝑥 𝑘 , the direction 𝑝 𝑘 is a descent direction. However, it would
be wasteful to spend a lot of effort determining the exact minimum in 𝑥𝑘

the 𝑝 𝑘 direction because it would not take us any closer to the minimum
𝑝𝑘
of the overall function (the dot on the right side of the plot). Instead, 𝑥2

we should find a point that is good enough and then update the search
direction.
To simplify the notation for the line search, we define the single-
𝑥1
variable function
𝜙(𝛼) = 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 ) , (4.27) Fig. 4.19 The descent direction does
not necessarily point toward the min-
where 𝛼 = 0 corresponds to the start of the line search (𝑥 𝑘 in Fig. 4.17), imum, in which case it would be
and thus 𝜙(0) = 𝑓 (𝑥 𝑘 ). Then, using 𝑥 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , the slope of the wasteful to do an exact line search.
single-variable function is

𝜕 ( 𝑓 (𝑥)) Õ 𝜕 𝑓 𝜕𝑥 𝑖
𝑛
𝜙0(𝛼) = = . (4.28)
𝜕𝛼 𝜕𝑥 𝑖 𝜕𝛼
𝑖=1

Substituting into the derivatives results in

𝜙0(𝛼) = ∇ 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 )| 𝑝 𝑘 , (4.29)

which is the directional derivative along the search direction. The slope
at the start of a given line search is

𝜙0(0) = ∇ 𝑓 𝑘 | 𝑝 𝑘 . (4.30) 𝜙0(0) < 0

Because 𝑝 𝑘 must be a descent direction, 𝜙0(0) is always negative. Fig- 𝜙

ure 4.20 is a version of the one-dimensional slice from Fig. 4.17 in this
notation. The 𝛼 axis and the slopes scale with the magnitude of 𝑝 𝑘 . 𝜙0(𝛼)

𝛼=0 𝛼
4.3.1 Sufficient Decrease and Backtracking
Fig. 4.20 For the line search, we de-
The simplest line search algorithm to find a “good enough” point relies note the function as 𝜙(𝛼) with the
on the sufficient decrease condition combined with a backtracking algorithm. same value as 𝑓 . The slope 𝜙0 (𝛼) is
The sufficient decrease condition, also known as the Armijo condition, is the gradient of 𝑓 projected onto the
given by the inequality search direction.
† This condition can be problematic near
a local minimum because 𝜙(0) and 𝜙(𝛼)
𝜙(𝛼) ≤ 𝜙(0) + 𝜇1 𝛼𝜙 (0) ,0
(4.31) are so similar that their subtraction is inac-
curate. Hager and Zhang77 introduced a
condition with improved accuracy, along
where 𝜇1 is a constant such that 0 < 𝜇1 ≤ 1.† The quantity 𝛼𝜙0(0) with an efficient line search based on a
represents the expected decrease of the function, assuming the function secant method.

continued at the same slope. The multiplier 𝜇1 states that Eq. 4.31 will 77. Hager and Zhang, A new conjugate
gradient method with guaranteed descent and
be satisfied as long we achieve even a small fraction of the expected an efficient line search, 2005.
4 Unconstrained Gradient-Based Optimization 98

decrease, as shown in Fig. 4.21. In practice, this constant is several 𝜙(0) 𝜇1 𝜙0(0)
orders of magnitude smaller than 1, typically 𝜇1 = 10−4 . Because 𝑝 𝑘
Sufficient
is a descent direction, and thus 𝜙0(0) = ∇ 𝑓 𝑘 | 𝑝 𝑘 < 0, there is always a decrease
positive 𝛼 that satisfies this condition for a smooth function.
𝜙

The concept is illustrated in Fig. 4.22, which shows a function with 𝜙0(0)
a negative slope at 𝛼 = 0 and a sufficient decrease line whose slope is Expected decrease
a fraction of that initial slope. When starting a line search, we know 𝛼=0 𝛼
the function value and slope at 𝛼 = 0, so we do not really know how
the function varies until we evaluate it. Because we do not want to Fig. 4.21 The sufficient decrease line
has a slope that is a small fraction
evaluate the function too many times, the first point whose value is
of the slope at the start of the line
below the sufficient decrease line is deemed acceptable. The sufficient search.
decrease line slope in Fig. 4.22 is exaggerated for illustration purposes;
for typical values of 𝜇1 , the line is indistinguishable from horizontal
when plotted.

𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line

𝜙(𝛼)

𝛼=0 𝛼 Fig. 4.22 Sufficient decrease condi-


tion.
Acceptable range Acceptable range

Line search algorithms require a first guess for 𝛼. As we will see


later, some methods for finding the search direction also provide good
guesses for the step length. However, in many cases, we have no idea
of the scale of function, so our initial guess may not be suitable. Even if
we do have an educated guess for 𝛼, it is only a guess, and the first step
might not satisfy the sufficient decrease condition.
A straightforward algorithm that is guaranteed to find a step that
satisfies the sufficient decrease condition is backtracking (Alg. 4.2).
This algorithm starts with a maximum step and successively reduces
the step by a constant ratio 𝜌 until it satisfies the sufficient decrease
condition (a typical value is 𝜌 = 0.5). Because the search direction is a
descent direction, we know that we will achieve an acceptable decrease
in function value if we backtrack enough.

Algorithm 4.2 Backtracking line search algorithm

Inputs:
4 Unconstrained Gradient-Based Optimization 99

𝛼init > 0: Initial step length


0 < 𝜇1 < 1: Sufficient decrease factor (typically small, e.g., 𝜇1 = 10−4 )
0 < 𝜌 < 1: Backtracking factor (e.g., 𝜌 = 0.5)
Outputs:
𝛼∗ : Step size satisfying sufficient decrease condition

𝛼 = 𝛼init
while 𝜙(𝛼) > 𝜙(0) + 𝜇1 𝛼𝜙0 (0) do Function value is above sufficient decrease line
𝛼 = 𝜌𝛼 Backtrack
end while

Although backtracking is guaranteed to find a point that satisfies


sufficient decrease, there are two undesirable scenarios where this
algorithm performs poorly. The first scenario is that the guess for the
initial step is far too large, and the step sizes that satisfy sufficient de-
crease are smaller than the starting step by several orders of magnitude.
Depending on the value of 𝜌, this scenario requires a large number of
backtracking evaluations.
The other undesirable scenario is where our initial guess immedi-
ately satisfies sufficient decrease. However, the function’s slope is still
highly negative, and we could have decreased the function value by
much more if we had taken a larger step. In this case, our guess for the
initial step is far too small.
Even if our original step size is not too far from an acceptable step
size, the basic backtracking algorithm ignores any information we have
about the function values and gradients. It blindly takes a reduced step
based on a preselected ratio 𝜌. We can make more intelligent estimates
of where an acceptable step is based on the evaluated function values
(and gradients, if available). The next section introduces a more
sophisticated line search algorithm that deals with these scenarios
much more efficiently.
𝑥2
2.5
𝑝
Example 4.8 Backtracking line search 2

Consider the following function: 1.5

𝑓 (𝑥1 , 𝑥2 ) = 0.1𝑥16 − 1.5𝑥14 + 5𝑥12 + 0.1𝑥24 + 3𝑥22 − 9𝑥2 + 0.5𝑥1 𝑥 2 . 1 𝑥𝑘

0.5
Suppose we do a line search starting from 𝑥 = [−1.25, 1.25] in the direction
𝑝 = [4, 0.75], as shown in Fig. 4.23. Applying the backtracking algorithm with 0
−3 −2 −1 0 1 2 3
𝜇1 = 10−4 and 𝜌 = 0.7 produces the iterations shown in Fig. 4.24. The sufficient 𝑥1
decrease line appears to be horizontal, but that is because the small negative
slope cannot be discerned in a plot for typical values of 𝜇1 . Using a large initial Fig. 4.23 A line search direction for
step of 𝛼init = 1.2 (Fig. 4.24, left), several iterations are required. For a small an example problem.
4 Unconstrained Gradient-Based Optimization 100

initial step of 𝛼init = 0.05 (Fig. 4.24, right), the algorithm satisfies sufficient
decrease at the first iteration but misses the opportunity for further reductions.

30 30

20 20

𝑓 10 𝑓 10

0 0

Fig. 4.24 Backtracking using different


−10 −10
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 initial steps.
𝛼∗ 𝛼 𝛼 init 𝛼 ∗ = 𝛼init 𝛼

4.3.2 Strong Wolfe Conditions


One major weakness of the sufficient decrease condition is that it accepts
small steps that marginally decrease the objective function because 𝜇1
in Eq. 4.31 is typically small. We could increase 𝜇1 (i.e., tilt the red
line downward in Fig. 4.22) to prevent these small steps; however, that
would prevent us from taking large steps that still result in a reasonable
decrease. A large step that provides a reasonable decrease is desirable
because large steps generally lead to faster convergence. Therefore, we
want to prevent overly small steps while not making it more difficult
to accept reasonable large steps. We can accomplish this by adding a
second condition to construct a more efficient line search algorithm.
Just like guessing the step size, it is difficult to know in advance how
much of a function value decrease to expect. However, if we compare
the slope of the function at the candidate point with the slope at the
start of the line search, we can get an idea if the function is “bottoming
out”, or flattening, using the sufficient curvature condition: 𝜙(0)
𝜙0(0)

|𝜙0(𝛼)| ≤ 𝜇2 |𝜙0(0)| . (4.32) 𝜙

This condition requires that the magnitude of the slope at the new
+𝜇2 𝜙0(0) −𝜇2 𝜙0(0)
point be lower than the magnitude of the slope at the start of the line
search by a factor of 𝜇2 , as shown in Fig. 4.25. This requirement is called 𝛼=0 𝛼
the sufficient curvature condition because by comparing the two slopes,
Fig. 4.25 The sufficient curvature con-
we quantify the function’s rate of change in the slope—the curvature.
dition requires the function slope
Typical values of 𝜇2 range from 0.1 to 0.9, and the best value depends magnitude to be a fraction of the ini-
on the method for determining the search direction and is also problem tial slope.
4 Unconstrained Gradient-Based Optimization 101

dependent. As 𝜇2 tends to zero, enforcing the sufficient curvature


condition tends toward a point where 𝜙0(𝛼) = 0, which would yield an
exact line search.
The sign of the slope at a point satisfying this condition is not
relevant; all that matters is that the function slope be shallow enough.
The idea is that if the slope 𝜙0(𝛼) is still negative with a magnitude
similar to the slope at the start of the line search, then the step is too
small, and we expect the function to decrease even further by taking
a larger step. If the slope 𝜙0(𝛼) is positive with a magnitude similar
to that at the start of the line search, then the step is too large, and we
expect to decrease the function further by taking a smaller step. On
the other hand, when the slope is shallow enough (either positive or
negative), we assume that the candidate point is near a local minimum,
and additional effort yields only incremental benefits that are wasteful
in the context of the larger problem.
The sufficient decrease and sufficient curvature conditions are
collectively known as the strong Wolfe conditions. Figure 4.26 shows
acceptable intervals that satisfy the strong Wolfe conditions, which are
more restrictive than the sufficient decrease condition (Fig. 4.22).

𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line

𝜙(𝛼)

𝜇2 𝜙0(0)

Steps that satisfy the strong


Fig. 4.26𝜙(0)
𝛼=0 𝛼
Wolfe
𝜙0(0) conditions.
Acceptable range Acceptable range 𝜇1 𝜙0(0)
𝜙

The sufficient decrease slope must be shallower than the sufficient


curvature slope, that is, 0 < 𝜇1 < 𝜇2 < 1. This is to guarantee that 𝜇2 𝜙0(0)
there are steps that satisfy both the sufficient decrease and sufficient
𝛼=0
curvature conditions. Otherwise, the situation illustrated in Fig. 4.27
𝛼

could take place. Fig. 4.27 If 𝜇2 < 𝜇1 , there might be no


We now present a line search algorithm that finds a step satisfying point that satisfies the strong Wolfe
the strong Wolfe conditions. Enforcing the sufficient curvature condi- conditions.
tion means we require derivative information (𝜙0), at least using the 78. Moré and Thuente, Line search algo-
rithms with guaranteed sufficient decrease,
derivative at the beginning of the line search that we already computed 1994.
from the gradient. There are various line search algorithms in the ‡ A similar algorithm is detailed in Chap-
79
literature, including some that are derivative-free. Here, we detail a line ter 3 of Nocedal and Wright.
79. Nocedal and Wright, Numerical
search algorithm based on the one developed by Moré and Thuente.78‡ Optimization, 2006.
4 Unconstrained Gradient-Based Optimization 102

The algorithm has two phases:

1. The bracketing phase finds an interval within which we are certain


to find a point that satisfies the strong Wolfe conditions.
2. The pinpointing phase finds a point that satisfies the strong Wolfe
conditions within the interval provided by the bracketing phase.

The bracketing phase is given by Alg. 4.3 and illustrated in Fig. 4.28.
For brevity, we use a notation in the following algorithms where,
for example, 𝜙0 ≡ 𝜙(0) and 𝜙low ≡ 𝜙(𝛼low ). Overall, the bracketing
algorithm increases the step size until it either finds an interval that
must contain a point satisfying the strong Wolfe conditions or a point
that already meets those conditions.
We start the line search with a guess for the step size, which defines
the first interval. For a smooth continuous function, we are guaranteed
to have a minimum within an interval if either of the following hold:

1. The function value at the candidate step is higher than the value
at the start of the line search.
2. The step satisfies sufficient decrease, and the slope is positive.

These two scenarios are illustrated in the top two rows of Fig. 4.28. In
either case, we have an interval within which we can find a point that
satisfies the strong Wolfe conditions using the pinpointing algorithm.
The order in arguments to the pinpoint function in Alg. 4.3 is significant
because this function assumes that the function value corresponding
to the first 𝛼 is the lower one. The third row in Fig. 4.28 illustrates the
scenario where the point satisfies the strong Wolfe conditions, in which
case the line search is finished.
If the point satisfies sufficient decrease and the slope at that point
is negative, we assume that there are better points farther along the
line, and the algorithm increases the step size. This larger step and the
previous one define a new interval that has moved away from the line
search starting point. We repeat the procedure and check the scenarios
for this new interval. To save function calls, bracketing should return
not just 𝛼 ∗ but also the corresponding function value and gradient to
the outer function.

Algorithm 4.3 Bracketing phase for the line search algorithm

Inputs:
𝛼init > 0: Initial step size
𝜙0 , 𝜙00 : computed in outer routine, pass in to save function call
0 < 𝜇1 < 1: Sufficient decrease factor
4 Unconstrained Gradient-Based Optimization 103

𝜇1 < 𝜇2 < 1: Sufficient curvature factor


𝜎 > 1: Step size increase factor (e.g., 𝜎 = 2)
Outputs:
𝛼∗ : Acceptable step size (satisfies the strong Wolfe conditions)

𝛼1 = 0 Define initial bracket


𝛼2 = 𝛼init
𝜙1 = 𝜙0
𝜙01 = 𝜙00 Used in pinpoint
first = true
while true do

𝜙2 = 𝜙(𝛼2 )
  Compute 𝜙2 on this line ifuser provides derivatives
0

0
if 𝜙2 > 𝜙0 + 𝜇1 𝛼2 𝜙0 or not first and 𝜙2 > 𝜙1 then
𝛼∗ = pinpoint(𝛼 1 , 𝛼2 , . . .) 1 ⇒ low, 2 ⇒ high
return 𝛼∗
end if
𝜙02 = 𝜙0 (𝛼2 ) If not computed previously
if |𝜙02 | ≤ −𝜇2 𝜙00 then Step is acceptable; exit line search
return 𝛼∗ = 𝛼 2
else if 𝜙02 ≥ 0 then Bracketed minimum

𝛼 = pinpoint (𝛼2 , 𝛼 1 , . . .) Find acceptable step, 2 ⇒ low, 1 ⇒ high
return 𝛼∗
else Slope is negative
𝛼1 = 𝛼2
𝛼2 = 𝜎𝛼2 Increase step
end if
first = false
end while

If the bracketing phase does not find a point that satisfies the
strong Wolfe conditions, it finds an interval where we are guaranteed
to find such a point in the pinpointing phase described in Alg. 4.4
and illustrated in Fig. 4.29. The intervals generated by this algorithm,
bounded by 𝛼 low and 𝛼high , always have the following properties:

1. The interval has one or more points that satisfy the strong Wolfe
conditions.
2. Among all the points generated so far that satisfy the sufficient
decrease condition, 𝛼low has the lowest function value.
3. The slope at 𝛼low decreases toward 𝛼 high .

The first step of pinpointing is to find a new point within the


given interval. Various techniques can be used to find such a point.
The simplest one is to select the midpoint of the interval (bisection),
4 Unconstrained Gradient-Based Optimization 104

Minimum bracketed; call pinpoint

𝛼1 𝛼2

𝛼1 𝛼2

Conditions are met;


line search is done

Fig. 4.28 Visual representation of the


bracketing algorithm. The sufficient
decrease line is drawn as if 𝛼1 were
the starting point for the line search,
which is the case for the first line
search iteration but not necessarily
𝛼1 𝛼2 the case for later iterations.

but this method is limited to a linear convergence rate. It is more


efficient to perform interpolation and select the point that minimizes the
interpolation function, which can be done analytically (see Section 4.3.3).
Using this approach, we can achieve quadratic convergence.
Once we have a new point within the interval, four scenarios are

possible, as shown in Fig. 4.29. The first scenario is that 𝜙 𝛼 𝑝 is above
the sufficient decrease line or greater than or equal to 𝜙(𝛼 low ). In that
scenario, 𝛼 𝑝 becomes the new 𝛼 high , and we have a new smaller interval.

In the second, third, and fourth scenarios, 𝜙 𝛼 𝑝 is below the

sufficient decrease line, and 𝜙 𝛼 𝑝 < 𝜙(𝛼 low ). In those scenarios, we

check the value of the slope 𝜙0 𝛼 𝑝 . In the second and third scenarios,
we choose the new interval based on the direction in which the slope
predicts a local decrease. If the slope is shallow enough (fourth scenario),
we have found a point that satisfies the strong Wolfe conditions.

Algorithm 4.4 Pinpoint function for the line search algorithm

Inputs:
4 Unconstrained Gradient-Based Optimization 105

𝛼low : Interval endpoint with lower function value


𝛼high : Interval endpoint with higher function value
𝜙0 , 𝜙low , 𝜙high , 𝜙00 : Computed in outer routine
𝜙0low , 𝜙0high : One, if not both, computed previously
0 < 𝜇1 < 1: Sufficient decrease factor
𝜇1 < 𝜇2 < 1: Sufficient curvature factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions

𝑘=0
while true do
Find 𝛼 𝑝 in interval (𝛼low , 𝛼 high ) Use interpolation (see Section 4.3.3)


Uses 𝜙low , 𝜙high , and 𝜙0 from at least one endpoint
𝜙𝑝 = 𝜙 𝛼𝑝 Also evaluate 𝜙0𝑝 if derivatives available
0
if 𝜙 𝑝 > 𝜙0 + 𝜇1 𝛼 𝑝 𝜙0 or 𝜙 𝑝 > 𝜙low then
𝛼high = 𝛼 𝑝 Also update 𝜙high = 𝜙 𝑝 , and if cubic interpolation 𝜙0high = 𝜙0𝑝


else
𝜙0𝑝 = 𝜙0 𝛼 𝑝 If not already computed
if |𝜙0𝑝 | ≤ −𝜇2 𝜙00 then
𝛼∗ = 𝛼 𝑝
return 𝛼 𝑝
else if 𝜙0𝑝 (𝛼high − 𝛼 low ) ≥ 0 then
𝛼high = 𝛼 low
end if
𝛼low = 𝛼 𝑝
end if
𝑘 = 𝑘+1
end while

In theory, the line search given in Alg. 4.3 followed by Alg. 4.4 is
guaranteed to find a step length satisfying the strong Wolfe conditions.
In practice, some additional considerations are needed for improved
robustness. One of these criteria is to ensure that the new point
in the pinpoint algorithm is not so close to an endpoint as to cause
the interpolation to be ill-conditioned. A fallback option in case
the interpolation fails could be a simpler algorithm, such as bisection.
Another criterion is to ensure that the loop does not continue indefinitely
in case finite-precision arithmetic leads to indistinguishable function
value changes. A limit on the number of iterations might be necessary.

Example 4.9 Line search with bracketing and pinpointing

Let us perform the same line search as in Alg. 4.2 but using bracketing
and pinpointing instead of backtracking. In this example, we use quadratic
4 Unconstrained Gradient-Based Optimization 106

𝛼 low 𝛼 𝑝 𝛼 high 𝛼 low 𝛼 high

𝛼high 𝛼 low

𝛼 low 𝛼 high

Fig. 4.29 Visual representation of the


Done pinpointing algorithm. The labels
in red indicate the new interval end-
points.
𝛼∗

interpolation, the pinpointing phase uses a step size increase factor of 𝜎 = 2,


and the sufficient curvature factor is 𝜇2 = 0.9. Bracketing is achieved in the
first iteration by using a large initial step of 𝛼 init = 1.2 (Fig. 4.30, left). Then

30 30

Bracketing
20 20 Bracketing

𝜙 10 Pinpointing 𝜙 10
Pinpointing

0 0

Fig. 4.30 Example of a line search


−10 −10
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 iteration with different initial steps.
𝛼∗ 𝛼 𝛼 init 𝛼init 𝛼∗ 𝛼

pinpointing finds an improved point through interpolation. The small initial


step of 𝛼init = 0.05 (Fig. 4.30, right) does not satisfy the strong Wolfe conditions,
and the bracketing phase moves forward toward a flatter part of the function.
4 Unconstrained Gradient-Based Optimization 107

The result is a point that is much better than the one obtained with backtracking.

Tip 4.3 When stuck, plot the line search

When gradient-based optimizers cannot move away from a non-optimal


point, it usually happens during the line search. To understand why the
optimizer is stuck, plot the iterations along the line search, add more points, or
plot the whole line if you can afford to. Even if you have a high-dimensional
problem, you can always plot the line search, which will be understandable
because it is one-dimensional.

4.3.3 Interpolation for Pinpointing


Interpolation is recommended to find a new point within each interval
at the pinpointing phase. Once we have an interpolation function,
we find the new point by determining the analytic minimum of that
function. This accelerates the convergence compared with bisection. We
consider two options: quadratic interpolation and cubic interpolation.
Because we have the function value and derivative at one endpoint
of the interval and at least the function value at the other endpoint, one
option is to perform quadratic interpolation to estimate the minimum
within the interval.
The quadratic can be written as

˜
𝜙(𝛼) = 𝑐0 + 𝑐1 𝛼 + 𝑐2 𝛼2 , (4.33)

where 𝑐 0 , 𝑐 1 , and 𝑐 2 are constants to be determined by interpolation.


Suppose that we have the function value and the derivative at 𝛼1
and the function value at 𝛼 2 , as illustrated in Fig. 4.31. These values
correspond to 𝛼 low and 𝛼 high in the pinpointing algorithm, but we use
the more generic indices 1 and 2 because the formulas of this section
are not dependent on which one is lower or higher. Then, the boundary
conditions at the endpoints are

𝜙(𝛼 1 ) = 𝑐 0 + 𝑐 1 𝛼 1 + 𝑐2 𝛼21
𝜙(𝛼 2 ) = 𝑐 0 + 𝑐 2 𝛼 2 + 𝑐2 𝛼22 (4.34)
𝜙 (𝛼 1 ) = 𝑐 1 + 2𝑐 2 𝛼 1 .
0

We can use these three equations to find the three coefficients based
on function and derivative values. Once we have the coefficients for
the quadratic, we can find the minimum of the quadratic analytically
4 Unconstrained Gradient-Based Optimization 108

by finding the point 𝛼 ∗ such that 𝜙˜ 0(𝛼 ∗ ) = 0, which is 𝛼∗ = −𝑐 1 /2𝑐 2 .


Substituting the analytic solution for the coefficients as a function of
the given values into this expression yields the final expression for the
minimizer of the quadratic:
  
2𝛼 1 𝜙(𝛼2 ) − 𝜙(𝛼1 ) + 𝜙0(𝛼 1 ) 𝛼 21 − 𝛼22

𝛼 =   . (4.35)
2 𝜙(𝛼 2 ) − 𝜙(𝛼 1 ) + 𝜙0(𝛼 1 )(𝛼 1 − 𝛼2 )

Performing this quadratic interpolation for successive intervals is 𝜙(𝛼 2 )


similar to the Newton method and also converges quadratically. The
pure Newton method also models a quadratic, but it is based on the
information at a single point (function value, derivative, and curvature), 𝜙0(𝛼 1 ) ˜
𝜙(𝛼)
as opposed to information at two points. 𝜙(𝛼1 )

If computing additional derivatives is inexpensive, or we already


evaluated 𝜙0 (𝛼 𝑖 ) (either as part of Alg. 4.3 or as part of checking the
strong Wolfe conditions in Alg. 4.4), then we have the function values 𝛼1 𝛼∗ 𝛼 𝛼2

and derivatives at both points. With these four pieces of information, Fig. 4.31 Quadratic interpolation with
we can perform a cubic interpolation, two function values and one deriva-
tive.
˜
𝜙(𝛼) = 𝑐0 + 𝑐1 𝛼 + 𝑐2 𝛼2 + 𝑐3 𝛼3 , (4.36)
𝜙(𝛼 2 )
𝜙0(𝛼2 )
as shown in Fig. 4.32. To determine the four coefficients, we apply the
boundary conditions:
𝜙0(𝛼 1 )
˜
𝜙(𝛼 1 ) = 𝑐 0 + 𝑐 1 𝛼 1 + 𝑐2 𝛼21 + 𝑐3 𝛼31 𝜙(𝛼1 ) 𝜙(𝛼)

𝜙(𝛼 2 ) = 𝑐 0 + 𝑐 2 𝛼 2 + 𝑐2 𝛼22 + 𝑐3 𝛼32


(4.37)
𝜙0(𝛼 1 ) = 𝑐 1 + 2𝑐 2 𝛼 1 + 3𝑐 3 𝛼 21 𝛼1 𝛼∗ 𝛼 𝛼2
𝜙 (𝛼 2 ) = 𝑐 1 + 2𝑐 2 𝛼 2 +
0
3𝑐 3 𝛼 22 .
Fig. 4.32 Cubic interpolation with
function values and derivatives at
Using these four equations, we can find expressions for the four co-
endpoints.
efficients as a function of the four pieces of information. Similar
to the quadratic interpolation function, we can find the solution for
𝜙˜ 0(𝛼 ∗ ) = 𝑐 1 + 2𝑐2 𝛼∗ + 3𝑐3 𝛼∗2 = 0 as a function of the coefficients. There
could be two valid solutions, but we are only interested in the minimum,
for which the curvature is positive; that is, 𝜙˜ 00(𝛼 ∗ ) = 2𝑐 2 + 6𝑐3 𝛼∗ > 0.
Substituting the coefficients with the expressions obtained from solving
the boundary condition equations and selecting the minimum solution
yields
𝜙0(𝛼 2 ) + 𝛽 2 − 𝛽 1
𝛼 ∗ = 𝛼 2 − (𝛼 2 − 𝛼 1 ) 0 , (4.38)
𝜙 (𝛼2 ) − 𝜙0(𝛼 1 ) + 2𝛽 2
where
𝜙(𝛼1 ) − 𝜙(𝛼 2 )
𝛽1 = 𝜙0(𝛼 1 ) + 𝜙0(𝛼 2 ) − 3
𝛼1 − 𝛼2
q (4.39)
𝛽2 = sign(𝛼 2 − 𝛼 1 ) 𝛽 21 − 𝜙0(𝛼 1 )𝜙0(𝛼 2) .
4 Unconstrained Gradient-Based Optimization 109

These interpolations become ill-conditioned if the interval becomes


too small. The interpolation may also lead to points outside the bracket.
In such cases, we can switch to bisection for the problematic iterations.

4.4 Search Direction

As stated at the beginning of this chapter, each iteration of an uncon-


strained gradient-based algorithm consists of two main steps: deter-
mining the search direction and performing the line search (Alg. 4.1).
The optimization algorithms are named after the method used to find
the search direction, 𝑝 𝑘 , and can use any suitable line search. We start
by introducing two first-order methods that only require the gradient
and then explain two second-order methods that require the Hessian,
or at least an approximation of the Hessian.

4.4.1 Steepest Descent


The steepest-descent method (also called gradient descent) is a simple and
intuitive method for determining the search direction. As discussed in
Section 4.1.1, the gradient points in the direction of steepest increase,
so −∇ 𝑓 points in the direction of steepest descent, as shown in Fig. 4.33.
Thus, our search direction at iteration 𝑘 is simply

𝑝 = −∇ 𝑓 . (4.40)

One major issue with the steepest descent is that, in general, the
∇𝑓
entries in the gradient and its overall scale can vary greatly depending
on the magnitudes of the objective function and design variables. The
gradient itself contains no information about an appropriate step length,
and therefore the search direction is often better posed as a normalized
direction, 𝑝𝑘
∇ 𝑓𝑘
𝑝𝑘 = − . (4.41)
k∇ 𝑓 𝑘 k Fig. 4.33 The steepest-descent direc-
tion points in the opposite direction
Algorithm 4.5 provides the complete steepest descent procedure. of the gradient.

Algorithm 4.5 Steepest descent

Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
4 Unconstrained Gradient-Based Optimization 110

𝑘=0 Initialize iteration counter


while k∇ 𝑓 k ∞ > 𝜏 do Optimality condition
∇𝑓
𝑝 𝑘 = − k∇ 𝑓 𝑘 k Normalized steepest descent direction
𝑘
Estimate 𝛼init from Eq. 4.43
𝛼 𝑘 = linesearch (𝑝 𝑘 , 𝛼init ) Perform a line search
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while

Regardless of whether we choose to normalize the search direction


or not, the gradient does not provide enough information to inform
a good guess of the initial step size for the line search. As we saw in
Section 4.3, this initial choice has a large impact on the efficiency of
the line search because the first guess could be orders of magnitude
too small or too large. The second-order methods described later in
this section are better in this respect. In the meantime, we can make
a guess of the step size for a given line search based on the result of
the previous one. Assuming that we will obtain a decrease in objective
function at the current line search that is comparable to the previous
one, we can write

𝛼 𝑘 ∇ 𝑓 𝑘 | 𝑝 𝑘 ≈ 𝛼 𝑘−1 ∇ 𝑓 𝑘−1 | 𝑝 𝑘−1 . (4.42)

Solving for the step length, we obtain the guess

∇ 𝑓 𝑘−1 | 𝑝 𝑘−1
𝛼 𝑘 = 𝛼 𝑘−1 . (4.43)
∇ 𝑓𝑘 | 𝑝 𝑘

Although this expression could be simplified for the steepest descent,


we leave it as is so that it is applicable to other methods. If the slope of
the function increases in magnitude relative to the previous line search,
this guess decreases relative to the previous line search step length, and
vice versa. This is just the first step length in the new line search, after
which we proceed as usual.
Although steepest descent sounds like the best possible search
direction for decreasing a function, it generally is not. The reason is
that when a function curvature varies significantly with direction, the
gradient alone is a poor representation of function behavior beyond a
small neighborhood, as illustrated previously in Fig. 4.19.

Example 4.10 Steepest descent with varying amount of curvature


4 Unconstrained Gradient-Based Optimization 111

Consider the following quadratic function:


𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝛽𝑥22 ,
where 𝛽 can be set to adjust the curvature in the 𝑥2 direction. In Fig. 4.34, we
show this function for 𝛽 = 1, 5, 15. The starting point is 𝑥0 = (10, 1). When
𝛽 = 1 (left), this quadratic has the same curvature in all directions, and the
steepest-descent direction points directly to the minimum. When 𝛽 > 1 (middle
and right), this is no longer the case, and steepest descent shows abrupt changes
in the subsequent search directions. This zigzagging is an inefficient way to
approach the minimum. The higher the difference in curvature, the more
iterations it takes.

1 iteration 32 iterations 111 iterations


5 5 5

𝑥0 𝑥0 𝑥0
𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗

−5 −5 −5

−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥1 𝑥1 𝑥1

𝛽=1 𝛽=5 𝛽 = 15

Fig. 4.34 Iteration history for a


quadratic function, with three differ-
ent curvatures, using the steepest-
The behavior shown in Ex. 4.10 is expected, and we can show it descent method with an exact line
mathematically. Assuming we perform an exact line search at each search (small enough 𝜇2 ).
iteration, this means selecting the optimal value for 𝛼 along the line
search:
𝜕 𝑓 (𝑥 𝑘 + 𝛼𝑝 𝑘 )
=0⇒
𝜕𝛼
𝜕 𝑓 (𝑥 𝑘+1 )
=0⇒
𝜕𝛼
𝜕 𝑓 (𝑥 𝑘+1 ) 𝜕 (𝑥 𝑘 + 𝛼𝑝 𝑘 ) (4.44)
=0⇒
𝜕𝑥 𝑘+1 𝜕𝛼
∇ 𝑓 𝑘+1 | 𝑝 𝑘 = 0 ⇒
−𝑝 𝑘+1 | 𝑝 𝑘 = 0 .
Hence, each search direction is orthogonal to the previous one. When
performing an exact line search, the gradient projection in the line search
direction vanishes at the minimum, which means that the gradient is
orthogonal to the search direction, as shown in Fig. 4.35.
As discussed in the last section, exact line searches are not desirable,
so the search directions are not orthogonal. However, the overall
zigzagging behavior still exists.
4 Unconstrained Gradient-Based Optimization 112

120

𝑥𝑘 80
∇ 𝑓 |𝑝
𝑥2 0 𝑓

∇𝑓 40

−5 𝑝 Fig. 4.35 The gradient projection in


the line search direction vanishes at
0
−5 0 5 10 𝛼 the line search minimum.
𝑥1

Example 4.11 Steepest descent applied to the bean function

We now find the minimum of the bean function, 𝑥2

1 2 3
34 iterations
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 , 𝑥0
2 2

using the steepest-descent algorithm with an exact line search, and a conver-
1
gence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . The optimization path is shown in Fig. 4.36.
𝑥∗
Although it takes only a few iterations to get close to the minimum, it takes
0
many more to satisfy the specified convergence tolerance.
−1
−2 −1 0 1 2
𝑥1

Tip 4.4 Scale the design variables and the objective function Fig. 4.36 Steepest-descent optimiza-
tion path.
Problem scaling is one of the most crucial considerations in practical
optimization. Steepest descent is susceptible to scaling, as demonstrated in
Ex. 4.10. Even though we will learn about less sensitive methods, poor scaling
can decrease the effectiveness of any method for general nonlinear functions.
A common cause of poor scaling is unit choice. For example, consider a
problem with two types of design variables, where one type is the material
thickness, on the order of 10−6 m, and the other type is the length of the
structure, on the order of 1 m. If both distances are measured in meters, the
derivative in the thickness direction is much larger than the derivative in the
length direction. In other words, the design space would have a valley that
is steep in one direction and shallow in the other. The optimizer would have
great difficulty in navigating this type of design space.
Similarly, if the objective is power and a typical value is 106 W, the gradients
would likely be relatively small, and satisfying convergence tolerances may be
challenging.
A good rule of thumb is to scale the objective function and every design
variable to be around unity. The scaling of the objective is only needed after
the model analysis computes the function and can be written as

𝑓¯ = 𝑓 /𝑠 𝑓 , (4.45)

where 𝑠 𝑓 is the scaling factor, which could be the value of the objective at the
starting point, 𝑓 (𝑥0 ), or another typical value. Multiplying the functions by a
4 Unconstrained Gradient-Based Optimization 113

scalar does not change the optimal solution but can significantly improve the
ability of the optimizer to find the optimum.
Scaling the design variables is more involved because scaling them changes 𝑥0 𝑥∗
the value that the optimizer would pass to the model and thus changes their
meaning. In general, we might use different scaling factors for different types 𝑥0 𝑠 𝑥 𝑥¯ ∗ 𝑠 𝑥
of variables, so we represent these as an 𝑛-vector, 𝑠 𝑥 . Starting with the physical
𝑥¯ 0 𝑥¯ ∗
design variables, 𝑥0 , we obtain the scaled variables by dividing them by the
scaling factors: Optimizer
𝑥¯ 0 = 𝑥0 𝑠 𝑥 , (4.46)
𝑥¯ 𝑓¯
where denotes element-wise division. Then, because the optimizer works
with the scaled variables, we need to convert them back to physical variables 𝑥¯ 𝑠 𝑥 𝑓 /𝑠 𝑓
by multiplying them by the scaling factors: 𝑥 𝑓

𝑥 = 𝑥¯ 𝑠 𝑥 , (4.47) Model

where denotes element-wise multiplication. Finally, we must also convert


Fig. 4.37 Scaling works by providing
the scaled variables to their physical values after the optimization is completed.
a scaled version of the design vari-
The complete process is shown in Fig. 4.37. ables and objective function to the op-
It is not necessary that the objective and all variables be precisely 1—which timizer. However, the model analysis
is impossible to maintain as the optimization progresses. Instead, this heuristic still needs to work with the original
suggests that the objective and all variables should have an order of magnitude variables and function.
of 1. If one of the variables or functions is expected to vary across multiple
orders of magnitude during an optimization, one effective way to scale is to
take the logarithm. For example, suppose the objective was expected to vary
across multiple orders of magnitude. In that case, we could minimize log( 𝑓 )
instead of minimizing 𝑓 .∗ ∗ If 𝑓 can be negative, a transformation
is required to ensure that the logarithm
This heuristic still does not guarantee that the derivatives are well scaled,
argument is always positive.
but it provides a reasonable starting point for further fine-tuning of the problem
scaling. A scaling example is discussed in Ex. 4.19.
Sometimes, additional adjustment is needed if the objective is far less
sensitive to some of the design variables than others (i.e., the entries in the
gradient span various orders of magnitude). A more appropriate but more
involved approach is to scale the variables and objective function such that the
gradient elements have a similar magnitude (ideally of order 1). Achieving a
well-scaled gradient sometimes requires adjusting inputs and outputs away
from the earlier heuristic. Sometimes this occurs because the objective is much
less sensitive to a particular variable.

4.4.2 Conjugate Gradient


Steepest descent generally performs poorly, especially if the problem is
not well scaled, like the quadratic example in Fig. 4.34. The conjugate
gradient method updates the search directions such that they do
not zigzag as much. This method is based on the linear conjugate
4 Unconstrained Gradient-Based Optimization 114

gradient method, which was designed to solve linear equations. We


first introduce the linear conjugate gradient method and then adapt it
to the nonlinear case.
For the moment, let us assume that we have the following quadratic
objective function:
1
𝑓 (𝑥) = 𝑥 | 𝐴𝑥 − 𝑏 | 𝑥 , (4.48)
2 2 iterations

where 𝐴 is a positive definite Hessian, and 𝑏 is the gradient at 𝑥 = 0.


The constant term is omitted with no loss of generality because it does 𝑥∗
𝑥2
not change the location of the minimum. To find the minimum of this 𝑝1 𝑝0
quadratic, we require 𝑥0

∇ 𝑓 (𝑥 ∗ ) = 𝐴𝑥 ∗ − 𝑏 = 0 . (4.49) 𝑥1

Thus, finding the minimum of a quadratic amounts to solving the linear Fig. 4.38 For a quadratic function with
system 𝐴𝑥 = 𝑏, and the residual vector is the gradient of the quadratic. elliptical contours and the principal
axis aligned with the coordinate axis,
If 𝐴 were a positive-definite diagonal matrix, the contours would be
we can find the minimum in 𝑛 steps,
elliptical, as shown in Fig. 4.38 (or hyper-ellipsoids in the 𝑛-dimensional where 𝑛 is the number of dimensions,
case), and the axes of the ellipses would align with the coordinate direc- by using a coordinate search.
tions. In that case, we could converge to the minimum by successively
16 iterations
performing an exact line search in each coordinate direction for a total
of 𝑛 line searches.
𝑥∗
In the more general case (but still assuming 𝐴 to be positive definite), 𝑥2
𝑝1
the axes of the ellipses form an orthogonal coordinate system in some 𝑝0

other orientation. A coordinate search would no longer work as well in 𝑥0

this case, as illustrated in Fig. 4.39.


𝑥1
Recall from Section 4.1.2 that the eigenvectors of the Hessian repre-
sent the directions of principal curvature, which correspond to the axes Fig. 4.39 For a quadratic function
of the ellipses. Therefore, we could successively perform a line search with the elliptical principal axis not
along the direction defined by each eigenvector and again converge aligned with the coordinate axis,
more iterations are needed to find the
to the minimum with 𝑛 line searches, as illustrated in Fig. 4.40. The minimum using a coordinate search.
problem with this approach is that we would have to compute the
2 iterations
eigenvectors of 𝐴, a computation whose cost is 𝒪(𝑛 3 ).
Fortunately, the eigenvector directions are not the only set of direc-
tions that can minimize the quadratic function in 𝑛 line searches. To 𝑥2
𝑥∗
find out which directions can achieve this, let us express the path from
𝑝1 𝑝0
the origin to the minimum of the quadratic as a sequence of 𝑛 steps 𝑥0
with directions 𝑝 𝑖 and length 𝛼 𝑖 ,
𝑥1
Õ
𝑛−1
𝑥∗ = 𝛼𝑖 𝑝𝑖 . (4.50) Fig. 4.40 We can converge to the min-
𝑖=0 imum of a quadratic function by min-
imizing along each Hessian eigenvec-
tor.
4 Unconstrained Gradient-Based Optimization 115

Thus, we have represented the solution as a linear combination of 𝑛


vectors. Substituting this into the quadratic (Eq. 4.48), we get
!
Õ
𝑛−1

𝑓 (𝑥 ) = 𝑓 𝛼𝑖 𝑝𝑖

!| !
𝑖=0

1 Õ ©Õ ª Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼𝑖 𝑝𝑖 𝐴­ 𝛼 𝑗 𝑝 𝑗 ® − 𝑏| 𝛼𝑖 𝑝𝑖 (4.51)
2
𝑖=0 « 𝑗=0 ¬ 𝑖=0

1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑖 𝛼 𝑗 𝑝 𝑖 | 𝐴𝑝 𝑗 − 𝛼𝑖 𝑏| 𝑝𝑖 .
2
𝑖=0 𝑗=0 𝑖=0

Suppose that the vectors 𝑝0 , 𝑝1 , . . . , 𝑝 𝑛−1 are conjugate with respect to


𝐴; that is, they have the following property:

𝑝 𝑖 | 𝐴𝑝 𝑗 = 0, for all 𝑖 ≠ 𝑗. (4.52)

Then, the double-sum term in Eq. 4.51 can be simplified to a single sum
and we can write
𝑛−1 
Õ 
1 2 iterations

𝑓 (𝑥 ) = 𝛼 2 𝑝 𝑖 | 𝐴𝑝 𝑖 |
− 𝛼𝑖 𝑏 𝑝𝑖 . (4.53)
2 𝑖
𝑖=0
𝑥∗
𝑝1
Because each term in this sum involves only one direction 𝑝 𝑖 , we have 𝑥2
𝑝0
reduced the original problem to a series of one-dimensional quadratic
𝑥0
functions that can be minimized one at a time. Two possible conjugate
directions are shown for the two-dimensional case in Fig. 4.41. 𝑥1
Each one-dimensional problem corresponds to minimizing the
quadratic with respect to the step length 𝛼 𝑖 . Differentiating each term Fig. 4.41 By minimizing along a se-
quence of conjugate directions in
and setting it to zero yields
turn, we can find the minimum of
a quadratic in 𝑛 steps, where 𝑛 is the
𝑏| 𝑝𝑖
𝛼 𝑖 𝑝 𝑖 | 𝐴𝑝 𝑖 − 𝑏 | 𝑝 𝑖 = 0 ⇒ 𝛼 𝑖 = , (4.54) number of dimensions.
𝑝 𝑖 | 𝐴𝑝 𝑖
𝑝𝑘
which corresponds to the result of an exact line search in direction 𝑝 𝑖 .
𝛽 𝑘 𝑝 𝑘−1
There are many possible sets of vectors that are conjugate with
respect to 𝐴, including the eigenvectors. The conjugate gradient method 𝑝 𝑘−1
finds these directions starting with the steepest-descent direction,
−∇ 𝑓 𝑘

𝑝0 = −∇ 𝑓 (𝑥0 ) , (4.55)
Fig. 4.42 The conjugate gradient
and then finds each subsequent direction using the update, search direction update combines the
steepest-descent direction with the
𝑝 𝑘 = −∇ 𝑓 𝑘 + 𝛽 𝑘−1 𝑝 𝑘−1 . (4.56) previous conjugate gradient direc-
tion.
4 Unconstrained Gradient-Based Optimization 116

For a positive 𝛽, the result is a new direction somewhere between the


current steepest descent and the previous search direction, as shown in
Fig. 4.42. The factor 𝛽 is set such that 𝑝 𝑘 and 𝑝 𝑘−1 are conjugate with
respect to 𝐴. One option to compute a 𝛽 that achieves conjugacy is
given by the Fletcher–Reeves formula,

∇ 𝑓𝑘 | ∇ 𝑓𝑘
𝛽𝑘 = . (4.57)
∇ 𝑓 𝑘−1 | ∇ 𝑓 𝑘−1

This formula is derived in Appendix B.4 as Eq. B.40 in the context of


linear solvers. Here, we replace the residual of the linear system with
the gradient of the quadratic because they are equivalent. Using the
directions given by Eq. 4.56 and the step lengths given by Eq. 4.54,
we can minimize a quadratic in 𝑛 steps, where 𝑛 is the size of 𝑥.
The minimization shown in Fig. 4.41 starts with the steepest-descent
direction and then computes one update to converge to the minimum in
two iterations using exact line searches. The linear conjugate gradient
method is detailed in Alg. B.2.
However, we are interested in minimizing general nonlinear func-
tions. We can adapt the linear conjugate gradient method to the
nonlinear case by doing the following:

1. Use the gradient of the nonlinear function in the search direction


update (Eq. 4.56) and the expression for 𝛽 (Eq. 4.57). This gradient
can be computed using any of the methods in Chapter 6.
2. Perform an inexact line search instead of doing the exact line
search. This frees us from providing the Hessian vector products
required for an exact line search (see Eq. 4.54). A line search that
satisfies the strong Wolfe conditions is a good choice, but we need
a stricter range in the sufficient decrease and sufficient curvature
parameters (0 < 𝜇1 < 𝜇2 < 1/2).† This stricter requirement on 𝜇2 † For more details on the line search re-

quirements, see Sec. 5.2 in Nocedal and


is necessary with the Fletcher–Reeves formula (Eq. 4.57) to ensure Wright.79
descent directions. As a first guess for 𝛼 in the line search, we can 79. Nocedal and Wright, Numerical
use the same estimate proposed for steepest descent (Eq. 4.43). Optimization, 2006.

3. Reset the search direction periodically back to the steepest-descent


direction. In practice, resetting is often helpful to remove old
information that is no longer useful. Some methods reset every 𝑛
iterations, motivated by the fact that the linear case only generates
𝑛 conjugate vectors. A more mathematical approach resets the
direction when |
|∇ 𝑓 𝑘 ∇ 𝑓 𝑘−1 |
| ≥ 0.1 . (4.58)
|∇ 𝑓 𝑘 ∇ 𝑓 𝑘 |
4 Unconstrained Gradient-Based Optimization 117

The full procedure is given in Alg. 4.6. As with steepest descent, we


may use normalized search directions.
The nonlinear conjugate gradient method is no longer guaranteed
to converge in 𝑛 steps like its linear counterpart, but it significantly
outperforms the steepest-descent method. The change required relative
to steepest descent is minimal: save information on the search direction
and gradient from the previous iteration, and add the 𝛽 term to the
search direction update. Therefore, there is rarely a reason to prefer
steepest descent. The parameter 𝛽 can be interpreted as a “damping
parameter” that prevents each search direction from varying too much
relative to the previous one. When the function steepens, the damping
becomes larger, and vice versa.
The formula for 𝛽 in Eq. 4.57 is only one of several options. Another
well-known option is the Polak–Ribière formula, which is given by

∇ 𝑓 𝑘 | (∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1 )
𝛽𝑘 = . (4.59)
∇ 𝑓 𝑘−1 | ∇ 𝑓 𝑘−1

The conjugate gradient method with the Polak–Ribière formula tends


to converge more quickly than with the Fletcher–Reeves formula, and
this method does not require the more stringent range for 𝜇2 . However,
regardless of the value of 𝜇2 , the strong Wolfe conditions still do not
guarantee that 𝑝 𝑘 is a descent direction (𝛽 might become negative). This
issue can be addressed by forcing 𝛽 to remain nonnegative:

𝛽 ← max(0, 𝛽) . (4.60)

This equation automatically triggers a reset whenever 𝛽 = 0 (see


Eq. 4.56), so in this approach, other checks on resetting can be removed
from Alg. 4.6.

Algorithm 4.6 Nonlinear conjugate gradient

Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value

𝑘=0 Initialize iteration counter


while k∇ 𝑓 𝑘 k ∞ > 𝜏 do Optimality condition
if 𝑘 = 0 or reset = true then first direction, and at resets
∇𝑓
𝑝 𝑘 = − k∇ 𝑓 𝑘 k
𝑘
4 Unconstrained Gradient-Based Optimization 118

else
∇ 𝑓 |∇ 𝑓
𝛽 𝑘 = ∇ 𝑓 𝑘 | ∇ 𝑓𝑘
𝑘−1 𝑘−1
∇𝑓
𝑝 𝑘 = − k∇ 𝑓 𝑘 k + 𝛽 𝑘 𝑝 𝑘−1 Conjugate gradient direction update
𝑘
end if
𝛼 𝑘 = linesearch (𝑝 𝑘 , 𝛼init ) Perform a line search
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while

Example 4.12 Conjugate gradient applied to the bean function

Minimizing the same bean function from Ex. 4.11 and the same line search 𝑥2
3
algorithm and settings, we get the optimization path shown in Fig. 4.43. The 22 iterations
𝑥0
changes in direction for the conjugate gradient method are smaller than for 2
steepest descent, and it takes fewer iterations to achieve the same convergence
tolerance. 1 𝑥∗

−1
4.4.3 Newton’s Method −2 −1 0 1 2
𝑥1

The steepest-descent and conjugate gradient methods use only first-


Fig. 4.43 Conjugate gradient opti-
order information (the gradient). Newton’s method uses second-order mization path.
(curvature) information to get better estimates for search directions. The
main advantage of Newton’s method is that, unlike first-order methods,
it provides an estimate of the step length because the curvature predicts
where the function derivative is zero.
In Section 3.8, we presented Newton’s method for solving nonlinear
equations. Newton’s method for minimizing functions is based on the
same principle, but instead of solving 𝑟(𝑢) = 0, we solve for ∇ 𝑓 (𝑥) = 0.
As in Section 3.8, we can derive Newton’s method for one-dimensional
function minimization from the Taylor series approximation,

1
𝑓 (𝑥 𝑘 + 𝑠) ≈ 𝑓 (𝑥 𝑘 ) + 𝑠 𝑓 0 (𝑥 𝑘 ) + 𝑠 2 𝑓 00 (𝑥 𝑘 ) . (4.61)
2
We now include a second-order term to get a quadratic that we can
minimize. We minimize this quadratic approximation by differentiating
with respect to the step 𝑠 and setting the derivative to zero, which yields

𝑓 0 (𝑥 𝑘 )
𝑓 0 (𝑥 𝑘 ) + 𝑠 𝑓 00 (𝑥 𝑘 ) = 0 ⇒ 𝑠=− . (4.62)
𝑓 00 (𝑥 𝑘 )
4 Unconstrained Gradient-Based Optimization 119

Thus, the Newton update is

𝑓 𝑘0
𝑥 𝑘+1 = 𝑥 𝑘 − . (4.63)
𝑓 𝑘00

We could also derive this equation by taking Newton’s method for root
finding (Eq. 3.24) and replacing 𝑟(𝑢) with 𝑓 0(𝑥).
20 𝑓

Example 4.13 Newton’s method for one-dimensional minimization 15 0 2

Suppose we want to minimize the following single-variable function: 1


10

𝑓 (𝑥) = (𝑥 − 2)4 + 2𝑥 2 − 4𝑥 + 4 .
5

The first derivative is


𝑓 0 (𝑥) = 4(𝑥 − 2)3 + 4𝑥 − 4 ,
0
20 𝑓0

and the second derivative is


10
00 2
𝑓 (𝑥) = 12(𝑥 − 2) + 4 .
0
1 2 3
Starting from 𝑥0 = 3, we can form the quadratic (Eq. 4.61) using the function 𝑥∗ 𝑥0
−10
value and the first and second derivatives evaluated at that point, as shown
in the top plot in Fig. 4.44. Then, the minimum of the quadratic is given Fig. 4.44 Newton’s method for find-
analytically by the Newton update (Eq. 4.63). We successively form quadratics ing roots can be adapted for func-
at each iteration and minimize them to find the next iteration. This is equivalent tion minimization by formulating it
to finding the zero of the function’s first derivative, as shown in the bottom plot to find a zero of the derivative. We
in Eq. 4.63. step to the minimum of a quadratic
at each iteration (top) or equivalently
find the root of the function’s first
derivative (bottom).
Like the one-dimensional case, we can build an 𝑛-dimensional
Taylor series expansion about the current design point:

1
𝑓 (𝑥 𝑘 + 𝑠) ≈ 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑠 + 𝑠 | 𝐻 𝑘 𝑠 , (4.64) 𝑥2
2 1 iteration
5
where 𝑠 is a vector centered at 𝑥 𝑘 . Similar to the one-dimensional case,
𝑥0
we can find the step 𝑠 𝑘 that minimizes this quadratic model by taking
0 𝑥∗
the derivative with respect to 𝑠 and setting that equal to zero:

d 𝑓 (𝑥 𝑘 + 𝑠)
= ∇ 𝑓𝑘 + 𝐻𝑘 𝑠 = 0 . (4.65) −5
d𝑠
−5 0 5 10
Thus, each Newton step is the solution of a linear system where the 𝑥1

matrix is the Hessian,


Fig. 4.45 Iteration history for a
𝐻 𝑘 𝑠 𝑘 = −∇ 𝑓 𝑘 . (4.66) quadratic function using an exact line
search and Newton’s method. Un-
This linear system is analogous to the one used for solving nonlinear surprisingly, only one iteration is re-
systems with Newton’s method (Eq. 3.30), except that the Jacobian quired.
4 Unconstrained Gradient-Based Optimization 120

becomes the Hessian, the residual is the gradient, and the design
variables replace the states. We can use any of the linear solvers
mentioned in Section 3.6 and Appendix B to solve this system.
When minimizing the quadratic function from Ex. 4.10, Newton’s
method converges in one step for any value of 𝛽, as shown in Fig. 4.45.
Thus, Newton’s method is scale invariant
Because the function is quadratic, the quadratic “approximation”
from the Taylor series is exact, so we can find the minimum in one
step. It will take more iterations for a general nonlinear function, but
using curvature information generally yields a better search direction
than first-order methods. In addition, Newton’s method provides a
step length embedded in 𝑠 𝑘 because the quadratic model estimates
the stationary point location. Furthermore, Newton’s method exhibits
quadratic convergence.
Although Newton’s method is powerful, it suffers from a few issues
in practice. One issue is that the Newton step does not necessarily
result in a function decrease. This issue can occur if the Hessian is not
positive definite or if the quadratic predictions overshoot because the
actual function has more curvature than predicted by the quadratic
approximation. Both of these possibilities are illustrated in Fig. 4.46.

4 4

2 2
𝑥𝑘 𝑥𝑘
𝑥 𝑘+1
𝑥2 0 𝑥2 0 𝑥 𝑘+1

−2 −2

−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1
25
0
20 Fig. 4.46 Newton’s method in its pure
𝑓 𝑓 −20 form is vulnerable to negative curva-
15 ture (in which case it might step away
−40 from the minimum) and overshoot-
10
𝑥 𝑘+1 𝑥𝑘 𝑥𝑘 𝑥 𝑘+1 ing (which might result in a function
increase).
Negative curvature Overshoot

If the Hessian is not positive definite, the step might not even be in
a descent direction. Replacing the real Hessian with a positive-definite
Hessian can mitigate this issue. The quasi-Newton methods in the next
section force a positive-definite Hessian by construction.
To fix the overshooting issue, we can use a line search instead of
4 Unconstrained Gradient-Based Optimization 121

blindly accepting the Newton step length. We would set 𝑝 𝑘 = 𝑠 𝑘 ,


with 𝛼 init = 1 as the first guess for the step length. In this case, we
have a much better guess for 𝛼 compared with the steepest-descent
or conjugate gradient cases because this guess is based on the local
curvature. Even if the first step length given by the Newton step
overshoots, the line search would find a point with a lower function
value.
The trust-region methods in Section 4.5 address both of these issues
by minimizing the function approximation within a specified region
around the current iteration.
Another major issue with Newton’s method is that the Hessian can
be difficult or costly to compute. Even if available, the solution of the
linear system in Eq. 4.65 can be expensive. Both of these considerations
motivate the quasi-Newton methods, which we explain next.

Example 4.14 Newton method applied to the bean function

Minimizing the same bean function from Exs. 4.11 and 4.12, we get the
optimization path shown in Fig. 4.47. Newton’s method takes fewer iterations
than steepest descent (Ex. 4.11) or conjugate gradient (Ex. 4.12) to achieve the
same convergence tolerance. The first quadratic approximation is a saddle
function that steps to the saddle point, away from the minimum of the function.
However, in subsequent iterations, the quadratic approximation becomes
convex, and the steps take us along the valley of the bean function toward the
minimum.

3 3 3

𝑥0
2 𝑠0 2 2

𝑥1 𝑥2 𝑥∗
𝑥2 1 𝑥2 1 𝑥2 1

𝑠2 𝑥3
0 0 0

−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 10
4
8

𝑓 10 𝑓 6 𝑓 2

4
0
5 2
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼

𝑘=0 𝑘=2 𝑘=8

Fig. 4.47 Newton’s method minimizes


a sequence of quadratic approxima-
tions of the function at each iteration.
In this case, it converges in 8 major
iterations.
4 Unconstrained Gradient-Based Optimization 122

4.4.4 Quasi-Newton Methods


As mentioned in Section 4.4.3, Newton’s method is efficient because the
second-order information results in better search directions, but it has
the significant shortcoming of requiring the Hessian. Quasi-Newton
methods are designed to address this issue. The basic idea is that
we can use first-order information (gradients) along each step in the
iteration path to build an approximation of the Hessian.
In one dimension, we can adapt the secant method (see Eq. 3.26) for
function minimization. Instead of estimating the first derivative, we
now estimate the second derivative (curvature) using two successive
first derivatives, as follows:
𝑓 0 𝑘+1 − 𝑓 0 𝑘
𝑓 00 𝑘+1 = . (4.67)
𝑥 𝑘+1 − 𝑥 𝑘
Then we can use this approximation in the Newton step (Eq. 4.63) to
obtain an iterative procedure that requires only first derivatives instead
of first and second derivatives.
The quadratic approximation based on this approximation of the
second derivative is
 
𝑠2 𝑓 0 𝑘+1 − 𝑓 0 𝑘
𝑓˜𝑘+1 (𝑥 𝑘+1 + 𝑠) = 𝑓 𝑘+1 + 𝑠 𝑓 𝑘+1
0
+ . (4.68)
2 𝑥 𝑘+1 − 𝑥 𝑘

Taking the derivative of this approximation with respect to 𝑠, we get


 
𝑓 0 𝑘+1 − 𝑓 0 𝑘 8
𝑓˜𝑘+1
𝑓
0 0
(𝑥 𝑘+1 + 𝑠) = 𝑓 𝑘+1 +𝑠 . (4.69)
𝑥 𝑘+1 − 𝑥 𝑘
6

For 𝑠 = 0, which corresponds to 𝑥 𝑘+1 , we get 𝑓˜𝑘+1 0 0


(𝑥 𝑘+1 ) = 𝑓 𝑘+1 , which
tells us that the slope of the approximation matches the slope of the 4
𝑓
actual function at 𝑥 𝑘+1 , as expected. 𝑓˜
Also, by stepping backward to 𝑥 𝑘 by setting 𝑠 = − (𝑥 𝑘+1 − 𝑥 𝑘 ), we 2

find that 𝑓˜𝑘+1 (𝑥 𝑘 ) = 𝑓 𝑘 . Thus, the nature of this approximation is such


0 0
10 𝑓 0
that it matches the slope of the actual function at the last two points, as 5 𝑓0
shown in Fig. 4.48.
0 𝑓˜0
In 𝑛 dimensions, things are more involved, but the principle is
the same: use first-derivative information from the last two points −5

to approximate second-derivative information. Instead of iterating −10


𝑥𝑘 𝑥 𝑘+1
along the 𝑥-axis as we would in one dimension, the optimization in 𝑛
dimensions follows a sequence of steps (as shown in Fig. 4.1) for the Fig. 4.48 The quadratic approxima-
tion based on the secant method
separate line searches. We have gradients at the endpoints of each step,
matches the slopes at the two last
so we can take the difference between the gradients at those points to points and the function value at the
get the curvature along that direction. The question is: How do we last point.
4 Unconstrained Gradient-Based Optimization 123

update the Hessian, which is expressed in the coordinate system of 𝑥,


based on directional curvatures in directions that are not necessarily
aligned with the coordinate system?
Quasi-Newton methods use the quadratic approximation of the
objective function,

1
𝑓˜ (𝑥 𝑘 + 𝑝) = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑝 + 𝑝 | 𝐻˜ 𝑘 𝑝 , (4.70)
2

where 𝐻˜ is an approximation of the Hessian. Similar to Newton’s


method, we minimize this quadratic with respect to 𝑝, which yields the
linear system
𝐻˜ 𝑘 𝑝 𝑘 = −∇ 𝑓 𝑘 . (4.71)

We solve this linear system for 𝑝 𝑘 , but instead of accepting it as the final
step, we perform a line search in the 𝑝 𝑘 direction. Only after finding a
step size 𝛼 𝑘 that satisfies the strong Wolfe conditions do we update the
point using
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 . (4.72)
Quasi-Newton methods update the approximate Hessian at every
iteration based on the latest information using an update of the form

𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + Δ𝐻˜ 𝑘 , (4.73)

where the update Δ𝐻˜ 𝑘 is a function of the last two gradients. The first
Hessian approximation is usually set to the identity matrix (or a scaled
version of it), which yields a steepest-descent direction for the first line
search (set 𝐻˜ = 𝐼 in Eq. 4.71 to verify this).
We now develop the requirements for the approximate Hessian
update. Suppose we just obtained the new point 𝑥 𝑘+1 after a line search
starting from 𝑥 𝑘 in the direction 𝑝 𝑘 . We can write the new quadratic
based on an updated Hessian as follows:

1
𝑓˜ (𝑥 𝑘+1 + 𝑝) = 𝑓 𝑘+1 + ∇ 𝑓 𝑘+1 | 𝑝 + 𝑝 | 𝐻˜ 𝑘+1 𝑝 . (4.74)
2
We can assume that the new point’s function value and gradient are
given, but we do not have the new approximate Hessian yet. Taking
the gradient of this quadratic with respect to 𝑝, we obtain

∇ 𝑓˜ (𝑥 𝑘+1 + 𝑝) = ∇ 𝑓 𝑘+1 + 𝐻˜ 𝑘+1 𝑝 . (4.75)

In the single-variable case, we observed that the quadratic approx-


imation based on the secant method matched the slope of the actual
function at the last two points. Therefore, it is logical to require the
4 Unconstrained Gradient-Based Optimization 124

𝑛-dimensional quadratic based on the approximate Hessian to match


the gradient of the actual function at the last two points.
The gradient of the new approximation (Eq. 4.75) matches the
gradient at the new point 𝑥 𝑘+1 by construction (just set 𝑝 = 0). To
find the gradient predicted by the new approximation (Eq. 4.75) at the
previous point 𝑥 𝑘 , we set 𝑝 = 𝑥 𝑘 − 𝑥 𝑘+1 = −𝛼 𝑘 𝑝 𝑘 (which is a backward
step from the end of the last line search to the start of the line search) to
get
∇ 𝑓˜ (𝑥 𝑘+1 − 𝛼 𝑘 𝑝 𝑘 ) = ∇ 𝑓 𝑘+1 − 𝛼 𝑘 𝐻˜ 𝑘+1 𝑝 𝑘 . (4.76)
Now, we enforce that this must be equal to the actual gradient at that
point,
∇ 𝑓 𝑘+1 − 𝛼 𝑘 𝐻˜ 𝑘+1 𝑝 𝑘 = ∇ 𝑓 𝑘 ⇒
(4.77)
𝛼 𝑘 𝐻˜ 𝑘+1 𝑝 𝑘 = ∇ 𝑓 𝑘+1 − ∇ 𝑓 𝑘 .
To simplify the notation, we define the step as

𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘 = 𝛼 𝑘 𝑝 𝑘 , (4.78)

and the difference in the gradient as

𝑦 𝑘 = ∇ 𝑓 𝑘+1 − ∇ 𝑓 𝑘 . (4.79)

Figure 4.49 shows the step and the corresponding gradients.


Rewriting Eq. 4.77 using this notation, we get 𝑝 𝑘+1

𝑥 𝑘+1
𝐻˜ 𝑘+1 𝑠 𝑘 = 𝑦 𝑘 . (4.80) 𝑝𝑘
𝑠𝑘

𝑥𝑘
∇ 𝑓 𝑘+1
This is called the secant equation and is a fundamental requirement
for quasi-Newton methods. The result is intuitive when we recall the
∇ 𝑓𝑘
meaning of the product of the Hessian with a vector (Eq. 4.12): it is the
rate of change of the Hessian in the direction defined by that vector. Fig. 4.49 Quasi-Newton methods use
Thus, it makes sense that the rate of change of the curvature predicted the gradient at the endpoint of each
step to estimate the curvature in the
by the approximate Hessian should match the difference between the
step direction and update an approx-
gradients.‡ imation of the Hessian.
We need 𝐻˜ to be positive definite. Using the secant equation ‡ The secant equation is also known as the
(Eq. 4.80) and the definition of positive definiteness (𝑠 | 𝐻𝑠 > 0), we see quasi-Newton condition.
that this requirement implies that the predicted curvature is positive
along the step; that is,
𝑠 𝑘 | 𝑦𝑘 > 0 . (4.81)
This is called the curvature condition, and it is automatically satisfied if
the line search finds a step that satisfies the strong Wolfe conditions.
The secant equation (Eq. 4.80) is a linear system of 𝑛 equations
where the step and the gradients are known. However, there are
4 Unconstrained Gradient-Based Optimization 125

𝑛(𝑛 + 1)/2 unknowns in the approximate Hessian matrix (recall that it is


symmetric), so this equation is not sufficient to determine the elements
˜ The requirement of positive definiteness adds one more equation,
of 𝐻.
but those are not enough to determine all the unknowns, leaving us
with an infinite number of possibilities for 𝐻. ˜
To find a unique 𝐻˜ 𝑘+1 , we rationalize that among all the matrices
that satisfy the secant equation (Eq. 4.80), 𝐻˜ 𝑘+1 should be the one
closest to the previous approximate Hessian, 𝐻˜ 𝑘 . This makes sense
intuitively because the curvature information gathered in one step is
limited (because it is along a single direction) and should not change the
Hessian approximation more than necessary to satisfy the requirements.
The original quasi-Newton update, known as DFP, was first pro-
posed by Davidon and then refined by Fletcher and also Powell (see
historical note in Section 2.3).20,21 The DFP update formula has been 20. Davidon, Variable metric method for
minimization, 1991.
superseded by the BFGS formula, which was independently developed
21. Fletcher and Powell, A rapidly con-
by Broyden, Fletcher, Goldfarb, and Shanno.80–83 BFGS is currently vergent descent method for minimization,
considered the most effective quasi-Newton update, so we focus on this 1963.
80. Broyden, The convergence of a class
update. However, Appendix C.2.1 has more details on DFP. of double-rank minimization algorithms 1.
The formal derivation of the BFGS update formula is rather involved, General considerations, 1970.
so we do not include it here. Instead, we work through an informal 81. Fletcher, A new approach to variable
metric algorithms, 1970.
derivation that provides intuition about this update and quasi-Newton 82. Goldfarb, A family of variable-metric
methods in general. We also include more details in Appendix C.2.2. methods derived by variational means, 1970.
Recall that quasi-Newton methods add an update to the previous 83. Shanno, Conditioning of quasi-Newton
methods for function minimization, 1970.
Hessian approximation (Eq. 4.73). One way to think about an update
that yields a matrix close to the previous one is to consider the rank
of the update, Δ𝐻. ˜ The lower the rank of the update, the closer the
updated matrix is to the previous one. Also, the curvature information
contained in this update is minimal because we are only gathering
information in one direction for each update. Therefore, we can reason
that the rank of the update matrix should be the lowest possible rank
that satisfies the secant equation (Eq. 4.80).
The update must be symmetric and positive definite to ensure
a symmetric positive-definite Hessian approximation. If we start
with a symmetric positive-definite approximation, then all subsequent
approximations remain symmetric and positive definite. As it turns
out, it is possible to derive a rank 1 update matrix that satisfies the
secant equation, but this update is not guaranteed to be positive definite. =
However, we can get positive definiteness with a rank 2 update.
We can obtain a symmetric rank 2 update by adding two symmetric (𝑛 × 1) (1 × 𝑛) (𝑛 × 𝑛)
rank 1 matrices. One convenient way to obtain a symmetric rank 1
Fig. 4.50 The self outer product of a
matrix is to perform a self outer product of a vector, which takes a vector
vector produces a symmetric (𝑛 × 𝑛)
of size 𝑛 and multiplies it with its transpose to obtain an (𝑛 × 𝑛) matrix, matrix of rank 1.
4 Unconstrained Gradient-Based Optimization 126

as shown in Fig. 4.50. Matrices resulting from vector outer products


have rank 1 because all the columns are linearly dependent.
With two linearly independent vectors (𝑢 and 𝑣), we can get a rank
2 update using
𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + 𝛼𝑢𝑢 | + 𝛽𝑣𝑣 | , (4.82)
where 𝛼 and 𝛽 are scalar coefficients. Substituting this into the secant
equation (Eq. 4.80), we have

𝐻˜ 𝑘 𝑠 𝑘 + 𝛼𝑢𝑢 | 𝑠 𝑘 + 𝛽𝑣𝑣 | 𝑠 𝑘 = 𝑦 𝑘 . (4.83)

Because the new information about the function is encapsulated in the


vectors 𝑦 and 𝑠, we can reason that 𝑢 and 𝑣 should be based on these
vectors. It turns out that using 𝑠 on its own does not yield a useful
˜ does. Setting 𝑢 = 𝑦 and 𝑣 = 𝐻𝑠
update (one term cancels out), but 𝐻𝑠 ˜
in Eq. 4.83 yields
 |
𝐻˜ 𝑘 𝑠 𝑘 + 𝛼𝑦 𝑘 𝑦 𝑘 | 𝑠 𝑘 + 𝛽 𝐻˜ 𝑘 𝑠 𝑘 𝐻˜ 𝑘 𝑠 𝑘 𝑠 𝑘 = 𝑦𝑘 . (4.84)

Rearranging this equation, we have


 
𝑦 𝑘 (1 − 𝛼𝑦 𝑘 | 𝑠 𝑘 ) = 𝐻˜ 𝑘 𝑠 𝑘 1 + 𝛽𝑠 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘 . (4.85)

Because the vectors 𝑦 𝑘 and 𝐻˜ 𝑘 𝑠 𝑘 are not parallel in general (because the
secant equation applies to 𝐻˜ 𝑘+1 , not to 𝐻˜ 𝑘 ), the only way to guarantee
this equality is to set the terms in parentheses to zero. Thus, the scalar
coefficients are
1 1
𝛼= | , 𝛽=− . (4.86)
𝑦𝑘 𝑠 𝑘 𝑠 𝑘 𝐻˜ 𝑘 𝑠 𝑘
|

Substituting these coefficients and the chosen vectors back into Eq. 4.82,
we get the BFGS update,

𝑦 𝑘 𝑦 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘
𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + − . (4.87)
𝑦𝑘 | 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘
Although we did not explicitly enforce positive definiteness, the rank 2
update is positive definite, and therefore, all the Hessian approxima-
tions are positive definite, as long as we start with a positive-definite
approximation.
Now recall that we want to solve the linear system that involves
§ Thisformula is also known as the Wood-
this matrix (Eq. 4.71), so it would be more efficient to approximate
bury matrix identity. Given a matrix and an
the inverse of the Hessian directly instead. The inverse can be found update to that matrix, it yields an explicit
expression for the inverse of the updated
analytically from the update (Eq. 4.87) using the Sherman–Morrison–
matrix in terms of the inverses of the matrix
Woodbury formula.§ Defining 𝑉˜ as the approximation of the inverse of and the update (see Appendix C.3).
4 Unconstrained Gradient-Based Optimization 127

the Hessian, the final result is

𝑉˜ 𝑘+1 = (𝐼 − 𝜎 𝑘 𝑠 𝑘 𝑦 𝑘 | ) 𝑉˜ 𝑘 (𝐼 − 𝜎 𝑘 𝑦 𝑘 𝑠 𝑘 | ) + 𝜎 𝑘 𝑠 𝑘 𝑠 𝑘 | , (4.88)

where
1
𝜎𝑘 = . (4.89)
𝑦𝑘 | 𝑠 𝑘
Figure 4.51 shows the sizes of the vectors and matrices involved in this
equation.

𝑉˜ 𝑘+1 = 𝐼 − 𝜎𝑘 𝑠 𝑘 𝑦𝑘 𝑉˜ 𝑘 𝐼 − 𝜎𝑘 𝑦𝑘 𝑠𝑘 + 𝜎𝑘 𝑠 𝑘 𝑠𝑘

(𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛)

Now we can replace the potentially costly solution of the linear Fig. 4.51 Sizes of each term of the
BFGS update (Eq. 4.88).
system (Eq. 4.71) with the much cheaper matrix-vector product,

𝑝 𝑘 = −𝑉˜ 𝑘 ∇ 𝑓 𝑘 , (4.90)

where 𝑉˜ is the estimate for the inverse of the Hessian.


Algorithm 4.7 details the steps for the BFGS algorithm. Unlike
first-order methods, we should not normalize the direction vector 𝑝 𝑘
because the length of the vector is meaningful. Once we have curvature
information, the quasi-Newton step should give a reasonable estimate
of where the function slope flattens. Thus, as advised for Newton’s
method, we set 𝛼 init = 1. Alternatively, this would be equivalent to
using a normalized direction vector and then setting 𝛼 init to the initial
magnitude of 𝑝. However, optimization algorithms in practice use
𝛼 init = 1 to signify that a full (quasi-) Newton step was accepted (see
Tip 4.5).
As discussed previously, we need to start with a positive-definite
estimate to maintain a positive-definite inverse Hessian. Typically, this
is the identity matrix or a weighted identity matrix, for example:

1
𝑉˜ 0 = 𝐼. (4.91)
k∇ 𝑓0 k

This makes the first step a normalized steepest-descent direction:

∇ 𝑓0
𝑝0 = −𝑉˜ 0 ∇ 𝑓0 = − . (4.92)
k∇ 𝑓0 k
4 Unconstrained Gradient-Based Optimization 128

Algorithm 4.7 BFGS

Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value

𝑘=0 Initialize iteration counter


𝛼init = 1 Initial step length for line search
while k∇ 𝑓 𝑘 k ∞ > 𝜏 do Optimality condition
if 𝑘 = 0 or reset = true then
𝑉˜ 𝑘 = k∇1𝑓 k 𝐼
else
𝑠 = 𝑥 𝑘 − 𝑥 𝑘−1 Last step
𝑦 = ∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1 Curvature along last step
𝜎 = 𝑠 |1 𝑦
𝑉˜ 𝑘 = (𝐼 − 𝜎𝑠 𝑦 | ) 𝑉˜ 𝑘−1 (𝐼 − 𝜎𝑦𝑠 | ) + 𝜎𝑠𝑠 | Quasi-Newton update
end if
𝑝 = −𝑉˜ 𝑘 ∇ 𝑓 𝑘 Compute quasi-Newton step
𝛼 = linesearch (𝑝, 𝛼init ) Should satisfy the strong Wolfe conditions
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 Update design variables
𝑘 = 𝑘+1 Increment iteration index
end while

In a practical algorithm, 𝑉˜ might require occasional resets to the


scaled identity matrix. This is because as we iterate in the design
space, curvature information gathered far from the current point might
become irrelevant and even counterproductive. The trigger for this
reset could occur when the directional derivative ∇ 𝑓 | 𝑝 is greater than
some threshold. That would mean that the slope along the search
direction is shallow; in other words, the search direction is close to
orthogonal to the steepest-descent direction.
Another well-known quasi-Newton update is the symmetric rank 1
(SR1) update, which we derive in Appendix C.2.3. Because the update
is rank 1, it does not guarantee positive definiteness. Why would we be
interested in a Hessian approximation that is potentially indefinite? In
practice, the matrices produced by SR1 have been found to approximate
the true Hessian matrix well, often better than BFGS. This alternative is
more common in trust-region methods (see Section 4.5), which depend
more strongly on an accurate Hessian and do not require positive
definiteness. It is also sometimes used for constrained optimization
4 Unconstrained Gradient-Based Optimization 129

problems where the Hessian of the Lagrangian is often indefinite, even


at the minimizer.

Example 4.15 BFGS applied to the bean function

Minimizing the same bean function from previous examples using BFGS, we 𝑥2
3
get the optimization path shown in Fig. 4.52. We also show the corresponding 7 iterations
𝑥0
quadratic approximations for a few selected steps of this minimization in 2
Fig. 4.53. Because we generate approximations to the inverse, we invert those
approximations to get the Hessian approximation for the purpose of illustration. 1 𝑥∗
We initialize the inverse Hessian to the identity matrix, which results in
a quadratic with circular contours and a steepest-descent step (Fig. 4.53, left). 0

Using the BFGS update procedure, after two iterations,


−1
−2 −1 0 1 2
𝑥2 = (0.1197030, −0.043079) , 𝑥1

and the inverse Hessian approximation is Fig. 4.52 BFGS optimization path.
 
0.435747 −0.202020
𝑉˜ 2 = .
−0.202020 0.222556

The exact inverse Hessian at the same point is

3 3 3

𝑥0
2 2 2
𝑠0
𝑥2 1 𝑥2 1 𝑥2 1
𝑥∗
𝑥1
𝑥3
𝑥2
0 0 𝑠2 0

−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 6
4
4
𝑓 10 𝑓 𝑓 2
2
0
5 0
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼

𝑘=0 𝑘=2 𝑘=7

  Fig. 4.53 Minimization of the bean


−1 0.450435 0.035946 function using BFGS. The first
𝐻 (𝑥2 ) = .
0.035946 0.169535 quadratic approximation has circular
contours (left). After two iterations,
The predicted curvature improves, and it results in a good step toward the the quadratic approximation im-
proves, and the step approaches the
minimum, as shown in the middle plot of Fig. 4.53. The one-dimensional
minimum (middle). Once converged,
slice reveals how the approximation curvature in the line search direction is the minimum of the quadratic ap-
proximation coincides with the bean
function minimum (right).
4 Unconstrained Gradient-Based Optimization 130

higher than the actual; however, the line search moves past the approximation
minimum toward the true minimum.
By the end of the optimization, at 𝑥 ∗ = (1.213412, 0.824123), the BFGS
estimate is  
˜ ∗ 0.276946 0.224010
𝑉 = ,
0.224010 0.347847
whereas the exact one is
 
0.276901 0.223996
𝐻 −1 (𝑥 ∗ ) = .
0.223996 0.347867

Now the estimate is much more accurate. In the right plot of Fig. 4.53, we can
see that the minimum of the approximation coincides with the actual minimum.
The approximation is only accurate locally, worsening away from the minimum.

4.4.5 Limited-Memory Quasi-Newton Methods


When the number of design variables is large (millions or billions), it
might not be possible to store the Hessian inverse approximation matrix
in memory. This motivates limited-memory quasi-Newton methods,
which make it possible to handle such problems. In addition, these
methods also improve the computational efficiency of medium-sized
problems (hundreds or thousands of design variables) with minimal
sacrifice in accuracy.
˜ 𝑓
Recall that we are only interested in the matrix-vector product 𝑉∇
to find each search direction using Eq. 4.90. As we will see in this
section, we can compute this product without ever actually forming
the matrix 𝑉. ˜ We focus on doing this for the BFGS update because
this is the most popular approach (known as L-BFGS), although similar
techniques apply to other quasi-Newton update formulas.
The BFGS update (Eq. 4.88) is a recursive sequence:
 
𝑉˜ 𝑘 = (𝐼 − 𝜎𝑠 𝑦 | )𝑉(𝐼
˜ − 𝜎𝑦𝑠 | ) + 𝜎𝑠𝑠 |
𝑘−1
, (4.93)

where
1
𝜎= . (4.94)
𝑠| 𝑦
If we save the sequence of 𝑠 and 𝑦 vectors and specify a starting value
for 𝑉˜ 0 , we can compute any subsequent 𝑉˜ 𝑘 . Of course, what we want
is 𝑉˜ 𝑘 ∇ 𝑓 𝑘 , which we can also compute using an algorithm with the
recurrence relationship. However, such an algorithm would not be
advantageous from the memory-usage perspective because we would
have to store a long sequence of vectors and a starting matrix.
4 Unconstrained Gradient-Based Optimization 131

To reduce the memory usage, we do not store the entire history of


vectors. Instead, we limit the storage to the last 𝑚 vectors for 𝑠 and
𝑦. In practice, 𝑚 is usually between 5 and 20. Next, we make the
starting Hessian diagonal such that we only require vector storage (or
scalar storage if we make all entries in the diagonal equal). A common
choice is to use a scaled identity matrix, which just requires storing one
number,
𝑠| 𝑦
𝑉˜ 0 = | 𝐼 , (4.95)
𝑦 𝑦
where the 𝑠 and 𝑦 correspond to the previous iteration. Algorithm 4.8
details the procedure.

Algorithm 4.8 Compute search direction using L-BFGS

Inputs:
∇ 𝑓 𝑘 : Gradient at point 𝑥 𝑘
𝑠 𝑘−1,...,𝑘−𝑚 : History of steps 𝑥 𝑘 − 𝑥 𝑘−1
𝑦 𝑘−1,...,𝑘−𝑚 : History of gradient differences ∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1
Outputs:
𝑝: Search direction −𝑉˜ 𝑘 ∇ 𝑓 𝑘

𝑑 = ∇ 𝑓𝑘
for 𝑖 = 𝑘 − 1 to 𝑘 − 𝑚 by −1 do
𝛼 𝑖 = 𝜎𝑖 𝑠 𝑖 | 𝑑
𝑑 = 𝑑 − 𝛼 𝑖 𝑦𝑖
end for  | 
𝑠 𝑦 𝑘−1
𝑉˜ 0 = 𝑘−1| 𝐼 Initialize Hessian inverse approximation as a scaled identity matrix
𝑦 𝑘−1 𝑦 𝑘−1
𝑑 = 𝑉˜ 0 𝑑
for 𝑖 = 𝑘 − 𝑚 to 𝑘 − 1 do
𝛽 𝑖 = 𝜎𝑖 𝑦 𝑖 | 𝑑
𝑑 = 𝑑 + (𝛼 𝑖 − 𝛽 𝑖 )𝑠 𝑖
end for
𝑝 = −𝑑 𝑥2
3
L-BFGS: 7 iterations
BFGS: 7 iterations
2 𝑥0
Using this technique, we no longer need to bear the memory cost
L-BFGS
of storing a large matrix or incur the computational cost of a large 1
𝑥∗
matrix-vector product. Instead, we store a small number of vectors and 0
BFGS
require fewer vector-vector products (a cost that scales linearly with 𝑛
rather than quadratically). −1
−2 0 2
𝑥1

Example 4.16 L-BFGS compared with BFGS for the bean function
Fig. 4.54 Optimization paths using
BFGS and L-BFGS.
4 Unconstrained Gradient-Based Optimization 132

Minimizing the same bean function from the previous examples, the
optimization iterations using BFGS and L-BFGS are the same, as shown in
Fig. 4.54. The L-BFGS method is applied to the same sequence using the
last five iterations. The number of variables is too small to benefit from the
limited-memory approach, but we show it in this small problem as an example.
˜ 𝑓 is estimated using Alg. 4.8 as
At the same 𝑥 ∗ as in Ex. 4.15, the product 𝑉∇
 
−7.38683 × 10−5
𝑑∗ = ,
5.75370 × 10−5

whereas the exact value is:


 
−7.49228 × 10−5
𝑉˜ ∗ ∇ 𝑓 ∗ = .
5.90441 × 10−5

Example 4.17 Minimizing the total potential energy for a spring system

Many structural mechanics models involve solving an unconstrained energy


minimization problem. Consider a mass supported by two springs, as shown
in Fig. 4.55. Formulating the total potential energy for the system as a function
of the mass position yields the following problem:¶ ¶ Appendix D.1.8 has details on this prob-

q 2 q 2
lem.
1 1
minimize 𝑘 (ℓ1 + 𝑥1 )2 + 𝑥 22 − ℓ1 + 𝑘 (ℓ 2 − 𝑥1 )2 + 𝑥22 − ℓ2
𝑥 1 ,𝑥2 2 1 2 2
− 𝑚 𝑔𝑥2 .

ℓ1 ℓ2

𝑘1 𝑘2

𝑥1

𝑥2
Fig. 4.55 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔

The contours of this function are shown in Fig. 4.56 for the case where
𝑙1 = 12, 𝑙2 = 8, 𝑘1 = 1, 𝑘2 = 10, 𝑚 𝑔 = 7. There is a minimum and a maximum.
The minimum represents the position of the mass at the stable equilibrium
condition. The maximum also represents an equilibrium point, but it is unstable.
All methods converge to the minimum when starting near the maximum. All
4 Unconstrained Gradient-Based Optimization 133

four methods use the same parameters, convergence tolerance, and starting
point. Depending on the starting point, Newton’s method can become stuck at
the saddle point, and if a line search is not added to safeguard it, it could have
terminated at the maximum instead.
As expected, steepest descent is the least efficient, and the second-order
methods are the most efficient. The number of iterations and the relative
performance are problem dependent and sensitive to the optimization algorithm
parameters, so we should not analyze the number of iterations too closely.
However, these results show the expected trends for most problems.

12 12
32 iterations 27 iterations
8 𝑥∗ 8 𝑥∗

4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0

−4 −4

−8 −8
−5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1

Steepest descent Conjugate gradient

12 12
14 iterations 12 iterations
8 𝑥∗ 8 𝑥∗

4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0

−4 −4

−8 −8
−5 0 5 10 15 −5 0 5 10 15
Fig. 4.56 Minimizing the total poten-
𝑥1 𝑥1
tial for two-spring system.
Quasi-Newton Newton
4 Unconstrained Gradient-Based Optimization 134

Example 4.18 Comparing methods for the Rosenbrock function

We now test the methods on the following more challenging function:


 2
𝑓 (𝑥 1 , 𝑥2 ) = (1 − 𝑥1 )2 + 100 𝑥2 − 𝑥12 ,

which is known as the Rosenbrock function. This is a well-known optimization


problem because a narrow, highly curved valley makes it challenging to
minimize.‖ The optimization path and the convergence history for four methods ‖
The “bean” function we used in previ-
starting from 𝑥 = (−1.2, 1.0) are shown in Figs. 4.57 and 4.58, respectively. ous examples is a milder version of the
Rosenbrock function.
All four methods use an inexact line search with the same parameters and
a convergence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . Compared with the previous
two examples, the difference between the steepest-descent and second-order
methods is much more dramatic (two orders of magnitude more iterations!),
owing to the more challenging variation in the curvature (recall Ex. 4.10).

2 2
10,662 iterations 930 iterations

𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0

0 0

−1 0 1 −1 0 1
𝑥1 𝑥1

Steepest descent Conjugate gradient

2 2
36 iterations 24 iterations

𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0

0 0
Fig. 4.57 Optimization paths for the
Rosenbrock function using steepest
−1 0 1 −1 0 1
descent, conjugate gradient, BFGS,
𝑥1 𝑥1
and Newton.
Quasi-Newton Newton

The steepest-descent method converges, but it takes many iterations because


it bounces between the steep walls of the valley while making little progress
along the bottom of the valley. The conjugate gradient method is much
more efficient because it damps the steepest-descent oscillations. Eventually,
the conjugate gradient method achieves superlinear convergence near the
optimum, saving many iterations over the last several orders of magnitude in
4 Unconstrained Gradient-Based Optimization 135

103

102
Steepest
101
descent
100

10−1
||∇ 𝑓 || ∞ Newton
10−2 Fig. 4.58 Convergence of the four
10−3 Quasi- methods shows the dramatic differ-
Newton ence between the linear convergence
10−4 of steepest descent, the superlinear
Conjugate convergence of the conjugate gradi-
10−5
gradient ent method, and the quadratic conver-
10−6 gence of the methods that use second-
100 101 102 103 order information.
Major iterations

the convergence criterion. The methods that use second-order information are
even more efficient, exhibiting quadratic convergence in the last few iterations.

The number of major iterations is not always an effective way to


compare performance. For example, Newton’s method takes fewer ma-
jor iterations, but each iteration in Newton’s method is more expensive
than each iteration in the quasi-Newton method. This is because New-
ton’s method requires a linear solution, which is an 𝒪(𝑛 3 ) operation, as
opposed to a matrix-vector multiplication, which is an 𝒪(𝑛 2 ) operation.
For a small problem like the two-dimensional Rosenbrock function,
this is an insignificant difference, but this is a significant difference
in computational effort for large problems. Additionally, each major
iteration includes a line search, and depending on the quality of the
search direction, the number of function calls contained in each iteration
will differ.

Tip 4.5 Unit steps indicate good progress

When performing a line search within a quasi-Newton algorithm, we pick


𝛼init = 1 (a unit step) because this corresponds to the minimum if the quadratic
approximation were perfect. When the quadratic approximation matches the
actual function well enough, the line search should exit after the first evaluation.
On the other hand, if the line search takes many iterations, this indicates a poor
match or other numerical difficulties. If difficulties persist over many major
iterations, plot the line search (Tip 4.3).
4 Unconstrained Gradient-Based Optimization 136

Example 4.19 Problem scaling

In Tip 4.4, we discussed the importance of scaling. Let us illustrate this


with an example. Consider a stretched version of the Rosenbrock function from
Ex. 4.18: !
 2  2 2
𝑥 𝑥1
𝑓 (𝑥1 , 𝑥2 ) = 1 − 1 + 100 𝑥 2 − . (4.96)
104 104
The contours of this function have the same characteristics as those of the
original Rosenbrock function shown in Fig. 4.57, but the 𝑥1 axis is stretched,
as shown in Fig. 4.59. Because 𝑥1 is scaled by such a large number (104 ), we
cannot show it using the same scale as the 𝑥2 axis, otherwise the 𝑥2 axis would
disappear. The minimum of this function is at 𝑥 ∗ = [104 , 1], where 𝑓 ∗ = 0.

1 𝑥𝐴
𝑥2 𝑥∗
−1
𝑥0
−3
−1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
𝑥1 ·104

Let us attempt to minimize this function starting from 𝑥0 = [−5000, −3]. Fig. 4.59 The contours the scaled
The gradient at this starting point is ∇ 𝑓 (𝑥0 ) = [−0.0653, −650.0], so the slope Rosenbrock function (Eq. 4.96) are
highly stretched in the 𝑥1 direction,
in the 𝑥2 direction is four orders of magnitude times larger than the slope in by orders of magnitude more than
the 𝑥1 direction! Therefore, there is a significant bias toward moving along the what we can show here.
𝑥2 direction but little incentive to move in the 𝑥1 direction. After an exact line
search in the steepest descent direction, we obtain the step to 𝑥 𝐴 = [−5000, 0.25]
as shown in Fig. 4.59. The optimization stops at this point, even though it is
not a minimum. This premature convergence is because 𝜕 𝑓 /𝜕𝑥1 is orders of
magnitude smaller, so both components of the gradient satisfy the optimality
conditions when using a standard relative tolerance.
To address this issue, we scale the design variables as explained in
Tip 4.4. Using the scaling 𝑠 𝑥 = [104 , 1], the scaled starting point becomes
𝑥¯ 0 = [−5000, −3] [104 , 1] = [−0.5, −3]. Before evaluating the function, we
need to convert the design variables back to their unscaled values, that is,
𝑓 (𝑥) = 𝑓 (𝑥¯ 𝑠 𝑥 ).
This scaling of the design variables alone is sufficient to improve the
optimization convergence. Still, let us also scale the objective because it is
large at our starting point (around 900). Dividing the objective by 𝑠 𝑓 = 1000,
the initial gradient becomes ∇ 𝑓 (𝑥0 ) = [−0.00206, −0.6]. This is still not ideally
scaled, but it has much less variation in orders of magnitude—more than
sufficient to solve the problem successfully. The optimizer returns 𝑥¯ ∗ = [1, 1],
where 𝑓¯∗ = 1.57 × 10−12 . When rescaled back to the problem coordinates,
𝑥 ∗ = [104 , 1], 𝑓 ∗ = 1.57 × 10−9 .
In this example, the function derivatives span many orders of magnitude,
so dividing the function by a scalar does not have much effect. Instead, we
could minimize log( 𝑓 ), which allows us to solve the problem even without
scaling 𝑥. If we also scale 𝑥, the number of required iterations for convergence
4 Unconstrained Gradient-Based Optimization 137

decreases. Using log( 𝑓 ) as the objective and scaling the design variables as
before yields 𝑥¯ ∗ = [1, 1], where 𝑓¯∗ = −25.28, which in the original problem
space corresponds to 𝑥 ∗ = [104 , 1], where 𝑓 ∗ = 1.05 × 10−11 .
Although this example does not correspond to a physical problem, such
differences in scaling occur frequently in engineering analysis. For example,
optimizing the operating point of a propeller might involve two variables: the
pitch angle and the rotation rate. The angle would typically be specified in
radians (a quantity of order 1) and the rotation rate in rotations per minute
(typically tens of thousands).

Poor scaling causes premature convergence for various reasons. In


Ex. 4.19, it was because convergence was based on a tolerance relative
to the starting gradient, and some gradient components were much
smaller than others. When using an absolute tolerance, premature
convergence can occur when the gradients are small to begin with
(because of the scale of the problem, not because they are near an
optimum). When the scaling is poor, the optimizer is even more
dependent on accurate gradients to navigate the narrow regions of
function improvement.
Larger engineering simulations are usually more susceptible to
numerical noise due to iteration loops, solver convergence tolerances,
and longer computational procedures. Another issue arises when the
derivatives are not computed accurately. In these cases, poorly scaled
problems struggle because the line search directions are not accurate
enough to yield a decrease, except for tiny step sizes.
Most practical optimization algorithms terminate early when this
occurs, not because the optimality conditions are satisfied but because
the step sizes or function changes are too small, and progress is stalled
(see Tip 4.1). A lack of attention to scaling is one of the most frequent
causes of poor solutions in engineering optimization problems.

Tip 4.6 Accurate derivatives matter

The effectiveness of gradient-based methods depends strongly on providing


accurate gradients. Convergence difficulties, or apparent multimodal behavior,
are often mistakenly identified as optimization algorithm difficulties or fun-
damental modeling issues when in reality, the numerical issues are caused by
inaccurate gradients. Chapter 6 is devoted to computing accurate derivatives.
4 Unconstrained Gradient-Based Optimization 138

4.5 Trust-Region Methods

In Section 4.2, we mentioned two main approaches for unconstrained


gradient-based optimization: line search and trust region. We described
the line search in Section 4.3 and the associated methods for computing
search directions in Section 4.4. Now we describe trust-region methods,
also known as restricted-step methods. The main motivation for trust-
region methods is to address the issues with Newton’s method (see
Section 4.4.3) and quasi-Newton updates that do not guarantee a
positive definite-Hessian approximation (e.g., SR1, which we briefly
described in Section 4.4.4).
The trust-region approach is fundamentally different from the line
search approach because it finds the direction and distance of the
step simultaneously instead of finding the direction first and then the
distance. The trust-region approach builds a model of the function
to be minimized and then minimizes the model within a trust region,
within which we trust the model to be good enough for our purposes.
The most common model is a local quadratic function, but other
models may be used. When using a quadratic model based on the
function value, gradient, and Hessian at the current iteration, the
method is similar to Newton’s method.
𝑥0
The trust region is centered about the current iteration point and
can be defined as an 𝑛-dimensional box, sphere, or ellipsoid of a given
size. Each trust-region iteration consists of the following main steps: Create model
Update trust-
1. Create or update the model about the current point. region size, Δ
Minimize
2. Minimize the model within the trust region.
model
3. Move to the new point, update values, and adapt the size of the Update 𝑥
trust region.
Is 𝑥 a
No minimum?
These steps are illustrated in Fig. 4.60, and they are repeated until
convergence. Figure 4.61 shows the steps to minimize the bean function, Yes

where the circles show the trust regions for each iteration. 𝑥∗
The trust-region subproblem solved at each iteration is
Fig. 4.60 Trust-region methods mini-
mize a model within a trust region for
minimize 𝑓˜(𝑠) each iteration, and then they update
𝑠
(4.97) the trust-region size and the model
subject to k𝑠 k ≤ Δ , before the next iteration.

where 𝑓˜(𝑠) is the local trust-region model, 𝑠 is the step from the current
iteration point, and Δ is the size of the trust region. We use 𝑠 instead
of 𝑝 to indicate that this is a step vector and not simply the direction
vector used in methods based on a line search.
4 Unconstrained Gradient-Based Optimization 139

𝑥∗
𝑥2 𝑥 𝑘+1

𝑥𝑘 𝑠𝑘

Fig. 4.61 Path for the trust-region ap-


𝑥0
proach showing the circular trust re-
gions at each step.
𝑥1

The subproblem (Eq. 4.97) defines the trust region as a norm. The
Euclidean norm, k𝑠 k 2 , defines a spherical trust region and is the most
common type of trust region. Sometimes ∞-norms are used instead
because they are easy to apply, but 1-norms are rarely used because
they are just as complex as 2-norms but introduce sharp corners that
can be problematic.84 The shape of the trust region is dictated by the 84. Conn et al., Trust Region Methods,
2000.
norm (see Fig. A.8) and can significantly affect the convergence rate.
The ideal trust-region shape depends on the local function space, and
some algorithms allow for the trust-region shape to change throughout
the optimization.

4.5.1 Quadratic Model with Spherical Trust Region


Using a quadratic trust-region model and the Euclidean norm, we can
define the more specific subproblem:

1
minimize 𝑓˜(𝑠) = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑠 + 𝑠 | 𝐻˜ 𝑘 𝑠
𝑠 2 (4.98)
subject to k𝑠 k 2 ≤ Δ 𝑘 ,

where 𝐻˜ 𝑘 is the approximate (or true) Hessian at our current iterate.


This problem has a quadratic objective and quadratic constraints and
is called a quadratically constrained quadratic program (QCQP). If the
problem is unconstrained and 𝐻˜ is positive definite, we can get to the
solution using a single step, 𝑠 = −𝐻(𝑘)˜ −1 ∇ 𝑓 𝑘 . However, because of
the constraints, there is no analytic solution for the QCQP. Although
the problem is still straightforward to solve numerically (because it is
a convex problem; see Section 11.4), it requires an iterative solution
approach with multiple factorizations.
Similar to the line search, where we only obtain a sufficiently
good point instead of finding the exact minimum, in the trust-region
4 Unconstrained Gradient-Based Optimization 140

subproblem, we seek an approximate solution to the QCQP. Including


the trust-region constraint allows us to omit the requirement that 𝐻˜ 𝑘
be positive definite, which is used in most quasi-Newton methods. We
do not detail approximate solution methods to the QCQP, but there are
various options.79,84,85 79. Nocedal and Wright, Numerical
Optimization, 2006.
Figure 4.62 compares the bean function with a local quadratic
84. Conn et al., Trust Region Methods,
model, which is built using information about the point where the 2000.
arrow originates. The trust-region step seeks the minimum of the local 85. Steihaug, The conjugate gradient
method and trust regions in large scale
quadratic model within the circular trust region. Unlike line search optimization, 1983.
methods, as the size of the trust region changes, the direction of the
step (the solution to Eq. 4.98) might also change, as shown on the right
panel of Fig. 4.62.

Fig. 4.62 Quadratic model (gray con-


tours) compared to the actual func-
tion (blue contours), and two differ-
𝑥2 𝑥2 ent different trust region sizes (red
𝑠𝑘
𝑠𝑘 circles). The trust-region step 𝑠 𝑘 finds
𝑝𝑘 the minimum of the quadratic model
while remaining within the trust re-
gion. The steepest-descent direction
𝑝 is shown for comparison.
𝑥1 𝑥1

4.5.2 Trust-Region Sizing Strategy


This section presents an algorithm for updating the size of the trust
region at each iteration. The trust region can grow, shrink, or remain the
same, depending on how well the model predicts the actual function
decrease. The metric we use to assess the model is the actual function
decrease divided by the expected decrease:

𝑓 (𝑥) − 𝑓 (𝑥 + 𝑠)
𝑟= . (4.99)
𝑓˜(0) − 𝑓˜(𝑠)

The denominator in this definition is the expected decrease, which is


always positive. The numerator is the actual change in the function,
which could be a reduction or an increase. An 𝑟 value close to unity
means that the model agrees well with the actual function. An 𝑟 value
larger than 1 is fortuitous and means that the actual decrease was even
greater than expected. A negative value of 𝑟 means that the function
actually increased at the expected minimum, and therefore the model
is not suitable.
4 Unconstrained Gradient-Based Optimization 141

The trust-region sizing strategy in Alg. 4.9 determines the size of


the trust region at each major iteration 𝑘 based on the value of 𝑟 𝑘 . The
parameters in this algorithm are not derived from any theory; instead,
they are empirical. This example uses the basic procedure from Nocedal
and Wright79 with the parameters recommended by Conn et al.84 79. Nocedal and Wright, Numerical
Optimization, 2006.
84. Conn et al., Trust Region Methods,
Algorithm 4.9 Trust-region algorithm 2000.

Inputs:
𝑥 0 : Starting point
Δ0 : Initial size of the trust region
Outputs:
𝑥 ∗ : Optimal point

while not converged do


Compute or estimate the Hessian
Solve (approximately) for 𝑠 𝑘 Use Eq. 4.97
Compute 𝑟 𝑘 Use Eq. 4.99
⊲ Resize trust region
if 𝑟 𝑘 ≤ 0.05 then Poor model
Δ 𝑘+1 = Δ 𝑘 /4 Shrink trust region
𝑠𝑘 = 0 Reject step
else if 𝑟 𝑘 ≥ 0.9 and k𝑠 𝑘 k = Δ 𝑘 then Good model and step to edge
Δ 𝑘+1 = min(2Δ 𝑘 , Δmax ) Expand trust region
else Reasonable model and step within trust region
Δ 𝑘+1 = Δ 𝑘 Maintain trust region size
end if
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑠 𝑘 Update location of trust region
𝑘 = 𝑘+1 Update iteration count
end while

The initial value of Δ is usually 1, assuming the problem is already


well scaled. One way to rationalize the trust-region method is that the
quadratic approximation of a nonlinear function is guaranteed to be
reasonable only within a limited region around the current point 𝑥 𝑘 .
Thus, we minimize the quadratic function within a region around 𝑥 𝑘
within which we trust the quadratic model.
When our model performs well, we expand the trust region. When it
performs poorly, we shrink the trust region. If we shrink the trust region
sufficiently, our local model will eventually be a good approximation
of the actual function, as dictated by the Taylor series expansion.
We should also set a maximum trust-region size (Δmax ) to prevent
the trust region from expanding too much. Otherwise, it may take
4 Unconstrained Gradient-Based Optimization 142

too long to reduce the trust region to an acceptable size over other
portions of the design space where a smaller trust region is needed.
The same convergence criteria used in other gradient-based methods
are applicable.∗ ∗ Conn et al.84provide more detail on trust-
region problems, including trust-region
norms and scaling, approaches to solving
Example 4.20 Trust-region method applied to the total potential energy of the trust-region subproblem, extensions to
spring system the model, and other important practical
considerations.
Minimizing the total potential energy function from Ex. 4.17 using a trust-
region method starting from the same points as before yields the optimization
path shown in Fig. 4.63. The initial trust region size is Δ = 0.3, and the
maximum allowable is Δmax = 1.5.

8 8 8

4 4 4

𝑥2 𝑥2 𝑥2
0 𝑠0 0 0
𝑥0
𝑠5
−4 −4 −4
𝑠3

−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=3 𝑘=5

8 8 𝑠 11 8
𝑠8
𝑥∗
4 4 4

𝑥2 𝑥2 𝑥2
0 0 0

−4 −4 −4

−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1

𝑘=8 𝑘 = 11 𝑘 = 15

The first few quadratic approximations do not have a minimum because Fig. 4.63 Minimizing the total poten-
the function has negative curvature around the starting point, but the trust tial for two-spring system using a
trust-region method shown at differ-
region prevents steps that are too large. When it gets close enough to the bowl ent iterations. The local quadratic
containing the minimum, the quadratic approximation has a minimum, and approximation is overlaid on the func-
the trust-region subproblem yields a minimum within the trust region. In the tion contours and the trust region is
shown as a red circle.
last few iterations, the quadratic is a good model, and therefore the region
remains large.
4 Unconstrained Gradient-Based Optimization 143

Example 4.21 Trust-region method applied to the Rosenbrock function

We now test the trust-region method on the Rosenbrock function. The


overall path is similar to the other second-order methods, as shown in Fig. 4.64.
The initial trust region size is Δ = 1, and the maximum allowable is Δmax = 5.
At any given point, the direction of maximum curvature of the quadratic
approximation matches the maximum curvature across the valley and rotates
as we track the bottom of the valley toward the minimum.

2 2 2

1 𝑥0 𝑠0 1 1
𝑥2 𝑥2 𝑥2

𝑠3
0 0 0
𝑠7

−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=3 𝑘=7

2 2 2

𝑥∗
1 1 1
𝑥2 𝑥2 𝑠17 𝑥2

0 𝑠12 0 0

−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1

𝑘 = 12 𝑘 = 17 𝑘 = 35

Fig. 4.64 Minimization of the Rosen-


brock function using a trust-region
method.

4.5.3 Comparison with Line Search Methods


Trust-region methods are typically more strongly dependent on accurate
Hessians than are line search methods. For this reason, they are usually
only effective when exact gradients (or better yet, an exact Hessian)
can be supplied. Many optimization packages require the user to
provide the full Hessian, or at least the gradients, to use a trust-region
approach. Trust-region methods usually require fewer iterations than
quasi-Newton methods with a line search, but each iteration is more
computationally expensive because they require at least one matrix
factorization.
4 Unconstrained Gradient-Based Optimization 144

Scaling can also be more challenging with trust-region approaches.


Newton’s method is invariant with scaling, but a Euclidean trust-region
constraint implicitly assumes that the function changes in each direction
at a similar rate. Some enhancements try to address this issue through
elliptical trust regions rather than spherical ones.

Tip 4.7 Smooth model discontinuities

Many models are defined in a piecewise manner, resulting in a discontinu-


ous function value, discontinuous derivative, or both. This can happen even if
the underlying physical behavior is continuous, such as fitting experimental
data using a non-smooth interpolation. The solution is to modify the implemen-
tation so that it is continuous while remaining consistent with the physics. If the
physics is truly discontinuous, it might still be advisable to artificially smooth
the function, as long as there is no significant increase in the modeling error.
Even if the smoothed version is highly nonlinear, having a continuous first
derivative helps the derivative computation and gradient-based optimization.
Some techniques are specific to the problem, but we discuss some examples
here.
𝑓
The absolute value function can often be tolerated as the outermost level of
the optimization. However, if propagated through subsequent functions, it can
introduce numerical issues from rapid changes in the function. One possibility
to smooth this function is to round off the vertex with a quadratic function, as
shown in Fig. 4.65. If we force continuity in the function and the first derivative, 𝑥
then the equation of a smooth absolute value is −Δ𝑥 Δ𝑥



 |𝑥|
 if |𝑥| > Δ𝑥
(4.100) Fig. 4.65 Smoothed absolute value
𝑓 (𝑥) =
 𝑥2 Δ𝑥
 + otherwise , function.
 2Δ𝑥 2
where Δ𝑥 is a user-adjustable parameter representing the half-width of the
transition.
Piecewise functions are often used in fits to empirical data. Cubic splines
or a sigmoid function can blend the transition between two functions smoothly.
We can also use the same technique to blend discrete steps (where the two
functions are constant values) or implement smooth max or min functions.† † Another option to smooth the max of
For example, a sigmoid can be used to blend two functions ( 𝑓1 (𝑥) and 𝑓2 (𝑥)) multiple functions is aggregation, which
is detailed in Section 5.7.
together at a transition point 𝑥 𝑡 using
 
1
𝑓 (𝑥) = 𝑓1 (𝑥) + ( 𝑓2 (𝑥) − 𝑓1 (𝑥)) , (4.101)
1 + 𝑒 −ℎ(𝑥−𝑥 𝑡 )
where ℎ is a user-selected parameter that controls how sharply the transition
occurs. The left side of Fig. 4.66 shows an example transitioning 𝑥 and 𝑥 2 with
𝑥 𝑡 = 0 and ℎ = 50.
4 Unconstrained Gradient-Based Optimization 145

0.2 0.2

𝑓1 (𝑥) 𝑓1 (𝑥)
0.1 0.1
𝑓2 (𝑥) 𝑓2 (𝑥)
𝑓 0 𝑓 0

−0.1 −0.1
𝑓 (𝑥) 𝑓 (𝑥)
−0.2 −0.2

−0.2 −0.1 0 0.1 0.2 −0.2 −0.1 0 0.1 0.2


Fig. 4.66 Smoothly blending two func-
𝑥 𝑥
tions.
Sigmoid function Cubic spline

Another approach is to use a cubic spline for the blending. Given a transition
point 𝑥 𝑡 and a half-width Δ𝑥, we can compute a cubic spline transition as



 𝑓 (𝑥)
 1

if 𝑥 < 𝑥1
𝑓 (𝑥) = if (4.102)

𝑓 (𝑥)
2 𝑥 > 𝑥2

 𝑐 𝑥3 + 𝑐 𝑥2 + 𝑐 𝑥 + 𝑐
 1 2 3 4 otherwise ,

where we define 𝑥1 = 𝑥 𝑡 − Δ𝑥 and 𝑥2 = 𝑥 𝑡 + Δ𝑥, and the coefficients 𝑐 are found


by solving the following linear system:

 𝑥3 𝑥12 1 𝑐1   𝑓1 (𝑥1 )


 1 𝑥1    
 𝑥3  𝑐   𝑓 (𝑥 )
 2 𝑥22 𝑥2 1  2  2 2 
 2   = 0 . (4.103)
3𝑥1 2𝑥1 1 0 𝑐3   𝑓1 (𝑥1 )
     
3𝑥 2 0 𝑐4   𝑓 0(𝑥2 )
 2 2𝑥2 1    2 
This ensures continuity in the function and the first derivative. The right side
of Fig. 4.66 shows the same two functions and transition location, blended with
a cubic spline using a half-width of 0.05.

𝑥2
3

2
Tip 4.8 Gradient-based optimization can find the global optimum
1
Gradient-based methods are local search methods. If the design space is
fundamentally multimodal, it may be helpful to augment the gradient-based 0

search with a global search. The simplest and most common approach is to use −1
𝑥∗
a multistart approach, where we run a gradient-based search multiple times,
starting from different points, as shown in Fig. 4.67. The starting points might −2
−2 0 2 4
be chosen from engineering intuition, randomly generated points, or sampling 𝑥1

methods, such as Latin hypercube sampling (see Section 10.2.1).


Fig. 4.67 A multistart approach with
Convergence testing is needed to determine a suitable number of starting
a gradient-based algorithm finds the
points. If all points converge to the same optimum and the starting points are global minimum of the Jones func-
well spaced, this suggests that the design space might not be multimodal after tion. We successfully apply the same
all. By using multiple starting points, we increase the likelihood that we find strategy to a discontinuous version
the global optimum, or at least that we find a better optimum than would be of this function in Ex. 7.9.
4 Unconstrained Gradient-Based Optimization 146

found with a single starting point. One advantage of this approach is that it
can easily be run in parallel.
Another approach is to start with a global search strategy (see Chapter 7).
After a suitable initial exploration, the design(s) given by the global search
become starting points for gradient-based optimization(s). This finds points
that satisfy the optimality conditions, which is typically challenging with a
pure gradient-free approach. It also improves the convergence rate and finds
optima more precisely.

4.6 Summary

Gradient-based optimization is powerful because gradients make it


possible to efficiently navigate 𝑛-dimensional space in a series of steps
converging to an optimum. The gradient also determines when the
optimum has been reached, which is when the gradient is zero.
Gradients provide only local information, so an approach that
ensures a function decrease when stepping away from the current point
is required. There are two approaches to ensure this: line search and
trust region. Algorithms based on a line search have two stages: finding
an appropriate search direction and determining how far to step in
that direction. Trust-region algorithms minimize a surrogate function
within a finite region around the current point. The region expands or
contracts, depending on how well the optimization within the previous
iteration went. Gradient-based optimization algorithms based on a line
search are more prevalent than trust-region methods, but trust-region
methods can be effective when second derivatives are available.
There are different options for determining the search direction for
each line search using gradient information. Although the negative
gradient points in the steepest-descent direction, following this direction
is not the best approach because it is prone to oscillations. The conjugate
gradient method dampens these oscillations and thus converges much
faster than steepest descent.
Second-order methods use curvature information, which dramati-
cally improves the rate of convergence. Newton’s method converges
quadratically but requires the Hessian of the function, which can be
prohibitive. Quasi-Newton methods circumvent this requirement by
building an approximation of the inverse of the Hessian based on
changes in the gradients along the optimization path. Quasi-Newton
methods also avoid matrix factorization, requiring matrix-vector multi-
plication instead. Because they are much less costly while achieving
better than linear convergence, quasi-Newton methods are widely
4 Unconstrained Gradient-Based Optimization 147

used. Limited-memory quasi-Newton methods can be used when the


problem is too large to fit in computer memory.
The line search in a given direction does not seek to find a minimum
because this is not usually worthwhile. Instead, it seeks to find a “good
enough” point that sufficiently decreases the function and the slope.
Once such a point is found, we select a new search direction and repeat
the process. Second-order methods provide a guess for the first step
length in the line search that further improves overall convergence.
This chapter provides the building blocks for the gradient-based
constrained optimization covered in the next chapter.
4 Unconstrained Gradient-Based Optimization 148

Problems

4.1 Answer true or false and justify your answer.

a. Gradient-based optimization requires the function to be


continuous and infinitely differentiable.
b. Gradient-based methods perform a local search.
c. Gradient-based methods are only effective for problems with
one minimum.
d. The dot product of ∇ 𝑓 with a unit vector 𝑝 yields the slope
of the 𝑓 along the direction of 𝑝.
e. The Hessian of a unimodal function is positive definite or
positive semidefinite everywhere.
f. Each column 𝑗 of the Hessian quantifies the rate of change
of component 𝑗 of the gradient vector with respect to all
coordinate directions 𝑖.
g. If the function curvature at a point is zero in some direction,
that point cannot be a local minimum.
h. A globalization strategy in a gradient-based algorithm en-
sures convergence to the global minimum.
i. The goal of the line search is to find the minimum along a
given direction.
j. For minimization, the line search must always start in a
descent direction.
k. The direction in the steepest-descent algorithm for a given
iteration is orthogonal to the direction of the previous itera-
tion.
l. Newton’s method is not affected by problem scaling.
m. Quasi-Newton methods approximate the function Hessian
by using gradients.
n. Newton’s method is a good choice among gradient-based
methods because it uses exact second-order information and
therefore converges well from any starting point.
o. The trust-region method does not require a line search.

4.2 Consider the function

𝑓 (𝑥 1 , 𝑥2 , 𝑥3 ) = 𝑥12 𝑥2 + 4𝑥24 − 𝑥2 𝑥3 + 𝑥3−1 ,

and answer the following:


4 Unconstrained Gradient-Based Optimization 149

a. Find the gradient of this function. Where is the gradient not


defined?
b. Calculate the directional derivative of the function at 𝑥 𝐴 =
(2, −1, 5) in the direction 𝑝 = [6, −2, 3].
c. Find the Hessian of this function. Is the curvature in the
direction 𝑝 positive or negative?
d. Write the second-order Taylor series expansion of this func-
tion. Plot the Taylor series function along the 𝑝 direction
and compare it to the actual function.

4.3 Consider the function from Ex. 4.1,

𝑓 (𝑥1 , 𝑥2 ) = 𝑥13 + 2𝑥1 𝑥22 − 𝑥23 − 20𝑥1 . (4.104)

Find the critical points of this function analytically and classify


them. What is the global minimum of this function?

4.4 Review Kepler’s wine barrel story from Section 2.2. Approximate
the barrel as a cylinder and find the height and diameter of a
barrel that maximizes its volume for a diagonal measurement of
1 m.

4.5 Consider the following function:

𝑓 = 𝑥14 + 3𝑥 13 + 3𝑥 22 − 6𝑥1 𝑥2 − 2𝑥 2 .

Find the critical points analytically and classify them. Where is


the global minimum? Plot the function contours to verify your
results.

4.6 Consider a slightly modified version of the function from Prob. 4.5,
where we add a 𝑥24 term to get

𝑓 = 𝑥14 + 𝑥 24 + 3𝑥13 + 3𝑥22 − 6𝑥1 𝑥2 − 2𝑥2 .

Can you find the critical points analytically? Plot the function
contours. Locate the critical points graphically and classify them.

4.7 Implement the two line search algorithms from Section 4.3, such
that they work in 𝑛 dimensions (𝑥 and 𝑝 can be vectors of any
size).

a. As a first test for your code, reproduce the results from the
examples in Section 4.3 and plot the function and iterations
for both algorithms. For the line search that satisfies the
strong Wolfe conditions, reduce the value of 𝜇2 until you get
an exact line search. How much accuracy can you achieve?
4 Unconstrained Gradient-Based Optimization 150

b. Test your code on another easy two-dimensional function,


such as the bean function from Ex. 4.11, starting from differ-
ent points and using different directions (but remember that
you must always provide a valid descent direction; other-
wise, the algorithm might not work!). Does it always find a
suitable point? Exploration: Try different values of 𝜇2 and 𝜌
to analyze their effect on the number of iterations.
c. Apply your line search algorithms to the two-dimensional
Rosenbrock function and then the 𝑛-dimensional variant
(see Appendix D.1.2). Again, try different points and search
directions to see how robust the algorithm is, and try to tune
𝜇2 and 𝜌.

4.8 Consider the one-dimensional function


𝑥
𝑓 (𝑥) = − .
𝑥2 + 2
Solve this problem using your line search implementations from
Prob. 4.7. Start from 𝑥0 = 0 and with an initial step of 𝛼 0 =
−𝑘 𝑓 0(𝑥0 ), where 𝑘 = 1.

a. How many function evaluations are required for each of the


algorithms? Plot the points where each algorithm terminates
on top of the function.
b. Try a different initial step of 𝑘 = 20 from the same starting
point. Did your algorithms work as expected? Explain the
behaviors.
c. Start from 𝑥 0 = 30 with 𝑘 = 20 and discuss the results.

4.9 Program the steepest-descent, conjugate gradient, and BFGS


algorithms from Section 4.4. You must have a thoroughly tested
line search algorithm from the previous exercise first. For the
gradients, differentiate the functions analytically and compute
them exactly. Solve each problem using your implementations
of the various algorithms, as well as off-the-shelf optimization
software for comparison.

a. For your first test problem, reproduce the results from the
examples in Section 4.4.
b. Minimize the two-dimensional Rosenbrock function (see
Appendix D.1.2) using the various algorithms and compare
your results starting from 𝑥 = (−1, 2). Compare the total
number of evaluations. Compare the number of minor
4 Unconstrained Gradient-Based Optimization 151

versus major iterations. Discuss the trends. Exploration: Try


different starting points and tuning parameters (e.g., 𝜌 and
𝜇2 in the line search) and compare the number of major and
minor iterations.
c. Benchmark your algorithms on the 𝑛-dimensional variant
of the Rosenbrock function (see Appendix D.1.2). Try 𝑛 = 3
and 𝑛 = 4 first, then 𝑛 = 8, 16, 32, . . .. What is the highest
number of dimensions you can solve? How does the number
of function evaluations scale with the number of variables?
d. Optional: Implement L-BFGS and compare it with BFGS.

4.10 Implement a trust-region algorithm and apply it to one or more


of the test problems from the previous exercise. Compare the
trust-region results with BFGS and the off-the-shelf software.

4.11 Consider the aircraft wing design problem described in Ap-


pendix D.1.6. Program the model and solve the problem using an
optimizer of your choice. Plot the optimization path and conver-
gence histories. Exploration: Change the model to fit an aircraft
of your choice by picking the appropriate parameter values and
solve the same optimization problem.

4.12 The brachistochrone problem seeks to find the path that minimizes
travel time between two points for a particle under the force of
gravity.∗ Solve the discretized version of this problem using ∗ This problem was mentioned in Sec-
tion 2.2 as one of the problems that inspired
an optimizer of your choice (see Appendix D.1.7 for a detailed developments in calculus of variations.
description).

a. Plot the optimal path for the frictionless case with 𝑛 = 10


and compare it to the exact solution (see Appendix D.1.7).
b. Solve the optimal path with friction and plot the resulting
path. Report the travel time between the two points and
compare it to the frictionless case.
c. Study the effect of increased problem dimensionality. Start
with 4 points and double the dimension each time up to
128 points. Plot and discuss the increase in computational
expense with problem size. Example metrics include the
number of major iterations, function evaluations, and com-
putational time. Hint: When solving the higher-dimensional
cases, start with the solution interpolated from a lower-
dimensional case—this is called a warm start.
Constrained Gradient-Based Optimization
5
Engineering design optimization problems are rarely unconstrained. In
this chapter, we explain how to solve constrained problems. The meth-
ods in this chapter build on the gradient-based unconstrained methods
from Chapter 4 and also assume smooth functions. We first introduce
the optimality conditions for a constrained optimization problem and
then focus on three main methods for handling constraints: penalty
methods, sequential quadratic programming (SQP), and interior-point
methods.
Penalty methods are no longer used in constrained gradient-based
optimization because they have been replaced by more effective meth-
ods. Still, the concept of a penalty is useful when thinking about
constraints, partially motivates more sophisticated approaches like
interior-point methods, and is often used with gradient-free optimizers.
SQP and interior-point methods represent the state of the art in
nonlinear constrained optimization. We introduce the basics for these
two optimization methods, but a complete and robust implementation
of these methods requires detailed knowledge of a growing body of
literature that is not covered here.

By the end of this chapter you should be able to:

1. State and understand the optimality conditions for a con-


strained problem.
2. Understand the motivation for and the limitations of
penalty methods.
3. Understand the concepts behind state-of-the-art con-
strained optimization algorithms and use them to solve
real engineering problems.

152
5 Constrained Gradient-Based Optimization 153

5.1 Constrained Problem Formulation

We can express a general constrained optimization problem as

minimize 𝑓 (𝑥)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (5.1)
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 ,

where 𝑔(𝑥) is the vector of inequality constraints, ℎ(𝑥) is the vector of


equality constraints, and 𝑥 and 𝑥 are lower and upper design variable
bounds (also known as bound constraints). Both objective and constraint
functions can be nonlinear, but they should be 𝐶 2 continuous to be
solved using gradient-based optimization algorithms. The inequality
constraints are expressed as “less than” without loss of generality
because they can always be converted to “greater than” by putting a
negative sign on 𝑔. We could also eliminate the equality constraints
ℎ = 0 without loss of generality by replacing it with two inequality con-
straints, ℎ ≤ 𝜀 and −ℎ ≤ 𝜀, where 𝜀 is some small number. In practice,
it is desirable to distinguish between equality and inequality constraints
because of numerical precision and algorithm implementation.

Example 5.1 Graphical solution of constrained problem

Consider the following two-variable problem with quadratic objective and


constraint functions: 𝑥2
1 4
minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 − 𝑥 1 − 𝑥2 − 2
𝑥 1 ,𝑥2 2
𝑔2 (𝑥) 𝑔1 (𝑥)
subject to 𝑔1 (𝑥1 , 𝑥2 ) = 𝑥12 − 4𝑥 1 + 𝑥2 + 1 ≤ 0 2
𝑥∗
1 2
𝑔2 (𝑥1 , 𝑥2 ) = 𝑥 + 𝑥 22 − 𝑥1 − 4 ≤ 0 . 0
2 1
We can plot the contours of the objective function and the constraint lines
−2
(𝑔1 = 0 and 𝑔2 = 0), as shown in Fig. 5.1. We can see the feasible region defined
by the two constraints. The approximate location of the minimum is evident −2 0 2 4

by inspection. We can visualize the contours for this problem because the 𝑥1

functions can be evaluated quickly and because it has only two dimensions. If
Fig. 5.1 Graphical solution for con-
the functions were more expensive, we would not be able to afford the many strained problem showing contours
evaluations needed to plot the contours. If the problem had more dimensions, of the objective, the two constraint
it would become difficult or impossible to visualize the functions and feasible curves, and the shaded infeasible re-
space fully. gion.
5 Constrained Gradient-Based Optimization 154

Tip 5.1 Do not mistake constraints for objectives

Practitioners sometimes consider metrics to be objectives when it would be


more appropriate to pose them as constraints. This can lead to a multiobjective
problem, which does not have a single optimum and is costly to solve (more on
this in Chapter 9).
A helpful rule of thumb is to ask yourself if improving that metric indefi-
nitely is desirable or whether there is some threshold after which additional
improvements do not matter. For example, you might state that you want to
maximize the range of an electric car. However, there is probably a threshold
beyond which increasing the range does not improve the car’s desirability (e.g.,
if the range is greater than can be driven in one day). In that case, the range
should be posed as a constraint, and the objective should be another metric,
such as efficiency or profitability.

The constrained problem formulation just described does not dis-


tinguish between nonlinear and linear constraints. It is advantageous
to make this distinction because some algorithms can take advantage
of these differences. However, the methods introduced in this chapter
assume general nonlinear functions.
For unconstrained gradient-based optimization (Chapter 4), we
only require the gradient of the objective, ∇ 𝑓 . To solve a constrained
problem, we also require the gradients of all the constraints. Because
the constraints are vectors, their derivatives yield a Jacobian matrix. For
the equality constraints, the Jacobian is defined as

 𝜕ℎ1 𝜕ℎ1 
 ···   ∇ℎ | 
 𝜕𝑥1 𝜕𝑥 𝑛 𝑥   1
𝜕ℎ  . ..   .. 
𝐽ℎ = =  .. ..
. =  .  , (5.2)
 𝜕ℎ
.
 |
𝜕ℎ 𝑛 ℎ  ∇ℎ 𝑛 
𝜕𝑥
 𝑛ℎ  ℎ
 ··· 
 𝜕𝑥1 𝜕𝑥 𝑛 𝑥 
| {z }
(𝑛 ℎ ×𝑛 𝑥 )

which is an (𝑛 ℎ × 𝑛 𝑥 ) matrix whose rows are the gradients of each


constraint. Similarly, the Jacobian of the inequality constraints is an
(𝑛 𝑔 × 𝑛 𝑥 ) matrix.

Tip 5.2 Do not specify design variable bounds as nonlinear constraints

The design variable bounds in the general nonlinear constrained problem


(Eq. 5.1) are expressed as 𝑥 ≤ 𝑥 ≤ 𝑥, where 𝑥 is the vector of lower bounds and
𝑥 is the vector of upper bounds. Bounds are treated differently in optimization
algorithms, so they should be specified as a bound constraint rather than a
5 Constrained Gradient-Based Optimization 155

general nonlinear constraint. Some bounds stem from physical limitations


on the engineering system. If not otherwise limited, the bounds should be
sufficiently wide not to constrain the problem artificially. It is good practice to
check your optimal solution against your design variable bounds to ensure that
you have not artificially constrained the problem.

5.2 Understanding n-Dimensional Space

Understanding the optimality conditions and optimization algorithms


for constrained problems requires basic 𝑛-dimensional geometry and
linear algebra concepts. Here, we review the concepts in an informal
way.∗ We sketch the concepts for two and three dimensions to provide ∗ For
a more formal introduction to these
concepts, see Chapter 2 in Boyd and
some geometric intuition but keep in mind that the only way to tackle Vandenberghe.86 Strang87 provides a com-
𝑛 dimensions is through mathematics. prehensive treatment of linear algebra.

There are several essential linear algebra concepts for constrained 86. Boyd and Vandenberghe, Convex
Optimization, 2004.
optimization. The span of a set of vectors is the space formed by all the
87. Strang, Linear Algebra and its Applica-
points that can be obtained by a linear combination of those vectors. tions, 2006.
With one vector, this space is a line, with two linearly independent
vectors, this space is a two-dimensional plane (see Fig. 5.2), and so
on. With 𝑛 linearly independent vectors, we can obtain any point in
𝑛-dimensional space.

𝛼𝑢 + 𝛽𝑣 𝑤
𝑢 𝛼𝑢
𝑢
𝑢

𝑣 𝑣 Fig. 5.2 Span of one, two, and three


𝛼𝑢 + 𝛽𝑣 + 𝛾𝑤 vectors in three-dimensional space.

Because matrices are composed of vectors, we can apply the concept


of span to matrices. Suppose we have a rectangular (𝑚 × 𝑛) matrix 𝐴.
For our purposes, we are interested in considering the 𝑚 row vectors in
the matrix. The rank of 𝐴 is the number of linearly independent rows 𝑝
of 𝐴, and it corresponds to the dimension of the space spanned by the
𝑎2
row vectors of 𝐴. 𝑎1

The nullspace of a matrix 𝐴 is the set of all 𝑛-dimensional vectors 𝑝


such that 𝐴𝑝 = 0. This is a subspace of 𝑛 − 𝑟 dimensions, where 𝑟 is
the rank of 𝐴. One fundamental theorem of linear algebra is that the
Fig. 5.3 Nullspace of a (2 × 3) matrix
nullspace of a matrix contains all the vectors that are perpendicular to the row 𝐴 of rank 2, where 𝑎 and 𝑎 are the
1 2
space of that matrix and vice versa. This concept is illustrated in Fig. 5.3 row vectors of 𝐴.
5 Constrained Gradient-Based Optimization 156

for 𝑛 = 3, where 𝑟 = 2, leaving only one dimension for the nullspace.


Any vector 𝑣 that is perpendicular to 𝑝 must be a linear combination of
the rows of 𝐴, so it can be expressed as 𝑣 = 𝛼𝑎1 + 𝛽𝑎 2 .† † The subspaces spanned by 𝐴, 𝐴| , and
their respective nullspaces constitute four
A hyperplane is a generalization of a plane in 𝑛-dimensional space fundamental subspaces, which we elabo-
and is an essential concept in constrained optimization. In a space of 𝑛 rate on in Appendix A.4.
dimensions, a hyperplane is a subspace with at most 𝑛 − 1 dimensions.
In Fig. 5.4, we illustrate hyperplanes in two dimensions (a line) and
three dimensions (a two-dimensional plane); higher dimensions cannot
be visualized, but the mathematical description that follows holds for
any 𝑛.

𝑝|𝑣 = 0 𝑣 𝑝|𝑣 > 0


𝑝|𝑣 = 0
𝑣

𝑥0 𝑝|𝑣 > 0
𝑥0
Fig. 5.4 Hyperplanes and half-spaces
𝑝|𝑣 < 0 in two and three dimensions.
𝑝|𝑣 <0
𝑥0

To define a hyperplane of 𝑛 − 1 dimensions, we just need a point


contained in the hyperplane (𝑥0 ) and a vector (𝑣). Then, the hyperplane
is defined as the set of all points 𝑥 = 𝑥0 + 𝑝 such that 𝑝 | 𝑣 = 0. That is,
the hyperplane is defined by all vectors that are perpendicular to 𝑣. To
define a hyperplane with 𝑛 − 2 dimensions, we would need two vectors,
and so on. In 𝑛 dimensions, a hyperplane of 𝑛 − 1 dimensions divides
the space into two half-spaces: in one of these, 𝑝 | 𝑣 > 0, and in the other,
𝑝 | 𝑣 < 0. Each half-space is closed if it includes the hyperplane (𝑝 | 𝑣 = 0)
and open otherwise.
When we have the isosurface of a function 𝑓 , the function gradient
at a point on the isosurface is locally perpendicular to the isosurface.
The gradient vector defines the tangent hyperplane at that point, which is
the set of points such that 𝑝 | ∇ 𝑓 = 0. In two dimensions, the isosurface
reduces to a contour and the tangent reduces to a line, as shown
in Fig. 5.5 (left). In three dimensions, we have a two-dimensional
hyperplane tangent to an isosurface, as shown in Fig. 5.5 (right).

Tangent
plane ∇𝑓
∇𝑓

Tangent
line
Fig. 5.5 The gradient of a function
defines the hyperplane tangent to the
function isosurface.
𝑓 isosurface
5 Constrained Gradient-Based Optimization 157

The intersection of multiple half-spaces yields a polyhedral cone. A


polyhedral cone is the set of all the points that can be obtained by
the linear combination of a given set of vectors using nonnegative
coefficients. This concept is illustrated in Fig. 5.6 (left) for the two-
dimensional case. In this case, only two vectors are required to define
a cone uniquely. In three dimensions and higher there could be any
number of vectors corresponding to all the possible polyhedral “cross
sections”, as illustrated in Fig. 5.6 (middle and right).

𝛼𝑢 + 𝛽𝑣
𝛼, 𝛽 ≥ 0
𝑢

𝑣
Fig. 5.6 Polyhedral cones in two and
three dimensions.

5.3 Optimality Conditions

The optimality conditions for constrained optimization problems are


not as straightforward as those for unconstrained optimization (Sec-
tion 4.1.4). We begin with equality constraints because the mathematics
and intuition are simpler, then add inequality constraints. As in the
case of unconstrained optimization, the optimality conditions for con-
strained problems are used not only for the termination criteria, but
they are also used as the basis for optimization algorithms.

5.3.1 Equality Constraints


First, we review the optimality conditions for an unconstrained problem,
which we derived in Section 4.1.4. For the unconstrained case, we can
take a first-order Taylor series expansion of the objective function with
some step 𝑝 that is small enough that the second-order term is negligible
and write
𝑓 (𝑥 + 𝑝) ≈ 𝑓 (𝑥) + ∇ 𝑓 (𝑥)| 𝑝 . (5.3)
If 𝑥 ∗ is a minimum point, then every point in a small neighborhood
must have a greater value,

𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) . (5.4)

Given the Taylor series expansion (Eq. 5.3), the only way that this
inequality can be satisfied is if

∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 . (5.5)
5 Constrained Gradient-Based Optimization 158

The condition ∇ 𝑓 | 𝑝 = 0 defines a hyperplane that contains the


directions along which the first-order variation of the function is zero.
This hyperplane divides the space into an open half-space of directions
where the function decreases (∇ 𝑓 | 𝑝 < 0) and an open half-space where
the function increases (∇ 𝑓 | 𝑝 > 0), as shown in Fig. 5.7. Again, we are
considering first-order variations.
If the problem were unconstrained, the only way to satisfy the ∇𝑓
inequality in Eq. 5.5 would be if ∇ 𝑓 (𝑥 ∗ ) = 0. That is because for any ∇ 𝑓 |𝑝 = 0

nonzero ∇ 𝑓 , there is an open half-space of directions that result in a


function decrease (see Fig. 5.7). This is consistent with the first-order
unconstrained optimality conditions derived in Section 4.1.4. ∇ 𝑓 |𝑝 > 0
However, we now have a constrained problem. The function increase
condition (Eq. 5.5) still applies, but 𝑝 must also be a feasible direction.
∇ 𝑓 |𝑝 < 0
To find the feasible directions, we can write a first-order Taylor series
expansion for each equality constraint function as Half-space of
function decrease

ℎ 𝑗 (𝑥 + 𝑝) ≈ ℎ 𝑗 (𝑥) + ∇ℎ 𝑗 (𝑥)| 𝑝, 𝑗 = 1, . . . , 𝑛 ℎ . (5.6) Fig. 5.7 The gradient 𝑓 (𝑥), which is


the direction of steepest function in-
Again, the step size is assumed to be small enough so that the higher- crease, splits the design space into
order terms are negligible. two halves. Here we highlight the
Assuming that 𝑥 is a feasible point, then ℎ 𝑗 (𝑥) = 0 for all constraints open half-space of directions that re-
sult in function decrease.
𝑗, and we are left with the second term in the linearized constraint
(Eq. 5.6). To remain feasible a small step away from 𝑥, we require that
ℎ 𝑗 (𝑥 + 𝑝) = 0 for all 𝑗. Therefore, first-order feasibility requires that

∇ℎ 𝑗 (𝑥)| 𝑝 = 0, for all 𝑗 = 1, . . . , 𝑛 ℎ , (5.7)

which means that a direction is feasible when it is orthogonal to all equality


constraint gradients. We can write this in matrix form as

𝐽ℎ (𝑥)𝑝 = 0 . (5.8)

This equation states that any feasible direction has to lie in the nullspace
of the Jacobian of the constraints, 𝐽ℎ .
Assuming that 𝐽ℎ has full row rank (i.e., the constraint gradients are Feasible point
linearly independent), then the feasible space is a subspace of dimension ∇ℎ1
𝑛 𝑥 − 𝑛 ℎ . For optimization to be possible, we require 𝑛 𝑥 > 𝑛 ℎ . Figure 5.8
illustrates a case where 𝑛 𝑥 = 𝑛 ℎ = 2, where the feasible space reduces 𝑥

to a single point, and there is no freedom for performing optimization. ∇ℎ 2


For one constraint, Eq. 5.8 reduces to a dot product, and the feasible ℎ2 = 0
space corresponds to a tangent hyperplane, as illustrated on the left side ℎ1 = 0

of Fig. 5.9 for the three-dimensional case. For two or more constraints,
Fig. 5.8 If we have two equality con-
the feasible space corresponds to the intersection of all the tangent straints (𝑛 ℎ = 2) in two-dimensional
hyperplanes. On the right side of Fig. 5.9, we show the intersection of space (𝑛 𝑥 = 2), we are left with no
two tangent hyperplanes in three-dimensional space (a line). freedom for optimization.
5 Constrained Gradient-Based Optimization 159

∇ℎ | 𝑝 = 0 ℎ2 = 0
∇ℎ ∇ℎ1
𝐽ℎ 𝑝 = 0
∇ℎ 2
𝑥∗
𝑥∗

ℎ1 = 0 Fig. 5.9 Feasible spaces in three di-


mensions for one and two constraints.
ℎ=0

For constrained optimality, we need to satisfy both ∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0


(Eq. 5.5) and 𝐽ℎ (𝑥)𝑝 = 0 (Eq. 5.8). For equality constraints, if a direction
𝑝 is feasible, then −𝑝 must also be feasible. Therefore, the only way to
satisfy ∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 is if ∇ 𝑓 (𝑥)| 𝑝 = 0.
In sum, for 𝑥 ∗ to be a constrained optimum, we require

∇ 𝑓 (𝑥 ∗ )| 𝑝 = 0 for all 𝑝 such that 𝐽ℎ (𝑥 ∗ )𝑝 = 0 . (5.9)

In other words, the projection of the objective function gradient onto the
feasible space must vanish. Figure 5.10 illustrates this requirement for a
case with two constraints in three dimensions.

∇𝑓
∇ℎ1 ∇ℎ 1
∇𝑓
Fig. 5.10 If the projection of ∇ 𝑓 onto
the feasible space is nonzero, there is
a feasible descent direction (left); if
𝑝 ∇ 𝑓 |𝑝 ∇ℎ2 𝑝 ∇ 𝑓 | 𝑝 = 0 ∇ℎ2 the projection is zero, the point is a
constrained optimum (right).

The constrained optimum conditions (Eq. 5.9) require that ∇ 𝑓 be


orthogonal to the nullspace of 𝐽ℎ (since 𝑝, as defined, is the nullspace
of 𝐽ℎ ). The row space of a matrix contains all the vectors that are
orthogonal to its nullspace.∗ Because the rows of 𝐽ℎ are the gradients of ∗ Recall
the fundamental theorem of linear
the constraints, the objective function gradient must be a linear combination algebra illustrated in Fig. 5.3 and the four
subspaces reviewed in Appendix A.4.
of the gradients of the constraints. Thus, we can write the requirements
defined in Eq. 5.9 as a single vector equation,
Õ
𝑛ℎ
∇ 𝑓 (𝑥 ∗ ) = − 𝜆 𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) , (5.10)
𝑗=1

where 𝜆 𝑗 are called the Lagrange multipliers.† There is a multiplier † Despite our convention of reserving
associated with each constraint. The sign of the Lagrange multipliers Greek symbols for scalars, we use 𝜆 to
is arbitrary for equality constraints but will be significant later when represent the 𝑛 ℎ -vector of Lagrange multi-
pliers because it is common usage.
dealing with inequality constraints.
5 Constrained Gradient-Based Optimization 160

Therefore, the first-order optimality conditions for the equality


constrained case are
∇ 𝑓 (𝑥 ∗ ) = −𝐽ℎ (𝑥)| 𝜆
(5.11)
ℎ(𝑥) = 0 ,

where we have reexpressed Eq. 5.10 in matrix form and added the
constraint satisfaction condition.
In constrained optimization, it is sometimes convenient to use the
Lagrangian function, which is a scalar function defined as

ℒ(𝑥, 𝜆) = 𝑓 (𝑥) + ℎ(𝑥)| 𝜆 . (5.12)

In this function, the Lagrange multipliers are considered to be indepen-


dent variables. Taking the gradient of ℒ with respect to both 𝑥 and 𝜆
and setting them to zero yields

∇𝑥 ℒ = ∇ 𝑓 (𝑥) + 𝐽ℎ (𝑥)| 𝜆 = 0
(5.13)
∇𝜆 ℒ = ℎ(𝑥) = 0 ,

which are the first-order conditions derived in Eq. 5.11.


With the Lagrangian function, we have transformed a constrained
problem into an unconstrained problem by adding new variables,
𝜆. A constrained problem of 𝑛 𝑥 design variables and 𝑛 ℎ equality
constraints was transformed into an unconstrained problem with 𝑛 𝑥 +𝑛 ℎ
variables. Although you might be tempted to simply use the algorithms
of Chapter 4 to minimize the Lagrangian function (Eq. 5.12), some
modifications are needed in the algorithms to solve these problems
effectively (particularly once inequality constraints are introduced).
The derivation of the first-order optimality conditions (Eq. 5.11) ∇ℎ2
assumes that the gradients of the constraints are linearly independent; ℎ2 = 0
that is, 𝐽ℎ has full row rank. A point satisfying this condition is ℎ1 = 0 𝑥∗
called a regular point and is said to satisfy linear independence constraint
qualification. Figure 5.11 illustrates a case where the 𝑥 ∗ is not a regular
∇ℎ1
point. A special case that does not satisfy constraint qualification is
when one (or more) constraint gradient is zero. In that case, that
constraint is not linearly independent, and the point is not regular. Fig. 5.11 The constraint qualification
Fortunately, these situations are uncommon. condition does not hold in this case
because the gradients of the two con-
The optimality conditions just described are first-order conditions straints not linearly independent.
that are necessary but not sufficient. To make sure that a point is a
constrained minimum, we also need to satisfy second-order conditions.
For the unconstrained case, the Hessian of the objective function has
to be positive definite. In the constrained case, we need to check the
Hessian of the Lagrangian with respect to the design variables in the
5 Constrained Gradient-Based Optimization 161

space of feasible directions. The Lagrangian Hessian is


Õ
𝑛ℎ
𝐻ℒ = 𝐻 𝑓 + 𝜆 𝑗 𝐻ℎ 𝑗 , (5.14)
𝑗=1

where 𝐻 𝑓 is the Hessian of the objective, and 𝐻 ℎ 𝑗 is the Hessian of


equality constraint 𝑗. The second-order sufficient conditions are as
follows:
𝑝 | 𝐻ℒ 𝑝 > 0 for all 𝑝 such that 𝐽ℎ 𝑝 = 0 . (5.15)
This ensures that the curvature of the Lagrangian is positive when
projected onto any feasible direction.

Example 5.2 Equality constrained problem

Consider the following constrained problem featuring a linear objective


function and a quadratic equality constraint:

minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to ℎ(𝑥1 , 𝑥2 ) = 𝑥 + 𝑥22 − 1 = 0 .
4 1
The Lagrangian for this problem is
 
1
ℒ(𝑥1 , 𝑥2 , 𝜆) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 .
4

Differentiating this to get the first-order optimality conditions,


𝜕ℒ 1
= 1 + 𝜆𝑥1 = 0
𝜕𝑥1 2
𝜕ℒ
= 2 + 2𝜆𝑥2 = 0
𝜕𝑥2
𝜕ℒ 1
= 𝑥 12 + 𝑥22 − 1 = 0 .
𝜕𝜆 4
Solving these three equations for the three unknowns (𝑥1 , 𝑥2 , 𝜆), we obtain two
possible solutions:
  " √ #
𝑥 − 2 √
𝑥𝐴 = 1 = √
2
, 𝜆𝐴 = 2,
𝑥2 − 2
  "√ #
𝑥 2 √
𝑥 𝐵 = 1 = √2 , 𝜆𝐵 = − 2 .
𝑥2
2

These two points are shown in Fig. 5.12, together with the objective and
constraint gradients. The optimality conditions (Eq. 5.11) state that the gradient
must be a linear combination of the gradients of the constraints at the optimum.
In the case of one constraint, this means that the two gradients are colinear
(which occurs in this example).
5 Constrained Gradient-Based Optimization 162

2
∇𝑓

∇ℎ
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0

Minimum
𝑥𝐴
−1

∇ℎ Fig. 5.12 Two points satisfy the first-


−2 order optimality conditions; one is a
constrained minimum, and the other
−3 −2 −1 0 1 2 3 is a constrained maximum.
𝑥1

To determine if either of these points is a minimum, we check the second-


order conditions by evaluating the Hessian of the Lagrangian,
1 
2𝜆 0
𝐻ℒ = .
0 2𝜆

The Hessian is only positive definite for the case where 𝜆𝐴 = 2, and therefore
𝑥 𝐴 is a minimum. Although the Hessian only needs to be positive definite in 𝑥2

the feasible directions, in this case, we can show that it is positive or negative 2
definite in all possible directions. The Hessian is negative definite for 𝑥 𝐵 , so
this is not a minimum; instead, it is a maximum.
Figure 5.13 shows the Lagrangian function (with the optimal Lagrange 0
multiplier we solved for) overlaid on top of the original function and constraint.
𝑥∗
The unconstrained minimum of the Lagrangian corresponds to the constrained
minimum of the original function. The Lagrange multiplier can be visualized −2

as a third dimension coming out of the page. Here we show only the slice for −2 0 2
the Lagrange multiplier that solves the optimality conditions. 𝑥1

Fig. 5.13 The minimum of the La-


grangian function with the optimum

Lagrange multiplier value (𝜆 = 2)
Example 5.3 Second-order conditions for constrained case is the constrained minimum of the
original problem.
Consider the following problem:

minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 3(𝑥2 − 2)2


𝑥 1 ,𝑥2

subject to ℎ(𝑥1 , 𝑥2 ) = 𝛽𝑥 12 − 𝑥2 = 0 ,

where 𝛽 is a parameter that we will vary to change the characteristics of the


constraint.
The Lagrangian for this problem is
 
ℒ(𝑥1 , 𝑥2 , 𝜆) = 𝑥12 + 3(𝑥2 − 2)2 + 𝜆 𝛽𝑥12 − 𝑥2 .
5 Constrained Gradient-Based Optimization 163

Differentiating for the first-order optimality conditions, we get


 
2𝑥1 (1 + 𝜆𝛽)
∇𝑥 ℒ = =0
6(𝑥2 − 2) − 𝜆
∇𝜆 ℒ = 𝛽𝑥12 − 𝑥2 = 0 .
Solving these three equations for the three unknowns (𝑥1 , 𝑥2 , 𝜆), the solution is
[𝑥 1 , 𝑥2 , 𝜆] = [0, 0, −12], which is independent of 𝛽.
To determine if this is a minimum, we must check the second-order
conditions by evaluating the Hessian of the Lagrangian,
 
2(1 − 12𝛽) 0
𝐻ℒ = .
0 6
We only need 𝐻ℒ to be positive definite in the feasible directions. The feasible
directions are all 𝑝 such that 𝐽 ℎ 𝑝 = 0. In this case, 𝐽 ℎ = [2𝛽𝑥1 , −1], yielding
|

𝐽 ℎ (𝑥 ∗ ) = [0, −1]. Therefore, the feasible directions at the solution can be


represented as 𝑝 = [𝛼, 0], where 𝛼 is any real number. For positive curvature
in the feasible directions, we require that
𝑝 | 𝐻ℒ 𝑝 = 2𝛼2 (1 − 12𝛽) > 0 .
Thus, the second-order sufficient condition requires that 𝛽 < 1/12.‡
We plot the constraint and the Lagrangian for three different values of
‡ This happens to be the same condition for
𝛽 in Fig. 5.14. The location of the point satisfying the first-order optimality
a positive-definite 𝐻ℒ in this case, but this
conditions is the same for all three cases, but the curvature of the constraint does not happen in general.
changes the Lagrangian significantly.

4 4 4

∇ℎ ∇ℎ ∇ℎ
2 2 2
𝑥2 𝑥2 𝑥2

0 0 0

−2 ∇𝑓 −2 ∇𝑓 −2 ∇𝑓

−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1

𝛽 = −0.5 1 𝛽 = 0.5
𝛽= 12

For 𝛽 = −0.5, the Hessian of the Lagrangian is positive definite, and we Fig. 5.14 Three different problems il-
have a minimum. For 𝛽 = 0.5, the Lagrangian has negative curvature in the lustrating the meaning of the second-
order conditions for constrained
feasible directions, so the point is not a minimum; we can reduce the objective problems.
by moving along the curved constraint. The first-order conditions alone do
not capture this possibility because they linearize the constraint. Finally, in the
limiting case (𝛽 = 1/12), the curvature of the constraint matches the curvature
of the objective, and the curvature of the Lagrangian is zero in the feasible
directions. This point is not a minimum either.
5 Constrained Gradient-Based Optimization 164

5.3.2 Inequality Constraints


We can reuse some of the concepts from the equality constrained
optimality conditions for inequality constrained problems. Recall that
an inequality constraint 𝑗 is feasible when 𝑔 𝑗 (𝑥 ∗ ) ≤ 0 and it is said to be 𝑓
active if 𝑔 𝑗 (𝑥 ∗ ) = 0 and inactive if 𝑔 𝑗 (𝑥 ∗ ) < 0. ∇𝑓

As before, if 𝑥 ∗ is an optimum, any small enough feasible step 𝑝


from the optimum must result in a function increase. Based on the
Taylor series expansion (Eq. 5.3), we get the condition 𝑥

∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 , (5.16)
∇ 𝑓 |𝑝 < 0
which is the same as for the equality constrained case. We use the Descent directions
arc in Fig. 5.15 to show the descent directions, which are in the open
Fig. 5.15 The descent directions are
half-space defined by the hyperplane tangent to the gradient of the
in the open half-space defined by the
objective. objective function gradient.
To consider inequality constraints, we use the same linearization as
∇𝑔 | 𝑝 ≥ 0
the equality constraints (Eq. 5.6), but now we enforce an inequality to Infeasible
get directions ∇𝑔

𝑔 𝑗 (𝑥 + 𝑝) ≈ 𝑔 𝑗 (𝑥) + ∇𝑔 𝑗 (𝑥)| 𝑝 ≤ 0, 𝑗 = 1, . . . , 𝑛 𝑔 . (5.17)

For a given candidate point that satisfies all constraints, there are 𝑥
Feasible
two possibilities to consider for each inequality constraint: whether
directions
the constraint is inactive (𝑔 𝑗 (𝑥) < 0) or active (𝑔 𝑗 (𝑥) = 0). If a given
𝑔=0
constraint is inactive, we do not need to add any condition for it because
we can take a step 𝑝 in any direction and remain feasible as long as Fig. 5.16 The feasible directions for
the step is small enough. Thus, we only need to consider the active each constraint are in the closed half-
constraints for the optimality conditions. space defined by the inequality con-
straint gradient.
For the equality constraint, we found that all first-order feasible
directions are in the nullspace of the Jacobian matrix. Inequality
|
𝐽 𝑔 𝜎, 𝜎 ≥ 0 ∇𝑔2
directions
constraints are not as restrictive. From Eq. 5.17, if constraint 𝑗 is
active (𝑔 𝑗 (𝑥) = 0), then the nearby point 𝑔 𝑗 (𝑥 + 𝑝) is only feasible if ∇𝑔1

∇𝑔 𝑗 (𝑥)| 𝑝 ≤ 0 for all constraints 𝑗 that are active. In matrix form, we can
write 𝐽 𝑔 (𝑥)𝑝 ≤ 0, where the Jacobian matrix includes only the gradients
𝑥
of the active constraints. Thus, the feasible directions for inequality
Feasible
constraint 𝑗 can be any direction in the closed half-space, corresponding directions
to all directions 𝑝 such that 𝑝 | 𝑔 𝑗 ≤ 0, as shown in Fig. 5.16. In this
figure, the arc shows the infeasible directions.
The set of feasible directions that satisfies all active constraints is Fig. 5.17 Excluding the infeasible di-
rections with respect to each con-
the intersection of all the closed half-spaces defined by the inequality
straint (red arcs) leaves the cone of
constraints, that is, all 𝑝 such that 𝐽 𝑔 (𝑥)𝑝 ≤ 0. This intersection of the feasible directions (blue), which is
feasible directions forms a polyhedral cone, as illustrated in Fig. 5.17 the polar cone of the active constraint
for a two-dimensional case with two constraints. To find the cone of gradients cone (gray).
5 Constrained Gradient-Based Optimization 165

feasible directions, let us first consider the cone formed by the active
inequality constraint gradients (shown in gray in Fig. 5.17). This cone
is defined by all vectors 𝑑 such that

Õ
𝑛𝑔
where 𝜎𝑗 ≥ 0 . (5.18)
|
𝑑 = 𝐽𝑔 𝜎 = 𝜎 𝑗 ∇𝑔 𝑗 ,
𝑗=1

A direction 𝑝 is feasible if 𝑝 | 𝑑 ≤ 0 for all 𝑑 in the cone. The set of all


feasible directions forms the polar cone of the cone defined by Eq. 5.18
and is shown in blue in Fig. 5.17.
Now that we have established some intuition about the feasible
directions, we need to establish under which conditions there is no
feasible descent direction (i.e., we have reached an optimum). In other
words, when is there no intersection between the cone of feasible
directions and the open half-space of descent directions? To answer
this question, we can use Farkas’ lemma. This lemma states that given
a rectangular matrix (𝐽 𝑔 in our case) and a vector with the same size
as the rows of the matrix (∇ 𝑓 in our case), one (and only one) of two
possibilities occurs:§ § Farkas’ lemma has other applications be-

yond optimization and can be written in


1. There exists a 𝑝 such that 𝐽 𝑔 𝑝 ≤ 0 and ∇ 𝑓 | 𝑝 < 0. This means that various equivalent forms. Using the state-
ment by Dax,88 we set 𝐴 = 𝐽 𝑔 , 𝑥 = −𝑝 ,
there is a descent direction that is feasible (Fig. 5.18, left). 𝑐 = −∇ 𝑓 , and 𝑦 = 𝜎 .
2. There exists a 𝜎 such that 𝐽 𝑔 𝜎 = −∇ 𝑓 with 𝜎 ≥ 0 (Fig. 5.18,
|
88. Dax, Classroom note: An elementary
proof of Farkas’ lemma, 1997.
right). This corresponds to optimality because it excludes the first
possibility.

𝐽 | 𝜎, 𝜎 ≥ 0 ∇𝑔2 −∇ 𝑓 ∇𝑔2
directions
∇𝑔1
∇𝑔1
−∇ 𝑓

Feasible ∇ 𝑓
descent
directions ∇𝑓

1. Feasible descent direction ex- 2. No feasible descent di- Fig. 5.18 Two possibilities involving
ists, so point is not an optimum rection exists, so point is an active inequality constraints.
optimum

The second possibility yields the following optimality criterion for


inequality constraints:

∇ 𝑓 + 𝐽 𝑔 (𝑥)| 𝜎 = 0 , with 𝜎 ≥ 0. (5.19)


5 Constrained Gradient-Based Optimization 166

Comparing with the corresponding criteria for equality constraints


(Eq. 5.13), we see a similar form. However, 𝜎 corresponds to the
Lagrange multipliers for the inequality constraints and carries the
additional restriction that 𝜎 ≥ 0.
If equality constraints are present, the conditions for the inequality
constraints apply only in the subspace of the directions feasible with
respect to the equality constraints.
Similar to the equality constrained case, we can construct a La-
grangian function whose stationary points are candidates for optimal
points. We need to include all inequality constraints in the optimality
conditions because we do not know in advance which constraints are
active. To represent inequality constraints in the Lagrangian, we replace
them with the equality constraints defined by

𝑔 𝑗 + 𝑠 2𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.20)

where 𝑠 𝑗 is a new unknown associated with each inequality constraint


called a slack variable. The slack variable is squared to ensure it is
nonnegative In that way, Eq. 5.20 can only be satisfied when 𝑔 𝑗 is
feasible (𝑔 𝑗 ≤ 0). The significance of the slack variable is that when
𝑠 𝑗 = 0, the corresponding inequality constraint is active (𝑔 𝑗 = 0), and
when 𝑠 𝑗 ≠ 0, the corresponding constraint is inactive.
The Lagrangian including both equality and inequality constraints
is then

ℒ(𝑥, 𝜆, 𝜎, 𝑠) = 𝑓 (𝑥) + 𝜆| ℎ(𝑥) + 𝜎| (𝑔(𝑥) + 𝑠 𝑠) , (5.21)

where 𝜎 represents the Lagrange multipliers associated with the in-


equality constraints. Here, we use to represent the element-wise
multiplication of 𝑠.¶ ¶ This is a special case of the Hadamard

product of two matrices.


Similar to the equality constrained case, we seek a stationary point for
the Lagrangian, but now we have additional unknowns: the inequality
Lagrange multipliers and the slack variables. Taking partial derivatives
of the Lagrangian with respect to each set of unknowns and setting
those derivatives to zero yields the first-order optimality conditions:

𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑙 Õ 𝜕𝑔 𝑗
𝑛𝑔
∇𝑥 ℒ = 0 ⇒ = + 𝜆𝑙 + 𝜎𝑗 =0
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑙=1 𝑗=1

𝑖 = 1, . . . , 𝑛 𝑥 . (5.22)

This criterion is the same as before but with additional Lagrange


multipliers and constraints. Taking the derivatives with respect to the
5 Constrained Gradient-Based Optimization 167

equality Lagrange multipliers, we have

𝜕ℒ
∇𝜆 ℒ = 0 ⇒ = ℎ 𝑙 = 0, 𝑙 = 1, . . . , 𝑛 ℎ , (5.23)
𝜕𝜆 𝑙
which enforces the equality constraints as before. Taking derivatives
with respect to the inequality Lagrange multipliers, we get

𝜕ℒ
∇𝜎 ℒ = 0 ⇒ = 𝑔 𝑗 + 𝑠 2𝑗 = 0 𝑗 = 1, . . . , 𝑛 𝑔 , (5.24)
𝜕𝜎 𝑗

which enforces the inequality constraints. Finally, differentiating the


Lagrangian with respect to the slack variables, we obtain

𝜕ℒ
∇𝑠 ℒ = 0 ⇒ = 2𝜎 𝑗 𝑠 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.25)
𝜕𝑠 𝑗

which is called the complementary slackness condition. This condition


helps us to distinguish the active constraints from the inactive ones.
For each inequality constraint, either the Lagrange multiplier is zero
(which means that the constraint is inactive), or the slack variable
is zero (which means that the constraint is active). Unfortunately,
the complementary slackness condition introduces a combinatorial
problem. The complexity of this problem grows exponentially with
the number of inequality constraints because the number of possible
combinations of active versus inactive constraints is 2𝑛 𝑔 .
In addition to the conditions for a stationary point of the Lagrangian
(Eqs. 5.22 to 5.25), recall that we require the Lagrange multipliers for
the active constraints to be nonnegative. Putting all these conditions to-
gether in matrix form, the first-order constrained optimality conditions
are as follows:
∇ 𝑓 + 𝐽ℎ 𝜆 + 𝐽𝑔 𝜎 = 0
| |

ℎ=0
𝑔+𝑠 𝑠 =0 (5.26)
𝜎 𝑠=0
𝜎 ≥ 0.

These are called the Karush–Kuhn–Tucker (KKT) conditions. The equality


and inequality constraints are sometimes lumped together using a single
Jacobian matrix (and single Lagrange multiplier vector). This can be
convenient because the expression for the Lagrangian follows the same
form for both cases.
As in the equality constrained case, these first-order conditions are
necessary but not sufficient. The second-order sufficient conditions
5 Constrained Gradient-Based Optimization 168

require that the Hessian of the Lagrangian must be positive definite in


all feasible directions, that is,

𝑝 | 𝐻ℒ 𝑝 > 0 for all 𝑝 such that:


𝐽ℎ 𝑝 = 0 (5.27)
𝐽𝑔 𝑝 ≤ 0 for the active constraints.

In other words, we only require positive definiteness in the intersection


of the nullspace of the equality constraint Jacobian with the feasibility
cone of the active inequality constraints.
Similar to the equality constrained case, the KKT conditions (Eq. 5.26)
only apply when a point is regular, that is, when it satisfies linear inde-
pendence constraint qualification. However, the linear independence
applies only to the gradients of the inequality constraints that are active
and the equality constraint gradients.
Suppose we have the two constraints shown in the left pane of
Fig. 5.19. For the given objective function contours, point 𝑥 ∗ is a
minimum. At 𝑥 ∗ , the gradients of the two constraints are linearly
independent, and 𝑥 ∗ is thus a regular point. Therefore, we can apply
the KKT conditions at this point.

∇𝑔1
∇𝑔1

𝑥∗ 𝑥∗ ∇𝑔2 ∇𝑔1 𝑥∗
∇𝑓 ∇𝑓 ∇𝑓

∇𝑔2
∇𝑔2

𝑥 ∗ is regular 𝑥 ∗ is not regular 𝑥 ∗ is not regular

The middle and right panes of Fig. 5.19 illustrate cases where 𝑥 ∗ Fig. 5.19 The KKT conditions apply
only to regular points. A point 𝑥 ∗
is also a constrained minimum. However, 𝑥 ∗ is not a regular point in
is regular when the gradients of the
either case because the gradients of the two constraints are not linearly constraints are linearly independent.
independent. This means that the gradient of the objective cannot be The middle and right panes illustrate
cases where 𝑥 ∗ is a constrained mini-
expressed as a unique linear combination of the constraints. Therefore, mum but not a regular point.
we cannot use the KKT conditions, even though 𝑥 ∗ is a minimum.
The problem would be ill-conditioned, and the numerical methods
described in this chapter would run into numerical difficulties. Similar
to the equality constrained case, this situation is uncommon in practice.
5 Constrained Gradient-Based Optimization 169

Example 5.4 Problem with one inequality constraint

Consider a variation of the problem in Ex. 5.2 where the equality is replaced
by an inequality, as follows:

minimize 𝑓 (𝑥1 , 𝑥2 ) = 𝑥 1 + 2𝑥2


𝑥1 ,𝑥 2
1 2
subject to 𝑔(𝑥 1 , 𝑥2 ) = 𝑥 + 𝑥22 − 1 ≤ 0 .
4 1
The Lagrangian for this problem is
 
1
ℒ(𝑥1 , 𝑥2 , 𝜎, 𝑠) = 𝑥1 + 2𝑥2 + 𝜎 𝑥12 + 𝑥22 − 1 + 𝑠 2 .
4

The objective function and feasible region are shown in Fig. 5.20.

2
∇𝑓
∇𝑔
1 Maximum
𝑥𝐵
∇𝑓
𝑥2 0

Minimum
𝑥𝐴
−1

∇𝑔
−2 Fig. 5.20 Inequality constrained prob-
lem with linear objective and feasible
−3 −2 −1 0 1 2 3 space within an ellipse.
𝑥1

Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions

𝜕ℒ 1
=1+ 𝜎𝑥 = 0
𝜕𝑥1 2 1
𝜕ℒ
= 2 + 2𝜎𝑥 2 = 0
𝜕𝑥2
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 = 0
𝜕𝜎 4 1
𝜕ℒ
= 2𝜎𝑠 = 0 .
𝜕𝑠
There are two possibilities in the last (complementary slackness) condition:
𝑠 = 0 (meaning the constraint is active) and 𝜎 = 0 (meaning the constraint is
not active). However, we can see that setting 𝜎 = 0 in either of the two first
equations does not yield a solution. Assuming that 𝑠 = 0 and 𝜎 ≠ 0, we can
solve the equations to obtain:
√ √
𝑥1   − 2  𝑥1   2 
   √    √ 
𝑥 𝐴 = 𝑥2  = − 2/2 , 𝑥 𝐵 = 𝑥2  =  2/2 .
 𝜎   √2   𝜎   −√2 
       
5 Constrained Gradient-Based Optimization 170

These are the same critical points as in the equality constrained case of Ex. 5.2,
as shown in Fig. 5.20. However, now the sign of the Lagrange multiplier is
significant.
According to the KKT conditions, the Lagrange multiplier has to be nonneg-
ative. Point 𝑥 𝐴 satisfies this condition. As a result, there is no feasible descent
direction at 𝑥 𝐴 , as shown in Fig. 5.21 (left). The Hessian of the Lagrangian at
this point is the same as in Ex. 5.2, which we have already shown to be positive
definite. Therefore, 𝑥 𝐴 is a minimum.

∇𝑓 Infeasible ∇𝑓
directions Fig. 5.21 At the minimum (left), the
∇𝑔 Lagrange multiplier is positive, and
there is no feasible descent direction.
𝑥∗ 𝑥𝐵 At the critical point 𝑥 𝐵 (right), the
Descent
directions Feasible Lagrange multiplier is negative, and
descent all descent directions are feasible, so
Infeasible directions this point is not a minimum.
∇𝑔 directions

Unlike the equality constrained problem, we do not need to check the Hes-
sian at point 𝑥 𝐵 because the Lagrange multiplier is negative. As a consequence,
there are feasible descent directions, as shown in Fig. 5.21 (right). Therefore,
𝑥 𝐵 is not a minimum.

Example 5.5 Simple problem with two inequality constraints

Consider a variation of Ex. 5.4 where we add one more inequality constraint,
as follows:
minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to 𝑔1 (𝑥 1 , 𝑥2 ) =𝑥 + 𝑥22 − 1 ≤ 0
4 1
𝑔2 (𝑥 2 ) = −𝑥2 ≤ 0 .
The feasible region is the top half of the ellipse, as shown in Fig. 5.22.
The Lagrangian for this problem is
   
1 2
ℒ(𝑥, 𝜎, 𝑠) = 𝑥1 + 2𝑥2 + 𝜎1 𝑥1 + 𝑥2 − 1 + 𝑠 1 + 𝜎2 −𝑥2 + 𝑠 22 .
2 2
4

Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions,

𝜕ℒ 1
= 1 + 𝜎1 𝑥1 = 0
𝜕𝑥1 2
𝜕ℒ
= 2 + 2𝜎1 𝑥2 − 𝜎2 = 0
𝜕𝑥2
𝜕ℒ 1
= 𝑥12 + 𝑥 22 − 1 + 𝑠 12 = 0
𝜕𝜎1 4
5 Constrained Gradient-Based Optimization 171

𝜕ℒ
= −𝑥 2 + 𝑠 22 = 0
𝜕𝜎2
𝜕ℒ
= 2𝜎1 𝑠 1 = 0
𝜕𝑠 1
𝜕ℒ
= 2𝜎2 𝑠 2 = 0 .
𝜕𝑠 2
We now have two complementary slackness conditions, which yield the four
potential combinations listed in Table 5.1.

Assumption Meaning 𝑥1 𝑥2 𝜎1 𝜎2 𝑠1 𝑠2 Point


𝑠1 = 0 𝑔1 is active −2 0 1 2 0 0 𝑥∗
𝑠2 = 0 𝑔2 is active 2 0 −1 2 0 0 𝑥𝐶
𝜎1 = 0 𝑔1 is inactive
– – – – – –
𝜎2 = 0 𝑔2 is inactive
𝑠1 = 0 𝑔1 is active √ √ √ 1
2 2 − 2 0 0 2− 4 𝑥𝐵
𝜎2 = 0 𝑔2 is inactive 2

𝜎1 = 0 𝑔1 is inactive Table 5.1 Two inequality constraints


– – – – – –
𝑠2 = 0 𝑔2 is active yield four potential combinations.

2
∇𝑓
∇𝑓 ∇𝑔1 ∇𝑓
1
𝑥𝐵
∇𝑔1 Minimum 𝑥𝐶
𝑥2 0
𝑥∗ ∇𝑔1

−1
∇𝑔2 ∇𝑔2

−2
Fig. 5.22 Only one point satisfies the
−3 −2 −1 0 1 2 3 first-order KKT conditions.
𝑥1

∇𝑓 ∇𝑓
Infeasible Feasible
𝑔1 descent Infeasible
directions directions 𝑔1 Fig. 5.23 At the minimum (left), the
directions intersection of the feasible directions
∇𝑔1 𝑥∗ 𝑥𝐶 and descent directions is null, so
∇𝑔1 there is no feasible descent direction.
At this point, there is a cone of de-
Infeasible
Infeasible scent directions that is also feasible,
𝑔2
∇𝑔2 𝑔2 ∇𝑔2 so it is not a minimum.
directions
directions
5 Constrained Gradient-Based Optimization 172

Assuming that both constraints are active yields two possible solutions (𝑥 ∗
and 𝑥 𝐶 ) corresponding to two different Lagrange multipliers. According to the
KKT conditions, the Lagrange multipliers for all active inequality constraints
have to be positive, so only the solution with 𝜎1 = 1 (𝑥 ∗ ) is a candidate for a
minimum. This point corresponds to 𝑥 ∗ in Fig. 5.22. As shown in Fig. 5.23 (left),
there are no feasible descent directions starting from 𝑥 ∗ . The Hessian of the
Lagrangian at 𝑥 ∗ is identical to the previous example and is positive definite
when 𝜎1 is positive. Therefore, 𝑥 ∗ is a minimum.
The other solution for which both constraints are active is point 𝑥 𝐶 in
Fig. 5.22. As shown in Fig. 5.23 (right), there is a cone of feasible descent
directions, and therefore 𝑥 𝐶 is not a minimum.
Assuming that neither constraint is active yields 1 = 0 for the first optimality
condition, so this situation is not possible. Assuming that 𝑔1 is active yields
the solution corresponding to the maximum that we already found in Ex. 5.4,
𝑥 𝐵 . Finally, assuming that only 𝑔2 is active yields no candidate point.

Although these examples can be solved analytically, they are the


exception rather than the rule. The KKT conditions quickly become
challenging to solve analytically (try solving Ex. 5.1), and as the number
of constraints increases, trying all combinations of active and inactive
constraints becomes intractable. Furthermore, engineering problems
usually involve functions defined by models with implicit equations,
which are impossible to solve analytically. The reason we include
these analytic examples is to gain a better understanding of the KKT
conditions. For the rest of the chapter, we focus on numerical methods,
which are necessary for the vast majority of practical problems.

5.3.3 Meaning of the Lagrange Multipliers


The Lagrange multipliers quantify how much the corresponding con-
straints drive the design. More specifically, a Lagrange multiplier
quantifies the sensitivity of the optimal objective function value 𝑓 (𝑥 ∗ )
to a variation in the value of the corresponding constraint. Here we
explain why that is the case. We discuss only inequality constraints,
but the same analysis applies to equality constraints.
When a constraint is inactive, the corresponding Lagrange multiplier
is zero. This indicates that changing the value of an inactive constraint
does not affect the optimum, as expected. This is only valid to the
first order because the KKT conditions are based on the linearization
of the objective and constraint functions. Because small changes are
assumed in the linearization, we do not consider the case where an
inactive constraint becomes active after perturbation.
5 Constrained Gradient-Based Optimization 173

Now let us examine the active constraints. Suppose that we want to


quantify the effect of a change in an active (or equality) constraint 𝑔𝑖 on
the optimal objective function value.‖ The differential of 𝑔𝑖 is given by ‖
As an example, we could change the value
of the allowable stress constraint in the
the following dot product: structural optimization problem of Ex. 3.9.
𝜕𝑔𝑖
d𝑔𝑖 = d𝑥 . (5.28)
𝜕𝑥
For all the other constraints 𝑗 that remain unperturbed, which means
that
𝜕𝑔 𝑗
d𝑥 = 0 for all 𝑗 ≠ 𝑖 . (5.29)
𝜕𝑥
This equation states that any movement d𝑥 must be in the nullspace
of the remaining constraints to remain feasible with respect to those
constraints.∗∗ An example with two constraints is illustrated in Fig. 5.24,
where 𝑔1 is perturbed and 𝑔2 remains fixed. The objective and constraint ∗∗ Thiscondition is similar to Eq. 5.7, but
here we apply it to all equality and active
functions are linearized because we are considering first-order changes constraints except for constraint 𝑖 .
represented by the differentials.
From the KKT conditions (Eq. 5.22), we know that at the optimum,
0
𝜕𝑓 𝜕𝑔 𝜕𝑔1
d𝑔
1
=

0
= −𝜎 | . (5.30) 𝜕𝑥 𝑔1
+ 𝑔1
𝜕𝑥 𝜕𝑥
Using this condition, we can write the differential of the objective, 𝜕𝑓
𝑥 ∗ + d𝑥
d 𝑓 = (𝜕 𝑓 /𝜕𝑥) d𝑥, as 𝑥∗ 𝜕𝑥
𝑔2 ≤
𝜕𝑔 0
d 𝑓 = −𝜎| d𝑥 . (5.31) 𝜕𝑔2
𝜕𝑥 𝜕𝑥
According to Eqs. 5.28 and 5.29, the product with d𝑥 is only nonzero
for the perturbed constraint 𝑖 and therefore, Fig. 5.24 Lagrange multipliers can be
interpreted as the change in the op-
𝜕𝑔𝑖 timal objective due a perturbation in
d 𝑓 = −𝜎𝑖 d𝑥 = −𝜎𝑖 d𝑔𝑖 . (5.32) the corresponding constraint. In this
𝜕𝑥
case, we show the effect of perturbing
This leads to the derivative of the optimal 𝑓 with respect to a change in 𝑔1 .
the value of constraint 𝑖:
d𝑓
𝜎𝑖 = − . (5.33)
d𝑔𝑖
Thus, the Lagrange multipliers can predict how much improvement
can be expected if a given constraint is relaxed. For inequality con-
straints, because the Lagrange multipliers are positive at an optimum,
this equation correctly predicts a decrease in the objective function
value when the constraint value is increased.
The derivative defined in Eq. 5.33 has practical value because it tells
us how much a given constraint drives the design. In this interpretation
of the Lagrange multipliers, we need to consider the scaling of the
problem and the units. Still, for similar quantities, they quantify the
relative importance of the constraints.
5 Constrained Gradient-Based Optimization 174

5.3.4 Post-Optimality Sensitivities


It is sometimes helpful to find sensitivities of the optimal objective func-
tion value with respect to a parameter held fixed during optimization.
Suppose that we have found the optimum for a constrained problem.
Say we have a scalar parameter 𝜌 held fixed in the optimization, but
now want to quantify the effect of a perturbation in that parameter on
the optimal objective value. Perturbing 𝜌 changes the objective and
the constraint functions, so the optimum point moves, as illustrated in
Fig. 5.25. For our current purposes, we use 𝑔 to represent either active
inequality or equality constraints. We assume that the set of active
constraints does not change with a perturbation in 𝜌 like we did when
perturbing the constraint in Section 5.3.3.
The objective function is affected by 𝜌 through a change in 𝑓 itself and 𝑔1𝜌
a change induced by the movement of the constraints. This dependence
𝑥 𝜌∗ 𝑔1
can be written in the total differential form as:
𝜕𝑓 𝜕 𝑓 𝜕𝑔
d𝑓 = d𝜌 + d𝜌 . (5.34) 𝑥∗
𝑔2
𝜌
𝜕𝜌 𝜕𝑔 𝜕𝜌 𝑔2
𝑓𝜌 𝑓

The derivative 𝜕 𝑓 /𝜕𝑔 corresponds to the derivative of the optimal value


of the objective with respect to a perturbation in the constraint, which Fig. 5.25 Post-optimality sensitivities
quantify the change in the optimal
according to Eq. 5.33, is the negative of the Lagrange multipliers. This
objective due to a perturbation of a
means that the post-optimality derivative is parameter that was originally fixed
in the optimization. The optimal ob-
d𝑓 𝜕𝑓 𝜕𝑔 jective value changes due to changes
= − 𝜎| , (5.35) in the optimum point (which moves
d𝜌 𝜕𝜌 𝜕𝜌
to 𝑥 𝜌∗ ) and objective function (which
becomes 𝑓𝜌 .)
where the partial derivatives with respect to 𝜌 can be computed without
re-optimizing.

5.4 Penalty Methods

The concept behind penalty methods is intuitive: to transform a con-


strained problem into an unconstrained one by adding a penalty to
the objective function when constraints are violated or close to being
violated. As mentioned in the introduction to this chapter, penalty
methods are no longer used directly in gradient-based optimization
algorithms because they have difficulty converging to the true solu-
tion. However, these methods are still valuable because (1) they are
simple and thus ease the transition into understanding constrained
optimization; (2) they are useful in some constrained gradient-free
methods (Chapter 7); (3) they can be used as merit functions in line
search algorithms, as discussed in Section 5.5.3; (4) penalty concepts
5 Constrained Gradient-Based Optimization 175

are used in interior points methods, as discussed in Section 5.6. The


penalized function can be written as

𝑓ˆ(𝑥) = 𝑓 (𝑥) + 𝜇𝜋(𝑥) , (5.36)

where 𝜋(𝑥) is a penalty function, and the scalar 𝜇 is a penalty parameter.


This is similar in form to the Lagrangian, but one difference is that 𝜇 is
fixed instead of being a variable.
We can use the unconstrained optimization techniques to minimize
ˆ𝑓 (𝑥). However, instead of just solving a single optimization problem,
penalty methods usually solve a sequence of problems with different
values of 𝜇 to get closer to the actual constrained minimum. We will
see shortly why we need to solve a sequence of problems rather than
just one problem.
Various forms for 𝜋(𝑥) can be used, leading to different penalty
methods. There are two main types of penalty functions: exterior
penalties, which impose a penalty only when constraints are violated,
and interior penalty functions, which impose a penalty that increases as
a constraint is approached.
Figure 5.26 shows both interior and exterior penalties for a two-
dimensional function. The exterior penalty leads to slightly infeasible
solutions, whereas an interior penalty leads to a feasible solution but
underpredicts the objective.

5.4.1 Exterior Penalty Methods


Of the many possible exterior penalty methods, we focus on two of
the most popular ones: quadratic penalties and the augmented La-
grangian method. Quadratic penalties are continuously differentiable
and straightforward to implement, but they suffer from numerical
ill-conditioning. The augmented Lagrangian method is more sophisti-
cated; it is based on the quadratic penalty but adds terms that improve
the numerical properties. Many other penalties are possible, such as
1-norms, which are often used when continuous differentiability is
unnecessary.

Quadratic Penalty Method

For equality constrained problems, the quadratic penalty method takes


𝜇Õ
the form
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + ℎ 𝑖 (𝑥)2 , (5.37)
2
𝑖

where the semicolon denotes that 𝜇 is a fixed parameter. The motivation


for a quadratic penalty is that it is simple and results in a function that
5 Constrained Gradient-Based Optimization 176

Objective Objective

𝑥∗ 𝑥∗
𝑥∗ 𝑥∗

𝑝 𝑝

Interior Exterior
Constraint Constraint
penalty penalty

Fig. 5.26 Interior penalties tend to in-


finity as the constraint is approached
𝑓ˆ(𝑥) from the feasible side of the constraint
(left), whereas exterior penalty func-
tions activate when the points are not
feasible (right). The minimum for
both approaches is different from the
𝛼𝑝 true constrained minimum.
∗ ∗ ∗
𝑥interior 𝑥true 𝑥exterior

is continuously differentiable. The factor of one half is unnecessary but


is included by convention because it eliminates the extra factor of two
when taking derivatives. The penalty is nonzero unless the constraints
are satisfied (ℎ 𝑖 = 0), as desired.

𝑓ˆ(𝑥; 𝜇)
Fig. 5.27 Quadratic penalty for an
equality constrained problem. The
𝜇↑ 𝑓 (𝑥) minimum of the penalized function
(black dots) approaches the true con-
strained minimum (blue circle) as the
𝑥 penalty parameter 𝜇 increases.

𝑥 true

The value of the penalty parameter 𝜇 must be chosen carefully.


Mathematically, we recover the exact solution to the constrained prob-
lem only as 𝜇 tends to infinity (see Fig. 5.27). However, starting with a
large value for 𝜇 is not practical. This is because the larger the value of
5 Constrained Gradient-Based Optimization 177

𝜇, the larger the Hessian condition number, which corresponds to the


curvature varying greatly with direction (see Ex. 4.10). This behavior
makes the problem difficult to solve numerically.
To solve the problem more effectively, we begin with a small value of
𝜇 and solve the unconstrained problem. We then increase 𝜇 and solve
the new unconstrained problem, using the previous solution as the
starting point. We repeat this process until the optimality conditions (or
some other approximate convergence criteria) are satisfied, as outlined
in Alg. 5.1. By gradually increasing 𝜇 and reusing the solution from
the previous problem, we avoid some of the ill-conditioning issues.
Thus, the original constrained problem is transformed into a sequence
of unconstrained optimization problems.

Algorithm 5.1 Exterior penalty method

Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor (𝜌 ∼ 1.2 is conservative, 𝜌 ∼ 10 is aggressive)
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value

𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while

There are three potential issues with the approach outlined in


Alg. 5.1. Suppose the starting value for 𝜇 is too low. In that case, the
penalty might not be enough to overcome a function that is unbounded
from below, and the penalized function has no minimum.
The second issue is that we cannot practically approach 𝜇 → ∞.
Hence, the solution to the problem is always slightly infeasible. By
comparing the optimality condition of the constrained problem,

∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 = 0 , (5.38)
|

and the optimality conditional of the penalized function,

∇𝑥 𝑓ˆ = ∇ 𝑓 + 𝜇𝐽ℎ ℎ = 0 , (5.39)
|
5 Constrained Gradient-Based Optimization 178

we see that for each constraint 𝑗,

𝜆∗𝑗
ℎ𝑗 ≈ . (5.40)
𝜇

Because ℎ 𝑗 = 0 at the optimum, 𝜇 must be large to satisfy the constraints.


The third issue has to do with the curvature of the penalized function,
which is directly proportional to 𝜇. The extra curvature is added in a
direction perpendicular to the constraints, making the Hessian of the
penalized function increasingly ill-conditioned as 𝜇 increases. Thus,
the need to increase 𝜇 to improve accuracy directly leads to a function
space that is increasingly challenging to solve.

Example 5.6 Quadratic penalty for equality constrained problem

2 2 2

1 1 1

𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝜇 = 0.5 𝜇 = 3.0 𝜇 = 10.0

Consider the equality constrained problem from Ex. 5.2. The penalized Fig. 5.28 The quadratic penalized
function for that case is function minimum approaches the
 2 constrained minimum as the penalty
𝜇 1 2 parameter increases.
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝑥1 + 𝑥 22 − 1 . (5.41)
2 4

Figure 5.28 shows this function for different values of the penalty parameter | 𝑓ˆ∗ − 𝑓 ∗ |
𝜇. The penalty is active for all points that are infeasible, but the minimum of
the penalized function does not coincide with the constrained minimum of 100
the original problem. The penalty parameter needs to be increased for the
minimum of the penalized function to approach the correct solution, but this 10−1

results in a poorly conditioned function.


To show the impact of increasing 𝜇, we solve a sequence of problems starting 10−2

with a small value of 𝜇 and reusing the optimal point for one solution as the
10−3
starting point for the next. Figure 5.29 shows that large penalty values are
10−1 100 101 102 103
required for high accuracy. In this example, even using a penalty parameter of 𝜇
𝜇 = 1, 000 (which results in extremely skewed contours), the objective value
achieves only three digits of accuracy. Fig. 5.29 Error in optimal solution for
increasing penalty parameter.
5 Constrained Gradient-Based Optimization 179

The approach discussed so far handles only equality constraints,


but we can extend it to handle inequality constraints. Instead of adding
a penalty to both sides of the constraints, we add the penalty when the
inequality constraint is violated (i.e., when 𝑔 𝑗 (𝑥) > 0). This behavior
can be achieved by defining a new penalty function as

𝜇Õ 2
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.42)
2
𝑗=1

The only difference relative to the equality constraint penalty shown


in Fig. 5.27 is that the penalty is removed on the feasible side of the
inequality constraint, as shown in Fig. 5.30.

𝑓ˆ(𝑥; 𝜇)

Fig. 5.30 Quadratic penalty for an in-


𝜇↑ equality constrained problem. The
minimum of the penalized function
approaches the constrained mini-
𝑥 mum from the infeasible side.

𝑥 true

The inequality quadratic penalty can be used together with the


quadratic penalty for equality constraints if we need to handle both
types of constraints:

Õ
𝑛ℎ
𝜇𝑔 Õ
𝑛𝑔
2
ˆ𝑓 (𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇 ℎ 2
ℎ 𝑙 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.43)
2 2
𝑙=1 𝑗=1

The two penalty parameters can be incremented in lockstep or inde-


pendently.

Example 5.7 Quadratic penalty for inequality constrained problem

Consider the inequality constrained problem from Ex. 5.4. The penalized
function for that case is
 2
𝜇 1
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + max 0, 𝑥 12 + 𝑥22 − 1 .
2 4

This function is shown in Fig. 5.31 for different values of the penalty parameter
𝜇. The contours of the feasible region inside the ellipse coincide with the
5 Constrained Gradient-Based Optimization 180

original function contours. However, outside the feasible region, the contours
change to create a function whose minimum approaches the true constrained
minimum as the penalty parameter increases.

2 2 2

1 1 1

𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝜇 = 0.5 𝜇 = 3.0 𝜇 = 10.0

Fig. 5.31 The quadratic penalized


function minimum approaches the
constrained minimum from the infea-
sible side.
Tip 5.3 Scaling is also important for constrained problems

The considerations on scaling discussed in Tip 4.4 are just as crucial for
constrained problems. Similar to scaling the objective function, a good scaling
rule of thumb is to normalize each constraint function such they are of order
1. For constraints, a natural scale is typically already defined by the limits we
provide. For example, instead of

𝑔 𝑗 (𝑥) − 𝑔max 𝑗 ≤ 0 , (5.44)

we can reexpress a scaled version as

𝑔 𝑗 (𝑥)
−1 ≤ 0. (5.45)
𝑔max 𝑗

Augmented Lagrangian

As explained previously, the quadratic penalty method requires a


large value of 𝜇 for constraint satisfaction, but the large 𝜇 degrades
the numerical conditioning. The augmented Lagrangian method
helps alleviate this dilemma by adding the quadratic penalty to the
Lagrangian instead of just adding it to the function. The augmented
Lagrangian function for equality constraints is

Õ
𝑛ℎ
𝜇Õℎ 𝑛
𝑓ˆ(𝑥; 𝜆, 𝜇) = 𝑓 (𝑥) + 𝜆 𝑗 ℎ 𝑗 (𝑥) + ℎ 𝑗 (𝑥)2 . (5.46)
2
𝑗=1 𝑗=1
5 Constrained Gradient-Based Optimization 181

To estimate the Lagrange multipliers, we can compare the optimality


conditions for the augmented Lagrangian,

Õ
𝑛ℎ

∇𝑥 𝑓ˆ(𝑥; 𝜆, 𝜇) = ∇ 𝑓 (𝑥) + 𝜆 𝑗 + 𝜇ℎ 𝑗 (𝑥) ∇ℎ 𝑗 = 0 , (5.47)
𝑗=1

to those of the actual Lagrangian,

Õ
𝑛ℎ
∇𝑥 ℒ(𝑥 ∗ , 𝜆∗ ) = ∇ 𝑓 (𝑥 ∗ ) + 𝜆∗𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0 . (5.48)
𝑗=1

Comparing these two conditions suggests the approximation

𝜆∗𝑗 ≈ 𝜆 𝑗 + 𝜇ℎ 𝑗 . (5.49)

Therefore, we update the vector of Lagrange multipliers based on the


current estimate of the Lagrange multipliers and constraint values
using
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) . (5.50)
The complete algorithm is shown in Alg. 5.2.
This approach is an improvement on the plain quadratic penalty
because updating the Lagrange multiplier estimates at each iteration
allows for more accurate solutions without increasing 𝜇 as much. The
augmented Lagrangian approximation for each constraint obtained
from Eq. 5.49 is
1
ℎ 𝑗 ≈ (𝜆∗𝑗 − 𝜆 𝑗 ) . (5.51)
𝜇
The corresponding approximation in the quadratic penalty method is

𝜆∗𝑗
ℎ𝑗 ≈ . (5.52)
𝜇

The quadratic penalty relies solely on increasing 𝜇 in the denominator to


drive the constraints to zero. However, the augmented Lagrangian also
controls the numerator through the Lagrange multiplier estimate. If the
estimate is reasonably close to the true Lagrange multiplier, then the
numerator becomes small for modest values of 𝜇. Thus, the augmented
Lagrangian can provide a good solution for 𝑥 ∗ while avoiding the
ill-conditioning issues of the quadratic penalty.
5 Constrained Gradient-Based Optimization 182

Algorithm 5.2 Augmented Lagrangian penalty method

Inputs:
𝑥 0 : Starting point
𝜆0 = 0: Initial Lagrange multiplier
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value

𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜆 𝑘 , 𝜇 𝑘 )
𝑥𝑘
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) Update Lagrange multipliers
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while

So far we have only discussed equality constraints where the defini-


tion for the augmented Lagrangian is universal. Example 5.8 included
an inequality constraint by assuming it was active and treating it like
an equality, but this is not an approach that can be used in general.
Several formulations exist for handling inequality constraints using the
augmented Lagrangian approach.89–91 One well-known approach is 89. Gill et al., Some theoretical properties
of an augmented Lagrangian merit function,
given by:92 1986.
1
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜆| 𝑔¯ (𝑥) + 𝜇 k 𝑔¯ (𝑥)k 22 . (5.53) 90. Di Pillo and Grippo, A new augmented
2 Lagrangian function for inequality con-
straints in nonlinear programming problems,
where 1982.


 ℎ 𝑗 (𝑥)


for equality constraints 91. Birgin et al., Numerical comparison
of augmented Lagrangian algorithms for
𝑔¯ 𝑗 (𝑥) ≡ 𝑔 (𝑥) if 𝑔 𝑗 ≥ −𝜆 𝑗 /𝜇 (5.54)

nonconvex problems, 2005.

𝑗
 −𝜆 𝑗 /𝜇
 otherwise . 92. Rockafellar, The multiplier method
of Hestenes and Powell applied to convex
programming, 1973.

Example 5.8 Augmented Lagrangian for inequality constrained problem

Consider the inequality constrained problem from Ex. 5.4. Assuming the
inequality constraint is active, the augmented Lagrangian (Eq. 5.46) is
   2
1 𝜇 1 2
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 + 𝑥1 + 𝑥22 − 1 .
4 2 4

Applying Alg. 5.2, starting with 𝜇 = 0.5 and using 𝜌 = 1.1, we get the iterations
shown in Fig. 5.32.
5 Constrained Gradient-Based Optimization 183

2 2 2
𝑥0
1 1 1

𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘 = 0, 𝜇 = 0.50, 𝜆 = 0.000 𝑘 = 2, 𝜇 = 0.61, 𝜆 = 1.146 𝑘 = 9, 𝜇 = 1.18, 𝜆 = 1.413

Compared with the quadratic penalty in Ex. 5.7, the penalized function Fig. 5.32 Augmented Lagrangian ap-
is much better conditioned, thanks to the term associated with the Lagrange plied to inequality constrained prob-
lem.
multiplier. The minimum of the penalized function eventually becomes the
minimum of the constrained problem without a large penalty parameter.
As done in Ex. 5.6, we solve a sequence of problems starting with a small | 𝑓ˆ∗ − 𝑓 ∗ |

value of 𝜇 and reusing the optimal point for one solution as the starting point
for the next. In this case, we update the Lagrange multiplier estimate between 10−1

optimizations as well. Figure 5.33 shows that only modest penalty parameters 10−4
are needed to achieve tight convergence to the true solution, a significant
improvement over the regular quadratic penalty. 10−7

10−10

10−13
10−1 100
5.4.2 Interior Penalty Methods 𝜇

Interior penalty methods work the same way as exterior penalty Fig. 5.33 Error in optimal solution
as compared with true solution as
methods—they transform the constrained problem into a series of
a function of an increasing penalty
unconstrained problems. The main difference with interior penalty parameter.
methods is that they always seek to maintain feasibility. Instead of
adding a penalty only when constraints are violated, they add a penalty
as the constraint is approached from the feasible region. This type of
penalty is particularly desirable if the objective function is ill-defined
outside the feasible region. These methods are called interior because 𝜋(𝑥)
8
the iteration points remain on the interior of the feasible region. They
are also referred to as barrier methods because the penalty function acts 6

as a barrier preventing iterates from leaving the feasible region. 4

One possible interior penalty function to enforce 𝑔(𝑥) ≤ 0 is the 2 Inverse barrier
inverse barrier,
Õ
0
𝑛𝑔 −2 −1 𝑔(𝑥) 0
1
𝜋(𝑥) = − , (5.55) −2 Logarithmic barrier
𝑔 𝑗 (𝑥)
𝑗=1
Fig. 5.34 Two different interior
where 𝜋(𝑥) → ∞ as 𝑔 𝑗 (𝑥) → 0− (where the superscript “−” indicates a penalty functions: inverse barrier and
left-sided derivative). A more popular interior penalty function is the logarithmic barrier.
5 Constrained Gradient-Based Optimization 184

logarithmic barrier,
Õ
𝑛𝑔

𝜋(𝑥) = − ln −𝑔 𝑗 (𝑥) , (5.56)
𝑗=1

which also approaches infinity as the constraint tends to zero from the
feasible side. The penalty function is then

Õ
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) − 𝜇 ln(−𝑔 𝑗 (𝑥)) . (5.57)
𝑗=1

These two penalty functions as illustrated in Fig. 5.34.


Neither of these penalty functions applies when 𝑔 > 0 because they
are designed to be evaluated only within the feasible space. Algorithms
based on these penalties must be prevented from evaluating infeasible
points.
Like exterior penalty methods, interior penalty methods must also
solve a sequence of unconstrained problems but with 𝜇 → 0 (see
Alg. 5.3). As the penalty parameter decreases, the region across which
the penalty acts decreases, as shown in Fig. 5.35.

𝑓ˆ(𝑥; 𝜇) Fig. 5.35 Logarithmic barrier penalty


for an inequality constrained prob-
lem. The minimum of the penalized
function (black circles) approaches
the true constrained minimum (blue
𝜇↓ circle) as the penalty parameter 𝜇 de-
𝑥 creases.

𝑥true

The methodology is the same as is described in Alg. 5.1 but with


a decreasing penalty parameter. One major weakness of the method
is that the penalty function is not defined for infeasible points, so a
feasible starting point must be provided. For some problems, providing
a feasible starting point may be difficult or practically impossible.
The optimization must be safeguarded to prevent the algorithm
from becoming infeasible when starting from a feasible point. This
can be achieved by checking the constraints values during the line
search and backtracking if any of them is greater than or equal to zero.
Multiple backtracking iterations might be required.
5 Constrained Gradient-Based Optimization 185

Algorithm 5.3 Interior penalty method

Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 < 1: Penalty decrease factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value

𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Decrease penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while

Example 5.9 Logarithmic penalty for inequality constrained problem

Consider the equality constrained problem from Ex. 5.4. The penalized
function for that case using the logarithmic penalty (Eq. 5.57) is
 
1
𝑓ˆ(𝑥; 𝜇) = 𝑥 1 + 2𝑥2 − 𝜇 ln − 𝑥12 − 𝑥22 + 1 .
4
Figure 5.36 shows this function for different values of the penalty parameter 𝜇.
The penalized function is defined only in the feasible space, so we do not plot
its contours outside the ellipse.

2 2 2

1 1 1

𝑥2 0 𝑥 ∗ˆ 𝑥2 0 𝑥2 0
𝑓 𝑥 ∗ˆ 𝑥 ∗ˆ
𝑓
𝑓
−1 𝑥∗ −1 𝑥∗ −1 𝑥∗

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝜇 = 3.0 𝜇 = 1.0 𝜇 = 0.2

Fig. 5.36 Logarithmic penalty for one


inequality constraint. The minimum
of the penalized function approaches
Like exterior penalty methods, the Hessian for interior penalty
the constrained minimum from the
methods becomes increasingly ill-conditioned as the penalty parameter feasible side.
5 Constrained Gradient-Based Optimization 186

tends to zero.93 There are augmented and modified barrier approaches 93. Murray, Analytical expressions for the
eigenvalues and eigenvectors of the Hessian
that can avoid the ill-conditioning issue (and other methods that remain matrices of barrier and penalty functions,
ill-conditioned but can still be solved reliably, albeit inefficiently).94 1971.
94. Forsgren et al., Interior methods for
However, these methods have been superseded by the modern interior- nonlinear optimization, 2002.
point methods discussed in Section 5.6, so we do not elaborate on
further improvements to classical penalty methods.

5.5 Sequential Quadratic Programming

SQP is the first of the modern constrained optimization methods we


discuss. SQP is not a single algorithm; instead, it is a conceptual method
from which various specific algorithms are derived. We present the
basic method but mention only a few of the many details needed for
robust practical implementations. We begin with equality constrained
SQP and then add inequality constraints.

5.5.1 Equality Constrained SQP


To derive the SQP method, we start with the KKT conditions for this
problem and treat them as equation residuals that need to be solved.
Recall that the Lagrangian (Eq. 5.12) is

ℒ(𝑥, 𝜆) = 𝑓 (𝑥) + ℎ(𝑥)| 𝜆 . (5.58)

Differentiating this function with respect to the design variables and


Lagrange multipliers and setting the derivatives to zero, we get the
KKT conditions,
   | 
∇𝑥 ℒ(𝑥, 𝜆) ∇ 𝑓 (𝑥) + 𝐽ℎ 𝜆
𝑟= = = 0. (5.59)
∇𝜆 ℒ(𝑥, 𝜆) ℎ(𝑥)

Recall that to solve a system of equations 𝑟(𝑢) = 0 using Newton’s


method, we solve a sequence of linear systems,

𝐽𝑟 (𝑢 𝑘 ) 𝑝 𝑢 = −𝑟 (𝑢 𝑘 ) , (5.60)

where 𝐽𝑟 is the Jacobian of derivatives 𝜕𝑟/𝜕𝑢. The step in the variables


is 𝑝 𝑢 = 𝑢 𝑘+1 − 𝑢 𝑘 , where the variables are
 
𝑥
𝑢≡ . (5.61)
𝜆

Differentiating the vector of residuals (Eq. 5.59) with respect to the two
concatenated vectors in 𝑢 yields the following block linear system:
 |    
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇𝑥 ℒ
= . (5.62)
𝐽ℎ 0 𝑝𝜆 −ℎ
5 Constrained Gradient-Based Optimization 187

This is a linear system of 𝑛 𝑥 + 𝑛 ℎ equations where the Jacobian matrix is


square. The shape of the matrix and its blocks are as shown in Fig. 5.37. 𝑛𝑥 𝑛ℎ

We solve a sequence of these problems to converge to the optimal design


variables and the corresponding optimal Lagrange multipliers. At each 𝑛𝑥 𝐻ℒ |
𝐽ℎ
iteration, we update the design variables and Lagrange multipliers as
follows:
𝑛ℎ 𝐽ℎ 0
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑥 (5.63)
𝜆 𝑘+1 = 𝜆 𝑘 + 𝑝𝜆 . (5.64) Fig. 5.37 Structure and block shapes
for the matrix in the SQP system
(Eq. 5.62)
The inclusion of 𝛼 𝑘 suggests that we do not automatically accept the
Newton step (which corresponds to 𝛼 = 1) but instead perform a line
search as previously described in Section 4.3. The function used in the
line search needs some modification, as discussed later in this section.
SQP can be derived in an alternative way that leads to different in-
sights. This alternate approach requires an understanding of quadratic
programming (QP), which is discussed in more detail in Section 11.3
but briefly described here. A QP problem is an optimization problem
with a quadratic objective and linear constraints. In a general form, we
can express any equality constrained QP as
1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥
𝑥 2 (5.65)
subject to 𝐴𝑥 + 𝑏 = 0 .

A two-dimensional example with one constraint is illustrated in Fig. 5.38.


The constraint is a matrix equation that represents multiple linear equal-
ity constraints—one for every row in 𝐴. We can solve this optimization
problem analytically from the optimality conditions. First, we form the
Lagrangian: 𝑥2
𝑥∗
1 |
ℒ(𝑥, 𝜆) = 𝑥 𝑄𝑥 + 𝑞 | 𝑥 + 𝜆| (𝐴𝑥 + 𝑏) . (5.66)
2
We now take the partial derivatives and set them equal to zero: 𝑥1

Fig. 5.38 Quadratic problem in two


∇𝑥 ℒ = 𝑄𝑥 + 𝑞 + 𝐴| 𝜆 = 0
(5.67) dimensions.
∇𝜆 ℒ = 𝐴𝑥 + 𝑏 = 0 .

We can express those same equations in a block matrix form:


    
𝑄 𝐴| 𝑥 −𝑞
= . (5.68)
𝐴 0 𝜆 −𝑏

This is like the procedure we used in solving the KKT conditions, except
that these are linear equations, so we can solve them directly without
5 Constrained Gradient-Based Optimization 188

any iteration. As in the unconstrained case, finding the minimum of a


quadratic objective results in a system of linear equations.
As long as 𝑄 is positive definite, then the linear system always has
a solution, and it is the global minimum of the QP.∗ The ease with ∗ Inother words, this is a convex problem.
Convex optimization is discussed in Chap-
which a QP can be solved provides a strong motivation for SQP. For a ter 11.
general constrained problem, we can make a local QP approximation
of the nonlinear model, solve the QP, and repeat this process until
convergence. This method involves iteratively solving a sequence of
quadratic programming problems, hence the name sequential quadratic
programming.
To form the QP, we use a local quadratic approximation of the
Lagrangian (removing the constant term because it does not change the
solution) and a linear approximation of the constraints for some step
𝑝 near our current point. In other words, we locally approximate the
problem as the following QP:

1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑥 ℒ | 𝑝
𝑝 2 (5.69)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .

We substitute the gradient of the Lagrangian into the objective:

1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 + 𝜆| 𝐽 ℎ 𝑝 . (5.70)
2
Then, we substitute the constraint 𝐽ℎ 𝑝 = −ℎ into the objective:

1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 − 𝜆| ℎ . (5.71)
2
Now, we can remove the last term in the objective because it does
not depend on the variable (𝑝), resulting in the following equivalent
problem:
1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑓 | 𝑝
𝑝 2 (5.72)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .
Using the QP solution method outlined previously results in the
following system of linear equations:
 |    
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇ 𝑓
= . (5.73)
𝐽ℎ 0 𝜆 𝑘+1 −ℎ

Replacing 𝜆 𝑘+1 = 𝜆 𝑘 + 𝑝𝜆 and multiply through:


 |    |   
𝐻ℒ 𝐽ℎ 𝑝𝑥 𝐽 𝜆 −∇ 𝑓
+ ℎ 𝑘 = . (5.74)
𝐽ℎ 0 𝑝𝜆 0 −ℎ
5 Constrained Gradient-Based Optimization 189

Subtracting the second term on both sides yields


 |    
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇𝑥 ℒ
= , (5.75)
𝐽ℎ 0 𝑝𝜆 −ℎ

which is the same linear system we found from applying Newton’s


method to the KKT conditions (Eq. 5.62).
This derivation relies on the somewhat arbitrary choices of choosing
a QP as the subproblem and using an approximation of the Lagrangian
with constraints (rather than an approximation of the objective with
constraints or an approximation of the Lagrangian with no constraints).† † The Lagrangian objective can also be con-
sidered to be an approximation of the objec-
Nevertheless, it is helpful to conceptualize the method as solving a tive along the feasible surface ℎ(𝑥) = 0.95
sequence of QPs. This concept will motivate the solution process once 95. Gill and Wong, Sequential quadratic
we add inequality constraints. programming methods, 2012.

5.5.2 Inequality Constraints


Introducing inequality constraints adds complications. For inequality
constraints, we cannot solve the KKT conditions directly as we could
for equality constraints. This is because the KKT conditions include the
complementary slackness conditions 𝜎 𝑗 𝑔 𝑗 = 0, which we cannot solve
directly. Even though the number of equations in the KKT conditions
is equal to the number of unknowns, the complementary conditions do
not provide complete information (they just state that each constraint
is either active or inactive). Suppose we knew which of the inequality
constraints were active (𝑔 𝑗 = 0) and which were inactive (𝜎 𝑗 = 0) at
the optimum. Then, we could use the same approach outlined in the
previous section, treating the active constraints as equality constraints
and ignoring the inactive constraints. Unfortunately, we do not know
which constraints are active at the optimum beforehand in general.
Finding which constraints are active in an iterative way is challenging
because we would have to try all possible combinations of active
constraints. This is intractable if there are many constraints.
A common approach to handling inequality constraints is to use an
active-set method. The active set is the set of constraints that are active at
the optimum (the only ones we ultimately need to enforce). Although
the actual active set is unknown until the solution is found, we can
estimate this set at each iteration. This subset of potentially active
‡ Linearizing the constraints can sometimes

lead to an infeasible QP subproblem; ad-


constraints is called the working set. The working set is then updated at ditional techniques are needed to handle
each iteration. such cases.79 , 96
79. Nocedal and Wright, Numerical
Similar to the SQP developed in the previous section for equality Optimization, 2006.
constraints, we can create an algorithm based on solving a sequence of 96. Gill et al., SNOPT: An SQP algorithm
QPs that linearize the constraints.‡ We extend the equality constrained for large-scale constrained optimization,
2005.
5 Constrained Gradient-Based Optimization 190

QP (Eq. 5.69) to include the inequality constraints as follows:

1 |
minimize 𝑠 𝐻ℒ 𝑠 + ∇ 𝑥 ℒ | 𝑠
𝑠 2
subject to 𝐽ℎ 𝑠 + ℎ = 0 (5.76)

𝐽𝑔 𝑠 + 𝑔 ≤ 0 .

The determination of the working set could happen in the inner loop,
that is, as part of the inequality constrained QP subproblem (Eq. 5.76).
Alternatively, we could choose a working set in the outer loop and
then solve the QP subproblem with only equality constraints (Eq. 5.69),
where the working-set constraints would be posed as equalities. The
former approach is more common and is discussed here. In that case,
we need consider only the active-set problem in the context of a QP.
Many variations on active-set methods exist; we outline just one such
approach based on a binding-direction method.
The general QP problem we need to solve is as follows:

1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (5.77)

𝐶𝑥 + 𝑑 ≤ 0 .

We assume that 𝑄 is positive definite so that this problem is convex.


Here, 𝑄 corresponds to the Lagrangian Hessian. Using an appropriate
quasi-Newton approximation (which we will discuss in Section 5.5.4)
ensures a positive definite Lagrangian Hessian approximation.
Consider iteration 𝑘 in an SQP algorithm that handles inequality
constraints. At the end of the previous iteration, we have a design point
𝑥 𝑘 and a working set 𝑊𝑘 . The working set in this approach is a set
of row indices corresponding to the subset of inequality constraints
that are active at 𝑥 𝑘 .§ Then, we consider the corresponding inequality § Thisis not a universal definition. For
example, the constraints in the working
constraints to be equalities, and we write: set need not be active at 𝑥 𝑘 in some ap-
proaches.
𝐶 𝑤 𝑥 𝑘 + 𝑑𝑤 = 0 , (5.78)

where 𝐶𝑤 and 𝑑𝑤 correspond to the rows of the inequality constraints


specified in the working set.
The constraints in the working set, combined with the equality
constraints, must be linearly independent. Thus, we cannot include
more working-set constraints (plus equality constraints) than design
variables. Although the active set is unique, there can be multiple valid
choices for the working set.
5 Constrained Gradient-Based Optimization 191

Assume, for the moment, that the working set does not change at
nearby points (i.e., we ignore the constraints outside the working set).
We seek a step 𝑝 to update the design variables as follows: 𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝.
We find 𝑝 by solving the following simplified QP that considers only
the working set:

1
minimize (𝑥 𝑘 + 𝑝)| 𝑄(𝑥 𝑘 + 𝑝) + 𝑞 | (𝑥 𝑘 + 𝑝)
𝑝 2
(5.79)
subject to 𝐴(𝑥 𝑘 + 𝑝) + 𝑏 = 0
𝐶𝑤 (𝑥 𝑘 + 𝑝) + 𝑑𝑤 = 0 .

We solve this QP by varying 𝑝, so after multiplying out the terms


in the objective, we can ignore the terms that do not depend on 𝑝. We
can also simplify the constraints because we know the constraints were
satisfied at the previous iteration (i.e., 𝐴𝑥 𝑘 + 𝑏 = 0 and 𝐶𝑤 𝑥 𝑘 + 𝑑𝑤 = 0).
The simplified problem is as follows:

1 |
minimize 𝑝 𝑄𝑝 + (𝑞 + 𝑄 | 𝑥 𝑘 )𝑝
𝑝 2
(5.80)
subject to 𝐴𝑝 = 0
𝐶𝑤 𝑝 = 0 .

We now have an equality constrained QP that we can solve using the


methods from the previous section. Using Eq. 5.68, the KKT solution 𝑛𝑥 𝑛ℎ 𝑛 𝑔𝑤

to this problem is as follows:


|
𝑄 𝐶𝑤   𝑝  −𝑞 − 𝑄 | 𝑥 𝑘 
𝑛𝑥 𝑄 𝐴| 𝐶𝑤
    
|
𝐴|
𝐴 0  𝜆 =   .
 0    0  (5.81)
𝐶 𝑤 0  𝜎  
 0    0  𝑛ℎ 𝐴 0 0

Figure 5.39 shows the structure of the matrix in this linear system. 𝑛 𝑔𝑤 𝐶𝑤 0 0
Let us consider the case where the solution of this linear system is
nonzero. Solving the KKT conditions in Eq. 5.80 ensures that all the Fig. 5.39 Structure of the QP subprob-
constraints in the working set are still satisfied at 𝑥 𝑘 + 𝑝. Still, there is no lem within the inequality constrained
QP solution process.
guarantee that the step does not violate some of the constraints outside
of our working set. Suppose that 𝐶 𝑛 and 𝑑𝑛 define the constraints
outside of the working set. If

𝐶 𝑛 (𝑥 𝑘 + 𝑝) + 𝑑𝑛 ≤ 0 (5.82)

for all rows, all the constraints are still satisfied. In that case, we accept
the step 𝑝 and update the design variables as follows:

𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝 . (5.83)
5 Constrained Gradient-Based Optimization 192

The working set remains unchanged as we proceed to the next iteration.


Otherwise, if some of the constraints are violated, we cannot take
the full step 𝑝 and reduce it the step length by 𝛼 as follows:

𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 . (5.84)

We cannot take the full step (𝛼 = 1), but we would like to take as large
a step as possible while still keeping all the constraints feasible.
Let us consider how to determine the appropriate step size, 𝛼.
Substituting the step update (Eq. 5.84) into the equality constraints, we
obtain the following:

𝐴(𝑥 𝑘 + 𝛼𝑝) + 𝑏 = 0 . (5.85)

We know that 𝐴𝑥 𝑘 + 𝑏 = 0 from solving the problem at the previous


iteration. Also, we just solved 𝑝 under the condition that 𝐴𝑝 = 0.
Therefore, the equality constraints (Eq. 5.85) remain satisfied for any
choice of 𝛼. By the same logic, the constraints in our working set remain
satisfied for any choice of 𝛼 as well.
Now let us consider the constraints that are not in the working set.
We denote 𝑐 𝑖 as row 𝑖 of the matrix 𝐶 𝑛 (associated with the inequality
constraints outside of the working set). If these constraints are to remain
satisfied, we require

𝑐 𝑖 (𝑥 𝑘 + 𝛼𝑝) + 𝑑 𝑖 ≤ 0 . (5.86)
|

After rearranging, this condition becomes

(5.87)
| |
𝛼𝑐 𝑖 𝑝 ≤ −(𝑐 𝑖 𝑥 𝑘 + 𝑑 𝑖 ) .

We do not divide through by 𝑐 𝑖 𝑝 yet because the direction of the


|

inequality would change depending on its sign. We consider the two


possibilities separately. Because the QP constraints were satisfied at
the previous iteration, we know that 𝑐 𝑖 𝑥 𝑘 + 𝑑 𝑖 ≤ 0 for all 𝑖. Thus, the
|

right-hand side is always positive. If 𝑐 𝑖 𝑝 is negative, then the inequality


|

will be satisfied for any choice of 𝛼. Alternatively, if 𝑐 𝑖 𝑝 is positive, we


|

can rearrange Eq. 5.87 to obtain the following:


|
(𝑐 𝑖 𝑥 𝑘 + 𝑑 𝑖 )
𝛼𝑖 ≤ − | . (5.88)
𝑐𝑖 𝑝

This equation determines how large 𝛼 can be without causing one of


the constraints outside of the working set to become active. Because
multiple constraints may become active, we have to evaluate 𝛼 for each
one and choose the smallest 𝛼 among all constraints.
5 Constrained Gradient-Based Optimization 193

A constraint for which 𝛼 < 1 is said to be blocking. In other words, if


we had included that constraint in our working set before solving the
QP, it would have changed the solution. We add one of the blocking
constraints to the working set, and proceed to the next iteration.¶ ¶ In practice, adding only one constraint
to the working set at a time (or removing
Now consider the case where the solution to Eq. 5.81 is 𝑝 = 0. If only one constraint in other steps described
all inequality constraint Lagrange multipliers are positive (𝜎𝑖 > 0), the later) typically leads to faster convergence.
KKT conditions are satisfied and we have solved the original inequality
constrained QP. If one or more 𝜎𝑖 values are negative, additional
iterations are needed. We find the 𝜎𝑖 value that is most negative,
remove constraint 𝑖 from the working set, and proceed to the next
iteration.
As noted previously, all the constraints in the reduced QP (the
equality constraints plus all working-set constraints) must be linearly
independent and thus [𝐴 𝐶𝑤 ]| has full row rank. Otherwise, there
would be no solution to Eq. 5.81. Therefore, the starting working set
might not include all active constraints at 𝑥0 and must instead contain
only a subset, such that linear independence is maintained. Similarly,
when adding a blocking constraint to the working set, we must again
check for linear independence. At a minimum, we need to ensure
that the length of the working set does not exceed 𝑛 𝑥 . The complete
algorithm for solving an inequality constrained QP is shown in Alg. 5.4.

Tip 5.4 Some equality constraints can be posed as inequality con-


straints
Equality constraints are less common in engineering design problems than
inequality constraints. Sometimes we pose a problem as an equality constraint
unnecessarily. For example, the simulation of an aircraft in steady-level flight
may require the lift to equal the weight. Formally, this is an equality constraint,
but it can also be posed as an inequality constraint (lift greater or equal to
weight). There is no advantage to having more lift than the required because
it increases drag, so the constraint is always active at the optimum. When
such a constraint is not active at the solution, it can be a helpful indicator that
something is wrong with the formulation, the optimizer, or the assumptions.
Although an equality constraint is more natural from the algorithm perspective,
the flexibility of the inequality constraint might allow the optimizer to explore
the design space more effectively.
Consider another example: a propeller design problem might require a
specified thrust. Although an equality constraint would likely work, it is more
constraining than necessary. If the optimal design were somehow able to
produce excess thrust, we would accept that design. Thus, we should not
formulate the constraint in an unnecessarily restrictive way.
5 Constrained Gradient-Based Optimization 194

Algorithm 5.4 Active-set solution method for an inequality constrained QP

Inputs:
𝑄, 𝑞, 𝐴, 𝑏, 𝐶, 𝐷: Matrices and vectors defining the QP (Eq. 5.77); Q must be positive definite
𝜀: Tolerance used for termination and for determining whether constraint is active
Outputs:
𝑥 ∗ : Optimal point

𝑘=0
𝑥 𝑘 = 𝑥0
𝑊𝑘 = 𝑖 for all 𝑖 where (𝑐 𝑖 | 𝑥 𝑘 + 𝑑 𝑖 ) > −𝜀 and length(𝑊𝑘 ) ≤ 𝑛 𝑥 One possible
initial working set
while true do
set 𝐶𝑤 = 𝐶 𝑖,∗ and 𝑑𝑤 = 𝑑 𝑖 for all 𝑖 ∈ 𝑊𝑘 Select rows for working set
Solve the KKT system (Eq. 5.81)
if k𝑝 k < 𝜀 then
if 𝜎 ≥ 0 then Satisfied KKT conditions
𝑥∗ = 𝑥 𝑘
return
else
𝑖 = argmin 𝜎
𝑊𝑘+1 = 𝑊𝑘 \ {𝑖} Remove 𝑖 from working set
𝑥 𝑘+1 = 𝑥 𝑘
end if
else
𝛼=1 Initialize with optimum step
𝐵 = {} Blocking index
for 𝑖 ∉ 𝑊𝑘 do Check constraints outside of working set
if 𝑐 𝑖 𝑝 > 0 then
|
Potential blocking constraint
|
−(𝑐 𝑖 𝑥 𝑘 +𝑑 𝑖 )
𝛼𝑏 = | 𝑐 𝑖 is a row of 𝐶 𝑛
𝑐𝑖 𝑝
if 𝛼 𝑏 < 𝛼 then
𝛼 = 𝛼𝑏
𝐵=𝑖 Save or overwrite blocking index
end if
end if
end for
𝑊𝑘+1 = 𝑊𝑘 ∪ {𝐵} Add 𝐵 to working set (if linearly independent)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝
end if
𝑘 = 𝑘+1
end while

Example 5.10 Inequality constrained QP


5 Constrained Gradient-Based Optimization 195

Let us solve the following problem using the active-set QP algorithm:

minimize 3𝑥12 + 𝑥22 + 2𝑥1 𝑥2 + 𝑥 1 + 6𝑥2


𝑥 1 ,𝑥2

subject to 2𝑥1 + 3𝑥2 ≥ 4


𝑥1 ≥ 0
𝑥2 ≥ 0 .

Rewriting in the standard form (Eq. 5.77) yields the following:

    −2 −3 4


  
𝐶 = −1 0  , 𝑑 = 0 .
6 2 1
𝑄= , 𝑞= ,
2 2 6 0 −1 0
  
We arbitrarily chose 𝑥 = [3, 2] as a starting point. Because none of the 𝑥2
constraints are active, the initial working set is empty, 𝑊 = {}. At each iteration, 4

we solve the QP formed by the equality constraints and any constraints in the
active set (treated as equality constraints). The sequence of iterations is detailed 𝑥0
2
as follows and is plotted in Fig. 5.40:
𝑥∗
𝑘 = 1 The QP subproblem yields 𝑝 = [−1.75, −6.25] and 𝜎 = [0, 0, 0]. Next,
we check whether any constraints are blocking at the new point 𝑥 + 𝑝. 0
Because all three constraints are outside of the working set, we check
all three. Constraint 1 is potentially blocking (𝑐 𝑖 𝑝 > 0) and leads to
|
0 2 4
𝛼 𝑏 = 0.35955. Constraint 2 is also potentially blocking and leads to 𝑥1

𝛼 𝑏 = 1.71429. Finally, constraint 3 is also potentially blocking and leads


Fig. 5.40 Iteration history for the
to 𝛼 𝑏 = 0.32. We choose the constraint with the smallest 𝛼, which is
active-set QP example.
constraint 3, and add it to our working set. At the end of the iteration,
𝑥 = [2.44, 0.0] and 𝑊 = {3}.
𝑘 = 2 The new QP subproblem yields 𝑝 = [−2.60667, 0.0] and 𝜎 = [0, 0, 5.6667].
Constraints 1 and 2 are outside the working set. Constraint 1 is potentially
blocking and gives 𝛼 𝑏 = 0.1688; constraint 2 is also potentially blocking
and yields 𝛼 𝑏 = 0.9361. Because constraint 1 yields the smaller step, we
add it to the working set. At the end of the iteration, 𝑥 = [2.0, 0.0] and
𝑊 = {1, 3}.
𝑘 = 3 The QP subproblem now yields 𝑝 = [0, 0] and 𝜎 = [6.5, 0, −9.5]. Because
𝑝 = 0, we check for convergence. One of the Lagrange multipliers
is negative, so this cannot be a solution. We remove the constraint
associated with the most negative Lagrange multiplier from the working
set (constraint 3). At the end of the iteration, 𝑥 is unchanged at 𝑥 =
[2.0, 0.0], and 𝑊 = {1}.
𝑘 = 4 The QP yields 𝑝 = [−1.5, 1.0] and 𝜎 = [3, 0, 0]. Constraint 2 is potentially
blocking and yields 𝛼 𝑏 = 1.333 (which means it is not blocking because
𝛼 𝑏 > 1). Constraint 3 is also not blocking (𝑐 𝑖 𝑝 < 0). None of the 𝛼 𝑏
|

values was blocking, so we can take the full step (𝛼 = 1). The new 𝑥
point is 𝑥 = [0.5, 1.0], and the working set is unchanged at 𝑊 = {1}.
𝑘 = 5 The QP yields 𝑝 = [0, 0], 𝜎 = [3, 0, 0]. Because 𝑝 = 0, we check for
convergence. All Lagrange multipliers are nonnegative, so the problem
5 Constrained Gradient-Based Optimization 196

is solved. The solution to the original inequality constrained QP is then


𝑥 ∗ = [0.5, 1.0].

Because SQP solves a sequence of QPs, an effective approach is to


use the optimal 𝑥 and active set from the previous QP as the starting
point and working set for the next QP. The algorithm outlined in this
section requires both a feasible starting point and a working set of
linearly independent constraints. Although the previous starting point
and working set usually satisfy these conditions, this is not guaranteed,
and adjustments may be necessary.
Algorithms to determine a feasible point are widely used (often by
solving a linear programming problem). There are also algorithms to
96. Gill et al., SNOPT: An SQP algorithm
remove or add to the constraint matrix as needed to ensure full rank.96 for large-scale constrained optimization,
2005.
Tip 5.5 Consider reformulating your constraints

There are often multiple mathematically equivalent ways to pose the


problem constraints. Reformulating can sometimes yield equivalent problems
that are significantly easier to solve. In some cases, it can help to add redundant
constraints to avoid areas of the design space that are not useful. Similarly, we
should consider whether the model that computes the objective and constraint
functions should be solved separately or posed as constraints at the optimizer
level (as we did in Eq. 3.33).

5.5.3 Merit Functions and Filters


Similar to what we did in unconstrained optimization, we do not directly
accept the step 𝑝 returned from solving the subproblem (Eq. 5.62 or
Eq. 5.76). Instead, we use 𝑝 as the first step length in a line search.
In the line search for unconstrained problems (Section 4.3), deter-
mining if a point was good enough to terminate the search was based
solely on comparing the objective function value (and the slope when
enforcing the strong Wolfe conditions). For constrained optimization,
we need to make some modifications to these methods and criteria.
In constrained optimization, objective function decrease and fea-
sibility often compete with each other. During a line search, a new
point may decrease the objective but increase the infeasibility, or it may
decrease the infeasibility but increase the objective. We need to take
these two metrics into account to determine the line search termination
criterion.
5 Constrained Gradient-Based Optimization 197

The Lagrangian is a function that accounts for the two metrics.


However, at a given iteration, we only have an estimate of the Lagrange
multipliers, which can be inaccurate.
One way to combine the objective value with the constraints in a
line search is to use merit functions, which are similar to the penalty
functions introduced in Section 5.4. Common merit functions include
functions that use the norm of constraint violations:

𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇 k 𝑔¯ (𝑥)k 𝑝 , (5.89)

where 𝑝 is 1 or 2 and 𝑔¯ are the constraint violations, defined as


(
ℎ 𝑗 (𝑥) for equality constraints
𝑔¯ 𝑗 (𝑥) = (5.90)
max(0, 𝑔 𝑗 (𝑥)) for inequality constraints .

The augmented Lagrangian from Section 5.4.1 can also be repurposed


for a constrained line search (see Eqs. 5.53 and 5.54).
Like penalty functions, one downside of merit functions is that it is
challenging to choose a suitable value for the penalty parameter 𝜇. This
parameter needs to be large to ensure feasibility. However, if it is too
large, a full Newton step might not be permitted. This might slow the
convergence unnecessarily. Using the augmented Lagrangian can help,
as discussed in Section 5.4.1. However, there are specific techniques
used in SQP line searches and various safeguarding techniques needed
for robustness.
Filter methods are an alternative to using penalty-based methods in a
line search.97 Filter methods interfere less with the full Newton step 97. Fletcher and Leyffer, Nonlinear pro-
gramming without a penalty function, 2002.
and are effective for both SQP and interior-point methods (which are
introduced in Section 5.6).98,99 The approach is based on concepts from 98. Benson et al., Interior-point methods for
nonconvex nonlinear programming: Filter
multiobjective optimization, which is the subject of Chapter 9. In the methods and merit functions, 2002.
filter method, there are two objectives: decrease the objective function 99. Fletcher et al., A brief history of filter
and decrease infeasibility. A point is said to dominate another if its methods, 2006.

objective is lower and the sum of its constraint violations is lower. The
filter consists of all the points that have been found to be non-dominated
in the line searches so far. The line search terminates when it finds a
point that is not dominated by any point in the current filter. That new
point is then added to the filter, and any points that it dominates are
removed from the filter.‖ ‖
See Section 9.2 for more details on the
concept of dominance.
This is only the basic concept. Robust implementation of a fil-
ter method requires imposing sufficient decrease conditions, not un-
like those in the unconstrained case, and several other modifications.
Fletcher et al.99 provide more details on filter methods. 99. Fletcher et al., A brief history of filter
methods, 2006.
5 Constrained Gradient-Based Optimization 198

Example 5.11 Using a filter

A filter consists of pairs ( 𝑓 (𝑥), k 𝑔¯ k 1 ), where k 𝑔¯ k 1 is the sum of the constraint


violations (Eq. 5.90). Suppose that the current filter contains the following
three points: {(2, 5), (3, 2), (7, 1)}. None of the points in the filter dominates any
other. These points are plotted as the blue dots in Fig. 5.41, where the shaded
regions correspond to all the points that are dominated by the points in the
filter. During a line search, a new candidate point is evaluated. There are three k 𝑔¯ k 1
possible outcomes. Consider the following three points that illustrate these 8

three outcomes (corresponding to the labeled points in Fig. 5.41): 2


6
1. (1, 4): This point is not dominated by any point in the filter. The step
is accepted, the line search ends, and this point is added to the filter. 4
1 3
Because this new point dominates one of the points in the filter, (2, 5),
that dominated point is removed from the filter. The current set in the 2

filter is now {(1, 4), (3, 2), (7, 1)}.


0
2. (1, 6): This point is not dominated by any point in the filter. The step is 0 2 4 6 8

accepted, the line search ends, and this new point is added to the filter. 𝑓 (𝑥)

Unlike the previous case, none of the points in the filter are dominated.
Fig. 5.41 Filter method example show-
Therefore, no points are removed from the filter set, which becomes ing three points in the filter (blue
{(1, 6), (2, 5), (3, 2), (7, 1)}. dots); the shaded regions correspond
3. (4, 3): This point is dominated by a point in the filter, (3, 2). The step to all the points that are dominated by
is rejected, and the line search continues by selecting a new candidate the filter. The red dots illustrate three
different possible outcomes when
point. The filter is unchanged.
new points are considered.

5.5.4 Quasi-Newton SQP


In the discussion of the SQP method so far, we have assumed that we
have the Hessian of the Lagrangian 𝐻ℒ . Similar to the unconstrained
optimization case, the Hessian might not be available or be too expensive
to compute. Therefore, it is desirable to use a quasi-Newton approach
that approximates the Hessian, as we did in Section 4.4.4.
The difference now is that we need an approximation of the La-
grangian Hessian instead of the objective function Hessian. We denote
this approximation at iteration 𝑘 as 𝐻˜ ℒ 𝑘 .
Similar to the unconstrained case, we can approximate 𝐻˜ ℒ 𝑘 using
the gradients of the Lagrangian and a quasi-Newton update, such as
the Broyden-–Fletcher-–Goldfarb-–Shanno (BFGS) update. Unlike in
unconstrained optimization, we do not want the inverse of the Hessian
directly. Therefore, we use the version of the BFGS formula that
computes the Hessian (Eq. 4.87):

𝐻˜ ℒ 𝑘 𝑠 𝑘 𝑠 𝑘 𝐻˜ ℒ 𝑘
| |
𝑦𝑘 𝑦𝑘
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 − + , (5.91)
𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
|
|
𝑦𝑘 𝑠 𝑘
5 Constrained Gradient-Based Optimization 199

where:
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘
(5.92)
𝑦 𝑘 = ∇𝑥 ℒ(𝑥 𝑘+1 , 𝜆 𝑘+1 ) − ∇𝑥 ℒ(𝑥 𝑘 , 𝜆 𝑘+1 ) .
The step in the design variable space, 𝑠 𝑘 , is the step that resulted from
the latest line search. The Lagrange multiplier is fixed to the latest value
when approximating the curvature of the Lagrangian because we only
need the curvature in the space of the design variables.
Recall that for the QP problem (Eq. 5.76) to have a solution, 𝐻˜ ℒ 𝑘
must be positive definite. To ensure a positive definite approximation,
we can use a damped BFGS update.25∗∗ This method replaces 𝑦 with a 25. Powell, Algorithms for nonlinear con-
straints that use Lagrangian functions, 1978.
new vector 𝑟, defined as ∗∗ The damped BFGS update is not al-

ways the best approach. There are ap-


𝑟 𝑘 = 𝜃𝑘 𝑦 𝑘 + (1 − 𝜃𝑘 )𝐻˜ ℒ 𝑘 𝑠 𝑘 , (5.93) proaches built around other approxima-
tion methods, such as symmetric rank 1
(SR1).100 Limited-memory updates similar
where the scalar 𝜃𝑘 is defined as to L-BFGS (see Section 4.4.5) can be used


when storing a dense Hessian for large

1
 if 𝑠 𝑘 𝑦 𝑘 ≥ 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
| | problems is prohibitive.101
𝜃𝑘 = (5.94)

0.8𝑠 𝐻˜ 100. Fletcher, Practical Methods of Opti-
|

 𝑠 | 𝐻˜ ℒ 𝑠 𝑘 −𝑠 | 𝑦 𝑘
𝑠
if 𝑠 𝑘 𝑦 𝑘 < 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘 ,
𝑘 ℒ𝑘 𝑘 | |
 𝑘 𝑘
mization, 1987.
𝑘
101. Liu and Nocedal, On the limited
memory BFGS method for large scale opti-
which can range from 0 to 1. We then use the same BFGS update mization, 1989.
formula (Eq. 5.91), except that we replace each 𝑦 𝑘 with 𝑟 𝑘 .
To better understand this update, let us consider the two extremes
for 𝜃. If 𝜃𝑘 = 0, then Eq. 5.93 in combination with Eq. 5.91 yields
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 ; that is, the Hessian approximation is unmodified. At the
other extreme, 𝜃𝑘 = 1 yields the full BFGS update formula (𝑟 𝑘 is set
to 𝑦 𝑘 ). Thus, the parameter 𝜃𝑘 provides a linear weighting between
keeping the current Hessian approximation and using the full BFGS
update.
The definition of 𝜃𝑘 (Eq. 5.94) ensures that 𝐻˜ ℒ 𝑘+1 stays close enough
to 𝐻˜ ℒ 𝑘 and remains positive definite. The damping is activated when
the predicted curvature in the new latest step is below one-fifth of the
curvature predicted by the latest approximate Hessian. This could †† A few popular SQP implementations
happen when the function is flattening or when the curvature becomes include SNOPT,96 Knitro,102 MATLAB’s
fmincon, and SLSQP.103 The first three
negative. are commercial options, whereas SLSQP
is open source. There are interfaces in dif-
ferent programming languages for these
5.5.5 Algorithm Overview optimizers, including pyOptSparse (for
SNOPT and SLSQP).1
We now put together the various pieces in a high-level description 1. Wu et al., pyOptSparse: A Python frame-
of SQP with quasi-Newton approximations in Alg. 5.5.†† For the work for large-scale constrained nonlinear
optimization of sparse systems, 2020.
convergence criterion, we can use an infinity norm of the KKT system
102. Byrd et al., Knitro: An Integrated
residual vector. For better control over the convergence, we can consider Package for Nonlinear Optimization, 2006.
two separate tolerances: one for the norm of the optimality and another 103. Kraft, A software package for sequential
quadratic programming, 1988.
5 Constrained Gradient-Based Optimization 200

for the norm of the feasibility. For problems that only have equality
constraints, we can solve the corresponding QP (Eq. 5.62) instead.

Algorithm 5.5 SQP with quasi-Newton approximation

Inputs:
𝑥 0 : Starting point
𝜏opt : Optimality tolerance
𝜏feas : Feasibility tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value

𝜆0 = 0, 𝜎0 = 0 Initial Lagrange multipliers


𝛼init = 1 For line search
Evaluate functions ( 𝑓 , 𝑔, ℎ) and derivatives (∇ 𝑓 , 𝐽 𝑔 , 𝐽 ℎ )
| |
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 + 𝐽 𝑔 𝜎
𝑘=0
while k∇𝑥 ℒ k ∞ > 𝜏opt or k ℎ k ∞ > 𝜏feas do
if 𝑘 = 0 or reset = true then
𝐻˜ ℒ0 = 𝐼 Initialize to identity matrix or scaled version (Eq. 4.95)
else
Update 𝐻˜ ℒ 𝑘+1 Compute damped BFGS (Eqs. 5.91 to 5.94)
end if
Solve QP subproblem (Eq. 5.76) for 𝑝 𝑥 , 𝑝𝜆
1 |˜
minimize 𝑝 𝐻 𝑝 𝑥 + ∇𝑥 ℒ | 𝑝 𝑥
2 𝑥 ℒ
by varying 𝑝𝑥
subject to 𝐽ℎ 𝑝 𝑥 + ℎ = 0
𝐽𝑔 𝑝𝑥 + 𝑔 ≤ 0

𝜆 𝑘+1 = 𝜆 𝑘 + 𝑝𝜆
𝛼 = linesearch (𝑝 𝑥 , 𝛼 init ) Use merit function or filter (Section 5.5.3)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 Update step
𝑊𝑘+1 = 𝑊𝑘 Active set becomes initial working set for next QP
Evaluate functions ( 𝑓 , 𝑔, ℎ) and derivatives (∇ 𝑓 , 𝐽 𝑔 , 𝐽 ℎ )
| |
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 + 𝐽 𝑔 𝜎
𝑘 = 𝑘+1
end while

Example 5.12 SQP applied to equality constrained problem

We now solve Ex. 5.2 using the SQP method (Alg. 5.5). We start at
𝑥0 = [2, 1] with an initial Lagrange multiplier 𝜆 = 0 and an initial estimate
5 Constrained Gradient-Based Optimization 201

of the Lagrangian Hessian as 𝐻˜ ℒ = 𝐼 for simplicity. The line search uses an


augmented Lagrangian merit function with a fixed penalty parameter (𝜇 = 1)
and a quadratic bracketed search as described in Section 4.3.2. The choice
between a merit function and line search has only a small effect in this simple
problem. The gradient of the equality constraint is
   
𝐽 ℎ = 12 𝑥1 2𝑥 2 = 1 2 ,
and differentiating the Lagrangian with respect to 𝑥 yields
   
1 + 21 𝜆𝑥1 1
∇𝑥 ℒ = = .
2 + 2𝜆𝑥2 2
The KKT system to be solved (Eq. 5.62) in the first iteration is
1 0 1 𝑠 𝑥1  −1
    
0 2 𝑠 𝑥  = −2 .
 1  2  
1 0  𝑠  −1
 2  𝜆  
The solution of this system is 𝑠 = [−0.2, −0.4, −0.8]. Using 𝑝 = [−0.2, −0.4], the
full step 𝛼 = 1 satisfies the strong Wolfe conditions, so for the new iteration we
have 𝑥1 = [1.8, 0.6], 𝜆1 = −0.8.
To update the approximate Hessian 𝐻˜ ℒ using the damped BFGS update
(Eq. 5.93), we need to compare the values of 𝑠 0 𝑦0 = −0.272 and 𝑠 0 𝑊0 𝑠 0 =
| |

0.2. Because 𝑠 𝑦 𝑘 < 0.2𝑠 𝐻˜ ℒ 𝑠 𝑘 , we need to compute the scalar 𝜃 = 0.339


| |
𝑘 𝑘 𝑘
using Eq. 5.94. This results in a partial BFGS update to maintain positive
definiteness. After a few iterations, 𝜃 = 1 for the remainder of the optimization,
corresponding to a full BFGS update. The initial estimate for the Lagrangian
Hessian is poor (just a scaled identity matrix), so some damping is necessary.
However, the estimate is greatly improved after a few iterations. Using the
quasi-Newton update in Eq. 5.91, we get the approximate Hessian for the next
iteration as  
1.076 −0.275
𝐻˜ ℒ1 = .
−0.275 0.256

2 2 2
𝑥0
1 1 1

𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘 = 0, 𝜆 = 0.00 𝑘 = 6, 𝜆 = 0.95 𝑘 = 11, 𝜆 = 1.41

We repeat this process for subsequent iterations, as shown in Figure 5.42. Fig. 5.42 SQP algorithm iterations.
The gray contours show the QP subproblem (Eq. 5.72) solved at each itera-
tion: the quadratic objective appears as elliptical contours and the linearized
5 Constrained Gradient-Based Optimization 202

constraint as a straight line. The starting point is infeasible, and the iterations
remain infeasible until the last few iterations.
k∇𝑥 ℒ k
This behavior is common for SQP because although it satisfies the linear 101
approximation of the constraints at each step, it does not necessarily satisfy the
constraints of the actual problem, which is nonlinear. As the constraint approx-
10−2
imation becomes more accurate near the solution, the nonlinear constraint is
then satisfied. Figure 5.43 shows the convergence of the Lagrangian gradient
norm, with the characteristic quadratic convergence at the end. 10−5

10−8
0 5 10
𝑘
Example 5.13 SQP applied to inequality constrained problem

We now solve the inequality constrained version of the previous example Fig. 5.43 Convergence history of the
(Ex. 5.4) with the same initial conditions and general approach. The only norm of the Lagrangian gradient.
difference is that rather than solving the linear system of equations Eq. 5.62, we
have to solve an active-set QP problem at each iteration, as outlined in Alg. 5.4.
The iteration history and convergence of the norm of the Lagrangian gradient
are plotted in Figs. 5.44 and 5.45, respectively.

2 2 2
𝑥0
1 1 1

𝑥2 0 𝑥2 0 𝑥2 0

−1 −1 −1

−2 −2 −2

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=3 𝑘=7

Fig. 5.44 Iteration history of SQP ap-


101 plied to an inequality constrained
problem, with the Lagrangian and
the linearized constraint overlaid
(with a darker infeasible region).
10−2

k∇𝑥 ℒ k

10−5

Fig. 5.45 Convergence history of the


10−8
0 2 4 6 8 norm of the Lagrangian gradient.
𝑘
5 Constrained Gradient-Based Optimization 203

Tip 5.6 How to handle maximum and minimum constraints

Constraints that take the maximum or minimum of a set of quantities


are often desired. For example, the stress in a structure may be evaluated at
many points, and we want to make sure the maximum stress does not exceed a
specified yield stress, such that

max(𝜎) ≤ 𝜎yield .

However, the maximum function is not continuously differentiable (because


the maximum can switch elements between iterations), which may cause
difficulties when using gradient-based optimization. The constraint aggregation
methods from Section 5.7 can enforce such conditions with a smooth function.
Nevertheless, it is challenging for an optimizer to find a point that satisfies the
KKT conditions because the information is reduced to one constraint.
Instead of taking the maximum, you should consider constraining the stress
at all 𝑛 𝜎 points as follows

𝜎 𝑗 ≤ 𝜎yield , 𝑗 = 1, . . . , 𝑛 𝜎 .

Now all constraints are continuously differentiable. The optimizer has 𝑛 𝜎


constraints instead of 1, but that generally provides more information and
makes it easier for the optimizer to satisfy the KKT conditions with more than
one Lagrange multiplier. Even though we have added more constraints, an
active set method makes this efficient because it considers only the critical
constraints.

5.6 Interior-Point Methods

Interior-point methods use concepts from both SQP and interior penalty
methods.∗ These methods form an objective similar to the interior ∗ The name interior point stems from early
methods based on interior penalty meth-
penalty but with the key difference that instead of penalizing the ods that assumed that the initial point was
constraints directly, they add slack variables to the set of optimization feasible. However, modern interior-point
methods can start with infeasible points.
variables and penalize the slack variables. The resulting formulation is
as follows:
Õ
𝑛𝑔
minimize 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑗
𝑥,𝑠
𝑗=1
(5.95)
subject to ℎ(𝑥) = 0
𝑔(𝑥) + 𝑠 = 0 .
This formulation turns the inequality constraints into equality con-
straints and thus avoids the combinatorial problem.
Similar to SQP, we apply Newton’s method to solve for the KKT
conditions. However, instead of solving the KKT conditions of the
5 Constrained Gradient-Based Optimization 204

original problem (Eq. 5.59), we solve the KKT conditions of the interior-
point formulation (Eq. 5.95).
These slack variables in Eq. 5.95 do not need to be squared, as was
done in deriving the KKT conditions, because the logarithm is only
defined for positive 𝑠 values and acts as a barrier preventing negative
values of 𝑠 (although we need to prevent the line search from producing
negative 𝑠 values, as discussed later). Because 𝑠 is always positive,
that means that 𝑔(𝑥 ∗ ) < 0 at the solution, which satisfies the inequality
constraints.
Like penalty method formulations, the interior-point formulation
(Eq. 5.95) is only equivalent to the original constrained problem in the
limit, as 𝜇𝑏 → 0. Thus, as in the penalty methods, we need to solve a
sequence of solutions to this problem where 𝜇𝑏 approaches zero.
First, we form the Lagrangian for this problem as

ℒ(𝑥, 𝜆, 𝜎, 𝑠) = 𝑓 (𝑥) − 𝜇𝑏 𝑒 | ln 𝑠 + ℎ(𝑥)| 𝜆 + (𝑔(𝑥) + 𝑠)| 𝜎 , (5.96)

where ln 𝑠 is an 𝑛 𝑔 -vector whose components are the logarithms of each


component of 𝑠, and 𝑒 = [1, . . . , 1] is an 𝑛 𝑔 -vector of 1s introduced to
express the sum in vector form. By taking derivatives with respect to 𝑥,
𝜆, 𝜎, and 𝑠, we derive the KKT conditions for this problem as

∇ 𝑓 (𝑥) + 𝐽ℎ (𝑥)| 𝜆 + 𝐽 𝑔 (𝑥)| 𝜎 = 0


ℎ=0
(5.97)
𝑔+𝑠 =0
−𝜇𝑏 𝑆−1 𝑒 + 𝜎 = 0 ,

where 𝑆 is a diagonal matrix whose diagonal entries are given by the


slack variable vector, and therefore 𝑆−1
𝑘𝑘
= 1/𝑠 𝑘 . The result is a set of
𝑛 𝑥 + 𝑛 ℎ + 2𝑛 𝑔 equations and the same number of variables.
To get a system of equations that is more favorable for Newton’s
method, we multiply the last equation by 𝑆 to obtain

∇ 𝑓 (𝑥) + 𝐽ℎ (𝑥)| 𝜆 + 𝐽 𝑔 (𝑥)| 𝜎 = 0


ℎ=0
(5.98)
𝑔+𝑠 =0
−𝜇𝑏 𝑒 + 𝑆𝜎 = 0 .

We now have a set of residual equations to which we can apply


Newton’s method, just like we did for SQP. Taking the Jacobian of the
5 Constrained Gradient-Based Optimization 205

residuals in Eq. 5.98, we obtain the linear system

𝐻ℒ (𝑥) 𝐽ℎ (𝑥)| 0  𝑠𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)


 𝐽 𝑔 (𝑥)|    
 𝐽ℎ (𝑥) 0   𝑠𝜆   
 0 0   =− ℎ(𝑥) 
 𝐽 𝑔 (𝑥) 𝐼  𝑠 𝜎   𝑔(𝑥) + 𝑠  , (5.99)
 0 0    
 0 Σ  𝑠𝑠   𝑆𝜎 − 𝜇𝑏 𝑒 
 0 𝑆    
where Σ is a diagonal matrix whose entries are given by 𝜎, and 𝐼 is
the identity matrix. For numerical efficiency, we make the matrix
symmetric by multiplying the last equation by 𝑆−1 to get the symmetric
linear system, as follows:

𝐻ℒ (𝑥) 𝐽ℎ (𝑥)| 0  𝑠𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)


 𝐽 𝑔 (𝑥)|     
 𝐽ℎ (𝑥)   𝑠𝜆   
 0 0 0    =− ℎ(𝑥) 
 𝐽 𝑔 (𝑥)  𝑠 𝜎   𝑔(𝑥) + 𝑠  . (5.100)
 0 0 𝐼     
 0 𝑆 −1 Σ  𝑠𝑠   𝜎 − 𝜇𝑏 𝑆−1 𝑒 
 0 𝐼    

The advantage of this equivalent system is that we can use a linear


solver specialized for symmetric matrices, which is more efficient than
a solver for general linear systems. If we had applied Newton’s method 𝑛𝑥 𝑛ℎ 𝑛𝑔 𝑛𝑔

to the original KKT system (Eq. 5.97) and then made it symmetric, we
would have obtained a term with 𝑆−2 , which would make the system 𝑛𝑥 𝐻ℒ |
𝐽ℎ
|
𝐽𝑔 0
more challenging than with the 𝑆 −1 term in Eq. 5.100. Figure 5.46 shows
the structure and block sizes of the matrix.
𝑛ℎ 𝐽ℎ 0 0 0

5.6.1 Modifications to the Basic Algorithm 𝑛𝑔 𝐽𝑔 0 0 𝐼


We can reuse many of the concepts covered under SQP, including quasi-
𝑛𝑔 0 0 𝑆−1 Σ
Newton estimates of the Lagrangian Hessian and line searches with 𝐼

merit functions or filters. The merit function would usually be modified


to a form more consistent with the formulation used in Eq. 5.95. For Fig. 5.46 Structure and shape of the
interior-point system matrix from
example, we could write a merit function as follows: Eq. 5.100.

Õ 1  
𝑛𝑔
𝑓ˆ(𝑥) = 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑖 + 𝜇𝑝 k ℎ(𝑥)k 2 + k 𝑔(𝑥) + 𝑠 k 2 , (5.101)
2
𝑖=1

where 𝜇𝑏 is the barrier parameter from Eq. 5.95, and 𝜇𝑝 is the penalty
parameter. Additionally, we must enforce an 𝛼max in the line search so
that the implicit constraint on 𝑠 > 0 remains enforced. The maximum
allowed step size can be computed prior to the line search because we
know the value of 𝑠 and 𝑝 𝑠 and require that

𝑠 + 𝛼𝑝 𝑠 ≥ 0 . (5.102)
5 Constrained Gradient-Based Optimization 206

In practice, we enforce a fractional tolerance so that we do not get too


close to zero. For example, we could enforce the following:

𝑠 + 𝛼 max 𝑝 𝑠 = 𝜏𝑠 , (5.103)

where 𝜏 is a small value (e.g., 𝜏 = 0.005). The maximum step size is


the smallest positive value that satisfies this equation for all entries in
𝑠. A possible algorithm for determining the maximum step size for
feasibility is shown in Alg. 5.6.

Algorithm 5.6 Maximum step size for feasibility

Inputs:
𝑠: Current slack values
𝑝 𝑠 : Proposed step
𝜏: Fractional tolerance (e.g., 0.005)
Outputs:
𝛼max : Maximum feasible step length

𝛼max = 1
for 𝑖 = 1 to 𝑛 𝑔 do
𝑠
𝛼 = (𝜏 − 1) 𝑖
𝑝𝑠 𝑖
if 𝛼 > 0 then
𝛼max = min(𝛼max , 𝛼)
end if
end for

The line search typically uses a simple backtracking approach


because we must enforce a maximum step length. After the line search,
we can update 𝑥 and 𝑠 as follows:

𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑥 , where 𝛼 𝑘 ∈ (0, 𝛼 max ] (5.104)


𝑠 𝑘+1 = 𝑠 𝑘 + 𝛼 𝑘 𝑝 𝑠 . (5.105)

The Lagrange multipliers 𝜎 must also remain positive, so the pro-


cedure in Alg. 5.6 is repeated for 𝜎 to find the maximum step length
for the Lagrange multipliers 𝛼 𝜎 . Enforcing a maximum step size for
Lagrange multiplier updates was not necessary for the SQP method
because the QP subproblem handled the enforcement of nonnegative
Lagrange multipliers. We then update both sets of Lagrange multipliers
using this step size:

𝜆 𝑘+1 = 𝜆 𝑘 + 𝛼 𝜎 𝑝𝜆 (5.106)
𝜎 𝑘+1 = 𝜎 𝑘 + 𝛼 𝜎 𝑝 𝜎 . (5.107)
5 Constrained Gradient-Based Optimization 207

Finally, we need to update the barrier parameter 𝜇𝑏 . The simplest


approach is to decrease it by a multiplicative factor:
𝜇𝑏 𝑘+1 = 𝜌𝜇𝑏 𝑘 , (5.108)
where 𝜌 is typically around 0.2. Better methods are adaptive based on
how well the optimizer is progressing. There are other implementation
details for improving robustness that can be found in the literature.104,105
104. Wächter and Biegler, On the imple-
The steps for a basic interior-point method are detailed in Alg. 5.7.† mentation of an interior-point filter line-
This version focuses on a line search approach, but there are variations search algorithm for large-scale nonlinear
programming, 2005.
of interior-point methods that use the trust-region approach. 105. Byrd et al., An interior point algorithm
for large-scale nonlinear programming, 1999.
† IPOPT is an open-source nonlinear
Algorithm 5.7 Interior-point method with a quasi-Newton approximation
interior-point method.106 The commer-
Inputs: cial packages Knitro102 and fmincon men-
tioned earlier also include interior-point
𝑥 0 : Starting point methods.
𝜏opt : Optimality tolerance 106. Wächter and Biegler, On the imple-
𝜏feas : Feasibility tolerance mentation of a primal-dual interior point
filter line search algorithm for large-scale
Outputs: nonlinear programming, 2006.
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Optimal function value

𝜆0 = 0; 𝜎0 = 0 Initial Lagrange multipliers


𝑠0 = 1 Initial slack variables
𝐻˜ ℒ0 = 𝐼 Initialize Hessian of Lagrangian approximation to identity matrix
𝑘=0
while k∇𝑥 ℒ k ∞ > 𝜏opt or k ℎ k ∞ > 𝜏feas do
Evaluate 𝐽 ℎ , 𝐽 𝑔 , ∇𝑥 ℒ
Solve the KKT system (Eq. 5.100) for 𝑝
 𝐻˜ ℒ 𝑘 | |
0  𝑝𝑥  ∇𝑥 ℒ(𝑥, 𝜆, 𝜎)
     
𝐽ℎ 𝐽𝑔
 𝐽 ℎ (𝑥)   𝑝𝜆   
 0 0 0    =− ℎ(𝑥) 
𝐽 𝑔 (𝑥)  𝑝 𝜎   𝑔(𝑥) + 𝑠 
 0 0 𝐼     
 0 𝑆 Σ 𝑝   𝜎 − 𝜇𝑆−1 𝑒 
  𝑠  
0 𝐼 −1

𝛼max = alphamax(𝑠, 𝑝 𝑠 ) Use Alg. 5.6


𝛼 𝑘 = backtrack(𝑝 𝑥 , 𝑝 𝑠 , 𝛼max ) Line search (Alg. 4.2) with merit function (Eq. 5.101)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑥 Update design variables
𝑠 𝑘+1 = 𝑠 𝑘 + 𝛼 𝑘 𝑝 𝑠 Update slack variables
𝛼 𝜎 = alphamax(𝜎, 𝑝 𝜎 )
𝜆 𝑘+1 = 𝜆 𝑘 + 𝛼 𝜎 𝑠𝜆 Update equality Lagrange multipliers
𝜎 𝑘+1 = 𝜎 𝑘 + 𝛼 𝜎 𝑠 𝜎 Update inequality Lagrange multipliers
Update 𝐻˜ ℒ 𝑘+1
Compute quasi-Newton approximation using Eq. 5.91
𝜇𝑏 = 𝜌𝜇𝑏 Reduce barrier parameter
𝑘 = 𝑘+1
end while
5 Constrained Gradient-Based Optimization 208

5.6.2 SQP Comparisons and Examples


Both interior-point methods and SQP are considered state-of-the-art
approaches for solving nonlinear constrained optimization problems.
Each of these two methods has its strengths and weaknesses. The KKT
system structure is identical at each iteration for interior-point methods,
so we can exploit this structure for improved computational efficiency.
SQP is not as amenable to this because changes in the working set cause
the system’s structure to change between iterations. The downside of
the interior-point structure is that turning all constraints into equalities
means that all constraints must be included at every iteration, even if
they are inactive. In contrast, active-set SQP only needs to consider a
subset of the constraints, reducing the subproblem size.
Active-set SQP methods are generally more effective for medium-
scale problems, whereas interior-point methods are more effective
for large-scale problems. Interior-point methods are usually more
sensitive to the initial starting point and the scaling of the problem.
Therefore, SQP methods are usually more suitable for solving sequences
of warm-started problems.79,107 These are just general guidelines; both 79. Nocedal and Wright, Numerical
Optimization, 2006.
approaches should be considered and tested for a given problem of
107. Gill et al., On the performance of SQP
interest. methods for nonlinear optimization, 2015.

Example 5.14 Numerical solution of graphical solution example

Recall the constrained problem with a quadratic objective and quadratic


constraints introduced in Ex. 5.1. Instead of finding an approximate solution
graphically or trying to solve this analytically, we can now solve this numerically
using SQP or the interior-point method. The resulting optimization paths are
shown in Fig. 5.47. These results are only illustrative; paths and iterations can
vary significantly with the starting point and algorithmic parameters.

4 4

2 2
𝑥0 𝑥∗ 𝑥0 𝑥∗
𝑥2 𝑥2
0 0

−2 −2

−2 0 2 4 −2 0 2 4
Fig. 5.47 Numerical solution of prob-
𝑥1 𝑥1
lem solved graphically in Ex. 5.1.
Sequential quadratic programming Interior-point method
5 Constrained Gradient-Based Optimization 209

Example 5.15 Interior-point method applied to inequality constrained


problem
Here we solve Ex. 5.4 using the interior-point method (Alg. 5.7) starting
from 𝑥0 = [2, 1]. The initial Lagrange multiplier is 𝜎 = 0, and the initial slack
variable is 𝑠 = 1. Starting with a penalty parameter of 𝜇 = 20 results in the
iterations shown in Fig. 5.48.
For the first iteration, differentiating the Lagrangian with respect to 𝑥 yields 𝑥2

    2 19 iterations
1 + 12 𝜎𝑥1 1 𝑥0
∇𝑥 ℒ(𝑥1 , 𝑥2 ) = = , 1
2 + 2𝜎𝑥2 2
0
and the gradient of the constraint is
1    −1 𝑥∗
𝑥 1
∇𝑔(𝑥1 , 𝑥2 ) = 2 1 = .
2𝑥 2 2 −2

−3 −2 −1 0 1 2 3
The interior-point system of equations (Eq. 5.100) at the starting point is 𝑥1

1 0 1 0 𝑠 𝑥1  −1 Fig. 5.48 Interior-point algorithm it-


    
0 0 𝑠 𝑥2  −2 erations.
 1 2  =  .
1 1  𝑠 𝜎  −2
 2 0    
0 0  𝑠   20 
 0 1  𝑠  
The solution is 𝑠 = [−21, −42, 20, 103]. Performing a line search in the direction
𝑝 = [−21, −42] yields 𝑥1 = [1.34375, −0.3125]. The Lagrange multiplier and
slack variable are updated to 𝜎1 = 20 and 𝑠 1 = 104, respectively.
To update the approximate Hessian 𝐻˜ ℒ 𝑘 , we use the damped BFGS update
(Eq. 5.93) to ensure that 𝐻˜ ℒ 𝑘 is positive definite. By comparing 𝑠 0 𝑦0 = 73.21
|
| ˜ | ˜
and 𝑠 𝐻ℒ 𝑠 0 = 2.15, we can see that 𝑠 𝑦 𝑘 ≥ 0.2𝑠 𝐻ℒ 𝑠 𝑘 , and therefore, we do
|
0 0 𝑘 𝑘 𝑘
a full BFGS update with 𝜃0 = 1 and 𝑟0 = 𝑦0 . Using the quasi-Newton update
(Eq. 5.91), we get the approximate Hessian:
 
1.388 4.306
𝐻˜ ℒ1 = .
4.306 37.847

We reduce the barrier parameter 𝜇 by a factor of 2 at each iteration. This process


is repeated for subsequent iterations.
The starting point is infeasible, but the algorithm finds a feasible point after
the first iteration. From then on, it approaches the optimum from within the
feasible region, as shown in Fig. 5.48.

Example 5.16 Constrained spring system

Consider the spring system from Ex. 4.17, which is an unconstrained


optimization problem. We can constrain the spring system by attaching two
cables as shown in Fig. 5.49, where ℓ 𝑐1 = 9 m, ℓ 𝑐2 = 6 m, 𝑦 𝑐 = 2 m, 𝑥 𝑐1 = 7 m,
and 𝑥 𝑐2 = 3 m.
5 Constrained Gradient-Based Optimization 210

𝑥 𝑐1 𝑥 𝑐2

𝑦𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2

ℓ 𝑐2 Fig. 5.49 Spring system constrained


ℓ 𝑐1 by two cables.

Because the cables do not resist compression forces, they correspond to


inequality constraints, yielding the following problem:
q 2 q 2
1 1
minimize 𝑘 (ℓ 1 + 𝑥1 )2 + 𝑥22 − ℓ1 + 𝑘 (ℓ 2 − 𝑥1 )2 + 𝑥22 − ℓ 2 − 𝑚 𝑔𝑥 2
𝑥 1 ,𝑥2 2 1 2 2
q
subject to (𝑥 1 + 𝑥 𝑐1 )2 + (𝑥2 + 𝑦 𝑐 )2 ≤ ℓ 𝑐1
q
(𝑥 1 − 𝑥 𝑐2 )2 + (𝑥2 + 𝑦 𝑐 )2 ≤ ℓ 𝑐2 .

The optimization paths for SQP and the interior-point method are shown in
Fig. 5.50.

12 12

8 8

4 4
𝑥2 𝑥∗ 𝑥2 𝑥∗
0 0
𝑥0 𝑥0
−4 −4

−8 −8
−5
ℓrope2 0 5 10 15 −5
ℓrope2 0 5 10 15
rope1 rope1 Fig. 5.50 Optimization of constrained
𝑥1 𝑥1
spring system.
Sequential quadratic programming Interior-point method

5.7 Constraint Aggregation

As will be discussed in Chapter 6, some derivative computation methods


are efficient for problems with many inputs and few outputs, and others
are advantageous for problems with few inputs and many outputs.
Thus, if we have many design variables and many constraints, there is
no efficient way to compute the required constraint Jacobian.
One workaround is to aggregate the constraints and solve the op-
timization problem with a new set of constraints. Each aggregation
5 Constrained Gradient-Based Optimization 211

would have the form

𝑔¯ (𝑥) ≡ 𝑔¯ (𝑔(𝑥)) ≤ 0 , (5.109)

where 𝑔¯ is a scalar, and 𝑔 is the vector of constraints we want to


aggregate. One of the properties we want for the aggregation function
is that if any of the original constraints are violated, then 𝑔¯ > 0.
One way to aggregate constraints would be to define the aggregated
constraint function as the maximum of all constraints,

𝑔¯ (𝑥) = max(𝑔(𝑥)) . (5.110)

If max(𝑔(𝑥)) ≤ 0, then we know that all of components of 𝑔(𝑥) ≤ 0.


However, the maximum function is not differentiable, so it is not
desirable for gradient-based optimization. In the rest of this section,
we introduce several viable functions for constraint aggregation that
are differentiable.
The Kreisselmeier–Steinhauser (KS) aggregation was one of the
first aggregation functions proposed for optimization and is defined as
follows:108 108. Kreisselmeier and Steinhauser,

1 ©Õ
Systematic control design by optimizing a
ª
𝑛𝑔
ln ­ exp(𝜌𝑔 𝑗 )® ,
vector performance index, 1979.
𝑔¯ KS (𝜌, 𝑔) = (5.111)
𝜌
« 𝑗=1 ¬
where 𝜌 is an aggregation factor that determines how close this function
is to the maximum function (Eq. 5.110). As 𝜌 → ∞, 𝑔¯ KS (𝜌, 𝑔) → max(𝑔).
However, as 𝜌 increases, the curvature of 𝑔¯ increases, which can cause
ill-conditioning in the optimization.
The exponential function disproportionately weighs the higher
positive values in the constraint vector, but it does so in a smooth way.
Because the exponential function can easily result in overflow, it is
preferable to use the alternate (but equivalent) form of the KS function,

  
1 ©Õ ª
𝑛𝑔
𝑔¯ KS (𝜌, 𝑔) = max 𝑔 𝑗 + ln ­ exp 𝜌 𝑔 𝑗 − max 𝑔 𝑗 ®. (5.112)
𝑗 𝜌 𝑗
« 𝑗=1 ¬

The value of 𝜌 should be tuned for each problem, but 𝜌 = 100 works
well for many problems.

Example 5.17 Constrained spring system with aggregated constraints

Consider the constrained spring system from Ex. 5.16. Aggregating the two
constraints using the KS function, we can formulate a single constraint as
1
𝑔¯ KS (𝑥 1 , 𝑥2 ) = ln (exp (𝜌𝑔2 (𝑥1 , 𝑥2 )) + exp (𝜌𝑔2 (𝑥1 , 𝑥2 ))) ,
𝜌
5 Constrained Gradient-Based Optimization 212

where q
𝑔1 (𝑥1 , 𝑥2 ) = (𝑥 1 + 𝑥 𝑐1 )2 + (𝑥2 + 𝑦 𝑐 )2 − ℓ 𝑐1
q
𝑔2 (𝑥1 , 𝑥2 ) = (𝑥 1 − 𝑥 𝑐2 )2 + (𝑥2 + 𝑦 𝑐 )2 − ℓ 𝑐2 .
Figure 5.51 shows the contour of 𝑔¯ KS = 0 for increasing values of the aggregation
parameter 𝜌.

8 8 8

6 6 6

𝑥2 𝑥2 𝑥2
4 𝑥∗ 4 𝑥∗ 4 𝑥∗




𝑥KS 𝑥 KS
2 𝑥KS 2 2

ℓrope2
rope1 ℓrope2
rope1 ℓrope2
rope1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
𝑥1 𝑥1 𝑥1

𝜌KS = 2, ∗
𝑓KS = −19.448 𝜌KS = 10, ∗
𝑓KS = −21.653 𝜌KS = 100, ∗
𝑓KS = −22.090

For the lowest value of 𝜌, the feasible region is reduced, resulting in a Fig. 5.51 KS function aggregation of
conservative optimum. For the highest value of 𝜌, the optimum obtained two constraints. The optimum of the
problem with aggregated constraints,
with constraint aggregation is graphically indistinguishable, and the objective ∗ , approaches the true optimum
𝑥KS
function value approaches the true optimal value of −22.1358. as the aggregation parameter 𝜌KS in-
creases.

The 𝑝-norm aggregation function is another option for aggregation


109. Duysinx and Bendsøe, Topology
and is defined as follows:109 optimization of continuum structures with
local stress constraints, 1998.
𝑛𝑔 𝜌 𝜌
1

©Õ 𝑔 𝑗 ª
𝑔¯ 𝑃𝑁 (𝜌) = max | 𝑔 𝑗 | ­ max 𝑗 𝑔 𝑗 ® . (5.113)
𝑗
« 𝑗=1 ¬
The absolute value in this equation can be an issue if 𝑔 can take both
positive and negative values because the function is not differentiable
in regions where 𝑔 transitions from positive to negative.
A class of aggregation functions known as induced functions was
designed to provide more accurate estimates of max(𝑔) for a given
value of 𝜌 than the KS and induced norm functions.110 There are two 110. Kennedy and Hicken, Improved
constraint-aggregation methods, 2015.
main types of induced functions: one uses exponentials, and the other
uses powers. The induced exponential function is given by
Í𝑛 𝑔
𝑗=1
𝑔 𝑗 exp(𝜌𝑔 𝑗 )
𝑔IE (𝜌) = Í𝑛 𝑔 . (5.114)
𝑗=1
exp(𝜌𝑔 𝑗 )
5 Constrained Gradient-Based Optimization 213

The induced power function is given by


Í𝑛 𝑔 𝜌+1
𝑔
𝑗=1 𝑖
𝑔IP (𝜌) = Í𝑛 𝑔 𝜌 . (5.115)
𝑔
𝑗=1 𝑖

The induced power function is only applicable if 𝑔 𝑗 ≥ 0 for 𝑗 = 1, . . . , 𝑛 𝑔 .

5.8 Summary

Most engineering design problems are constrained. When formulating


a problem, practitioners should be critical of their choice of objective
function and constraints. Metrics that should be constraints are often
wrongly formulated as objectives. A constraint should not limit the
design unnecessarily and should reflect the underlying physical reason
for that constraint as much as possible.
The first-order optimality conditions for constrained problems—the
KKT conditions—require the gradient of the objective to be a linear
combination of the gradients of the constraints. This ensures that there
is no feasible descent direction. Each constraint is associated with
a Lagrange multiplier that quantifies how significant that constraint
is at the optimum. For inequality constraints, a Lagrange multiplier
that is zero means that the corresponding constraint is inactive. For
inequality constraints, slack variables quantify how close a constraint
is to becoming active; a slack variable that is zero means that the
corresponding constraint is active. Lagrange multipliers and slack
variables are unknowns that need to be solved together with the
design variables. The complementary slackness condition introduces a
combinatorial problem that is challenging to solve.
Penalty methods solve constrained problems by adding a metric
to the objective function quantifying how much the constraints are
violated. These methods are helpful as a conceptual model and are
used in gradient-free optimization algorithms (Chapter 7). However,
penalty methods only find approximate solutions and are subject to
numerical issues when used with gradient-based optimization.
Methods based on the KKT conditions are preferable. The most
widely used among such methods are SQP and interior-point methods.
These methods apply Newton’s method to the KKT conditions. One
primary difference between these two methods is in the treatment of
inequality constraints. SQP methods distinguish between active and
inactive constraints, treating potentially active constraints as equality
constraints and ignoring the potentially inactive ones. Interior-point
methods add slack variables to force all constraints to behave like
equality constraints.
5 Constrained Gradient-Based Optimization 214

Problems

5.1 Answer true or false and correct the false statements.

a. Penalty methods are among the most effective methods for


constrained optimization.
b. For an equality constraint in 𝑛-dimensional space, all feasible
directions about a point are perpendicular to the constraint
gradient at that point and define a hyperplane with dimen-
sion 𝑛 − 1.
c. The feasible directions about a point on an inequality con-
straint define an open half-space whose dividing hyperplane
is perpendicular to the gradient of the constraint at that
point.
d. A point is optimal if there is only one feasible direction that
is also a descent direction.
e. For an inequality constrained problem, if we replace the
inequalities that are active at the optimum with equality
constraints and ignore the inactive constraints, we get the
same optimum.
f. For a point to be optimal, the Lagrange multipliers for both
the equality constraint and the active inequality constraints
must be positive.
g. The complementary slackness conditions are easy to solve
for because either the Lagrange multiplier is zero or the slack
variable is zero.
h. At the optimum of a constrained problem, the Hessian of
the Lagrangian function must be positive semidefinite.
i. The Lagrange multipliers represent the change in the objec-
tive function we would get for a perturbation in the constraint
value.
j. SQP seeks to find the solution of the KKT system.
k. Interior-point methods must start with a point in the interior
of the feasible region.
l. Constraint aggregation combines multiple constraints into a
single constraint that is equivalent.

5.2 Let us modify Ex. 5.2 so that the equality constraint is the negative
of the original one—that is,
1
ℎ(𝑥1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 = 0 .
4
5 Constrained Gradient-Based Optimization 215

Classify the critical points and compare them with the original
solution. What does that tell you about the significance of the
Lagrange multiplier sign?

5.3 Similar to the previous exercise, consider Ex. 5.4 and modify it
so that the inequality constraint is the negative of the original
one—that is,
1
ℎ(𝑥 1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 ≤ 0 .
4
Classify the critical points and compare them with the original
solution.

5.4 Consider the following optimization problem:

minimize 𝑥12 + 3𝑥22 + 4


by varying 𝑥1 , 𝑥2
(5.116)
subject to 𝑥2 ≥ 1
𝑥12 + 4𝑥22 ≤ 4 .

Find the optimum analytically.

5.5 Find the rectangle of maximum area that can be inscribed in an


ellipse. Give your answer in terms of the ratio of the two areas.
Check that your answer is intuitively correct for the special case
of a rectangle inscribed in a circle.

5.6 In Section 2.1, we mentioned that Euclid showed that among


rectangles of a given perimeter, the square has the largest area.
Formulate the problem and solve it analytically. What are the 𝐹

units in this problem, and what is the physical interpretation of


the Lagrange multiplier? Exploration: Show that if you minimize
the perimeter with an area constrained to the optimal value you
found previously, you get the same solution.

5.7 Column in compression. Consider a thin-walled tubular column


𝑡 ℓ
subjected to a compression force, as shown in Fig. 5.52. We want
to minimize the mass of the column while ensuring that the
structure does not yield or buckle under a compression force of
magnitude 𝐹. The design variables are the radius of the tube (𝑅)
and the wall thickness (𝑡). This design optimization problem can
𝑅

Fig. 5.52 Slender tubular column in


compression.
5 Constrained Gradient-Based Optimization 216

be stated as follows:
minimize 2𝜌ℓ 𝜋𝑅𝑡 mass
by varying 𝑅, 𝑡 radius, wall thickness
𝐹
subject to − 𝜎yield ≤ 0 yield stress
2𝜋𝑅𝑡
𝜋3 𝐸𝑅3 𝑡
𝐹− ≤0 buckling load
4ℓ 2
In the formula for the mass in this objective, 𝜌 is the material
density, and we assume that 𝑡  𝑅. The first constraint is the
compressive stress, which is simply the force divided by the cross-
sectional area. The second constraint uses Euler’s critical buckling
load formula, where 𝐸 is the material Young’s modulus, and the
second moment of area is replaced with the one corresponding
to a circular cross section (𝐼 = 𝜋𝑅 3 𝑡).
Find the optimum 𝑅 and 𝑡 as a function of the other parameters.
Pick reasonable values for the parameters, and verify your solution
graphically. Plot the gradients of the objective and constraints at
the optimum, and verify the Lagrange multipliers graphically.
5.8 Beam with H section. Consider a cantilevered beam with an H-
𝑏 = 125 mm
shaped cross section composed of a web and flanges subject to a
transverse load, as shown in Fig. 5.53. The objective is to minimize
the structural weight by varying the web thickness 𝑡𝑤 and the
𝑡𝑏
flange thickness 𝑡𝑏 , subject to stress constraints. The other cross-
sectional parameters are fixed; the web height ℎ is 250 mm, and
𝑡𝑤 ℎ = 250 mm
the flange width 𝑏 is 125 mm. The axial stress in the flange and
the shear stress in the web should not exceed the corresponding
𝑡𝑏
yield values (𝜎yield = 200 MPa, and 𝜏yield = 116 MPa, respectively).
The optimization problem can be stated as follows:
minimize 2𝑏𝑡𝑏 + ℎ𝑡𝑤 mass
by varying 𝑡𝑏 , 𝑡𝑤 flange and web thicknesses 𝑃 = 100 kN

𝑃ℓ ℎ
subject to − 𝜎yield ≤ 0 axial stress
2𝐼
ℓ =1m
1.5𝑃
− 𝜏yield ≤ 0 shear stress
ℎ𝑡𝑤 Fig. 5.53 Cantilever beam with H sec-
The second moment of area for the H section is tion.

ℎ3 𝑏 ℎ2𝑏
𝐼= 𝑡𝑤 + 𝑡𝑏3 + 𝑡𝑏 .
12 6 2
Find the optimal values of 𝑡𝑏 and 𝑡𝑤 by solving the KKT conditions
analytically. Plot the objective contours and constraints to verify
your result graphically.
5 Constrained Gradient-Based Optimization 217

5.9 Penalty method implementation. Program one or more penalty


methods from Section 5.4.

a. Solve the constrained problem from Ex. 5.6 as a first test of


your implementation. Use an existing software package for
the optimization subproblem or the unconstrained optimizer
you implemented in Prob. 4.9. How far can you push the
penalty parameter until the optimizer fails? How close can
you get to the exact optimum? Try different starting points
and verify that the algorithm always converges to the same
optimum.
b. Solve Prob. 5.3.
c. Solve Prob. 5.11.
d. Exploration: Solve any other problem from this section or a
problem of your choosing.

5.10 Constrained optimizer implementation. Program an SQP or interior-


point algorithm. You may repurpose the BFGS algorithm that you
implemented in Prob. 4.9. For SQP, start by implementing only
equality constraints, reformulating test problems with inequality
constraints as problems with only equality constraints.

a. Reproduce the results from Ex. 5.12 (SQP) or Ex. 5.15 (interior
point).
b. Solve Prob. 5.3.
c. Solve Prob. 5.11.
d. Compare the computational cost, precision, and robustness
of your optimizer with those of an existing software package.
𝑑
5.11 Aircraft fuel tank. A jet aircraft needs to carry a streamlined ℓ

external fuel tank with a required volume. The tank shape is


approximated as an ellipsoid (Fig. 5.54). We want to minimize the
drag of the fuel tank by varying its length and diameter—that is: Fig. 5.54 Ellipsoid fuel tank.

minimize 𝐷(ℓ , 𝑑)
by varying ℓ , 𝑑
subject to 𝑉req − 𝑉(ℓ , 𝑑) ≤ 0 .

The drag is given by

1 2 ∗ Hoerner111 provides this approximation


𝐷= 𝜌𝑣 𝐶 𝐷 𝑆, on page 6-17.
2
111. Hoerner, Fluid-Dynamic Drag, 1965.
5 Constrained Gradient-Based Optimization 218

where the air density is 𝜌 = 0.55 kg/m3 , and the aircraft speed is
𝑣 = 300 m/s. The drag coefficient of an ellipsoid can be estimated
as∗ "   3/2  3#
𝑑 𝑑
𝐶 𝐷 = 𝐶 𝑓 1 + 1.5 +7 .
ℓ ℓ

We assume a friction coefficient of 𝐶 𝑓 = 0.0035. The drag is


proportional to the surface area of the tank, which, for an ellipsoid,
is  
𝜋 2 ℓ
𝑆 = 𝑑 1+ arcsin 𝑒 ,
2 𝑑𝑒
p
where 𝑒 = 1 − 𝑑2 /ℓ 2 . The volume of the fuel tank is
𝜋 2
𝑉= 𝑑 ℓ,
6
and the required volume is 𝑉req = 2.5 m3 .
Find the optimum tank length and diameter numerically using
your own optimizer or a software package. Verify your solution
graphically by plotting the objective function contours and the
constraint.

5.12 Solve a variation of Ex. 5.16 where we replace the system of cables
with a cable and a rod that resists both tension and compression.
The cable is positioned above the spring, as shown in Fig. 5.55,
where 𝑥 𝑐 = 2 m, and 𝑦 𝑐 = 3 m, with a maximum length of
ℓ 𝑐 = 7.0 m. The rod is positioned at 𝑥 𝑟 = 2 m and 𝑦𝑟 = 4 m,
with a length of ℓ 𝑟 = 4.5 m. How does this change the problem

𝑥𝑐

𝑦𝑐 ℓ𝑐

𝑘1 , ℓ1 𝑘2 , ℓ2

ℓ𝑟 𝑦𝑟
Fig. 5.55 Spring system constrained
by two cables.
𝑥𝑟

formulation? Does the optimum change?


† Thisis a well-known optimization prob-
5.13 Three-bar truss. Consider the truss shown in Fig. 5.56. The truss is lem formulated by Schmit32 when he first
proposed integrating numerical optimiza-
subjected to a load 𝑃, and we want to minimize the mass of the tion with finite-element structural analysis.
structure subject to stress and buckling constraints.† The axial 32. Schmit, Structural design by systematic
synthesis, 1960.
5 Constrained Gradient-Based Optimization 219

stresses in each bar are


 
1 𝑃 cos 𝜃 𝑃 sin 𝜃
𝜎1 = √ + √
2 𝐴 𝑜 𝐴 𝑜 + 2𝐴𝑚

2𝑃 sin 𝜃
𝜎2 = √
𝐴 𝑜 + 2𝐴𝑚
 
1 𝑃 sin 𝜃 𝑃 cos 𝜃
𝜎3 = √ √ − ,
2 𝐴 𝑜 + 2𝐴𝑚 𝐴𝑜

where 𝐴 𝑜 is the cross-sectional area of the outer bars 1 and 3,


and 𝐴𝑚 is the cross-sectional area of the middle bar 2. The full ℓ = 0.5 m ℓ
optimization problem for the three-bar truss is as follows:
 √ 
minimize 𝜌 ℓ (2 2𝐴 𝑜 + 𝐴𝑚 ) mass
1 2 3 ℓ
by varying 𝐴 𝑜 , 𝐴𝑚 cross-sectional areas
subject to 𝐴min − 𝐴 𝑜 ≤ 0 area lower bound 𝜃 = 55 deg
𝐴min − 𝐴𝑚 ≤ 0 𝑃 = 500 kN
𝜎yield − 𝜎1 ≤ 0 stress constraints
Fig. 5.56 Three-bar truss elements.
𝜎yield − 𝜎2 ≤ 0
𝜎yield − 𝜎3 ≤ 0
𝜋2 𝐸𝛽𝐴 𝑜
− 𝜎1 − ≤0 buckling constraints
2ℓ 2
𝜋2 𝐸𝛽𝐴𝑚
− 𝜎2 − ≤0
2ℓ 2
𝜋2 𝐸𝛽𝐴 𝑜
− 𝜎3 − ≤0
2ℓ 2
In the buckling constraints, 𝛽 relates the second moment of area to
the area (𝐼 = 𝛽𝐴2 ) and is dependent on the cross-sectional shape
of the bars. Assuming a square cross section, 𝛽 = 1/12. The bars
are made out of an aluminum alloy with the following properties:
𝜌 = 2710 kg/m3 , 𝐸 = 69 GPa, 𝜎yield = 110 MPa.
Find the optimal bar cross-sectional areas using your own opti-
mizer or a software package. Which constraints are active? Verify
your result graphically. Exploration: Try different combinations
of unit magnitudes (e.g., Pa versus MPa for the stresses) for the
functions of interest and the design variables to observe the effect
of scaling.

5.14 Solve the same three-bar truss optimization problem in Prob. 5.13
by aggregating all the constraints into a single constraint. Try
5 Constrained Gradient-Based Optimization 220

different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.13.

5.15 Ten-bar truss.Consider the 10-bar truss structure described in


Appendix D.2.2. The full design optimization problem is as
follows:
Õ
10
minimize 𝜌 𝐴𝑖 ℓ 𝑖 mass
𝑖=1
by varying 𝐴𝑖 , 𝑖 = 1, . . . , 10 cross-sectional areas
subject to 𝐴 𝑖 ≥ 𝐴min minimum area
|𝜎𝑖 | ≤ 𝜎 𝑦 𝑖 𝑖 = 1, . . . , 10 stress constraints

Find the optimal mass and corresponding cross-sectional areas


using your own optimizer or a software package. Show a conver-
gence plot. Report the number of function evaluations and the
number of major iterations. Exploration: Restart from different
starting points. Do you get more than one local minimum? What
can you conclude about the multimodality of the design space?

5.16 Solve the same 10-bar truss optimization problem of Prob. 5.15
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.15.

5.17 Consider the aircraft wing design problem described in Ap-


pendix D.1.6. Now we will add a constraint on the bending stress
at the root of the wing, as described in Ex. 1.3.
We derive the bending stress using the one-dimensional beam
bending theory. Assuming that the lift distribution is uniform,
the load per unit length is 𝐿/𝑏. We can consider the wing as a
cantilever of length 𝑏/2. The bending moment at the wing root is

(𝐿/𝑏)(𝑏/2)2 𝐿𝑏
𝑀= = .
2 8

Now we assume that the wing structure has the H-shaped cross
section from Prob. 5.8 with a constant thickness of 𝑡𝑤 = 𝑡𝑏 = 4 mm.
We relate the cross-section height ℎsec and width 𝑏sec to the chord
as ℎ sec = 0.1𝑐 and 𝑏sec = 0.4𝑐. With these assumptions, we can
compute the second moment of area 𝐼 in terms of 𝑐.
The maximum bending stress is then
𝑀 ℎ sec
𝜎max = .
2𝐼
5 Constrained Gradient-Based Optimization 221

Considering the safety factor of 1.5 and the ultimate load factor
of 2.5, the stress constraint is
𝜎yield
2.5𝜎max − ≤ 0,
1.5
where 𝜎yield = 200 MPa.
Solve this problem and compare the solution with the uncon-
strained optimum. Plot the objective contours and constraint to
verify your result graphically.
Computing Derivatives
6
The gradient-based optimization methods introduced in Chapters 4
and 5 require the derivatives of the objective and constraints with
respect to the design variables, as illustrated in Fig. 6.1. Derivatives
also play a central role in other numerical algorithms. For example, the
Newton-based methods introduced in Section 3.8 require the derivatives
of the residuals.
𝑥
The accuracy and computational cost of the derivatives are critical for Optimizer
the success of these methods. Gradient-based methods are only efficient
when the derivative computation is also efficient. The computation of 𝑓,𝑔 Model
derivatives can be the bottleneck in the overall optimization procedure,
especially when the model solver needs to be called repeatedly. This Derivative
chapter introduces the various methods for computing derivatives and ∇ 𝑓 , 𝐽𝑔 Computation
discusses the relative advantages of each method.
Fig. 6.1 Efficient derivative computa-
tion is crucial for the overall efficiency
of gradient-based optimization.
By the end of this chapter you should be able to:

1. List the methods for computing derivatives.


2. Explain the pros and cons of these methods.
3. Implement the methods for some computational models.
4. Understand how the methods are connected through the
unified derivatives equation.

6.1 Derivatives, Gradients, and Jacobians

The derivatives we focus on are first-order derivatives of one or more


functions of interest ( 𝑓 ) with respect to a vector of variables (𝑥). In
the engineering optimization literature, the term sensitivity analysis is
often used to refer to the computation of derivatives, and derivatives
are sometimes referred to as sensitivity derivatives or design sensitivities.
Although these terms are not incorrect, we prefer to use the more
specific and concise term derivative.

222
6 Computing Derivatives 223

For the sake of generality, we do not specify which functions we want


to differentiate in this chapter (which could be an objective, constraints,
residuals, or any other function). Instead, we refer to the functions
being differentiated as the functions of interest and represent them as a
vector-valued function, 𝑓 = [ 𝑓1 , 𝑓2 , . . . , 𝑓𝑛 𝑓 ]. Neither do we specify the
variables with respect to which we differentiate (which could be design
variables, state variables, or any other independent variable).
The derivatives of all the functions of interest with respect to all the
variables form a Jacobian matrix,
 𝜕 𝑓1 𝜕 𝑓1 
 
 ∇ 𝑓1 |   𝜕𝑥1 ···
𝜕𝑥 𝑛 𝑥 
  
𝜕𝑓  .   . .. 
𝐽𝑓 = =  ..  =  .. ..
. , (6.1)
   
.
∇ 𝑓 𝑛 |   𝜕 𝑓 𝑛 𝑓 𝜕 𝑓𝑛 𝑓 
𝜕𝑥
   
 𝜕𝑥 ···
𝜕𝑥 𝑛 𝑥 
𝑓

 1
| {z }
(𝑛 𝑓 ×𝑛 𝑥 )

which is an (𝑛 𝑓 × 𝑛 𝑥 ) rectangular matrix where each row corresponds to


the gradient of each function with respect to all the variables. Row 𝑖 of
the Jacobian is the gradient of function 𝑓𝑖 . Each column in the Jacobian
is called the tangent with respect to a given variable 𝑥 𝑗 . The Jacobian
can be related to the ∇ operator as follows:
 𝜕 𝑓1 𝜕 𝑓1 
 
 𝑓1     𝜕𝑥1 ···
𝜕𝑥 𝑛 𝑥 
  𝜕 
  𝜕  . .. 
𝐽 𝑓 = 𝑓 ∇| =  ..  =  .. .  .
. ..
... (6.2)
  𝜕𝑥1 
.
 𝑓𝑛   𝜕 𝑓𝑛 𝑓 𝜕 𝑓𝑛 𝑓 
𝜕𝑥 𝑛 𝑥
 𝑓  
 𝜕𝑥 ···
𝜕𝑥 𝑛 𝑥 
 1

Example 6.1 Jacobian of a vector-valued function

Consider the following function with two variables and two functions of
interest:    
𝑓1 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + sin 𝑥1
𝑓 (𝑥) = = .
𝑓2 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + 𝑥22
We can differentiate this symbolically to obtain exact reference values:
 
𝜕𝑓 𝑥 + cos 𝑥1 𝑥1
= 2 .
𝜕𝑥 𝑥2 𝑥1 + 2𝑥 2
We evaluate this at 𝑥 = (𝜋/4, 2), which yields
 
𝜕𝑓 2.707 0.785
= .
𝜕𝑥 2.000 4.785
6 Computing Derivatives 224

6.2 Overview of Methods for Computing Derivatives

We can classify the methods for computing derivatives according to the


representation used for the numerical model. There are three possible
representations, as shown in Fig. 6.2. In one extreme, we know nothing
about the model and consider it a black box where we can only control
the inputs and observe the outputs (Fig. 6.2, left). In this chapter, we
often refer to 𝑥 as the input variables and 𝑓 as the output variables. When
this is the case, we can only compute derivatives using finite differences
(Section 6.4).
In the other extreme, we have access to the source code used to
compute the functions of interest and perform the differentiation line by
line (Fig. 6.2, right). This is the essence of the algorithmic differentiation
approach (Section 6.6). The complex-step method (Section 6.5) is related
to algorithmic differentiation, as explained in Section 6.6.5.
In the intermediate case, we consider the model residuals and
states (Fig. 6.2, middle), which are the quantities required to derive
and implement implicit analytic methods (Section 6.7). When the
model can be represented with multiple components, we can use a
coupled derivative approach (Section 13.3) where any of these derivative
computation methods can be used for each component.

𝑥 𝑥 𝑥 𝑣1 = 𝑥
𝑣2 = 𝑣2 (𝑣1 )
Solver Solver 𝑢 Solver
𝑢 𝑣3 = 𝑣3 (𝑣1 , 𝑣2 )
..
𝑟(𝑥, 𝑢) = 0 𝑟(𝑢; 𝑥) . 𝑟(𝑥, 𝑢) = 0
𝑟
𝑓 = 𝑣 𝑛 (𝑣1 , . . .)
𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓

Black box: Residuals and states: Lines of code:


Finite differencing Implicit analytic differentiation Algorithmic differentiation

Fig. 6.2 Derivative computation meth-


ods can consider three different levels
Tip 6.1 Identify and mitigate the sources of numerical noise of information: function values (left),
model states (middle), and lines of
As mentioned in Tip 3.2, it is vital to determine the level of numerical noise
code (right).
in your model. This is especially important when computing derivatives of the
model because taking the derivative can amplify the noise. There are several
common sources of model numerical noise, some of which we can mitigate.
Iterative solvers can introduce numerical noise when the convergence
tolerance is too high or when they have an inherent limit in their precision
(see Section 3.5.3). When we do not have enough precision, we can reduce the
convergence tolerance or increase the iteration limit.
Another possible source of error is file input and output. Many legacy
6 Computing Derivatives 225

codes are driven by reading and writing input and output files. However, the
numbers in the files usually have fewer digits than the code’s working precision.
The ideal solution is to modify the code to be called directly and pass the data
through memory. Another solution is to increase the precision in the files.

6.3 Symbolic Differentiation

Symbolic differentiation is well known and widely used in calculus, but


it is of limited use in the numerical optimization of most engineering
models. Except for the most straightforward cases (e.g., Ex. 6.1), many
engineering models involve a large number of operations, utilize loops
and various conditional logic, are implicitly defined, or involve itera-
tive solvers (see Chapter 3). Although the mathematical expressions
within these iterative procedures is explicit, it is challenging, or even
impossible, to use symbolic differentiation to obtain closed-form math-
ematical expressions for the derivatives of the procedure. Even when
it is possible, these expressions are almost always computationally
inefficient.

Example 6.2 Symbolic differentiation leading to expression swell

Kepler’s equation describes the orbit of a body under gravity, as briefly


discussed in Section 2.2. The following implicit equation can be obtained from
Kepler’s equation:∗ ∗ Here, 𝑓 is the difference between the ec-
centric and mean anomalies, 𝑥 is the mean
𝑓 = sin(𝑥 + 𝑓 ) .
anomaly, and the eccentricity is set to 1.
Thus, 𝑓 is an implicit function of 𝑥. As a simple numerical procedure, we use For more details, see Probs. 3.6 and 6.6.

fixed-point iteration to determine the value of 𝑓 for a given input 𝑥. That means
we start with a guess for 𝑓 on the right-hand side of that expression to estimate
a new value for 𝑓 , and repeat. In this case, convergence typically happens in
about 10 iterations. Arbitrarily, we choose 𝑥 as the initial guess for 𝑓 , resulting
in the following computational procedure:

Input: 𝑥
𝑓 =𝑥
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 )
end for
return 𝑓

Now that we have a computational procedure, we would like to compute the


derivative d 𝑓 /d𝑥. We can use a symbolic math toolbox to find the following
closed-form expression for this derivative:
6 Computing Derivatives 226

dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin (2*x))))))))))*( cos(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin (2*x)))))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin(x + sin (2*x))))))))*( cos(x + sin(x + sin(x + sin
(x + sin(x + sin(x + sin (2*x)))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin (2*x))))))*( cos(x + sin(x + sin(x + sin(x + sin (2*x)))))
*( cos(x + sin(x + sin(x + sin (2*x))))*( cos(x + sin(x + sin (2*x)))*(
cos(x + sin (2*x))*(2* cos (2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) +
1) + 1)

This expression is long and is full of redundant calculations. This problem


becomes exponentially worse as the number of iterations in the loop is increased,
so this approach is intractable for computational models of even moderate
complexity—this is known as expression swell. Therefore, we dedicate the rest
of this chapter to methods for computing derivatives numerically.

Symbolic differentiation is still valuable for obtaining derivatives


of simple explicit components within a larger model. Furthermore,
algorithm differentiation (discussed in a later section) relies on symbolic
differentiation to differentiate each line of code in the model.

6.4 Finite Differences

Because of their simplicity, finite-difference methods are a popular


approach to computing derivatives. They are versatile, requiring
nothing more than function values. Finite differences are the only
viable option when we are dealing with black-box functions because
they do not require any knowledge about how the function is evaluated.
Most gradient-based optimization algorithms perform finite differences
by default when the user does not provide the required gradients.
However, finite differences are neither accurate nor efficient.

6.4.1 Finite-Difference Formulas


Finite-difference approximations are derived by combining Taylor
series expansions. It is possible to obtain finite-difference formulas
that estimate an arbitrary order derivative with any order of truncation
error by using the right combinations of these expansions. The simplest
finite-difference formula can be derived directly from a Taylor series
expansion in the 𝑗th direction,

𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + ℎ + + +... , (6.3)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
6 Computing Derivatives 227

where 𝑒ˆ 𝑗 is the unit vector in the 𝑗th direction. Solving this for the first
derivative, we obtain the finite-difference formula,

𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= + 𝒪(ℎ) , (6.4)
𝜕𝑥 𝑗 ℎ

where ℎ is a small scalar called the finite-difference step size. This


approximation is called the forward difference and is directly related to Forward FD
𝑓 (𝑥)
the definition of a derivative because estimate

𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥) 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= lim ≈ . (6.5) 𝑓 (𝑥 + ℎ)
𝜕𝑥 𝑗 ℎ→0 ℎ ℎ True derivative

𝑥 𝑥+ℎ
The truncation error is 𝒪(ℎ), and therefore this is a first-order approx-
imation. The difference between this approximation and the exact Fig. 6.3 Exact derivative compared
derivative is illustrated in Fig. 6.3. with a forward finite-difference ap-
The backward-difference approximation can be obtained by replac- proximation (Eq. 6.4).
ing ℎ with −ℎ to yield

𝜕𝑓 𝑓 (𝑥) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ) , (6.6)
𝜕𝑥 𝑗 ℎ

which is also a first-order approximation.


Assuming each function evaluation yields the full vector 𝑓 , the
previous formulas compute the 𝑗th column of the Jacobian in Eq. 6.1.
To compute the full Jacobian, we need to loop through each direction
𝑒ˆ 𝑗 , add a step, recompute 𝑓 , and compute a finite difference. Hence, the
cost of computing the complete Jacobian is proportional to the number
of input variables of interest, 𝑛 𝑥 .
For a second-order estimate of the first derivative, we can use the
expansion of 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) to obtain
𝑓 (𝑥 − ℎ)
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓 Central FD
𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) − ℎ + − +.... (6.7) estimate
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
𝑓 (𝑥)
Then, if we subtract this from the expansion in Eq. 6.3 and solve the 𝑓 (𝑥 + ℎ)
resulting equation for the derivative of 𝑓 , we get the central-difference True derivative

formula, 𝑥−ℎ 𝑥 𝑥+ℎ


𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ 2 ). (6.8) Fig. 6.4 Exact derivative compared
𝜕𝑥 𝑗 2ℎ with a central finite-difference ap-
The stencil of points for this formula is shown in Fig. 6.4, where we can proximation (Eq. 6.8).

see that this estimate is closer to the actual derivative than the forward
difference.
Even more accurate estimates can be derived by combining differ-
ent Taylor series expansions to obtain higher-order truncation error
6 Computing Derivatives 228

terms. This technique is widely used in finite-difference methods


for solving differential equations, where higher-order estimates are
desirable.However, finite-precision arithmetic eventually limits the
achievable accuracy for our purposes (as discussed in the next section).
With double-precision arithmetic, there are not enough significant
digits to realize a significant advantage beyond central difference.
We can also estimate second derivatives (or higher) by combining
Taylor series expansions. For example, adding the expansions for
𝑓 (𝑥 + ℎ) and 𝑓 (𝑥 − ℎ) cancels out the first derivative and third derivative
terms, yielding the second-order approximation to the second-order
derivative,

𝜕2 𝑓 𝑓 (𝑥 + 2ℎ 𝑒ˆ 𝑗 ) − 2 𝑓 (𝑥) + 𝑓 (𝑥 − 2ℎ 𝑒ˆ 𝑗 )  
= + 𝒪 ℎ2 . (6.9)
𝜕𝑥 𝑗 2 4ℎ 2

The finite-difference method can also be used to compute directional


derivatives, which are the scalar projection of the gradient into a given
direction. To do this, instead of stepping in orthogonal directions to get
the gradient, we need to step in the direction of interest, 𝑝, as shown in
Fig. 6.5. Using the forward difference, for example,

𝑓 (𝑥 + ℎ𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 = + 𝒪(ℎ) . (6.10)

One application of directional derivatives is to compute the slope in


line searches (Section 4.3).

1.5 10

𝑥 8
1
𝑥 + ℎ𝑝
6
𝑥2 0.5 𝑝 𝑓
4 𝑓 (𝑥)
0
∇𝑝 𝑓
2 𝑓 (𝑥 + ℎ𝑝)
Fig. 6.5 Computing a directional
derivative using a forward finite dif-
−0.5 0
−1.5 −1 −0.5 0 0.5 ℎ 𝛼 ference.
𝑥1

6.4.2 The Step-Size Dilemma


When estimating derivatives using finite-difference formulas, we are
faced with the step-size dilemma. Because each estimate has a truncation
error of 𝒪(ℎ) (or 𝒪(ℎ 2 ) when second order), we would like to choose
as small of a step size as possible to reduce this error. However, as the
6 Computing Derivatives 229

step size reduces, subtractive cancellation (a roundoff error introduced in


Section 3.5.1) becomes dominant. Given the opposing trends of these
errors, there is an optimal step size for which the sum of the two errors
is at a minimum.
Theoretically, the optimal step size for the forward finite difference

is approximately 𝜀 𝑓 , where 𝜀 𝑓 is the precision of 𝑓 . The error bound

is also about 𝜀 𝑓 . For the central difference, the optimal step size scales
1/3 2/3
approximately with 𝜀 𝑓 , with an error bound of 𝜀 𝑓 . These step and
error bound estimates are just approximate and assume well-scaled
problems.

Example 6.3 Accuracy of finite differences

To demonstrate the step-size dilemma, consider the following function: 𝑓



10−15
𝑒𝑥 10
𝑓 (𝑥) = p . 10−1
10−8
sin3 𝑥 + cos3 𝑥
The exact derivative at 𝑥 = 1.5 is computed to 16 digits based on symbolic 5
10−16
differentiation as a reference value.
In Fig. 6.6, we show the derivatives given by the forward difference, where 0
we can see that as we decrease the step size, the derivative approaches the exact
value, but then it worsens and becomes zero for a small enough step size. 0 1 2 3
𝑥
We plot the relative error of the forward- and central-difference formulas for
a decreasing step size in Fig. 6.7. As the step decreases, the forward-difference Fig. 6.6 The forward-difference
estimate initially converges at a linear rate because its truncation error is 𝒪(ℎ), derivative initially improves as the
whereas the central difference converges quadratically. However, as the step step decreases but eventually gives
a zero derivative for a small enough
reduces below a particular value (about 10−8 for the forward difference and 10−5
step size.
for the central difference), subtractive cancellation errors become increasingly
significant. These values match the theoretical predictions for the optimal step 𝜀
Forward
and error bounds when we set 𝜀 𝑓 = 10−16 . When ℎ is so small that no difference 100
Central
difference
exists in the output (for steps smaller than 10−16 ), the finite-difference estimates difference
yield zero (and 𝜀 = 1), which corresponds to 100 percent error. 10−4

Table 6.1 lists the data for the forward difference, where we can see the
number of digits in the difference Δ 𝑓 decreasing with decreasing step size until 10−8

the difference is zero (for ℎ = 10−17 ).


10−12
10−1 10−9 10−17 10−25

Tip 6.2 When using finite differencing, always perform a step-size Fig. 6.7 As the step size ℎ decreases,
study the total error in the finite-difference
estimates initially decreases because
In practice, most gradient-based optimizers use finite differences by default of a reduced truncation error. How-
to compute the gradients. Given the potential for inaccuracies, finite differences ever, subtractive cancellation takes
are often the culprit in cases where gradient-based optimizers fail to converge. over when the step is small enough
Although some of these optimizers try to estimate a good step size, there is and eventually yields an entirely
no substitute for a step-size study by the user. The step-size study must be wrong derivative.
6 Computing Derivatives 230

ℎ 𝑓 (𝑥 + ℎ) Δ𝑓 d 𝑓 /d𝑥
10−1 4.9562638252880662 0.4584837713419043 4.58483771
10−2 4.5387928890592475 0.0410128351130856 4.10128351
10−4 4.4981854440562818 0.0004053901101200 4.05390110
10−6 4.4977841073787870 0.0000040534326251 4.05343263
10−8 4.4977800944804409 0.0000000405342790 4.05342799
10−10 4.4977800543515052 0.0000000004053433 4.05344203
10−12 4.4977800539502155 0.0000000000040536 4.05453449
10−14 4.4977800539462027 0.0000000000000409 4.17443857
10−16 4.4977800539461619 0.0000000000000000 0.00000000 Table 6.1 Subtractive cancellation
10−18 4.4977800539461619 0.0000000000000000 0.00000000 leads to a loss of precision and, ul-
timately, inaccurate finite-difference
Exact 4.4977800539461619 4.05342789 estimates.

performed for all variables and does not necessarily apply to the whole design
space. Therefore, repeating this study for other values of 𝑥 might be required.
Because we do not usually know the exact derivative, we cannot plot the
error as we did in Fig. 6.7. However, we can always tabulate the derivative
estimates as we did in Table 6.1. In the last column, we can see from the pattern
of digits that match the previous step size that ℎ = 10−8 is the best step size in
this case.

𝑓
Finite-difference approximations are sometimes used with larger
0.5369
steps than would be desirable from an accuracy standpoint to help +2 · 10−8
smooth out numerical noise or discontinuities in the model. This
approach sometimes works, but it is better to address these problems 0.5369
within the model whenever possible. Figure 6.8 shows an example of
this effect. For this noisy function, the larger step ignores the noise and 0.5369
−2 · 10−8
gives the correct trend, whereas the smaller step results in an estimate
2 − 1 · 10−6 2.0 2 + 1 · 10−6
with the wrong sign. 𝑥

6.4.3 Practical Implementation Fig. 6.8 Finite-differencing noisy func-


tions can either smooth the derivative
Algorithm 6.1 details a procedure for computing a Jacobian using estimates or result in estimates with
the wrong trends.
forward finite differences. It is usually helpful to scale the step size
based on the value of 𝑥 𝑗 , unless 𝑥 𝑗 is too small. Therefore, we combine
the relative and absolute quantities to obtain the following step size:

Δ𝑥 𝑗 = ℎ 1 + |𝑥 𝑗 | . (6.11)

This is similar to the expression for the convergence criterion in Eq. 4.24.
Although the absolute step size usually differs for each 𝑥 𝑗 , the relative
step size ℎ is often the same and is user-specified.
6 Computing Derivatives 231

Algorithm 6.1 Forward finite-difference gradient computation of a vector-


valued function 𝑓 (𝑥)

Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Vector of functions of interest
Outputs:
𝐽: Jacobian of 𝑓 with respect to 𝑥

𝑓0 = 𝑓 (𝑥) Evaluate reference values


ℎ = 10−6 Relative step size (value should be tuned)
for 𝑗 = 1 to 𝑛 𝑥 do
Δ𝑥 = ℎ(1 + |𝑥 𝑗 |) Step size should be scaled but not smaller than ℎ
𝑥 𝑗 = 𝑥 𝑗 + Δ𝑥 Modify in place for efficiency, but copying vector is also an option
𝑓+ = 𝑓 (𝑥) Evaluate function at perturbed point
𝑓+ − 𝑓0
𝐽∗,𝑗 = Finite difference yields one column of Jacobian at a time
Δ𝑥
𝑥 𝑗 = 𝑥 𝑗 − Δ𝑥 Do not forget to reset!
end for

6.5 Complex Step

The complex-step derivative approximation, strangely enough, com-


putes derivatives of real functions using complex variables. Unlike
finite differences, the complex-step method requires access to the source
code and cannot be applied to black-box components. The complex-step
method is accurate but no more efficient than finite differences because
the computational cost still scales linearly with the number of variables.

6.5.1 Theory
The complex-step method can also be derived using a Taylor series
expansion. Rather than using a real step ℎ, as we did to derive the ∗ This method originated with the work
finite-difference formulas, we use a pure imaginary step, 𝑖 ℎ.∗ If 𝑓 is a of Lyness and Moler,112 who developed
real function in real variables and is also analytic (differentiable in the formulas that use complex arithmetic for
computing the derivatives of real functions
complex domain), we can expand it in a Taylor series about a real point of arbitrary order with arbitrary order trun-
𝑥 as follows: cation error, much like the Taylor series
combination approach in finite differences.
Later, Squire and Trapp49 observed that
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + 𝑖 ℎ − − 𝑖 +... . (6.12) the simplest of these formulas was conve-
𝜕𝑥 𝑗 2 𝜕𝑥 𝑗 2 6 𝜕𝑥 𝑗 3 nient for computing first derivatives.
49. Squire and Trapp, Using complex
Taking the imaginary parts of both sides of this equation, we have variables to estimate derivatives of real
functions, 1998.

 𝜕𝑓 ℎ 3 𝜕3 𝑓 112. Lyness and Moler, Numerical differen-


Im 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = ℎ − +... . (6.13) tiation of analytic functions, 1967.
𝜕𝑥 𝑗 6 𝜕𝑥 𝑗 3
6 Computing Derivatives 232

Dividing this by ℎ and solving for 𝜕 𝑓 /𝜕𝑥 𝑗 yields the complex-step


derivative approximation,† † This approximation can also be derived
from one of the Cauchy–Riemann equa-
 tions, which are fundamental in complex
𝜕𝑓 Im 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) analysis and express complex differentia-
= + 𝒪(ℎ 2 ) , (6.14) bility.50
𝜕𝑥 𝑗 ℎ
50. Martins et al., The complex-step deriva-
tive approximation, 2003.
which is a second-order approximation. To use this approximation, we
must provide a complex number with a perturbation in the imaginary
part, compute the original function using complex arithmetic, and take
the imaginary part of the output to obtain the derivative.
In practical terms, this means that we must convert the function
evaluation to take complex numbers as inputs and compute complex
outputs. Because we have assumed that 𝑓 (𝑥) is a real function of a
real variable in the derivation of Eq. 6.14, the procedure described
here does not work for models that already involve complex arithmetic.
In Section 6.5.2, we explain how to convert programs to handle the
required complex arithmetic for the complex-step method to work in
general. The complex-step method has been extended to compute exact
second derivatives as well.113,114 113. Lantoine et al., Using multicomplex
variables for automatic computation of
Unlike finite differences, this formula has no subtraction operation high-order derivatives, 2012.
and thus no subtractive cancellation error. The only source of numerical 114. Fike and Alonso, Automatic differenti-
ation through the use of hyper-dual numbers
error is the truncation error. However, the truncation error can be for second derivatives, 2012.
eliminated if ℎ is decreased to a small enough value (say, 10−200 ). Then,
the precision of the complex-step derivative approximation (Eq. 6.14)
matches the precision of 𝑓 . This is a tremendous advantage over the
finite-difference approximations (Eqs. 6.4 and 6.8).
Like the finite-difference approach, each evaluation yields a column
of the Jacobian (𝜕 𝑓 /𝜕𝑥 𝑗 ), and the cost of computing all the derivatives is
proportional to the number of design variables. The cost of the complex-
step method is comparable to that of a central difference because we
compute a real and an imaginary part for every number in our code.
If we take the real part of the Taylor series expansion (Eq. 6.12), we
obtain the value of the function on the real axis,

𝑓 (𝑥) = Re 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) + 𝒪(ℎ 2 ) . (6.15)

Similar to the derivative approximation, we can make the truncation


error disappear by using a small enough ℎ. This means that no separate
evaluation of 𝑓 (𝑥) is required to get the original real value of the
function; we can simply take the real part of the complex evaluation.
What is a “small enough ℎ”? When working with finite-precision
arithmetic, the error can be eliminated entirely by choosing an ℎ so small
that all ℎ 2 terms become zero because of underflow (i.e., ℎ 2 is smaller
6 Computing Derivatives 233

than the smallest representable number, which is approximately 10−324


when using double precision; see Section 3.5.1). Eliminating these
squared terms does not affect the accuracy of the derivative carried in
the imaginary part because the squared terms only appear in the error
terms of the complex-step approximation.
At the same time, ℎ must be large enough that the imaginary part (ℎ ·
𝜕 𝑓 /𝜕𝑥) does not underflow. Suppose that 𝜇 is the smallest representable
number. Then, the two requirements result in the following bounds:
−1
𝜕𝑓
𝜇 < ℎ < 𝜇 .

(6.16)
𝜕𝑥

A step size of 10−200 works well for double-precision functions.

Example 6.4 Complex-step accuracy compared with finite differences

To show how the complex-step method works, consider the function in


Ex. 6.3. In addition to the finite-difference relative errors from Fig. 6.7, we plot
the complex-step error in Fig. 6.9.

101

Forward difference Central


10−2 difference
Relative error, 𝜀

10−5

10−8

10−11
Fig. 6.9 Unlike finite differences, the
10−14
complex-step method is not subject to
Complex step subtractive cancellation. Therefore,
the error is the same as that of the
10−17
10−1 10−4 10−8 10−12 10−16 10−20 10−200 10−321
function evaluation (machine zero in
this case).
Step size, ℎ

The complex-step estimate converges quadratically with decreasing step


size, as predicted by the truncation error term. The relative error reduces
to machine precision at around ℎ = 10−8 and stays at that level. The error
eventually increases when ℎ is so small that the imaginary parts get affected by
underflow (around ℎ = 10−308 in this case).
The real parts and the derivatives of the complex evaluations are listed
in Table 6.2. For a small enough step, the real part is identical to the original
real function evaluation, and the complex-step method yields derivatives that
match to machine precision.
Comparing the best accuracy of each of these approaches, we can see that
6 Computing Derivatives 234

ℎ Re ( 𝑓 ) Im ( 𝑓 ) /ℎ
10−1 4.4508662116993065 4.0003330384671729
10−2 4.4973069409015318 4.0528918144659292
10−4 4.4977800066307951 4.0534278402854467
10−6 4.4977800539414297 4.0534278938932582
10−8 4.4977800539461619 4.0534278938986201
10−10 4.4977800539461619 4.0534278938986201
10−12 4.4977800539461619 4.0534278938986201
10−14 4.4977800539461619 4.0534278938986210
10−16 4.4977800539461619 4.0534278938986201 Table 6.2 For a small enough step, the
10−18 4.4977800539461619 4.0534278938986210 real part of the complex evaluation is
10−200 4.4977800539461619 4.0534278938986201 identical to the real evaluation, and
the derivative matches to machine
Exact 4.4977800539461619 4.0534278938986201 precision.

by using finite differences, we only achieve a fraction of the accuracy that is


obtained by using the complex-step approximation.

6.5.2 Complex-Step Implementation


We can use the complex-step method even when the evaluation of 𝑓 in-
volves the solution of numerical models through computer programs.50 50. Martins et al., The complex-step deriva-
tive approximation, 2003.
The outer loop for computing the derivatives of multiple functions
with respect to all variables (Alg. 6.2) is similar to the one for finite
differences. A reference function evaluation is not required, but now
the function must handle complex numbers correctly.

Algorithm 6.2 Computing the gradients of a vector-valued function 𝑓 (𝑥) us-


ing the complex-step method

Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥

ℎ = 10−200 Typical “small enough” step size


for 𝑗 = 1 to 𝑛 𝑥 do
𝑥𝑗 = 𝑥𝑗 + 𝑖ℎ Add complex step to variable 𝑗
𝑓+ = 𝑓 (𝑥) Evaluate function with complex perturbation
Im( 𝑓+ )
𝐽∗,𝑗 = Extract derivatives from imaginary part

𝑥𝑗 = 𝑥𝑗 − 𝑖ℎ Reset perturbed variable
end for
6 Computing Derivatives 235

The complex-step method can be applied to any model, but modi-


fications might be required. We need the source code for the model
to make sure that the program can handle complex numbers and the
associated arithmetic, that it handles logical operators consistently, and
that certain functions yield the correct derivatives.
First, the program may need to be modified to use complex numbers.
In programming languages like Fortran or C, this involves changing
real-valued type declarations (e.g., double) to complex type declarations
(e.g., double complex). In some languages, such as MATLAB, Python,
and Julia, this is unnecessary because functions are overloaded to
automatically accept either type.
Second, some changes may be required to preserve the correct
logical flow through the program. Relational logic operators (e.g.,
“greater than”, “less than”, “if”, and “else”) are usually not defined
for complex numbers. These operators are often used in programs,
together with conditional statements, to redirect the execution thread.
The original algorithm and its “complexified” version should follow
the same execution thread. Therefore, defining these operators to
compare only the real parts of the arguments is the correct approach.
Functions that choose one argument, such as the maximum or the
minimum values, are based on relational operators. Following the
previous argument, we should determine the maximum and minimum
values based on the real parts alone. (𝑥 + 𝑖𝑦)
𝑦
Third, some functions need to be redefined for complex arguments.
The most common function that needs to be redefined is the absolute Im [𝑧]
|𝑧|
value function. For a complex number, 𝑧 = 𝑥 + 𝑖 𝑦, the absolute value is
defined as q
|𝑧| = 𝑥2 + 𝑦2 , (6.17)
Re [𝑧] 𝑥
as shown in Fig. 6.10. This definition is not complex analytic, which is
required in the derivation of the complex-step derivative approximation. Fig. 6.10 The usual definition of a
complex absolute value returns a real
As shown in Fig. 6.11, the correct derivatives for the real absolute
number (the length of the vector),
value function are +1 and −1, depending on whether 𝑥 is greater than which is not compatible with the
or less than zero. The following complex definition of the absolute complex-step method.
value yields the correct derivatives:
(
−𝑥 − 𝑖 𝑦, if 𝑥<0
abs(𝑥 + 𝑖 𝑦) = (6.18) −1
𝑓
+1
+𝑥 + 𝑖 𝑦, if 𝑥 ≥ 0.
Setting the imaginary part to 𝑦 = ℎ and dividing by ℎ corresponds 𝑥

to the slope of the absolute value function. There is an exception at


Fig. 6.11 The absolute value func-
𝑥 = 0, where the function is not analytic, but a derivative does not tion needs to be redefined such that
exist in any case. We use the “greater or equal” in the logic so that the the imaginary part yields the correct
approximation yields the correct right-sided derivative at that point. derivatives.
6 Computing Derivatives 236

Tip 6.3 Test complexified code by running it with ℎ = 0

Once you have made your code complex, the first test you should perform
is to run your code with no imaginary perturbation and verify that no variable
ends up with a nonzero imaginary part. If any number in the code acquires a
nonzero imaginary part, something is wrong, and you must trace the source of
the error. This is a necessary but not sufficient test.

Depending on the programming language, we may need to redefine


some trigonometric functions. This is because some default imple-
mentations, although correct, do not maintain accurate derivatives for
small complex-step sizes. We must replace these with mathematically
equivalent implementations that avoid numerical issues.
Fortunately, we can automate most of these changes by using scripts
to process the codes, and in most programming languages, we can
easily redefine functions using operator overloading.‡ ‡ Formore details on the problematic func-
tions and how to implement the complex-
step method in various programming lan-
Tip 6.4 Check the convergence of the imaginary part guages, see Martins et al.50 A summary, im-
plementation guide, and scripts are avail-
When the solver that computes 𝑓 is iterative, it might be necessary to change able at: http://bit.ly/complexstep

the convergence criterion so that it checks for the convergence of the imaginary 50. Martins et al., The complex-step deriva-
tive approximation, 2003.
part, in addition to the existing check on the real part. The imaginary part,
which contains the derivative information, often lags relative to the real part 𝜀

101
in terms of convergence, as shown in Fig. 6.12. Therefore, if the solver only
checks for the real part, it might yield a derivative with a precision lower
than the function value. In this example, 𝑓 is the drag coefficient given by a 10−4 
Im 𝑓
computational fluid dynamics solver and 𝜀 is the relative error for each part.
10−9

Re 𝑓
10−14
0 50 100 150 200 250
6.6 Algorithmic Differentiation Iterations

Algorithmic differentiation (AD)—also known as computational differenti- Fig. 6.12 The imaginary parts of the
ation or automatic differentiation—is a well-known approach based on the variables often lag relative to the real
parts in iterative solvers.
systematic application of the chain rule to computer programs.115,116
115. Griewank, Evaluating Derivatives,
The derivatives computed with AD can match the precision of the 2000.
function evaluation. The cost of computing derivatives with AD can 116. Naumann, The Art of Differentiating
Computer Programs—An Introduction to
be proportional to either the number of variables or the number of Algorithmic Differentiation, 2011.
functions, depending on the type of AD, making it flexible.
Another attractive feature of AD is that its implementation is largely
automatic, thanks to various AD tools. To explain AD, we start by
outlining the basic theory with simple examples. Then we explore how
the method is implemented in practice with further examples.
6 Computing Derivatives 237

6.6.1 Variables and Functions as Lines of Code


The basic concept of AD is as follows. Even long, complicated codes
consist of a sequence of basic operations (e.g., addition, multiplication,
cosine, exponentiation). Each operation can be differentiated symboli-
cally with respect to the variables in the expression. AD performs this
symbolic differentiation and adds the code that computes the deriva-
tives for each variable in the code. The derivatives of each variable
accumulate in what amounts to a numerical version of the chain rule.
𝑣1
The fundamental building blocks of a code are unary and binary
𝑣2
operations. These operations can be combined to obtain more elab- 𝑥
orate explicit functions, which are typically expressed in one line of
computer code. We represent the variables in the computer code as
the sequence 𝑣 = [𝑣1 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ], where 𝑛 is the total number of 𝑣
variables assigned in the code. One or more of these variables at the
start of this sequence are given and correspond to 𝑥, and one or more
of the variables toward the end of the sequence are the outputs of 𝑓 𝑣 𝑛−1
interest, 𝑓 , as illustrated in Fig. 6.13. In general, a variable assignment 𝑣𝑛
corresponding to a line of code can involve any other variable, including
itself, through an explicit function, Fig. 6.13 AD considers all the vari-
ables in the code, where the inputs 𝑥
𝑣 𝑖 = 𝑣 𝑖 (𝑣1 , 𝑣2 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ) . (6.19) are among the first variables, and the
outputs 𝑓 are among the last.
Except for the most straightforward codes, many of the variables in the
code are overwritten as a result of iterative loops.
To understand AD, it is helpful to imagine a version of the code Code with loop

where all the loops are unrolled. Instead of overwriting variables, we 𝑣1 𝑣2 𝑣3 𝑣4

create new versions of those variables, as illustrated in Fig. 6.14. Then, ×3


we can represent the computations in the code in a sequence with no Unrolled loop
loops such that each variable in this larger set only depends on previous ... 𝑣9 𝑣10
𝑣1 𝑣2
variables, and then
Fig. 6.14 Unrolling of loops is a use-
𝑣 𝑖 = 𝑣 𝑖 (𝑣1 , 𝑣2 , . . . , 𝑣 𝑖−1 ) . (6.20) ful mental model to understand the
derivative propagation in the AD of
Given such a sequence of operations and the derivatives for each general code.
operation, we can apply the chain rule to obtain the derivatives for
the entire sequence. Unrolling the loops is just a mental model for
understanding how the chain rule operates, and it is not explicitly done
in practice.
The chain rule can be applied in two ways. In the forward mode, we
choose one input variable and work forward toward the outputs until
we get the desired total derivative. In the reverse mode, we choose one
output variable and work backward toward the inputs until we get the
desired total derivative.
6 Computing Derivatives 238

6.6.2 Forward-Mode AD
The chain rule for the forward mode can be written as

d𝑣 𝑖 Õ 𝜕𝑣 𝑖 d𝑣 𝑘
𝑖−1
= , (6.21)
d𝑣 𝑗 𝜕𝑣 𝑘 d𝑣 𝑗
𝑘=𝑗

where each partial derivative is obtained by symbolically differentiating


the explicit expression for 𝑣 𝑖 . The total derivatives are the derivatives
with respect to the chosen input 𝑣 𝑗 , which can be computed using this
chain rule.
Using the forward mode, we evaluate a sequence of these expres-
sions by fixing 𝑗 in Eq. 6.21 (effectively choosing one input 𝑣 𝑗 ) and
incrementing 𝑖 to get the derivative of each variable 𝑣 𝑖 . We only need
to sum up to 𝑖 − 1 because of the form of Eq. 6.20, where each 𝑣 𝑖 only
depends on variables that precede it.
For a more convenient notation, we define a new variable that
represents the total derivative of variable 𝑖 with respect to a fixed input
𝑗 as 𝑣¤ 𝑖 ≡ d𝑣 𝑖 /d𝑣 𝑗 and rewrite the chain rule as

Õ
𝑖−1
𝜕𝑣 𝑖
𝑣¤ 𝑖 = 𝑣¤ 𝑘 . (6.22)
𝜕𝑣 𝑘
𝑘=𝑗

The chosen input 𝑗 corresponds to the seed, which we set to 𝑣¤ 𝑗 = 1 (using


the definition for 𝑣¤ 𝑗 , we see that means setting d𝑣 𝑗 /d𝑣 𝑗 = 1). This chain 𝑥 𝑓
rule then propagates the total derivatives forward, as shown in Fig. 6.15,
affecting all the variables that depend on the seeded variable.
Once we are done applying the chain rule (Eq. 6.22) for the chosen
input variable 𝑣 𝑗 , we end up with the total derivatives d𝑣 𝑖 /d𝑣 𝑗 for all
𝑖 > 𝑗. The sum in the chain rule (Eq. 6.22) only needs to consider the
nonzero partial derivative terms. If a variable 𝑘 does not explicitly
appear in the expression for 𝑣 𝑖 , then 𝜕𝑣 𝑖 /𝜕𝑣 𝑘 = 0, and there is no need
to consider the corresponding term in the sum. In practice, this means
that only a small number of terms is considered for each sum.
Seeded input, 𝑣¤ 𝑗
Suppose we have four variables 𝑣1 , 𝑣2 , 𝑣3 , and 𝑣4 , and 𝑥 ≡ 𝑣1 , 𝑓 ≡ 𝑣4 ,
and we want d 𝑓 /d𝑥. We assume that each variable depends explicitly
Fig. 6.15 The forward mode propa-
on all the previous ones. Using the chain rule (Eq. 6.22), we set 𝑗 = 1 gates derivatives to all the variables
(because we want the derivative with respect to 𝑥 ≡ 𝑣1 ) and increment that depend on the seeded input vari-
able.
6 Computing Derivatives 239

in 𝑖 to get the sequence of derivatives:

𝑣¤ 1 = 1
𝜕𝑣2
𝑣¤ 2 = 𝑣¤ 1
𝜕𝑣1
𝜕𝑣3 𝜕𝑣3 (6.23)
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2
𝜕𝑣1 𝜕𝑣2
𝜕𝑣4 𝜕𝑣4 𝜕𝑣4 d𝑓
𝑣¤ 4 = 𝑣¤ 1 + 𝑣¤ 2 + 𝑣¤ 3 ≡ .
𝜕𝑣1 𝜕𝑣2 𝜕𝑣3 d𝑥

In each step, we just need to compute the partial derivatives of the


current operation 𝑣 𝑖 and then multiply using the total derivatives 𝑣¤
that have already been computed. We move forward by evaluating the
partial derivatives of 𝑣 in the same sequence to evaluate the original
function. This is convenient because all of the unknowns are partial
derivatives, meaning that we only need to compute derivatives based
on the operation at hand (or line of code).
In this abstract example with four variables that depend on each
other sequentially, the Jacobian of the variables with respect to them-
selves is as follows:

 1 0 0 0
 
 d𝑣2 
 1 0 0
 d𝑣1 
𝐽𝑣 =  d𝑣3  .
0
d𝑣 3 (6.24)
 d𝑣1 1
 d𝑣 2 
 d𝑣4 d𝑣 4 d𝑣4 
 1
 d𝑣1 d𝑣 2 d𝑣3 
By setting the seed 𝑣¤ 1 = 1 and using the forward chain rule (Eq. 6.22), we
have computed the first column of 𝐽𝑣 from top to bottom. This column
corresponds to the tangent with respect to 𝑣1 . Using forward-mode
AD, obtaining derivatives for other outputs is free (e.g., d𝑣3 /d𝑣1 ≡ 𝑣¤ 3
in Eq. 6.23).
However, if we want the derivatives with respect to additional
inputs, we would need to set a different seed and evaluate an entire
set of similar calculations. For example, if we wanted d𝑣4 /d𝑣2 , we
would set the seed as 𝑣¤ 2 = 1 and evaluate the equations for 𝑣¤ 3 and 𝑣¤ 4 ,
where we would now have d𝑣 4 /d𝑣2 = 𝑣¤ 4 . This would correspond to
computing the second column in 𝐽𝑣 (Eq. 6.24).
Thus, the cost of the forward mode scales linearly with the number
of inputs we are interested in and is independent of the number of
outputs.
6 Computing Derivatives 240

Example 6.5 Forward-mode AD

Consider the function with two inputs and two outputs from Ex. 6.1. We
could evaluate the explicit expressions in this function using only two lines of
code. However, to make the AD process more apparent, we write the code such
that each line has a single unary or binary operation, which is how a computer
ends up evaluating the expression:

𝑣1 = 𝑣1 (𝑣1 ) = 𝑥1
𝑣2 = 𝑣 2 (𝑣2 ) = 𝑥2
𝑣3 = 𝑣 3 (𝑣1 , 𝑣2 ) = 𝑣1 𝑣 2
𝑣4 = 𝑣 4 (𝑣1 ) = sin 𝑣 1
𝑣5 = 𝑣 5 (𝑣3 , 𝑣4 ) = 𝑣3 + 𝑣4 = 𝑓1
𝑣6 = 𝑣 6 (𝑣2 ) = 𝑣22
𝑣 7 = 𝑣 7 (𝑣3 , 𝑣6 ) = 𝑣3 + 𝑣6 = 𝑓2 .

Using the forward mode, set the seed 𝑣¤ 1 = 1, and 𝑣¤ 2 = 0 to obtain the derivatives
with respect to 𝑥1 . When using the chain rule (Eq. 6.22), only one or two partial
derivatives are nonzero in each sum because the operations are either unary
or binary in this case. For example, the addition operation that computes
𝑣5 does not depend explicitly on 𝑣2 , so 𝜕𝑣5 /𝜕𝑣2 = 0. To further elaborate,
when evaluating the operation 𝑣5 = 𝑣3 + 𝑣4 , we do not need to know how 𝑣3
was computed; we just need to know the value of the two numbers we are
adding. Similarly, when evaluating the derivative 𝜕𝑣5 /𝜕𝑣2 , we do not need
to know how or whether 𝑣3 and 𝑣4 depended on 𝑣2 ; we just need to know
how this one operation depends on 𝑣2 . So even though symbolic derivatives
are involved in individual operations, the overall process is distinct from
symbolic differentiation. We do not combine all the operations and end up
with a symbolic derivative. We develop a computational procedure to compute
the derivative that ends up with a number for a given input—similar to the
computational procedure that computes the functional outputs and does not
produce a symbolic functional output.
Say we want to compute d 𝑓2 /d𝑥1 , which in our example corresponds to
d𝑣7 /d𝑣1 . The evaluation point is the same as in Ex. 6.1: 𝑥 = (𝜋/4, 2). Using the
chain rule (Eq. 6.22) and considering only the nonzero partial derivative terms,
we get the following sequence:

𝑣¤ 1 = 1
𝑣¤ 2 = 0
𝜕𝑣3 𝜕𝑣
𝑣¤ 3 = 𝑣¤ 1 + 3 𝑣¤ 2 = 𝑣 2 · 𝑣¤ 1 + 𝑣1 · 0 = 2
𝜕𝑣1 𝜕𝑣 2
𝜕𝑣4
𝑣¤ 4 = 𝑣¤ 1 = cos 𝑣1 · 𝑣¤ 1 = 0.707 . . .
𝜕𝑣1
𝜕𝑣5 𝜕𝑣 𝜕 𝑓1
𝑣¤ 5 = 𝑣¤ 3 + 5 𝑣¤ 4 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 4 = 2.707 . . . ≡
𝜕𝑣3 𝜕𝑣 4 𝜕𝑥1
6 Computing Derivatives 241

𝜕𝑣6
𝑣¤ 6 = 𝑣¤ 2 = 2𝑣2 · 𝑣¤ 2 = 0
𝜕𝑣2
(6.25)
𝜕𝑣7 𝜕𝑣 𝜕 𝑓2
𝑣¤ 7 = 𝑣¤ 3 + 7 𝑣¤ 6 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 6 = 2 ≡ .
𝜕𝑣3 𝜕𝑣6 𝜕𝑥1
This sequence is illustrated in matrix form in Fig. 6.16. The procedure is
equivalent to performing forward substitution in this linear system.
We now have a procedure (not a symbolic expression) for computing d 𝑓2 /d𝑥1 𝑣¤ 1 1 𝑣¤ 1
for any (𝑥1 , 𝑥2 ). The dependencies of these operations are shown in Fig. 6.17 as 𝑣¤ 2 𝑣¤ 2
a computational graph. 𝑣¤ 3 𝑣¤ 3

Although we set out to compute d 𝑓2 /d𝑥1 , we also obtained d 𝑓1 /d𝑥1 as a 𝑣¤ 4 = 𝑣¤ 4


𝑣¤ 5 𝑣¤ 5
by-product. We can obtain the derivatives for all outputs with respect to one
𝑣¤ 6 𝑣¤ 6
input for the same cost as computing the outputs. If we wanted the derivative
𝑣¤ 7 𝑣¤ 7
with respect to the other input, d 𝑓1 /d𝑥2 , a new sequence of calculations would
be necessary. Fig. 6.16 Dependency used in the
forward chain rule propagation in
Eq. 6.25. The forward mode is equiv-
𝑓1 = 𝑣5 𝑓1
𝑣1 = 𝑥1 𝑣 4 = sin 𝑣1 𝑣5 = 𝑣3 + 𝑣4 alent to solving a lower triangular sys-
𝑥1 𝜕 𝑓1 tem by forward substitution, where
𝑣¤ 1 = 1 𝑣¤ 4 = 𝑣¤ 1 cos 𝑣1 𝑣¤ 5 = 𝑣¤ 3 + 𝑣¤ 4 = 𝑣¤ 5 𝜕 𝑓1
𝜕𝑥1 the system is sparse.
𝜕𝑥1
𝑣3 = 𝑣1 𝑣2
𝑣¤ 3 = 𝑣¤ 1 𝑣2 + 𝑣1 𝑣¤ 2
Fig. 6.17 Computational graph for
𝑓2 = 𝑣7 𝑓2
the numerical example evaluations,
𝑥2
𝑣2 = 𝑥2 𝑣6 = 𝑣22 𝑣7 = 𝑣3 + 𝑣6
𝑣¤ 2 = 0 𝑣¤ 6 = 2𝑣¤ 2 𝑣¤ 7 = 𝑣¤ 3 + 𝑣¤ 6
𝜕 𝑓2
= 𝑣¤ 7
showing the forward propagation of
𝜕𝑥1 𝜕 𝑓2 the derivative with respect to 𝑥1 .
𝜕𝑥1

So far, we have assumed that we are computing derivatives with


respect to each component of 𝑥. However, just like for finite differences,
we can also compute directional derivatives using forward-mode AD.
We do so by setting the appropriate seed in the 𝑣¤ ’s that correspond to
the inputs in a vectorized manner. Suppose we have 𝑥 ≡ [𝑣1 , . . . , 𝑣 𝑛 𝑥 ].
To compute the derivative with respect to 𝑥 𝑗 , we would set the seed
as the unit vector 𝑣¤ = 𝑒ˆ 𝑗 and follow a similar process for the other
elements. To compute a directional derivative in direction 𝑝, we would
set the seed as 𝑣¤ = 𝑝/k𝑝k.

Tip 6.5 Use a directional derivative for quick verification

We can use a directional derivative in arbitrary directions to verify the


gradient computation. The directional derivative is the scalar projection of
the gradient in the chosen direction, that is, ∇ 𝑓 | 𝑝. We can use the directional
derivative to verify the gradient computed by some other method, which is
especially useful when the evaluation of 𝑓 is expensive and we have many
gradient elements. We can verify a gradient by projecting it into some direction
6 Computing Derivatives 242

(say, 𝑝 = [1, . . . , 1]) and then comparing it to the directional derivative in that
direction. If the result matches the reference, then all the gradient elements are
most likely correct (it is good practice to try a couple more directions just to be
sure). However, if the result does not match, this directional derivative does
not reveal which gradient elements are incorrect.

6.6.3 Reverse-Mode AD
The reverse mode is also based on the chain rule but uses the alternative
form:
d𝑣 𝑖 Õ 𝑖
𝜕𝑣 𝑘 d𝑣 𝑖
= , (6.26)
d𝑣 𝑗 𝜕𝑣 𝑗 d𝑣 𝑘
𝑘=𝑗+1

where the summation happens in reverse (starts at 𝑖 and decrements to


𝑗 + 1). This is less intuitive than the forward chain rule, but it is equally
valid. Here, we fix the index 𝑖 corresponding to the output of interest
and decrement 𝑗 until we get the desired derivative.
Similar to the forward-mode total derivative notation (Eq. 6.22), we
define a more convenient notation for the variables that carry the total
derivatives with a fixed 𝑖 as 𝑣¯ 𝑗 ≡ d𝑣 𝑖 /d𝑣 𝑗 , which are sometimes called
adjoint variables. Then we can rewrite the chain rule as

Õ
𝑖
𝜕𝑣 𝑘
𝑣¯ 𝑗 = 𝑣¯ 𝑘 . (6.27)
𝜕𝑣 𝑗
𝑘=𝑗+1

This chain rule propagates the total derivatives backward after setting
the reverse seed 𝑣¯ 𝑖 = 1, as shown in Fig. 6.18. This affects all the 𝑥 𝑓
variables on which the seeded variable depends.
Seeded output, 𝑣¯ 𝑖
The reverse-mode variables 𝑣¯ represent the derivatives of one output,
𝑖, with respect to all the input variables (instead of the derivatives of all
the outputs with respect to one input, 𝑗, in the forward mode). Once
we are done applying the reverse chain rule (Eq. 6.27) for the chosen
output variable 𝑣 𝑖 , we end up with the total derivatives d𝑣 𝑖 /d𝑣 𝑗 for all
𝑗 < 𝑖.
Applying the reverse mode to the same four-variable example as
before, we get the following sequence of derivative computations (we
set 𝑖 = 4 and decrement 𝑗):
Fig. 6.18 The reverse mode propa-
gates derivatives to all the variables
𝑣¯ 4 = 1 on which the seeded output variable
𝜕𝑣 4 depends.
𝑣¯ 3 = 𝑣¯ 4
𝜕𝑣 3
6 Computing Derivatives 243

𝜕𝑣3 𝜕𝑣4
𝑣¯ 2 = 𝑣¯ 3 + 𝑣¯ 4
𝜕𝑣2 𝜕𝑣2
(6.28)
𝜕𝑣2 𝜕𝑣3 𝜕𝑣4 d𝑓
𝑣¯ 1 = 𝑣¯ 2 + 𝑣¯ 3 + 𝑣¯ 4 ≡ .
𝜕𝑣1 𝜕𝑣1 𝜕𝑣1 d𝑥
The partial derivatives of 𝑣 must be computed for 𝑣 4 first, then 𝑣3 , and
so on. Therefore, we have to traverse the code in reverse. In practice,
not every variable depends on every other variable, so a computational
graph is created during code evaluation. Then, when computing the
adjoint variables, we traverse the computational graph in reverse. As
before, the derivatives we need to compute in each line are only partial
derivatives.
Recall the Jacobian of the variables,

 1 0 0 0
 
 d𝑣2 
 1 0 0
 d𝑣1 
𝐽𝑣 =  d𝑣3  .
0
d𝑣 3 (6.29)
 d𝑣1 1
 d𝑣 2 
 d𝑣4 d𝑣 4 d𝑣4 
 1
 d𝑣1 d𝑣 2 d𝑣3 
By setting 𝑣¯ 4 = 1 and using the reverse chain rule (Eq. 6.27), we have
computed the last row of 𝐽𝑣 from right to left. This row corresponds
to the gradient of 𝑓 ≡ 𝑣4 . Using the reverse mode of AD, obtaining
derivatives with respect to additional inputs is free (e.g., d𝑣 4 /d𝑣 2 ≡ 𝑣¯ 2
in Eq. 6.28).
However, if we wanted the derivatives of additional outputs, we
would need to evaluate a different sequence of derivatives. For example,
if we wanted d𝑣3 /d𝑣1 , we would set 𝑣¯ 3 = 1 and evaluate the expressions
for 𝑣¯ 2 and 𝑣¯ 1 in Eq. 6.28, where d𝑣 3 /𝑑𝑣 1 ≡ 𝑣¯ 1 . Thus, the cost of
the reverse mode scales linearly with the number of outputs and is
independent of the number of inputs.
One complication with the reverse mode is that the resulting se-
quence of derivatives requires the values of the variables, starting with
the last ones and progressing in reverse. For example, the partial deriva-
tive in the second operation of Eq. 6.28 might involve 𝑣3 . Therefore, the
code needs to run in a forward pass first, and all the variables must be
stored for use in the reverse pass, which increases memory usage.

Example 6.6 Reverse-mode AD

Suppose we want to compute 𝜕 𝑓2 /𝜕𝑥1 for the function from Ex. 6.5. First,
we need to run the original code (a forward pass) and store the values of all
the variables because they are necessary in the reverse chain rule (Eq. 6.26)
to compute the numerical values of the partial derivatives. Furthermore, the
6 Computing Derivatives 244

reverse chain rule requires the information on all the dependencies to determine
which partial derivatives are nonzero. The forward pass and dependencies are
represented by the computational graph shown in Fig. 6.19.

𝑥1 𝑣1 = 𝑥1 𝑣 4 = sin 𝑣 1 𝑣5 = 𝑣3 + 𝑣4 𝑓1 = 𝑣5 𝑓1

𝑣3 = 𝑣1 𝑣2

Fig. 6.19 Computational graph for the


𝑥2 𝑣2 = 𝑥2 𝑣6 = 𝑣22 𝑣7 = 𝑣3 + 𝑣6 𝑓2 = 𝑣7 𝑓2 function.

Using the chain rule (Eq. 6.26) and setting the seed for the desired variable
𝑣¯ 7 = 1, we get

𝑣¯ 7 = 1
𝜕𝑣 7
𝑣¯ 6 = 𝑣¯ 7 = 𝑣¯ 7 = 1
𝜕𝑣 6
𝑣¯ 5 = ==0
𝜕𝑣 5
𝑣¯ 4 = 𝑣¯ 5 = 𝑣¯ 5 = 0
𝜕𝑣 4 (6.30)
𝜕𝑣 7 𝜕𝑣
𝑣¯ 3 = 𝑣¯ 7 + 5 𝑣¯ 5 = 𝑣¯ 7 + 𝑣¯ 5 = 1
𝜕𝑣 3 𝜕𝑣3
𝜕𝑣6 𝜕𝑣3 𝜕 𝑓2
𝑣¯ 2 = 𝑣¯ 6 + 𝑣¯ 3 = 2𝑣2 𝑣¯ 6 + 𝑣1 𝑣¯ 3 = 4.785 =
𝜕𝑣 2 𝜕𝑣2 𝜕𝑥2
𝜕𝑣 4 𝜕𝑣 𝜕 𝑓2
𝑣¯ 1 = 𝑣¯ 4 + 3 𝑣¯ 3 = (cos 𝑣1 )𝑣¯ 4 + 𝑣2 𝑣¯ 3 = 2 = .
𝜕𝑣 1 𝜕𝑣1 𝜕𝑥1
After running the forward evaluation and storing the elements of 𝑣, we can run
the reverse pass shown in Fig. 6.20. This reverse pass is illustrated in matrix
form in Fig. 6.21. The procedure is equivalent to performing back substitution
in this linear system.

𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 1 = 𝑣¯ 4 cos 𝑣1
= 𝑣¯ 1 𝑣¯ 4 = 𝑣¯ 5 𝑣¯ 5 = 0 𝑓1
𝜕𝑥1 𝜕𝑥1 + 𝑣2 𝑣¯ 3

𝑣¯ 3 = 𝑣¯ 7 + 𝑣¯ 5
Fig. 6.20 Computational graph for the
reverse mode, showing the backward
𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 2 = 2𝑣2 𝑣¯ 6 propagation of the derivative of 𝑓2 .
= 𝑣¯ 2 𝑣¯ 6 = 𝑣¯ 7 𝑣¯ 7 = 1 𝑓2
𝜕𝑥2 𝜕𝑥2 + 𝑣 1 𝑣¯ 3

Although we set out to evaluate d 𝑓2 /d𝑥 1 , we also computed d 𝑓2 /d𝑥2 as a


by-product. For each output, the derivatives of all inputs come at the cost of
6 Computing Derivatives 245

evaluating only one more line of code. Conversely, if we want the derivatives
of 𝑓1 , a whole new set of computations is needed.
In forward mode, the computation of a given derivative, 𝑣¤ 𝑖 , requires the 𝑣¯ 1 𝑣¯ 1
partial derivatives of the line of code that computes 𝑣 𝑖 with respect to its inputs. 𝑣¯ 2 𝑣¯ 2
In the reverse case, however, to compute a given derivative, 𝑣¯ 𝑗 , we require the 𝑣¯ 3 𝑣¯ 3
𝑣¯ 4 = 𝑣¯ 4
partial derivatives with respect to 𝑣 𝑗 of the functions that the current variable
𝑣¯ 5 𝑣¯ 5
𝑣 𝑗 affects. Knowledge of the function a variable affects is not encoded in that
𝑣¯ 6 𝑣¯ 6
variable computation, and that is why the computational graph is required.
𝑣¯ 7 1 𝑣¯ 7

Fig. 6.21 Dependency used in the


Unlike with forward-mode AD and finite differences, it is impossible reverse chain rule propagation in
to compute a directional derivative by setting the appropriate seeds. Eq. 6.30. The reverse mode is equiv-
alent to solving an upper triangu-
Although the seeds in the forward mode are associated with the inputs,
lar system by backward substitution,
the seeds for the reverse mode are associated with the outputs. Suppose where the system is sparse.
we have multiple functions of interest, 𝑓 ≡ [𝑣 𝑛−𝑛 𝑓 , . . . , 𝑣 𝑛 ]. To find the
derivatives of 𝑓1 in a vectorized operation, we would set 𝑣¯ = [1, 0, . . . , 0].
A seed with multiple nonzero elements computes the derivatives of a
weighted function with respect to all the variables, where the weight for
each function is determined by the corresponding 𝑣¯ value.

6.6.4 Forward Mode or Reverse Mode? 1 Forward 𝐽𝑣


𝑛𝑥
1
Our goal is to compute 𝐽 𝑓 , the (𝑛 𝑓 × 𝑛 𝑥 ) matrix of derivatives of all
the functions of interest 𝑓 with respect to all the input variables 𝑥.
However, AD computes many other derivatives corresponding to
Reverse
intermediate variables. The complete Jacobian for all the intermediate 1
variables, 𝑣 𝑖 = 𝑣 𝑖 (𝑣1 , 𝑣2 , . . . , 𝑣 𝑖 , . . . , 𝑣 𝑛 ), assuming that the loops have 𝑛𝑓 𝐽𝑓
1
been unrolled, has the structure shown in Figs. 6.22 and 6.23. 1
1
The input variables 𝑥 are among the first entries in 𝑣, whereas the 𝑛𝑥 𝑛𝑓
functions of interest 𝑓 are among the last entries of 𝑣. For simplicity, let
us assume that the entries corresponding to 𝑥 and 𝑓 are contiguous, as Fig. 6.22 When 𝑛 𝑥 < 𝑛 𝑓 , the forward
previously shown in Fig. 6.13. Then, the derivatives we want (𝐽 𝑓 ) are a mode is advantageous.
block located on the lower left in the much larger matrix (𝐽𝑣 ), as shown 1 Forward 𝐽𝑣
in Figs. 6.22 and 6.23. Although we are only interested in this block, 𝑛𝑥
1
1
AD requires the computation of additional intermediate derivatives.
1
The main difference between the forward and the reverse approaches
is that the forward mode computes the Jacobian column by column,
whereas the reverse mode does it row by row. Thus, the cost of the
Reverse
forward mode is proportional to 𝑛 𝑥 , whereas the cost of the reverse
1
mode is proportional to 𝑛 𝑓 . If we have more outputs (e.g., objective and 𝑛𝑓 𝐽 𝑓
1
constraints) than inputs (design variables), the forward mode is more 𝑛𝑥 𝑛𝑓
efficient, as illustrated in Fig. 6.22. On the other hand, if we have many
more inputs than outputs, then the reverse mode is more efficient, as Fig. 6.23 When 𝑛 𝑥 > 𝑛 𝑓 , the reverse
mode is advantageous.
6 Computing Derivatives 246

illustrated in Fig. 6.23. If the number of inputs is similar to the number


of outputs, neither mode has a significant advantage.
In both modes, each forward or reverse pass costs less than 2–3 times
the cost of running the original code in practice. However, because
the reverse mode requires storing a large amount of data, memory
costs also need to be considered. In principle, the required memory is
proportional to the number of variables, but there are techniques that
can reduce the memory usage significantly.∗ ∗ One of the main techniques for reducing
the memory usage of reverse AD is check-
pointing; see Chapter 12 in Griewank.115
6.6.5 AD Implementation 115. Griewank, Evaluating Derivatives,
2000.
There are two main ways to implement AD: by source code transformation
or by operator overloading. The function we used to demonstrate the
issues with symbolic differentiation (Ex. 6.2) can be differentiated much
more easily with AD. In the examples that follow, we use this function
to demonstrate how the forward and reverse mode work using both
source code transformation and operator overloading.

Source Code Transformation

AD tools that use source code transformation process the whole source
code automatically with a parser and add lines of code that compute
the derivatives. The added code is highlighted in Exs. 6.7 and 6.8.

Example 6.7 Source code transformation for forward mode

Running an AD source transformation tool on the code from Ex. 6.2 produces
the code that follows.

Input: 𝑥, 𝑥¤ Set seed 𝑥¤ = 1 to get d 𝑓 /d𝑥


𝑓 =𝑥
𝑓¤ = 𝑥¤ Automatically added by AD tool
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 )
𝑓¤ = (𝑥¤ + 𝑓¤) · cos(𝑥 + 𝑓 ) Automatically added by AD tool
end for
return 𝑓 , 𝑓¤ d 𝑓 /d𝑥 is given by 𝑓¤

The AD tool added a new line after each variable assignment that computes the
corresponding derivative. We can then set the seed, 𝑥¤ = 1 and run the code. As
the loops proceed, 𝑓¤ accumulates the derivative as 𝑓 is successively updated.
6 Computing Derivatives 247

Example 6.8 Source code transformation for reverse mode

The reverse-mode AD version of the code from Ex. 6.2 follows.

Input: 𝑥, 𝑓¯ Set 𝑓¯ = 1 to get d 𝑓 /d𝑥


𝑓 =𝑥
for 𝑖 = 1 to 10 do
push( 𝑓 ) Save current value of 𝑓 on top of stack
𝑓 = sin(𝑥 + 𝑓 )
end for
for 𝑖 = 10 to 1 do Reverse loop added by AD tool
𝑓 = pop() Get value of 𝑓 from top of stack
𝑓¯ = cos(𝑥 + 𝑓 ) · 𝑓¯
end for
𝑥¯ = 𝑓¯
return 𝑓 , 𝑥¯ d 𝑓 /d𝑥 is given by 𝑥¯

The first loop is identical to the original code except for one line. Because the
derivatives that accumulate in the reverse loop depend on the intermediate
values of the variables, we need to store all the variables in the forward loop.
We store and retrieve the variables using a stack, hence the call to “push”.† † A stack, also known as last in, first out

The second loop, which runs in reverse, is where the derivatives are (LIFO), is a data structure that stores a one-
dimensional array. We can only add an
computed. We set the reverse seed, 𝑓¯ = 1, and then the adjoint variables element to the top of the stack (push) and
accumulate the derivatives back to the start. take the element from the top of the stack
(pop).

Operator Overloading

The operator overloading approach creates a new augmented data


type that stores both the variable value and its derivative. Every
floating-point number 𝑣 is replaced by a new type with two parts (𝑣, 𝑣¤ ),
commonly referred to as a dual number. All standard operations (e.g.,
addition, multiplication, sine) are overloaded such that they compute
𝑣 according to the original function value and 𝑣¤ according to the
derivative of that function. For example, the multiplication operation,
𝑥1 · 𝑥2 , would be defined for the dual-number data type as

(𝑥 1 , 𝑥¤ 1 ) · (𝑥2 , 𝑥¤ 2 ) = (𝑥1 𝑥2 , 𝑥 1 𝑥¤ 2 + 𝑥¤ 1 𝑥2 ) , (6.31)

where we compute the original function value in the first term, and the
second term carries the derivative of the multiplication.
Although we wrote the two parts explicitly in Eq. 6.31, the source
code would only show a normal multiplication, such as 𝑣3 = 𝑣1 · 𝑣 2 .
However, each of these variables would be of the new type and carry the
corresponding 𝑣¤ quantities. By overloading all the required operations,
6 Computing Derivatives 248

the computations happen “behind the scenes”, and the source code
does not have to be changed, except to declare all the variables to be of
the new type and to set the seed. Example 6.9 lists the original code
from Ex. 6.2 with notes on the actual computations that are performed
as a result of overloading.

Example 6.9 Operator overloading for forward mode

Using the derived data types and operator overloading approach in forward
mode does not change the code listed in Ex. 6.2. The AD tool provides
overloaded versions of the functions in use, which in this case are assignment,
addition, and sine. These functions are overloaded as follows:

𝑣2 = 𝑣1 ⇒ (𝑣2 , 𝑣¤ 2 ) = (𝑣 1 , 𝑣¤ 1 )
𝑣1 + 𝑣2 ⇒ (𝑣 1 , 𝑣¤ 1 ) + (𝑣2 , 𝑣¤ 2 ) ≡ (𝑣1 + 𝑣2 , 𝑣¤ 1 + 𝑣¤ 2 )
sin(𝑣) ⇒ sin (𝑣, 𝑣¤ ) ≡ (sin(𝑣), cos(𝑣)𝑣¤ ) .

In this case, the source code is unchanged, but additional computations occur
through the overloaded functions. We reproduce the code that follows with
notes on the hidden operations that take place.

Input: 𝑥 ¤
𝑥 is of a new data type with two components (𝑥, 𝑥)
𝑓 =𝑥 ( 𝑓 , 𝑓¤) = (𝑥, 𝑥)
¤ through the overloading of the “=” operation
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 ) Code is unchanged, but overloading
‡ The overloading
computes the derivative‡  of ‘’+” computes
end for (𝑣, 𝑣¤ ) = 𝑥 + 𝑓 , 𝑥¤ + 𝑓¤ and then the
 
return 𝑓 The new data type includes 𝑓¤, which is d 𝑓 /d𝑥 overloading of “sin” computes 𝑓 , 𝑓¤ =
(sin(𝑣), cos(𝑣)𝑣¤ ).
We set the seed, 𝑥¤ = 1, and for each function assignment, we add the cor-
responding derivative line. As the loops are repeated, 𝑓¤ accumulates the
derivative as 𝑓 is successively updated.

The implementation of the reverse mode using operating overload-


ing is less straightforward and is not detailed here. It requires a new
data type that stores the information from the computational graph and
the variable values when running the forward pass. This information
can be stored using the taping technique. After the forward evaluation
of using the new type, the “tape” holds the sequence of operations,
which is then evaluated in reverse to propagate the reverse-mode seed.§ § See Sec. 5.4 in Griewank115 for more de-
tails on reverse mode using operating over-
loading.
Connection of AD with the Complex-Step Method
115. Griewank, Evaluating Derivatives,
2000.
The complex-step method from Section 6.5 can be interpreted as forward-
mode AD using operator overloading, where the data type is the
6 Computing Derivatives 249

complex number (𝑥, 𝑦) ≡ 𝑥 + 𝑖 𝑦, and the imaginary part 𝑦 carries the


derivative. To see this connection more clearly, let us write the complex
multiplication operation as

𝑓 = (𝑥1 + 𝑖 𝑦1 )(𝑥 2 + 𝑖 𝑦2 ) = (𝑥1 𝑥2 − 𝑦1 𝑦2 ) + 𝑖 (𝑦1 𝑥2 + 𝑥 1 𝑦2 ) . (6.32)

This equation is similar to the overloaded multiplication (Eq. 6.31). The


only difference is that the real part includes the term −𝑦1 𝑦2 , which
corresponds to the second-order error term in Eq. 6.15. In this case, the
complex part gives the exact derivative, but a second-order error might
appear for other operations. As argued before, these errors vanish in
finite-precision arithmetic if the complex step is small enough.

Tip 6.6 AD tools

There are AD tools available for most programming languages, including


Fortran,117,118 C/C++,119 Python,120,121 , Julia,122 and MATLAB.123 These tools 117. Utke et al., OpenAD/F: A modular
have been extensively developed and provide the user with great functionality, open-source tool for automatic differentiation
of Fortran codes, 2008.
including the calculation of higher-order derivatives, multivariable derivatives,
118. Hascoet and Pascual, The Tapenade
and reverse-mode options. Although some AD tools can be applied recursively automatic differentiation tool: Principles,
to yield higher-order derivatives, this approach is not typically efficient and is model, and specification, 2013.
sometimes unstable.124 119. Griewank et al., Algorithm 755:
ADOL-C: A package for the automatic
differentiation of algorithms written in C/
C++, 1996.
120. Wiltschko et al., Tangent: automatic
differentiation using source code transforma-
Source Code Transformation versus Operator Overloading
tion in Python, 2017.

The source code transformation and the operator overloading ap- 121. Bradbury et al., JAX: Composable
Transformations of Python+NumPy Pro-
proaches each have their relative advantages and disadvantages. The grams, 2018.
overloading approach is much more elegant because the original code 122. Revels et al., Forward-mode automatic
differentiation in Julia, 2016.
stays practically the same and can be maintained directly. On the other
123. Neidinger, Introduction to automatic
hand, the source transformation approach enlarges the original code differentiation and MATLAB object-oriented
and results in less readable code, making it hard to work with. Still, it programming, 2010.

is easier to see what operations take place when debugging. Instead of 124. Betancourt, A geometric theory of
higher-order automatic differentiation, 2018.
maintaining source code transformed by AD, it is advisable to work
with the original source and devise a workflow where the parser is
rerun before compiling a new version.
One advantage of the source code transformation approach is that
it tends to yield faster code and allows more straightforward compile-
time optimizations. The overloading approach requires a language that
supports user-defined data types and operator overloading, whereas
source transformation does not. Developing a source transformation
AD tool is usually more challenging than developing the overloading
approach because it requires an elaborate parser that understands the
source syntax.
6 Computing Derivatives 250

6.6.6 AD Shortcuts for Matrix Operations


The efficiency of AD can be dramatically increased with manually imple-
mented shortcuts. When the code involves matrix operations, manual
implementation of a higher-level differentiation of those operations
is more efficient than the line-by-line AD implementation. Giles125 125. Giles, An extended collection of matrix
derivative results for forward and reverse
documents the forward and reverse differentiation of many matrix mode algorithmic differentiation, 2008.
elementary operations.
For example, suppose that we have a matrix multiplication 𝐶 = 𝐴𝐵.
Then, the forward mode yields

𝐶¤ = 𝐴𝐵
¤ + 𝐴 𝐵¤ . (6.33)

The idea is to use 𝐴¤ and 𝐵¤ from the AD code preceding the operation
and then manually implement this formula (bypassing any AD of the
¤ as shown in Fig. 6.24.
code that performs that operation) to obtain 𝐶,
Then we can use 𝐶¤ to seed the remainder of the AD code.
The reverse mode of the multiplication yields

𝐴¯ = 𝐶𝐵
¯ |, 𝐵¯ = 𝐴| 𝐶¯ . (6.34)

Similarly, we take 𝐶¯ from the reverse AD code and implement the


formula manually to compute 𝐴¯ and 𝐵, ¯ which we can use in the
remaining AD code in the reverse procedure.

Forward mode
𝐴¤ Manual 𝐶¤
𝑥¤ 𝑓¤
implementation
𝐵¤

Forward AD 𝐴, 𝐵 Forward AD
Original code
𝐴 Matrix 𝐶
𝑥 𝑓
operation
𝐵

Reverse AD 𝐴, 𝐵 Reverse AD
Fig. 6.24 Matrix operations, including
Reverse mode the solution of linear systems, can
𝐴¯ Manual 𝐶¯ be differentiated manually to bypass
𝑥¯ 𝑓¯ more costly AD code.
implementation
𝐵¯

One particularly useful (and astounding!) result is the differentiation


of the matrix inverse product. If we have a linear solver such that
𝐶 = 𝐴−1 𝐵, we can bypass the solver in the AD process by using the
following results:  
𝐶¤ = 𝐴−1 𝐵¤ − 𝐴𝐶
¤ (6.35)
6 Computing Derivatives 251

for the forward mode and

𝐵¯ = 𝐴−| 𝐶,
¯ 𝐴¯ = −𝐵𝐶
¯ | (6.36)

for the reverse mode.


In addition to deriving the formulas just shown, Giles125 derives 125. Giles, An extended collection of matrix
derivative results for forward and reverse
formulas for the matrix derivatives of the inverse, determinant, norms, mode algorithmic differentiation, 2008.
quadratic, polynomial, exponential, eigenvalues and eigenvectors, and
singular value decomposition. Taking shortcuts as described here
applies more broadly to any case where a part of the process can be
differentiated manually to produce a more efficient derivative compu-
tation.

6.7 Implicit Analytic Methods—Direct and Adjoint

Direct and adjoint methods—which we refer to jointly as implicit analytic


methods—linearize the model governing equations to obtain a system
of linear equations whose solution yields the desired derivatives. Like
the complex-step method and AD, implicit analytic methods compute
derivatives with a precision matching that of the function evaluation.
The direct method is analogous to forward-mode AD, whereas the
adjoint method is analogous to reverse-mode AD.
Analytic methods can be thought of as lying in between the finite-
difference method and AD in terms of the number of variables involved.
With finite differences, we only need to be aware of inputs and outputs,
whereas AD involves every single variable assignment in the code.
Analytic methods work at the model level and thus require knowledge
of the governing equations and the corresponding state variables.
There are two main approaches to deriving implicit analytic methods:
continuous and discrete. The continuous approach linearizes the
original continuous governing equations, such as a partial differential
equation (PDE), and then discretizes this linearization. The discrete
approach linearizes the governing equations only after they have been
discretized as a set of residual equations, 𝑟(𝑢) = 0.
Each approach has its advantages and disadvantages. The discrete
approach is preferred and is easier to generalize, so we explain this
approach exclusively. One of the primary reasons the discrete approach
is preferred is that the resulting derivatives are consistent with the func-
tion values because they use the same discretization. The continuous and Dwight126 compare the contin-
∗ Peter

approach is only consistent in the limit of a fine discretization. The uous and discrete adjoint approaches in
more detail.
resulting inconsistencies can mislead the optimization.∗
126. Peter and Dwight, Numerical sensitiv-
ity analysis for aerodynamic optimization: A
survey of approaches, 2010.
6 Computing Derivatives 252

6.7.1 Residuals and Functions


As mentioned in Chapter 3, a discretized numerical model can be
written as a system of residuals,

𝑟(𝑢; 𝑥) = 0 , (6.37)

where the semicolon denotes that the design variables 𝑥 are fixed when
these equations are solved for the state variables 𝑢. Through these
equations, 𝑢 is an implicit function of 𝑥. This relationship is represented
by the box containing the solver and residual equations in Fig. 6.25.
The functions of interest, 𝑓 (𝑥, 𝑢), are typically explicit functions of 𝑥
the state variables and the design variables. However, because 𝑢 is an
Solver 𝑢
implicit function of 𝑥, 𝑓 is ultimately an implicit function of 𝑥 as well. 𝑢
To compute 𝑓 for a given 𝑥, we must first find 𝑢 such that 𝑟(𝑢; 𝑥) = 0. 𝑟(𝑢; 𝑥)
𝑟
This is usually the most computationally costly step and requires a
solver (see Section 3.6). The residual equations could be nonlinear and 𝑓 (𝑥, 𝑢) 𝑓

involve many state variables. In PDE-based models it is common to


have millions of states. Once we have solved for the state variables 𝑢, Fig. 6.25 Relationship between func-
we can compute the functions of interest 𝑓 . The computation of 𝑓 for a tions and design variables for a sys-
tem involving a solver. The implicit
given 𝑢 and 𝑥 is usually much cheaper because it does not require a equations 𝑟(𝑢; 𝑥) = 0 define the states
solver. For example, in PDE-based models, computing such functions 𝑢 for a given 𝑥, so the functions of
typically involves an integration of the states over a surface, or some interest 𝑓 depend explicitly and im-
plicitly on the design variables 𝑥.
other transformation of the states.
To compute d 𝑓 /d𝑥 using finite differences, we would have to use
the solver to find 𝑢 for each perturbation of 𝑥. That means that we
would have to run the solver 𝑛 𝑥 times, which would not scale well when
the solution is costly. AD also requires the propagation of derivatives
through the solution process. As we will see, implicit analytic methods
avoid involving the potentially expensive nonlinear solution in the
derivative computation.

Example 6.10 Residuals and functions in structural analysis

Recall Ex. 3.2, where we introduced the structural model of a truss structure.
The residuals in this case are the linear equations,

𝑟(𝑢) ≡ 𝐾(𝑥)𝑢 − 𝑞 = 0 , (6.38)

where the state variables are the displacements, 𝑢. Solving for the displacement
requires only a linear solver in this case, but it is still the most costly part of the
analysis. Suppose that the design variables are the cross-sectional areas of the
truss members. Then, the stiffness matrix is a function of 𝑥, but the external
forces are not.
Suppose that the functions of interest are the stresses in each of the truss
members. This is an explicit function of the displacements, which is given by
6 Computing Derivatives 253

the matrix multiplication

𝑓 (𝑥, 𝑢) ≡ 𝜎(𝑢) = 𝑆𝑢 ,

where 𝑆 is a matrix that depends on 𝑥. This is a much cheaper computation


than solving the linear system (Eq. 6.38).

6.7.2 Direct and Adjoint Derivative Equations


The derivatives we ultimately want to compute are the ones in the
Jacobian d 𝑓 /d𝑥. Given the explicit and implicit dependence of 𝑓 on 𝑥,
we can use the chain rule to write the total derivative Jacobian of 𝑓 as

d𝑓 𝜕𝑓 𝜕 𝑓 d𝑢
= + , (6.39)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where the result is an (𝑛 𝑓 × 𝑛 𝑥 ) matrix.† † Thischain rule can be derived by writing
the total differential of 𝑓 as
In this context, the total derivatives, d 𝑓 /d𝑥, take into account the
change in 𝑢 that is required to keep the residuals of the governing 𝜕𝑓 𝜕𝑓
d𝑓 = d𝑥 + d𝑢
equations (Eq. 6.37) equal to zero. The partial derivatives in Eq. 6.39 𝜕𝑥 𝜕𝑢
represent the variation of 𝑓 (𝑥, 𝑢) with respect to changes in 𝑥 or 𝑢 and then “dividing” it by d𝑥 . See Ap-
without regard to satisfying the governing equations. pendix A.2 for more background on differ-
entials.
To better understand the difference between total and partial deriva-
tives in this context, imagine computing these derivatives using finite
differences with small perturbations. For the total derivatives, we
would perturb 𝑥, re-solve the governing equations to obtain 𝑢, and
then compute 𝑓 , which would account for both dependency paths
in Fig. 6.25. To compute the partial derivatives 𝜕 𝑓 /𝜕𝑥 and 𝜕 𝑓 /𝜕𝑢,
however, we would perturb 𝑥 or 𝑢 and recompute 𝑓 without re-solving
the governing equations. In general, these partial derivative terms are
cheap to compute numerically or can be obtained symbolically.
To find the total derivative d𝑢/d𝑥, we need to consider the governing
equations. Assuming that we are at a point where 𝑟(𝑥, 𝑢) = 0, any
perturbation in 𝑥 must be accompanied by a perturbation in 𝑢 such that
the governing equations remain satisfied. Therefore, the differential of
the residuals can be written as
𝜕𝑟 𝜕𝑟
d𝑟 = d𝑥 + d𝑢 = 0 . (6.40)
𝜕𝑥 𝜕𝑢
This constraint is illustrated in Fig. 6.26 in two dimensions, but keep
in mind that 𝑥, 𝑢, and 𝑟 are vectors in the general case. The governing
equations (Eq. 6.37) map an 𝑛 𝑥 -vector 𝑥 to an 𝑛𝑢 -vector 𝑢. This mapping
defines a hypersurface (also known as a manifold) in the 𝑥–𝑢 space.
6 Computing Derivatives 254

The total derivative d 𝑓 /d𝑥 that we ultimately want to compute


represents the effect that a perturbation on 𝑥 has on 𝑓 subject to the
constraint of remaining on this hypersurface, which can be achieved
with the appropriate variation in 𝑢.
To obtain a more useful equation, we rearrange Eq. 6.40 to get the
linear system
𝜕𝑟 d𝑢 𝜕𝑟 d𝑢
=− , (6.41)
𝜕𝑢 d𝑥 𝜕𝑥 𝑢 d𝑥

where 𝜕𝑟/𝜕𝑥 and d𝑢/d𝑥 are both (𝑛𝑢 × 𝑛 𝑥 ) matrices, and 𝜕𝑟/𝜕𝑢 is a 𝑟(𝑥, 𝑢) = 0
square matrix of size (𝑛𝑢 × 𝑛𝑢 ). This linear system is useful because if
we provide the partial derivatives in this equation (which are cheap 𝑥

to compute), we can solve for the total derivatives d𝑢/d𝑥 (whose


Fig. 6.26 The governing equations de-
computation would otherwise require re-solving 𝑟(𝑢) = 0). Because termine the values of 𝑢 for a given 𝑥.
d𝑢/d𝑥 is a matrix with 𝑛 𝑥 columns, this linear system needs to be solved Given a point that satisfies the equa-
for each 𝑥 𝑖 with the corresponding column of the right-hand-side matrix tions, the appropriate differential in
𝜕𝑟/𝜕𝑥 𝑖 . 𝑢 must accompany a differential of 𝑥
about that point for the equations to
Now let us assume that we can invert the matrix in the linear system remain satisfied.
(Eq. 6.41) and substitute the solution for d𝑢/d𝑥 into the total derivative
equation (Eq. 6.39). Then we get

d𝑓 𝜕𝑓 𝜕 𝑓 𝜕𝑟 −1 𝜕𝑟
= − , (6.42)
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
where all the derivative terms on the right-hand side are partial deriva-
tives. The partial derivatives in this equation can be computed using any
of the methods that we have described earlier: symbolic differentiation,
finite differences, complex step, or AD. Equation 6.42 shows two ways
to compute the total derivatives, which we call the direct method and the
adjoint method.
The direct method (already outlined earlier) consists of solving the
linear system (Eq. 6.41) and substituting d𝑢/d𝑥 into Eq. 6.39. Defining
𝜙 ≡ − d𝑢/d𝑥, we can rewrite Eq. 6.41 as

𝜕𝑟 𝜕𝑟
𝜙= . (6.43)
𝜕𝑢 𝜕𝑥

After solving for 𝜙 (one column at the time), we can use it in the total
derivative equation (Eq. 6.39) to obtain,

d𝑓 𝜕𝑓 𝜕𝑓
= − 𝜙. (6.44)
d𝑥 𝜕𝑥 𝜕𝑢

This is sometimes called the forward mode because it is analogous to


forward-mode AD.
6 Computing Derivatives 255

Solving the linear system (Eq. 6.43) is typically the most computa-
tionally expensive operation in this procedure. The cost of this approach
scales with the number of inputs 𝑛 𝑥 but is essentially independent
of the number of outputs 𝑛 𝑓 . This is the same scaling behavior as
finite differences and forward-mode AD. However, the constant of
proportionality is typically much smaller in the direct method because
we only need to solve the nonlinear equations 𝑟(𝑢; 𝑥) = 0 once to obtain
the states.

𝜙 (𝑛𝑢 × 𝑛 𝑥 )

d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑟 −1 𝜕𝑟
= −
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
Fig. 6.27 The total derivatives
(Eq. 6.42) can be computed either by
(𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛 𝑥 )
solving for 𝜙 (direct method) or by
solving for 𝜓 (adjoint method).
𝜓| (𝑛 𝑓 × 𝑛𝑢 )

The adjoint method changes the linear system that is solved to


compute the total derivatives. Looking at Fig. 6.27, we see that instead
of solving the linear system with 𝜕𝑟/𝜕𝑥 on the right-hand side, we
can solve it with 𝜕 𝑓 /𝜕𝑢 on the right-hand side. This corresponds
to replacing the two Jacobians in the middle with a new matrix of
unknowns,
𝜕 𝑓 𝜕𝑟 −1
𝜓| ≡ , (6.45)
𝜕𝑢 𝜕𝑢
where the columns of 𝜓 are called the adjoint vectors. Multiplying both
sides of Eq. 6.45 by 𝜕𝑟/𝜕𝑢 on the right and taking the transpose of the
whole equation, we obtain the adjoint equation,

𝜕𝑟 | 𝜕𝑓 |
𝜓= . (6.46)
𝜕𝑢 𝜕𝑢
This linear system has no dependence on 𝑥. Each adjoint vector is
associated with a function of interest 𝑓 𝑗 and is found by solving the
adjoint equation (Eq. 6.46) with the corresponding row 𝜕 𝑓 𝑗 /𝜕𝑢. The
solution (𝜓) is then used to compute the total derivative

d𝑓 𝜕𝑓 𝜕𝑟
= − 𝜓| . (6.47)
d𝑥 𝜕𝑥 𝜕𝑥
This is sometimes called the reverse mode because it is analogous to
reverse-mode AD.
6 Computing Derivatives 256

As we will see in Section 6.9, the adjoint vectors are equivalent to


the total derivatives d 𝑓 /d𝑟, which quantify the change in the function
of interest given a perturbation in the residual that gets zeroed out by
an appropriate change in 𝑢.‡ ‡ The adjoint vector can also be interpreted

as a Lagrange multiplier vector associated


with equality constraints 𝑟 = 0. Defining
6.7.3 Direct or Adjoint? the Lagrangian ℒ(𝑥, 𝑢) = 𝑓 + 𝜓 | 𝑟 and
differentiating it with respect to 𝑥 , we get
Similar to the direct method, the solution of the adjoint linear system
𝜕ℒ 𝜕𝑓 𝜕𝑟
(Eq. 6.46) tends to be the most expensive operation. Although the linear 𝜕𝑥
=
𝜕𝑥
+ 𝜓|
𝜕𝑥
.
system is of the same size as that of the direct method, the cost of the
Thus, the total derivatives d 𝑓 /d𝑥 are the
adjoint method scales with the number of outputs 𝑛 𝑓 and is essentially derivatives of this Lagrangian.
independent of the number of inputs 𝑛 𝑥 . The comparison between the
Solve 𝑛 𝑥 times
computational cost of the direct and adjoint methods is summarized in
Table 6.3 and illustrated in Fig. 6.28.
= −
Similar to the trade-offs between forward- and reverse-mode AD, if
the number of outputs is greater than the number of inputs, the direct 𝑛𝑥 < 𝑛 𝑓
(forward) method is more efficient (Fig. 6.28, top). On the other hand, if
Solve 𝑛 𝑓 times
the number of inputs is greater than the number of outputs, it is more
efficient to use the adjoint (reverse) method (Fig. 6.28, bottom). When
= −
the number of inputs and outputs is large and similar, neither method
has an advantage, and the cost of computing the full total derivative 𝑛𝑥 > 𝑛 𝑓
Jacobian might be prohibitive. In this case, aggregating the outputs and
using the adjoint method might be effective, as explained in Tip 6.7. Fig. 6.28 Two possibilities for the size
In practice, the adjoint method is implemented much more often of d 𝑓 /d𝑥 in Fig. 6.27. When 𝑛 𝑥 < 𝑛 𝑓 ,
it is advantageous to solve the linear
than the direct method. Although both methods require a similar system with the vector to the right
implementation effort, the direct method competes with methods that of the square matrix because it has
are much more easily implemented, such as finite differencing, complex fewer columns. When 𝑛 𝑥 > 𝑛 𝑓 , it is
step, and forward-mode AD. On the other hand, the adjoint method advantageous to solve the transposed
linear system with the vector to the
only competes with reverse-mode AD, which is plagued by the memory left because it has fewer rows.
issue.

Step Direct Adjoint


Partial derivative computation Same Same
Table 6.3 Cost comparison of com-
Linear solution 𝑛 𝑥 times 𝑛 𝑓 times
puting derivatives with direct and
Matrix multiplications Same Same adjoint methods.

Another reason why the adjoint method is more widely used is


that many optimization problems have a few functions of interest (one
objective and a few constraints) and many design variables. The adjoint § One widespread application of the ad-
method has made it possible to solve optimization problems involving joint method has been in aerodynamic and
computationally intensive PDE models.§ hydrodynamic shape optimization.127

Although implementing implicit analytic methods is labor intensive, 127. Martins, Perspectives on aerodynamic
design optimization, 2020.
6 Computing Derivatives 257

it is worthwhile if the differentiated code is used frequently and in


applications that demand repeated evaluations. For such applications,
analytic differentiation with partial derivatives computed using AD is
the recommended approach for differentiating code because it combines
the best features of these methods.

Example 6.11 Differentiating an implicit function

Consider the following simplified equation for the natural frequency of a


beam:
𝑓 = 𝜆𝑚 2 , (6.48)
where 𝜆 is a function of 𝑚 through the following relationship:

𝜆
+ cos 𝜆 = 0 .
𝑚
Figure 6.29 shows the equivalent of Fig. 6.25 in this case. Our goal is to compute
the derivative d 𝑓 /d𝑚. Because 𝜆 is an implicit function of 𝑚, we cannot find
an explicit expression for 𝜆 as a function of 𝑚, substitute that expression into
Eq. 6.48, and then differentiate normally. Fortunately, the implicit analytic
methods allow us to compute this derivative.
Referring back to our nomenclature, 𝑚

𝑓 (𝑥, 𝑢) ≡ 𝑓 (𝑚, 𝜆) = 𝜆𝑚 2 , Solver 𝜆


𝜆
𝜆 𝜆
+ cos 𝜆
𝑟(𝑢; 𝑥) ≡ 𝑟(𝜆; 𝑚) = + cos 𝜆 = 0 , 𝑚
𝑚
where 𝑚 is the design variable and 𝜆 is the state variable. The partial derivatives 𝜆𝑚 2 𝑓
that we need for the total derivative computation (Eq. 6.42) are as follows:
Fig. 6.29 Model for Ex. 6.11.
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓
= = 2𝜆𝑚, = = 𝑚2
𝜕𝑥 𝜕𝑚 𝜕𝑢 𝜕𝜆
𝜕𝑟 𝜕𝑟 𝜆 𝜕𝑟 𝜕𝑟 1
= =− , = = − sin 𝜆 .
𝜕𝑥 𝜕𝑚 𝑚2 𝜕𝑢 𝜕𝜆 𝑚
Because this is a problem of only one function of interest and one design variable,
there is no distinction between the direct and adjoint methods (forward and
reverse), and the linear system solution is simply a division. Substituting these
partial derivatives into the total derivative equation (Eq. 6.42) yields

d𝑓 𝜆
= 2𝜆𝑚 + .
d𝑚 1 − sin 𝜆
𝑚

Thus, we obtained the desired derivative despite the implicitly defined function.
Here, it was possible to get an explicit expression for the total derivative, but
generally, it is only possible to get a numeric value.
6 Computing Derivatives 258

Example 6.12 Direct and adjoint methods applied to structural analysis

Consider the structural analysis we reintroduced in Ex. 6.10. Let us compute


the derivatives of the stresses with respect to the cross-sectional truss member
areas and denote the number of degrees of freedom as 𝑛𝑢 and the number of
truss members as 𝑛𝑡 . Figure 6.30 shows the equivalent of Fig. 6.25 for this case. 𝑥
We require four Jacobians of partial derivatives: 𝜕𝑟/𝜕𝑥, 𝜕𝑟/𝜕𝑢, 𝜕𝜎/𝜕𝑥, and
𝜕𝜎/𝜕𝑢. When differentiating the governing equations with respect to an area Solver 𝑢
𝑢
𝑥 𝑖 , neither the displacements nor the external forces depend directly on the
𝐾(𝑥)𝑢 − 𝑞
areas,¶ so we obtain
𝜕𝑟 𝜕 𝜕 𝜕𝐾 𝑆𝑢 𝜎
= (𝐾𝑢 − 𝑞) = (𝐾𝑢) = 𝑢.
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
Fig. 6.30 Model for Ex. 6.12
This is a vector of size 𝑛𝑢 corresponding to one column of 𝜕𝑟/𝜕𝑥. We can
compute this term by symbolically differentiating the equations that assemble ¶ The displacements do change with the
areas but only through the solution of the
the stiffness matrix. Alternatively, we could use AD on the function that
governing equations, which are not con-
computes the stiffness matrix or use finite differencing. Using AD, we can sidered when taking partial derivatives.
employ the techniques described in Section 6.7.4 for an efficient implementation.
It is more efficient to compute the derivative of the product 𝐾𝑢 directly
instead of differentiating 𝐾 and then multiplying by 𝑢. This avoids storing
and subtracting the entire perturbed matrix. We can apply a forward finite
difference to the product as follows:

𝜕𝑟 𝐾(𝑥 + ℎ 𝑒ˆ𝑖 )𝑢 − 𝐾(𝑥)𝑢


≈ .
𝜕𝑥 𝑖 ℎ

Because the external forces do not depend on the displacements in this


case,‖ the partial derivatives of the governing equations with respect to the ‖
This is not true for large displacements,
displacements are given by but we assume small displacements .
𝜕𝑟
= 𝐾.
𝜕𝑢
We already have the stiffness matrix, so this term does not require any further
computations.
The partial derivative of the stresses with respect to the areas is zero
(𝜕𝜎/𝜕𝑥 = 0) because there is no direct dependence.∗∗ Thus, the partial derivative ∗∗ Although ultimately, the areas do change

the stresses, they do so only through


of the stress with respect to displacements is
changes in the displacements.

𝜕𝜎
=𝑆,
𝜕𝑢
which is an (𝑛𝑡 × 𝑛𝑢 ) matrix that we already have from the stress computation.
Now we can use either the direct or adjoint method by replacing the partial
derivatives in the respective equations. The direct linear system (Eq. 6.43)
yields
𝜕
𝐾𝜙 𝑖 = (𝐾𝑢) ,
𝜕𝑥 𝑖
where 𝑖 corresponds to each truss member area. Once we have 𝜙 𝑖 , we can use
it to compute the total derivatives of all the stresses with respect to member
6 Computing Derivatives 259

area 𝑖 with Eq. 6.44, as follows:


d𝜎
= −𝑆𝜙 𝑖 .
d𝑥 𝑖

The adjoint linear system (Eq. 6.46) yields†† †† Usually, the stiffness matrix is symmetric,

| and 𝐾 | = 𝐾 . This means that the solver


𝐾𝜓 𝑗 = 𝑆 𝑗,∗ , for displacements can be repurposed for
adjoint computation by setting the right-
where 𝑗 corresponds to each truss member, and 𝑆 𝑗,∗ is the 𝑗th row of 𝑆. Once we hand side shown here instead of the loads.
For that reason, this right-hand side is
have 𝜓 𝑗 , we can use it to compute the total derivative of the stress in member 𝑗 sometimes called a pseudo-load.
with respect to all truss member areas with Eq. 6.47, as follows:
d𝜎 𝑗 | 𝜕
= −𝜓 𝑗 (𝐾𝑢) .
d𝑥 𝜕𝑥
In this case, there is no advantage in using one method over the other because
the number of areas is the same as the number of stresses. However, if
we aggregated the stresses as suggested in Tip 6.7, the adjoint would be
advantageous.

Tip 6.7 Aggregate outputs to reduce the cost of adjoint or reverse


methods
For problems with many outputs and many inputs, there is no efficient way
of computing the Jacobian. This is common in some structural optimization
problems, where the number of stress constraints is similar to the number of
design variables because they are both associated with each structural element
(see Ex. 6.12).
We can address this issue by aggregating the functions of interest as
described in Section 5.7 and then implementing the adjoint method to compute
the gradient. In Ex. 6.12, we would aggregate the stresses in one or more groups
to reduce the number of required adjoint solutions.
We can use these techniques to aggregate any outputs, but in principle,
these outputs should have some relation to each other. For example, they could
be the stresses in a structure (see Ex. 6.12).‡‡
‡‡ Lambe et al.128 provide recommenda-
tions on constraint aggregation for struc-
tural optimization.
128. Lambe et al., An evaluation of con-
6.7.4 Adjoint Method with AD Partial Derivatives straint aggregation strategies for wing box
mass minimization, 2017.
Implementing the implicit analytic methods for models involving long,
complicated code requires significant development effort. In this section,
we focus on implementing the adjoint method because it is more widely
used, as explained in Section 6.7.3. We assume that 𝑛 𝑓 < 𝑛 𝑥 , so that the
adjoint method is advantageous.
To ease the implementation of adjoint methods, we recommend a
hybrid adjoint approach where the reverse mode of AD computes the
6 Computing Derivatives 260

partial derivatives in the adjoint equations (Eq. 6.46) and total derivative
equation (Eq. 6.47).§§ §§ Kenway et al.129
provide more details on
this approach and its applications.
The partials terms 𝜕 𝑓 /𝜕𝑥 form an (𝑛 𝑓 × 𝑛 𝑥 ) matrix and 𝜕 𝑓 /𝜕𝑢 is
129. Kenway et al., Effective Adjoint Ap-
an (𝑛 𝑓 × 𝑛𝑢 ) matrix. These partial derivatives can be computed by proaches for Computational Fluid Dynamics,
identifying the section of the code that computes 𝑓 for a given 𝑥 and 𝑢 2019.

and running the AD tool for that section. This produces code that takes
𝑓¯ as an input and outputs 𝑥¯ and 𝑢,¯ as shown in Fig. 6.31. Recall that Original code
we must first run the entire original code that computes 𝑢 and 𝑓 . Then 𝑥
𝑓 (𝑥, 𝑢) = 0 𝑓
we can run the AD code with the desired seed. Suppose we want the 𝑢
derivative of the 𝑗th component of 𝑓 . We would set 𝑓¯𝑗 = 1 and the other
elements to zero. After running the AD code, we obtain 𝑥¯ and 𝑢, ¯ which Reverse mode
correspond to the rows of the respective matrix of partial terms, that is, 𝑥¯
𝑓¯
𝑢¯
𝜕 𝑓𝑗 𝜕 𝑓𝑗
𝑥¯ = , 𝑢¯ = . (6.49)
𝜕𝑥 𝜕𝑢 Fig. 6.31 Applying reverse AD to the
code that computes 𝑓 produces code
Thus, with each run of the AD code, we obtain the derivatives of one
that computes the partial derivatives
function with respect to all design variables and all state variables. One of 𝑓 with respect to 𝑥 and 𝑢.
run is required for each element of 𝑓 . The reverse mode is advantageous
if 𝑛 𝑓 < 𝑛 𝑥 , .
The Jacobian 𝜕𝑟/𝜕𝑢 can also be computed using AD. Because 𝜕𝑟/𝜕𝑢
is typically sparse, the techniques covered in Section 6.8 significantly
increase the efficiency of computing this matrix. This is a square matrix,
so neither AD mode has an advantage over the other if we explicitly
compute and store the whole matrix.
However, reverse-mode AD is advantageous when using an iterative
method to solve the adjoint linear system (Eq. 6.46). When using an
iterative method, we do not form 𝜕𝑟/𝜕𝑢. Instead, we require products
of the transpose of this matrix with some vector 𝑣,¶¶ ¶¶ See Appendix B.4 for more details on
iterative solvers.
𝜕𝑟 |
𝑣. (6.50)
𝜕𝑢
The elements of 𝑣 act as weights on the residuals and can be interpreted Original code

as a projection onto the direction of 𝑣. Suppose we have the reverse 𝑥


𝑟(𝑥, 𝑢) = 0 𝑟
𝑢
AD code for the residual computation, as shown in Fig. 6.32. This code
requires a reverse seed 𝑟¯, which determines the weights we want on
each residual. Typically, a seed would have only one nonzero entry to Reverse mode

find partial derivatives (e.g., setting 𝑟¯ = [1, 0, . . . , 0] would yield the 𝑥¯


𝑟¯
𝑢¯
first row of the Jacobian, 𝑢¯ ≡ 𝜕𝑟1 /𝜕𝑢). However, to get the product in
Eq. 6.50, we require the seed to be weighted as 𝑟¯ = 𝑣. Then, we can
Fig. 6.32 Applying reverse AD to the
compute the product by running the reverse AD code once to obtain
code that computes 𝑟 produces code
𝑢¯ ≡ [𝜕𝑟/𝜕𝑢]| 𝑣. that computes the partial derivatives
of 𝑟 with respect to 𝑥 and 𝑢.
6 Computing Derivatives 261

The final term needed to compute total derivatives with the adjoint
method is the last term in Eq. 6.47, which can be written as
 |
𝜕𝑟 𝜕𝑟 |
𝜓 |
= 𝜓 . (6.51)
𝜕𝑥 𝜕𝑥
This is yet another transpose vector product that can be obtained using
the same reverse AD code for the residuals, except that now the residual
seed is 𝑟¯ = 𝜓, and the product we want is given by 𝑥.
¯
In sum, it is advantageous to use reverse-mode AD to compute
the partial derivative terms for the adjoint equations, especially if the
adjoint equations are solved using an iterative approach that requires
only matrix-vector products. Similar techniques and arguments apply
for the direct method, except that in that case, forward-mode AD is
advantageous for computing the partial derivatives.

Tip 6.8 Verifying the implementation of derivative computations

Always compare your derivative computation against a different implemen-


tation. You can compare analytic derivatives with finite-difference derivatives,
but that is only a partial verification because finite differences are not accurate
enough. Comparing against the complex-step method or AD is preferable. Still,
finite differences are recommended as an additional check. If you can only use
finite differences, compare two different finite difference approximations.
You should use unit tests to verify each partial derivative term as you are
developing the code (see Tip 3.4) instead of just hoping it all works together
at the end (it usually does not!). One necessary but not sufficient test for the
verification of analytic methods is the dot-product test. For analytic methods,
the dot-product test can be derived from Eq. 6.42. For a chosen variable 𝑥 𝑖 and
function 𝑓 𝑗 , we have the following equality:

𝜕𝑟 𝜕 𝑓𝑗
(6.52)
|
𝜓𝑗 = 𝜙𝑖 .
𝜕𝑥 𝑖 𝜕𝑢
Each side of this equation yields a scalar that should match to working precision.
The dot-product test verifies that your partial derivatives and the solutions for
the direct and adjoint linear systems are consistent. For AD, the dot-product
test for a code with inputs 𝑥 and outputs 𝑓 is as follows:
   
𝜕𝑓 | ¯ 𝜕 𝑓 | ¯ ¤| ¯
𝑥¤ | 𝑥¯ = 𝑥¤ | 𝑓 = 𝑥¤ | 𝑓 = 𝑓 𝑓. (6.53)
𝜕𝑥 𝜕𝑥

6.8 Sparse Jacobians and Graph Coloring

In this chapter, we have discussed various ways to compute a Jacobian


of a model. If the Jacobian has many zero elements, it is said to be sparse.
6 Computing Derivatives 262

In many cases, we can take advantage of that sparsity to significantly


reduce the computational time required to construct the Jacobian.
When applying a forward approach (forward-mode AD, finite
differencing, or complex step), the cost of computing the Jacobian scales
with 𝑛 𝑥 . Each forward pass re-evaluates the model to compute one
column of the Jacobian. For example, when using finite differencing,
𝑛 𝑥 evaluations would be required. To compute the 𝑗th column of the
Jacobian, the input vector would be

[𝑥1 , 𝑥 2 , . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 ] . (6.54)

We can significantly reduce the cost of computing the Jacobian


depending on its sparsity pattern. As a simple example, consider a
square diagonal Jacobian:

𝐽11 0 0 0 0
 
0 0 0 0 
d𝑓  𝐽22
≡  0 0 𝐽33 0 0  . (6.55)
d𝑥 0 0 
 0 0 𝐽44
0 𝐽55 
 0 0 0

For this scenario, the Jacobian can be constructed with one evaluation
rather than 𝑛 𝑥 evaluations. This is because a given output 𝑓𝑖 depends
on only one input 𝑥 𝑖 . We could think of the outputs as 𝑛 𝑥 independent
functions. Thus, for finite differencing, rather than requiring 𝑛 𝑥 input
vectors with 𝑛 𝑥 function evaluations, we can use one input vector, as
follows:
[𝑥1 + ℎ, 𝑥 2 + ℎ, . . . , 𝑥5 + ℎ] , (6.56)
allowing us to compute all the nonzero entries in one pass.∗ et al.130 were the first to show that
∗ Curtis

Although the diagonal case is easy to understand, it is a special the number of function evaluations could
be reduced for sparse Jacobians.
situation. To generalize this concept, let us consider the following (5 × 6)
130. Curtis et al., On the estimation of
matrix as an example: sparse Jacobian matrices, 1974.

𝐽11 0 0 0 𝐽16 
 𝐽14

0 0 0 0 
 𝐽23 𝐽24
𝐽31 0  .
 𝐽32 0 0 0 (6.57)
0 0 
 0 0 0 𝐽45
0 𝐽56 
 0 𝐽53 0 𝐽55

A subset of columns that does not have more than one nonzero in
any given row are said to be structurally orthogonal. In this example,
the following sets of columns are structurally orthogonal: (1, 3), (1,
5), (2, 3), (2, 4, 5), (2, 6), and (4, 5). Structurally orthogonal columns
can be combined, forming a smaller Jacobian that reduces the number
6 Computing Derivatives 263

of forward passes required. This reduced Jacobian is referred to as


compressed. There is more than one way to compress this Jacobian, but
in this case, the minimum number of compressed columns—referred to
as colors—is three. In the following compressed Jacobian, we combine
columns 1 and 3 (blue); columns 2, 4, and 5 (red); and leave column 6
on its own (black):
𝐽11 0 0 0 𝐽16  𝐽11 𝐽16 
 𝐽14
  𝐽14

0 0 0 0  𝐽23 0 
 𝐽23 𝐽24  𝐽24
𝐽31 0  ⇒ 𝐽31 0  .
 𝐽32 0 0 0 𝐽32 (6.58)
0 0  0 0 
 0 0 0 𝐽45  𝐽45
0  𝐽 𝐽56 
 0 𝐽53 0 𝐽55 𝐽56   53 𝐽55

For finite differencing, complex step, and forward-mode AD, only


compression among columns is possible. Reverse mode AD allows
compression among the rows. The concept is the same, but instead,
we look for structurally orthogonal rows. One such compression is as
follows:
𝐽11 0 0 0 𝐽16 
 𝐽14

0 0 0 0  𝐽11 0 0 𝐽16 
 𝐽23 𝐽24  𝐽14 𝐽45
𝐽31 0  ⇒  0
 0  .
 𝐽32 0 0 0 0 𝐽23 𝐽24 0
0 0  𝐽31 𝐽56 
 0 0 0 𝐽45  𝐽32 𝐽53 0 𝐽55
0 
 0 𝐽53 0 𝐽55 𝐽56 
(6.59)
AD can also be used even more flexibly when both modes are used:
forward passes to evaluate groups of structurally orthogonal columns
and reverse passes to evaluate groups of structurally orthogonal rows.
Rather than taking incremental steps in each direction as is done in
finite differencing, we set the AD seed vector with 1s in the directions
we wish to evaluate, similar to how the seed is set for directional
derivatives, as discussed in Section 6.6.
For these small Jacobians, it is straightforward to determine how to
compress the matrix in the best possible way. For a large matrix, this is
not so easy. One approach is to use graph coloring. This approach starts
by building a graph where the vertices represent the row and column
† Gebremedhin et al.131 provide a review of
indices, and the edges represent nonzero entries in the Jacobian. Then,
graph coloring in the context of computing
algorithms are applied to this graph that estimate the fewest number derivatives. Gray et al.132 show how to use
of “colors” (orthogonal columns) using heuristics. Graph coloring is a graph coloring to compute total coupled
derivatives.
large field of research, where derivative computation is one of many 131. Gebremedhin et al., What color is
applications.† your Jacobian? Graph coloring for comput-
ing derivatives, 2005.
132. Gray et al., OpenMDAO: An open-
Example 6.13 Speed up from sparse derivatives source framework for multidisciplinary
design, analysis, and optimization, 2019.
6 Computing Derivatives 264

In static aerodynamic analyses, the forces and moments produced at two


different flow conditions are independent. If there are many different flow
conditions of interest, the resulting Jacobian is sparse. Examples include
evaluating the power produced by a wind turbine at different wind speeds or
assessing an aircraft’s performance throughout a flight envelope. Many other
engineering analyses have a similar structure.
Consider a typical wind turbine blade optimization. The Jacobian of the
functions of interest is fully dense with respect to geometry changes. However,
the part of the Jacobian that contains the derivatives with respect to the various
flow conditions is diagonal, as illustrated on left side of Fig. 6.33. Blank blocks
represent derivatives that are zero. We can compress the diagonal part of the
Jacobian as shown on the right side of Fig. 6.33.

Geometry Inflow Geometry Inflow

Fig. 6.33 Jacobian structure for wind


Outputs Outputs turbine problem. The original Jaco-
bian (left) can be replaced with a com-
pressed one (right).

To illustrate the potential benefits of using a sparse representation, we time


the Jacobian computation for various sizes of inflow conditions using forward
AD with and without graph coloring (Fig. 6.34). For more than 100 inflow
conditions, the difference in time required exceeds one order of magnitude
(note the log-log scale). Because Jacobians are needed at every iteration in the
optimization, this is a tremendous speedup, enabled by exploiting the sparsity
pattern.133 133. Ning, Using blade element momen-
tum methods with gradient-based design
optimization, 2021.
101

100
Jacobian time [s]

AD
10−1

10−2

AD with coloring
10−3

10−4 Fig. 6.34 Jacobian computational time


100 101 102 with and without coloring.
Inflow conditions
6 Computing Derivatives 265

6.9 Unified Derivatives Equation

Now that we have introduced all the methods for computing deriva-
tives, we will see how they are connected. For example, we have
mentioned that the direct and adjoint methods are analogous to the
forward and reverse mode of AD, respectively, but we did not show
this mathematically. The unified derivatives equation (UDE) expresses
both methods.134 Also, the implicit analytic methods from Section 6.7 134. Martins and Hwang, Review and uni-
fication of methods for computing derivatives
assumed one set of implicit equations (𝑟 = 0) and one set of explicit of multidisciplinary computational models,
functions ( 𝑓 ). The UDE formulates the derivative computation for 2013.

systems with mixed sets of implicit and explicit equations.


We first derive the UDE from basic principles and give an intuitive
explanation of the derivative terms. Then, we show how we can use
the UDE to handle implicit and explicit equations. We also show how
the UDE can retrieve the direct and adjoint equations. Finally, we show
how the UDE is connected to AD.

6.9.1 UDE Derivation


Suppose we have a set of 𝑛 residual equations with the same number

𝑟2
of unknowns,

=
𝑟2
+Δ 𝑟 1


𝑟1 =

=0
𝑟2 =

𝑟2
0
𝑟1 =

−Δ
𝑟 𝑖 (𝑢1 , 𝑢2 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (6.60) 𝑢2

𝑟2
−Δ 𝑟 1
𝑢∗ 𝑟1 =

and that there is at least one solution 𝑢 ∗ such that 𝑟(𝑢 ∗ ) = 0. Such a
solution can be visualized for 𝑛 = 2, as shown in Fig. 6.35.
These residuals are general: each one can depend on any subset of 𝑢1

the variables 𝑢 and can be truly implicit functions or explicit functions Fig. 6.35 Solution of a system of two
converted to the implicit form (see Section 3.3 and Ex. 3.3). The total equations expressed by residuals.
differentials for these residuals are
𝜕𝑟 𝑖 𝜕𝑟 𝑖
d𝑟 𝑖 = d𝑢1 + . . . + d𝑢𝑛 , 𝑖 = 1, . . . , 𝑛 . (6.61)
𝜕𝑢1 𝜕𝑢𝑛
These represent first-order changes in 𝑟 due to perturbations in 𝑢. The
differentials of 𝑢 can be visualized as perturbations in the space of the +d
𝑟1
𝑟1 =
variables. The differentials of 𝑟 can be visualized as linear changes to 𝑟1 =
0

the surface defined by 𝑟 = 0, as illustrated in Fig. 6.36. 𝑢2


−d
𝑟1
𝑟1 =
We can write the differentials (Eq. 6.61) in matrix form as

 𝜕𝑟 𝜕𝑟1     
 1  d𝑢1   d𝑟1 
 ···     
 𝜕𝑢1 𝜕𝑢𝑛 
𝑢1

 .    
 .. ..   ..   .. 
 . = .  .
.. Fig. 6.36 The differential d𝑟 can be
 . .     
(6.62)
     
visualized as a linearized (first-order)
 𝜕𝑟𝑛 𝜕𝑟𝑛     
 d𝑢𝑛  d𝑟𝑛 
change of the contour value.
 𝜕𝑢 ···
𝜕𝑢𝑛     
 1
6 Computing Derivatives 266

The partial derivatives in the matrix are derivatives of the expressions


for 𝑟 with respect to 𝑢 that can be obtained symbolically, and they
are in general functions of 𝑢. The vector of differentials d𝑢 represents
perturbations in 𝑢 that can be solved for a given vector of changes d𝑟.
Now suppose that we are at a solution 𝑢 ∗ , such that 𝑟(𝑢 ∗ ) = 0. All
the partial derivatives (𝜕𝑟/𝜕𝑢) can be evaluated at 𝑢 ∗ . When all entries
in d𝑟 are zero, then the solution of this linear system yields d𝑢 = 0.
This is because if there is no disruption in the residuals that are already
zero, the variables do not need to change either.
How is this linear system useful? With these differentials, we can
choose different combinations of d𝑟 to obtain any total derivatives 𝑟1 =
+d
𝑟1
𝑟2
that we want. For example, we can get the total derivatives of 𝑢 =
0 𝑟1 =
0
𝑢2
with respect to a single residual 𝑟 𝑖 by keeping d𝑟 𝑖 while setting all d𝑢2
𝑢∗
the other differentials to zero (d𝑟 𝑗≠𝑖 = 0). The visual interpretation d𝑢1

of this total derivative is shown in Fig. 6.37 for 𝑛 = 2 and 𝑖 = 1.


Setting d𝑟 = [0, . . . , 0, d𝑟 𝑖 , 0, . . . , 0] in Eq. 6.62 and moving d𝑟 𝑖 to the 𝑢1

denominator, we obtain the following linear system:∗


Fig. 6.37 The total derivatives d𝑢1 /d𝑟1
 𝜕𝑟1 𝜕𝑟1   d𝑢1    and d𝑢2 /d𝑟1 represent the first-order
 𝜕𝑟1    
 𝜕𝑢1 ··· ···
𝜕𝑢𝑛   d𝑟 𝑖  0 changes needed to satisfy a perturba-
 𝜕𝑢𝑖     tion 𝑟1 = d𝑟1 while keeping 𝑟2 = 0.
 .. ..   ..   .. 
 . .   .  .
.. ..
 .
   
∗ As explained in Appendix A.2, we take
.
      the liberty of treating differentials alge-
 𝜕𝑟 𝑖 𝜕𝑟 𝑖   d𝑢𝑖   
   =  .
𝜕𝑟 𝑖 braically and skip a more rigorous and
(6.63)
 𝜕𝑢1 𝜕𝑢𝑛   d𝑟 𝑖  1
··· ··· lengthy proof.
 𝜕𝑢𝑖    
 .. .. ..   ..   .. 
 . .   .  .
..
.
 .
    
     
 𝜕𝑟𝑛 𝜕𝑟𝑛   𝑛  
d𝑢
    0
𝜕𝑟𝑛
··· ···
 𝜕𝑢1 𝜕𝑢𝑖 𝜕𝑢𝑛   d𝑟 𝑖   
Doing the same for all 𝑖 = 1, . . . , 𝑛, we get the following 𝑛 linear
systems:

 𝜕𝑟 𝜕𝑟1   d𝑢1 d𝑢1   


 1   1 · · · 0
   d𝑟1 d𝑟𝑛   
··· ···
 𝜕𝑢1 𝜕𝑢𝑛    
 . ..   .. ..   .. . . .. 
 .. ..
 .
..
. = . . . .
 . .  
.
  
(6.64)
     
 𝜕𝑟𝑛 𝜕𝑟𝑛   d𝑢𝑛 d𝑢𝑛   
   0 · · · 1
 𝜕𝑢 ···
𝜕𝑢𝑛   d𝑟1
···
d𝑟𝑛   
 1
Solving these linear systems yields the total derivatives of all the
elements of 𝑢 with respect to all the elements of 𝑟. We can write this
more compactly in matrix form as

𝜕𝑟 d𝑢
=𝐼. (6.65)
𝜕𝑢 d𝑟
6 Computing Derivatives 267

This is the forward form of the UDE.


The total derivatives d𝑢/d𝑟 might not seem like the derivatives
in which we are interested. Based on the implicit analytic methods
derived in Section 6.7.2, these look like derivatives of states with respect
to residuals, not the derivatives that we ultimately want to compute
(d 𝑓 /d𝑥). However, we will soon see that with the appropriate choice of
𝑟 and 𝑢, we can obtain a linear system that solves for the total derivatives
we want.
With Eq. 6.65, we can solve one column at a time. Similar to AD, we
can also solve for the rows instead by transposing the systems as † † Normally, for two matrices 𝐴 and 𝐵,
(𝐴𝐵)| = 𝐵| 𝐴| , but in this case,
𝐴𝐵 = 𝐼 ⇒ 𝐵 = 𝐴−1 ⇒ 𝐵| = 𝐴−| ⇒
𝜕𝑟 | d𝑢 | 𝐴| 𝐵 | = 𝐼 .
=𝐼, (6.66)
𝜕𝑢 d𝑟

which is the reverse form of the UDE. Now, each column 𝑗 yields 𝑟2
d𝑢 𝑗 /d𝑟—the total derivative of one variable with respect to all the =
+
d𝑟 +d
𝑟1
2
𝑟1 =
residuals. This total derivative is interpreted visually in Fig. 6.38. 𝑟2
=
0 0
𝑟1 =
The usefulness of the total derivative Jacobian d𝑢/d𝑟 might still not 𝑢2
𝑢∗
be apparent. In the next section, we explain how to set up the UDE to
include d 𝑓 /d𝑥 in the UDE unknowns (d𝑢/d𝑟). d𝑢1
𝑢1
Example 6.14 Computing and interpreting d𝑢/d𝑟
Fig. 6.38 The total derivatives d𝑢1 /d𝑟1
Suppose we want to find the rectangle that is inscribed in the ellipse given and d𝑢1 /d𝑟2 represent the first-order
by change in 𝑢1 resulting from perturba-
𝑢2 tions d𝑟1 and d𝑟2 .
𝑟1 (𝑢1 , 𝑢2 ) = 1 + 𝑢22 − 1 = 0 .
4
A change in this residual represents a change in the size of the ellipse without
changing its proportions. Of all the possible rectangles that can be inscribed in
the ellipse, we want the rectangle with an area that is half of that of this ellipse,
such that
𝑟2 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝜋 = 0 .
A change in this residual represents a change in the area of the rectangle. There
are two solutions, as shown in the left pane of Fig. 6.39. These solutions can be
found using Newton’s method, which converges to one solution or the other,
depending on the starting guess. We will pick the one on the right, which is
[𝑢1 , 𝑢2 ] = [1.79944, 0.43647]. The solution represents the coordinates of the
rectangle corner that touches the ellipse.
Taking the partial derivatives, we can write the forward UDE (Eq. 6.65) for
this problem as follows:

   d𝑢1 d𝑢1   
𝑢1 /2 2𝑢2   1 0
 d𝑟2  = 
   d𝑟1   .
d𝑢2  
(6.67)
  d𝑢 
 4𝑢2 4𝑢1   2 0 1
   d𝑟1 d𝑟2   
6 Computing Derivatives 268

2 0.48
𝑟2 = −

𝑟2 = = 0

𝑟1
𝑟1

=
𝑟2

+
1.5 +Δ 0.46

d𝑟
0
Δ𝑟 2

𝑟2

1
𝑟1 = +Δ𝑟 𝑟2 = 0
1
𝑢2 1 𝑟1 = 0 𝑢2 0.44 𝑢∗
𝑟1 = −Δ𝑟 d𝑢2
1
𝑢∗ d𝑢1
0.5 0.42

0 0.4
0 0.5 1 1.5 2 1.76 1.78 1.8 1.82 1.84
𝑢1 𝑢1
Fig. 6.39 Rectangle inscribed in ellipse
Residuals and solution First-order perturbation view showing problem.
interpretation of d𝑢/d𝑟1

Solving this linear system for each of the two right-hand sides, we get
 d𝑢 d𝑢1   
 1  1.45353
 d𝑟 −0.17628
 1 d𝑟2  = 
 d𝑢   .
d𝑢2  
(6.68)
 2 
 d𝑟 −0.35257 0.18169 
 1 d𝑟2   
These derivatives reflect the change in the coordinates of the point where the
rectangle touches the ellipse as a result of a perturbation in the size of the
ellipse, d𝑟1 , and the area of the rectangle d𝑟2 . The right side of Fig. 6.39 shows
the visual interpretation of d𝑢/d𝑟1 as an example.

6.9.2 UDE for Mixed Implicit and Explicit Components


In the previous section, the UDE was derived based on residual equa-
tions. The equations were written in implicit form, but there was no
assumption on whether the equations were implicit or explicit. Because
we can write an explicit equation in implicit form (see Section 3.3 and
Ex. 3.3), the UDE allows a mix of implicit and explicit set of equations,
which we now call components.
To derive the implicit analytic equations in Section 6.7, we considered
two components: an implicit component that determines 𝑢 by solving
𝑟(𝑢; 𝑥) = 0 and an explicit component that computes the functions of
interest, 𝑓 (𝑥, 𝑢).
We can recover the implicit analytic differentiation equations (di-
rect and adjoint) from the UDE by defining a set of variables that
concatenates the state variables with inputs and outputs as follows:
𝑥 
 
 
𝑢ˆ ≡ 𝑢  . (6.69)
 
𝑓
 
6 Computing Derivatives 269


This is a vector with 𝑛 𝑥 + 𝑛𝑢 + 𝑛 𝑓 variables. For the residuals, we
need a vector with the same size. We can obtain this by realizing that
the residuals associated with the inputs and outputs are just explicit
functions that can be written in implicit form. Then, we have
 𝑥 − 𝑥ˇ 
 
 
𝑟ˆ ≡  𝑟 − 𝑟ˇ(𝑥, 𝑢)  = 0 . (6.70)
 
 𝑓 − 𝑓ˇ(𝑥, 𝑢)
 
Here, we distinguish 𝑥 (the actual variable in the UDE system) from
𝑥ˇ (a given input) and 𝑓 (the variable) from 𝑓ˇ (an explicit function of
𝑥 and 𝑢). Similarly, 𝑟 is the vector of variables associated with the
residual and 𝑟ˇ is the residual function itself. Taking the differential of
the residuals, and considering only one of them to be nonzero at a time,
we obtain,
d𝑥 
 
 
dˆ𝑟 ≡  d𝑟  . (6.71)
 
d 𝑓 
 
Using these variable and residual definitions in Eqs. 6.65 and 6.66 yields
the full UDE shown in Fig. 6.40, where the block we ultimately want to
compute is d 𝑓 /d𝑥.

| | |
𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼 − − 𝐼
𝜕𝑥 𝜕𝑥 d𝑥 d𝑥
| | |
𝜕 𝑟ˇ 𝜕 𝑟ˇ d𝑢 d𝑢 𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
− − 0 0 = 0 𝐼 0 = 0 − − 0
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟 𝜕𝑢 𝜕𝑢 d𝑟 d𝑟

𝜕 𝑓ˇ 𝜕 𝑓ˇ d𝑓 d𝑓
− − 𝐼 𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟

To compute d 𝑓 /d𝑥 using the forward UDE (left-hand side of the Fig. 6.40 The direct and adjoint meth-
ods can be recovered from the UDE.
equation in Fig. 6.40, we can ignore all but three blocks in the total
derivatives matrix: 𝐼, d𝑢/d𝑥, and d 𝑓 /d𝑥. By multiplying these blocks
and using the definition 𝜙 ≡ − d𝑢/d𝑥, we recover the direct linear
system (Eq. 6.43) and the total derivative equation (Eq. 6.44).
To compute d 𝑓 /d𝑥 using the reverse UDE (right-hand side of
the equation in Fig. 6.40), we can ignore all but three blocks in the
total derivatives matrix: 𝐼, d 𝑓 /d𝑟, and d 𝑓 /d𝑥. By multiplying these
blocks and defining 𝜓 ≡ − d 𝑓 /d𝑟, we recover the adjoint linear system
(Eq. 6.46) and the corresponding total derivative equation (Eq. 6.47). The
definition of 𝜓 here is significant because, as mentioned in Section 6.7.2,
the adjoint vector is the total derivative of the objective function with
respect to the governing equation residuals.
6 Computing Derivatives 270

By defining one implicit component (associated with 𝑢) and two


explicit components (associated with 𝑥 and 𝑓 ), we have retrieved
the direct and adjoint methods from the UDE. In general, we can
define an arbitrary number of components, so the UDE provides a
mathematical framework for computing the derivatives of coupled
systems. Furthermore, each component can be implicit or explicit, so
the UDE can handle an arbitrary mix of components. All we need to do
is to include the desired states in the UDE augmented variables vector
(Eq. 6.69) and the corresponding residuals in the UDE residuals vector
(Eq. 6.70). We address coupled systems in Section 13.3.3 and use the
UDE in Section 13.2.6, where we extend it to coupled systems with a
hierarchy of components.

Example 6.15 Computing arbitrary derivatives with the UDE

Say we want to compute the total derivatives of the perimeter of the rectangle
from Ex. 6.14 with respect to the axes of the ellipse. The equation for the ellipse
can be rewritten as
𝑢2 𝑢2
𝑟3 (𝑢1 , 𝑢2 ) = 1 + 2 − 1 = 0 ,
𝑥12 𝑥22
where 𝑥1 and 𝑥2 are the semimajor and semiminor axes of the ellipse, respec-
tively. The baseline values are [𝑥1 , 𝑥2 ] = [2, 1]. The residual for the rectangle
area is
𝜋
𝑟4 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝑥1 𝑥2 = 0 .
2
To add the independent variables 𝑥1 and 𝑥2 , we write them as residuals in
implicit form as

𝑟1 (𝑥1 ) = 𝑥1 − 2 = 0, 𝑟2 (𝑥2 ) = 𝑥2 − 1 = 0 .

The perimeter can be written in implicit form as

𝑟5 (𝑢1 , 𝑢2 ) = 𝑓 − 4(𝑢1 + 𝑢2 ) = 0 .

Now we have a system of five equations and five variables, with the 𝑥

dependencies shown in Fig. 6.41. The first two variables in 𝑥 are given, and we
𝑟3 (𝑢, 𝑥) = 0
can compute 𝑢 and 𝑓 using a solver as before.
Taking all the partial derivatives, we get the following forward system: 𝑢
𝑟4 (𝑢, 𝑥) = 0
  
𝑢
 1 0  1 0
 0 0 0 0 0 0
   
   
𝑟5 (𝑢) = 0
 0 0  0 0
 1 0 0  1 0 0
   
 2𝑢12 2𝑢22 2𝑢1 2𝑢2   d𝑢1 d𝑢1 d𝑢1 d𝑢1  𝑓
− 0  0 = 𝐼 .
 𝑥3 −  d𝑥1
 𝑥23 𝑥12 𝑥22   d𝑥2 d𝑟3 d𝑟4 
 1   d𝑢2  Fig. 6.41 Dependencies of the residu-
 𝜋   d𝑢2 d𝑢2 d𝑢2
0 als.
− 𝑥 2 0  d𝑥
𝜋
4𝑢2 4𝑢1
 2 − 𝑥1
2   1 d𝑥2 d𝑟3 d𝑟4 
   d𝑓 
   d𝑓 d𝑓 d𝑓
1
 0 0 1  d𝑥
   1 
−4 −4 d𝑥2 d𝑟3 d𝑟4
6 Computing Derivatives 271

We only want the two d 𝑓 /d𝑥 terms in this equation. We can either solve this
linear system twice to compute the first two columns, or we can compute both
terms with a single solution of the reverse (transposed) system. Transposing
the system, substituting the numerical values for 𝑥 and 𝑢, and removing the
total derivative terms that we do not need, we get the following system:

   d𝑓   
1 0     
 0 
  d𝑥1   
−0.80950 −1.57080 0

   d𝑓   
0 0     
 1 
  d𝑥2   
−0.38101 −3.14159 0

   d𝑓   
0   = 0 .
 0 0.89972 1.74588 −4  
   d𝑟3   
   d𝑓   
   0
0 −4  
  d𝑟4   
0 0.87294 7.19776

    
   1
0 0 0 0 1 1   
  
Solving this linear system, we obtain

 d𝑓   
 
 d𝑥  3.59888
 1  
 d𝑓   
  
 d𝑥  1.74588
 2 =   .
 d𝑓  
  4.40385
 d𝑟   
 3  
 d𝑓  
  0.02163
   
 d𝑟4 
The total derivatives of interest are shown in Fig. 6.42. 𝑥2
2
We could have obtained the same solution using the adjoint equations
from Section 6.7.2. The only difference is the nomenclature because the adjoint
vector in this case is 𝜓 = −[d 𝑓 /d𝑟3 , d 𝑓 /d𝑟4 ]. We can interpret these terms as 1.5

the change of 𝑓 with respect to changes in the ellipse size and rectangle area, d𝑓
d𝑥 2
respectively. 1
d𝑓
d𝑥1
0.5

1 1.5 2 2.5 3
𝑥1
6.9.3 Recovering AD
Now we will see how we can recover AD from the UDE. First, we Fig. 6.42 Contours of 𝑓 as a function
of 𝑥 and the total derivatives at 𝑥 =
define the UDE variables associated with each operation or line of code [2, 1].
(assuming all loops have been unrolled), such that 𝑢 ≡ 𝑣 and

𝑣 𝑖 = 𝑣ˇ 𝑖 (𝑣1 , . . . , 𝑣 𝑖−1 ), 𝑖 = 1, . . . , 𝑛 . (6.72)

Recall from Section 6.6.1 that each variable is an explicit function of the
previous ones.
6 Computing Derivatives 272

To define the appropriate residuals, we use the same technique from


before to convert an explicit function into implicit form by moving all
the terms in the left-hand side to obtain

𝑟 𝑖 = 𝑣 𝑖 − 𝑣ˇ 𝑖 (𝑣 1 , . . . , 𝑣 𝑖−1 ) . (6.73)

The distinction between 𝑣 and 𝑣ˇ is that 𝑣 represents variables that are


considered independent in the UDE, whereas 𝑣ˇ represents the explicit
expressions. Of course, the values for these become equal once the
system is solved. Similar to the differentials in Eq. 6.71, d𝑟 ≡ d𝑣
Taking the partial derivatives of the residuals (Eq. 6.73) with respect
to 𝑣 (Eq. 6.72), and replacing the total derivatives in the forward form
of the UDE (Eq. 6.65) with the new symbols yields

   d𝑣1 
 1 0 0  0 0 
  d𝑣1 
... ...
 𝜕 𝑣ˇ 2 ..  d𝑣 .. 
− ..  2 d𝑣 2 ..
 𝜕𝑣 1 . .  . . 
   d𝑣1 d𝑣2  = 𝐼 . (6.74)
 ..  . 
1
 . .. ..
0  ..
.. ..
0 
 . . . .
 𝜕 𝑣ˇ  
− 𝑛 𝜕 𝑣ˇ 𝑛
1  𝑛
d𝑣 d𝑣 𝑛 d𝑣 𝑛 
 ... − ... 
 𝜕𝑣1 𝜕𝑣 𝑛−1   d𝑣1 d𝑣 𝑛−1 d𝑣 𝑛 
This equation is the matrix form of the AD forward chain rule (Eq. 6.21),
where each column of the total derivative matrix corresponds to the
tangent vector (𝑣¤ ) for the chosen input variable. As observed in Fig. 6.16,
the partial derivatives form a lower triangular matrix. The Jacobian
we ultimately want to compute (d 𝑓 /d𝑥) is composed of a subset of
derivatives in the bottom-left corner near the d𝑣 𝑛 /d𝑣1 term. To compute
these derivatives, we need to perform forward substitution and compute
one column of the total derivative matrix at a time, where each column
is associated with the inputs of interest.
Similarly, the reverse form of the UDE (Eq. 6.66) yields the transpose
of Eq. 6.74,

   d𝑣1 
1 − 2
𝜕 𝑣ˇ

𝜕 𝑣ˇ 𝑛  d𝑣2 d𝑣 𝑛 
 ... −   d𝑣1 ... 
 𝜕𝑣1 
𝜕𝑣1  d𝑣1 d𝑣1 
   d𝑣2 .. 
 0 
.. .. ..
0 1 . 
. 
. .

   . d𝑣2  = 𝐼 . (6.75)
 .. 𝜕 𝑣ˇ 𝑛   . d𝑣 𝑛−1 d𝑣 𝑛 
. . .. ..
 .
..

 . −
𝜕𝑣 𝑛−1  
.
d𝑣 𝑛−1 
 
d𝑣 𝑛−1
   0 d𝑣 𝑛 
0 1   d𝑣 𝑛 
0 0
 
... ...

This is equivalent to the AD reverse chain rule (Eq. 6.26), where each
column of the total derivative matrix corresponds to the gradient vector
6 Computing Derivatives 273

(𝑣¯ ) for the chosen output variable. The partial derivatives now form
an upper triangular matrix, as previously shown in Fig. 6.21. The
derivatives of interest are now near the top-right corner of the total
derivative matrix near the d𝑣 𝑛 /d𝑣1 term. To compute these derivatives,
we need to perform back substitutions, computing one column of the
matrix at a time.

Tip 6.9 Scaling affects the derivatives

When scaling a problem (Tips 4.4 and 5.3), you should be aware that the
scale changes also affect the derivatives. You can apply the derivative methods
of this chapter to the scaled function directly. However, scaling is often done
outside the model because the scaling is specific to the optimization problem. In
this case, you may want to use the original functions and derivatives and make
the necessary modifications in an outer function that provides the objectives
and constraints.
Using the nomenclature introduced in Tip 4.4, we represent the scaled
design variables given to the optimizer as 𝑥.¯ Then, the unscaled variables are
¯ Thus, the required scaled derivatives are
𝑥 = 𝑠 𝑥 𝑥.

d 𝑓¯ d𝑓 𝑠𝑥
= . (6.76)
d𝑥¯ d𝑥 𝑠𝑓

Tip 6.10 Provide your own derivatives and use finite differences only
as a last resort
Because of the step-size dilemma, finite differences are often the cause of
failed optimizations. To put it more dramatically, finite differences are the root
of all evil. Most gradient-based optimization software uses finite differences
internally as a default if you do not provide your own gradients. Although
some software packages try to find reasonable finite-difference steps, it is easy
to get inaccurate derivatives, which then causes optimization difficulties or
total failure. This is the top reason why beginners give up on gradient-based
optimization!
Instead, you should provide gradients computed using one of the other
methods described in this chapter. In contrast with finite differences, the
derivatives computed by the other methods are usually as accurate as the
function computation. You should also avoid using finite-difference derivatives
as a reference for a definitive verification of the other methods.
If you have to use finite differences as a last resort, make sure to do a step-
size study (see Tip 6.2). You should then provide your own finite-difference
derivatives to the optimization or make sure that the optimizer finite-difference
estimates are acceptable.
6 Computing Derivatives 274

6.10 Summary

Derivatives are useful in many applications beyond optimization. This


chapter introduced the methods available to compute the first deriva-
tives of the outputs of a model with respect to its inputs. In optimization,
the outputs are usually the objective function and the constraint func-
tions, and the inputs are the design variables. The typical characteristics
of the available methods are compared in Table 6.4.

Accuracy Scalability Ease of Implicit


implementation functions
Symbolic • Hard
Finite difference Easy • Table 6.4 Characteristics of the vari-
Complex step • Intermediate • ous derivative computation methods.
Some of these characteristics are prob-
AD • • Intermediate • lem or implementation dependent, so
Implicit analytic • • Hard • these are not universal.

Symbolic differentiation is accurate but only scalable for simple,


explicit functions of low dimensionality. Therefore, it is necessary to
compute derivatives numerically. Although it is generally intractable
or inefficient for many engineering models, symbolic differentiation is
used by AD at each line of code and in implicit analytic methods to
derive expressions for computing the required partial derivatives.
Finite-difference approximations are popular because they are easy
to implement and can be applied to any model, including black-box
models. The downsides are that these approximations are not accurate,
and the cost scales linearly with the number of variables. Many of
the issues practitioners experience with gradient-based optimization
can be traced to errors in the gradients when algorithms automatically
compute these gradients using finite differences.
The complex-step method is accurate and relatively easy to imple-
ment. It usually requires some changes to the analysis source code, but
this process can be scripted. The main advantage of the complex-step
method is that it produces analytically accurate derivatives. However,
like the finite-difference method, the cost scales linearly with the num-
ber of inputs, and each simulation requires more effort because of the
complex arithmetic.
AD produces analytically accurate derivatives, and many implemen-
tations can be fully automated. The implementation requires access
to the source code but is still relatively straightforward to apply. The
computational cost of forward-mode AD scales with the number of
inputs, and the reverse mode scales with the number of outputs. The
6 Computing Derivatives 275

scaling factor for the forward mode is generally lower than that for
finite differences. The cost of reverse-mode AD is independent of the
number of design variables.
Implicit analytic methods (direct and adjoint) are accurate and

Number of function evaluations


106
scalable but require significant implementation effort. These methods Finite
are exact (depending on how the partial derivatives are obtained), and 105 difference
1.49
like AD, they provide both forward and reverse modes with the same 104
scaling advantages. Gradient-based optimization using the adjoint
103
method is a powerful combination that scales well with the number Analytic
102 0.37
of variables, as shown in Fig. 6.43. The disadvantage is that because
implicit methods are intrusive, they require considerable development 101 102 103

effort. Number of design variables

A hybrid approach where the partial derivatives for the implicit


Fig. 6.43 Efficient gradient compu-
analytic equations are computed with AD is generally recommended. tation with an analytic method im-
This hybrid approach is computationally more efficient than AD while proves the scalability of gradient-
reducing the implementation effort of implicit analytic methods and based algorithms compared to finite
differencing. In this case, we show
ensuring accuracy.
the results for the 𝑛-dimensional
The UDE encapsulates all the derivative computation methods Rosenbrock, where the cost of com-
in a single linear system. Using the UDE, we can formulate the puting the derivatives analytically is
derivative computation for an arbitrary system of mixed explicit and independent of 𝑛.

implicit components. This will be used in Section 13.2.6 to develop a


mathematical framework for solving coupled systems and computing
the corresponding derivatives.
6 Computing Derivatives 276

Problems

6.1 Answer true or false and justify your answer.

a. A first-order derivative is only one of many types of sensitiv-


ity analysis.
b. Each column of the Jacobian matrix represents the gradient
of one of the functions of interest with respect to all the
variables.
c. You can only compute derivatives of models for which you
have the source code or, at the very least, understand how
the model computes the functions of interest.
d. Symbolic differentiation is intractable for all but the simplest
models because of expression swell.
e. Finite-difference approximations can compute first deriva-
tives with a precision matching that of the function being
differentiated.
f. The complex-step method can only be used to compute
derivatives of real functions.
g. AD via source code transformation uses a code parser to
differentiate each line of code symbolically.
h. The forward mode of AD computes the derivatives of all
outputs with respect to one input, whereas the reverse mode
computes the derivative of one output with respect to all
inputs.
i. The adjoint method requires the same partial derivatives as
the direct method.
j. Of the two implicit analytic methods, the direct method is
more widely used than the adjoint method because most
problems have more design variables than functions of
interest.
k. Graph coloring makes Jacobians sparse by selectively replac-
ing small-valued entries with zeros to trade accuracy for
speed.
l. The unified derivatives equation can represent implicit ana-
lytic approaches but not AD.

6.2 Reproduce the comparison between the complex-step and finite-


difference methods from Ex. 6.4. Do you get any complex-step
derivatives with zero error compared with the analytic reference?
6 Computing Derivatives 277

What does that mean, and how should you show those points on
the plot?

6.3 Compute the derivative using symbolic differentiation and using


algorithmic differentiation (either forward or reverse mode) for
the iterative code in Ex. 6.2. Use a package to facilitate the AD
portion. Most scientific computing languages have AD packages
(see Tip 6.6).

6.4 Write a forward-mode-AD script that computes the derivative of


the function in Ex. 6.3 using operator overloading. You need to
define your own type and provide it with overloaded functions
for exp, sin, cos, sqrt, addition, division, and exponentiation.

6.5 Suppose you have two airplanes that are flying in a horizontal
plane defined by 𝑥 and 𝑦 coordinates. Both airplanes start at 𝑦 = 0,
but airplane 1 starts at 𝑥 = 0, whereas airplane 2 has a head start
of 𝑥 = Δ𝑥. The airplanes fly at a constant velocity. Airplane 1 has
a velocity of 𝑣1 in the direction of the positive 𝑥-axis, and airplane
2 has a velocity of 𝑣2 at an angle 𝛾 with the 𝑥-axis. The functions
of interest are the distance (𝑑) and the angle (𝜃) between the two
airplanes as a function of time. The independent variables are
Δ𝑥, 𝛾, 𝑣1 , 𝑣2 , and 𝑡. Write the code that computes the functions of
interest (outputs) for a given set of independent variables (inputs).
Use AD to differentiate the code. Choose a set of inputs, compute
the derivatives of all the outputs with respect to the inputs, and
verify them against the complex-step method.

6.6 Kepler’s equation, which we mentioned in Section 2.2, defines the


relationship between a planet’s polar coordinates and the time
elapsed from a given initial point through the implicit equation

𝐸 − 𝑒 sin(𝐸) = 𝑀 ,

where 𝑀 is the mean anomaly (a parameterization of time), 𝐸


is the eccentric anomaly (a parameterization of the polar angle),
and 𝑒 is the eccentricity of the elliptical orbit. Suppose that the
function of interest is the difference between the eccentric and
mean anomalies,
𝑓 (𝐸, 𝑀) = 𝐸 − 𝑀 .
Derive an analytic expression for d 𝑓 /d𝑒 and d 𝑓 /d𝑀. Verify your
result against the complex-step method or AD (you will need a
solver for Kepler’s equation, which was the subject of Prob. 3.6).
6 Computing Derivatives 278

6.7 Compute the derivatives for the 10-bar truss problem described
in Appendix D.2.2 using the direct and adjoint implicit differenti-
ation methods. Compute the derivatives of the objective (mass)
with respect to the design variables (10 cross-sectional areas),
and the derivatives of the constraints (stresses in all 10 bars)
with respect to the design variables (a 10 × 10 Jacobian matrix).
Compute the derivatives using the following:

a. A finite-difference formula of your choice


b. The complex-step derivative method
c. AD
d. The implicit analytic method (direct and adjoint)

Report the errors relative to the implicit analytic methods. Discuss


your findings and the relative merits of each approach.

6.8 You can now solve the 10-bar truss problem (previously solved in
Prob. 5.15) using the derivatives computed in Prob. 6.7. Solve this
optimization problem using both finite-difference derivatives and
derivatives computed using an implicit analytic method. Report
the following:

a. Convergence plot with two curves for the different derivative


computation approaches on the same plot
b. Number of function calls required to converge for each
method (This metric is more meaningful if you use more
than one starting point and average the results.)

Discuss your findings.

6.9 Aggregate the constraints for the 10-bar truss problem and extend
the code from Prob. 6.7 to compute the required constraint deriva-
tives using the implicit analytic method that is most advantageous
in this case. Verify your derivatives against the complex-step
method. Solve the optimization problem and compare your re-
sults to the ones you obtained in Prob. 6.8. How close can you get
to the reference solution?
Gradient-Free Optimization
7
Gradient-free algorithms fill an essential role in optimization. The
gradient-based algorithms introduced in Chapter 4 are efficient in
finding local minima for high-dimensional nonlinear problems defined
by continuous smooth functions. However, the assumptions made
for these algorithms are not always valid, which can render these
algorithms ineffective. Also, gradients might not be available when a
function is given as a black box.
In this chapter, we introduce only a few popular representative
gradient-free algorithms. Most are designed to handle unconstrained
functions only, but they can be adapted to solve constrained problems
by using the penalty or filtering methods introduced in Chapter 5. We
start by discussing the problem characteristics relevant to the choice
between gradient-free and gradient-based algorithms and then give an
overview of the types of gradient-free algorithms.

By the end of this chapter you should be able to:

1. Identify problems that are well suited for gradient-free


optimization.
2. Describe the characteristics and approach of more than
one gradient-free optimization method.
3. Use gradient-free optimization algorithms to solve real
engineering problems.

7.1 When to Use Gradient-Free Algorithms

Gradient-free algorithms can be useful when gradients are not available,


such as when dealing with black-box functions. Although gradients
can always be approximated with finite differences, these approxima-
tions suffer from potentially significant inaccuracies (see Section 6.4.2).
Gradient-based algorithms require a more experienced user because
they take more effort to set up and run. Overall, gradient-free algo-

279
7 Gradient-Free Optimization 280

rithms are easier to get up and running but are much less efficient,
particularly as the dimension of the problem increases.
One significant advantage of gradient-free algorithms is that they
do not assume function continuity. For gradient-based algorithms,
function smoothness is essential when deriving the optimality con-
ditions, both for unconstrained functions and constrained functions.
More specifically, the Karush–Kuhn–Tucker (KKT) conditions (Eq. 5.11)
require that the function be continuous in value (𝐶 0 ), gradient (𝐶 1 ), and
Hessian (𝐶 2 ) in at least a small neighborhood of the optimum. If, for
example, the gradient is discontinuous at the optimum, it is undefined,
and the KKT conditions are not valid. Away from optimum points, this
requirement is not as stringent. Although gradient-based algorithms
work on the same continuity assumptions, they can usually tolerate
the occasional discontinuity, as long as it is away from an optimum
point. However, for functions with excessive numerical noise and
discontinuities, gradient-free algorithms might be the only option.
Many considerations are involved when choosing between a gradient-
based and a gradient-free algorithm. Some of these considerations are
common sources of misconception. One problem characteristic often
cited as a reason for choosing gradient-free methods is multimodality.
Design space multimodality can be a result of an objective function
with multiple local minima. In the case of a constrained problem, the
multimodality can arise from the constraints that define disconnected
or nonconvex feasible regions.
As we will see shortly, some gradient-free methods feature a global
search that increases the likelihood of finding the global minimum. This
feature makes gradient-free methods a common choice for multimodal
problems. However, not all gradient-free methods are global search
methods; some perform only a local search. Additionally, even though
gradient-based methods are by themselves local search methods, they
are often combined with global search strategies, as discussed in Tip 4.8.
It is not necessarily true that a global search, gradient-free method is
more likely to find a global optimum than a multistart gradient-based
method. As always, problem-specific testing is needed.
Furthermore, it is assumed far too often that any complex prob-
lem is multimodal, but that is frequently not the case. Although it
might be impossible to prove that a function is unimodal, it is easy to
prove that a function is multimodal simply by finding another local
minimum. Therefore, we should not make any assumptions about
the multimodality of a function until we show definite multiple local
minima. Additionally, we must ensure that perceived local minima are
not artificial minima arising from numerical noise.
7 Gradient-Free Optimization 281

Another reason often cited for using a gradient-free method is


multiple objectives. Some gradient-free algorithms, such as the genetic
algorithm discussed in this chapter, can be naturally applied to multiple
objectives. However, it is a misconception that gradient-free methods
are always preferable just because there are multiple objectives. This
topic is introduced in Chapter 9.
Another common reason for using gradient-free methods is when
there are discrete design variables. Because the notion of a derivative
with respect to a discrete variable is invalid, gradient-based algorithms
cannot be used directly (although there are ways around this limitation,
as discussed in Chapter 8). Some gradient-free algorithms can handle
discrete variables directly.
The preceding discussion highlights that although multimodality,
multiple objectives, or discrete variables are commonly mentioned as
reasons for choosing a gradient-free algorithm, these are not necessarily
automatic decisions, and careful consideration is needed. Assuming a
choice exists (i.e., the function is not too noisy), one of the most relevant
factors when choosing between a gradient-free and a gradient-based
approach is the dimension of the problem.

107 Gradient-free
Number of function evaluations

106

105

2.52
104 Fig. 7.1 Cost of optimization for in-
creasing number of design variables
103 in the 𝑛-dimensional Rosenbrock
Gradient-based function. A gradient-free algorithm
0.37
is compared with a gradient-based
102
algorithm, with gradients computed
analytically. The gradient-based al-
101 102 103 gorithm has much better scalability.
Number of design variables

Figure 7.1 shows how many function evaluations are required to


minimize the 𝑛-dimensional Rosenbrock function for varying numbers
of design variables. Two classes of algorithms are shown in the plot:
gradient-free and gradient-based algorithms. The gradient-based
algorithm uses analytic gradients in this case. Although the exact 135. Yu et al., On the influence of optimiza-
numbers are problem dependent, similar scaling has been observed tion algorithm and starting design on wing
aerodynamic shape optimization, 2018.
in large-scale computational fluid dynamics–based optimization.135
136. Rios and Sahinidis, Derivative-free
The general takeaway is that for small-size problems (usually ≤ 30 optimization: a review of algorithms and
variables136 ), gradient-free methods can be useful in finding a solution. comparison of software implementations,
2013.
7 Gradient-Free Optimization 282

Furthermore, because gradient-free methods usually take much less


developer time to use, a gradient-free solution may even be preferable for
these smaller problems. However, if the problem is large in dimension,
then a gradient-based method may be the only viable method despite
the need for more developer time.

Tip 7.1 Choose your bounds carefully for global algorithms

Unlike gradient-based methods, which usually do not require design


variable bounds, global algorithms require these bounds to be set. Because
the global search tends to explore the whole design space within the specified
bounds, the algorithm’s effectiveness diminishes considerably if the variable
bounds are unnecessarily wide.

7.2 Classification of Gradient-Free Algorithms

There is a much wider variety of gradient-free algorithms compared


with their gradient-based counterparts. Although gradient-based algo-
rithms tend to perform local searches, have a mathematical rationale,
and be deterministic, gradient-free algorithms exhibit different combi-
nations of these characteristics. We list some of the most widely known
gradient-free algorithms in Table 7.1 and classify them according to the
characteristics introduced in Fig. 1.22.∗ ∗ Riosand Sahinidis136 review and bench-
mark a large selection of gradient-free op-
timization algorithms.
Search Algorithm Function Stochas- 136. Rios and Sahinidis, Derivative-free
evaluation ticity optimization: a review of algorithms and
comparison of software implementations,
Mathematical

Deterministic

2013.
Stochastic
Surrogate
Heuristic
Global

Direct
Local

Nelder–Mead • • • •
GPS • • • •
MADS • • • •
Trust region • • • •
Implicit filtering • • • •
DIRECT • • • •
MCS • • • •
EGO • • • •
Table 7.1 Classification of gradient-
Hit and run • • • • free optimization methods using the
Evolutionary • • • • characteristics of Fig. 1.22.
7 Gradient-Free Optimization 283

Local search, gradient-free algorithms that use direct function evalu-


ations include the Nelder–Mead algorithm, generalized pattern search
(GPS), and mesh-adaptive direct search (MADS). Although classified
as local search in the table, the latter two methods are frequently used
with globalization approaches. The Nelder–Mead algorithm (which
we detail in Section 7.3) is heuristic, whereas the other two are not.
GPS and MADS (discussed in Section 7.4) are examples of derivative-
free optimization (DFO) algorithms, which, despite the name, do not
include all gradient-free algorithms. DFO algorithms are understood
to be largely heuristic-free and focus on local search.† GPS is a family † The textbooks by Conn et al.137 and Au-
det and Hare138 provide a more extensive
of methods that iteratively seek an improvement using a set of points treatment of gradient-free optimization al-
around the current point. In its simplest versions, GPS uses a pattern gorithms that are based on mathematical
criteria. Kokkolaras139 presents a succinct
of points based on the coordinate directions, but more sophisticated discussion on when to use DFO.
versions use a more general set of vectors. MADS improves GPS 137. Conn et al., Introduction to Derivative-
algorithms by allowing a much larger set of such vectors and improving Free Optimization, 2009.

convergence. 138. Audet and Hare, Derivative-Free and


Blackbox Optimization, 2017.
Model-based, local search algorithms include trust-region algo-
139. Kokkolaras, When, why, and how
rithms and implicit filtering. The model is an analytic approximation can derivative-free optimization be useful to
computational engineering design? 2020.
of the original function (also called a surrogate model), and it should
be smooth, easy to evaluate, and accurate in the neighborhood of the
current point. The trust-region approach detailed in Section 4.5 can be
considered gradient-free if the surrogate model is constructed using
just evaluations of the original function without evaluating its gradients.
This does not prevent the trust-region algorithm from using gradients
of the surrogate model, which can be computed analytically. Implicit
filtering methods extend the trust-region method by adding a surrogate
model of the function gradient to guide the search. This effectively
becomes a gradient-based method applied to the surrogate model
instead of evaluating the function directly, as done for the methods in
Chapter 4.
Global search algorithms can be broadly classified as deterministic
or stochastic, depending on whether they include random parameter
generation within the optimization algorithm.
Deterministic, global search algorithms can be either direct or
model based. Direct algorithms include Lipschitzian-based parti-
tioning techniques—such as the “divide a hyperrectangle” (DIRECT)
algorithm detailed in Section 7.5 and branch-and-bound search (dis-
cussed in Chapter 8)—and multilevel coordinate search (MCS). The
DIRECT algorithm selectively divides the space of the design variables
into smaller and smaller 𝑛-dimensional boxes—hyperrectangles. It
uses mathematical arguments to decide which boxes should be sub-
divided. Branch-and-bound search also partitions the design space,
7 Gradient-Free Optimization 284

but it estimates lower and upper bounds for the optimum by using
the function variation between partitions. MCS is another algorithm
that partitions the design space into boxes, where a limit is imposed on
how small the boxes can get based on the number of times it has been
divided.
Global-search algorithms based on surrogate models are similar to
their local search counterparts. However, they use surrogate models
to reproduce the features of a multimodal function instead of convex
surrogate models. One of the most widely used of these algorithms is
efficient global optimization (EGO), which employs kriging surrogate
models and uses the idea of expected improvement to maximize the
likelihood of finding the optimum more efficiently (surrogate modeling
techniques, including kriging are introduced in Chapter 10, which also
described EGO). Other algorithms use radial basis functions (RBFs) as
the surrogate model and also maximize the probability of improvement
at new iterates.
Stochastic algorithms rely on one or more nondeterministic pro-
cedures; they include hit-and-run algorithms and the broad class of
evolutionary algorithms. When performing benchmarks of a stochastic
algorithm, you should run a large enough number of optimizations to
obtain statistically significant results. ‡ Simon140 provides a more comprehensive
review of evolutionary algorithms.
Hit-and-run algorithms generate random steps about the current
140. Simon, Evolutionary Optimization
iterate in search of better points. A new point is accepted when it is Algorithms, 2013.
better than the current one, and this process repeats until the point § These algorithms include the follow-
cannot be improved. ing: ant colony optimization, artificial bee
colony algorithm, artificial fish swarm, ar-
What constitutes an evolutionary algorithm is not well defined.‡ tificial flora optimization algorithm, bac-
Evolutionary algorithms are inspired by processes that occur in nature terial foraging optimization, bat algo-
rithm, big bang–big crunch algorithm,
or society. There is a plethora of evolutionary algorithms in the literature, biogeography-based optimization, bird
thanks to the fertile imagination of the research community and a mating optimizer, cat swarm optimization,
cockroach swarm optimization, cuckoo
never-ending supply of phenomena for inspiration.§ These algorithms search, design by shopping paradigm,
are more of an analogy of the phenomenon than an actual model. dolphin echolocation algorithm, elephant
herding optimization, firefly algorithm,
They are, at best, simplified models and, at worst, barely connected flower pollination algorithm, fruit fly op-
to the phenomenon. Nature-inspired algorithms tend to invent a timization algorithm, galactic swarm op-
timization, gray wolf optimizer, grenade
specific terminology for the mathematical terms in the optimization explosion method, harmony search algo-
problem. For example, a design point might be called a “member of rithm, hummingbird optimization algo-
rithm, hybrid glowworm swarm optimiza-
the population”, or the objective function might be the “fitness”. tion algorithm, imperialist competitive al-
The vast majority of evolutionary algorithms are population based, gorithm, intelligent water drops, invasive
weed optimization, mine bomb algorithm,
which means they involve a set of points at each iteration instead of a monarch butterfly optimization, moth-
single one (we discuss a genetic algorithm in Section 7.6 and a particle flame optimization algorithm, penguin
search optimization algorithm, quantum-
swarm method in Section 7.7). Because the population is spread out in behaved particle swarm optimization,
the design space, evolutionary algorithms perform a global search. The salp swarm algorithm, teaching–learning-
based optimization, whale optimization
stochastic elements in these algorithms contribute to global exploration algorithm, and water cycle algorithm.
7 Gradient-Free Optimization 285

and reduce the susceptibility to getting stuck in local minima. These


features increase the likelihood of getting close to the global minimum
but by no means guarantee it. The algorithm may only get close because
heuristic algorithms have a poor convergence rate, especially in higher
dimensions, and because they lack a first-order mathematical optimality
criterion.
This chapter covers five gradient-free algorithms: the Nelder–Mead
algorithm, GPS, the DIRECT method, genetic algorithms, and particle
swarm optimization. A few other algorithms that can be used for
continuous gradient-free problems (e.g., simulated annealing and
branch and bound) are covered in Chapter 8 because they are more
frequently used to solve discrete problems. In Chapter 10, on surrogate
modeling, we discuss kriging and efficient global optimization.

7.3 Nelder–Mead Algorithm

The simplex method of Nelder and Mead28 is a deterministic, direct- 28. Nelder and Mead, A simplex method
for function minimization, 1965.
search method that is among the most cited gradient-free methods. It
is also known as the nonlinear simplex—not to be confused with the
simplex algorithm used for linear programming, with which it has
nothing in common. To avoid ambiguity, we will refer to it as the
Nelder–Mead algorithm.
The Nelder–Mead algorithm is based on a simplex, which is a 𝑥 (3)
geometric figure  defined by a set of 𝑛 + 1 points in the design space of 𝑛
variables, 𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛) . Each point 𝑥 (𝑖) represents a design
(i.e., a full set of design variables). In two dimensions, the simplex 𝑥 (0)
𝑥 (2)
is a triangle, and in three dimensions, it becomes a tetrahedron (see
Fig. 7.2). 𝑥 (1)

Each optimization iteration corresponds to a different simplex. The Fig. 7.2 A simplex for 𝑛 = 3 has four
algorithm modifies the simplex at each iteration using five simple vertices.
operations. The sequence of operations to be performed is chosen
based on the relative values of the objective function at each of the
points.
The first step of the simplex algorithm is to generate 𝑛 + 1 points
based on an initial guess for the design variables. This could be done by
simply adding steps to each component of the initial point to generate
𝑛 new points. However, this will generate a simplex with different edge
lengths, and equal-length edges are preferable. Suppose we want the
length of all sides to be 𝑙 and that the first guess is 𝑥 . The remaining
(0)

points of the simplex, 𝑥 , . . . , 𝑥


(1) (𝑛) , can be computed by

𝑥 (𝑖) = 𝑥 (0) + 𝑠 (𝑖) , (7.1)


7 Gradient-Free Optimization 286

where 𝑠 (𝑖) is a vector whose components 𝑗 are defined by


√ 



 𝑙
𝑛+1−1 + √𝑙 , if 𝑗 = 𝑖
√ 

(𝑖) 𝑛 2 2
𝑠𝑗 = (7.2)

 𝑛 √𝑙 2 𝑛+1−1 , if 𝑗 ≠ 𝑖 .

Figure 7.3 shows a starting simplex for a two-dimensional problem.
At any given iteration, the objective 𝑓 is evaluated for every point, 𝑥 (1)
and the points are ordered based on the respective values of 𝑓 , from
the lowest
 to the highest. Thus, in the ordered list of simplex points
𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) , the best point is 𝑥 (0) , and the worst one 𝑥 (0)
is 𝑥 (𝑛) . 𝑥 (2)
The Nelder–Mead algorithm performs five main operations on the
Fig. 7.3 Starting simplex for 𝑛 = 2.
simplex to create a new one: reflection, expansion, outside contraction, inside
contraction, and shrinking, as shown in Fig. 7.4. Except for shrinking,
each of these operations generates a new point,
 
𝑥 = 𝑥 𝑐 + 𝛼 𝑥 𝑐 − 𝑥 (𝑛) , (7.3)

where 𝛼 is a scalar, and 𝑥 𝑐 is the centroid of all the points except for the
worst one, that is,
1 Õ (𝑖)
𝑛−1
𝑥𝑐 = 𝑥 . (7.4)
𝑛
𝑖=0

This generates a new point along the line that connects the worst point,
𝑥 (𝑛) , and the centroid of the remaining points, 𝑥 𝑐 . This direction can be
seen as a possible descent direction.
Each iteration aims to replace the worst point with a better one
to form a new simplex. Each iteration always starts with reflection,
which generates a new point using Eq. 7.3 with 𝛼 = 1, as shown in
Fig. 7.4. If the reflected point is better than the best point, then the
“search direction” was a good one, and we go further by performing an
expansion using Eq. 7.3 with 𝛼 = 2. If the reflected point is between the
second-worst and the worst point, then the direction was not great, but
it improved somewhat. In this case, we perform an outside contraction
(𝛼 = 1/2). If the reflected point is worse than our worst point, we try
an inside contraction instead (𝛼 = −1/2). Shrinking is a last-resort
operation that we can perform when nothing along the line connecting
𝑥 (𝑛) and 𝑥 𝑐 produces a better point. This operation consists of reducing
the size of the simplex by moving all the points closer to the best point,
 
𝑥 (𝑖) = 𝑥 (0) + 𝛾 𝑥 (𝑖) − 𝑥 (0) for 𝑖 = 1, . . . , 𝑛 , (7.5)

where 𝛾 = 0.5.
7 Gradient-Free Optimization 287

𝑥𝑐

Initial simplex Reflection (𝛼 = 1) Expansion (𝛼 = 2)

Fig. 7.4 Nelder–Mead algorithm op-


Outside contraction Inside contraction Shrink erations for 𝑛 = 2.
(𝛼 = 0.5) (𝛼 = −0.5)

Algorithm 7.1 details how a new simplex is obtained for each


iteration. In each iteration, the focus is on replacing the worst point
with a better one instead of improving the best. The corresponding
flowchart is shown in Fig. 7.5.
The cost for each iteration is one function evaluation if the reflection
is accepted, two function evaluations if an expansion or contraction is
performed, and 𝑛 + 2 evaluations if the iteration results in shrinking.
Although we could parallelize the 𝑛 evaluations when shrinking, it
would not be worthwhile because the other operations are sequential.
There several ways to quantify the convergence of the simplex
method. One straightforward way is to use the size of simplex, that is,
𝑛−1
Õ
(𝑖)
Δ𝑥 = 𝑥 − 𝑥 (𝑛) , (7.6)
𝑖=0

and specify that it must be less than a certain tolerance. Another


measure of convergence we can use is the standard deviation of the
function value, v u
u
tÍ𝑛  2
𝑓 (𝑖) − 𝑓¯
𝑖=0
Δ𝑓 = , (7.7)
𝑛+1
where 𝑓¯ is the mean of the 𝑛 + 1 function values. Another possible
convergence criterion is the difference between the best and worst value
in the simplex. Nelder–Mead is known for occasionally converging to
non-stationary points, so you should check the result if possible.
7 Gradient-Free Optimization 288

Algorithm 7.1 Nelder–Mead algorithm

Inputs:
𝑥 (0) : Starting point
𝜏𝑥 : Simplex size tolerances
𝜏 𝑓 : Function value standard deviation tolerances
Outputs:
𝑥 ∗ : Optimal point

for 𝑗 = 1 to 𝑛 do Create a simplex with edge length 𝑙


𝑥 (𝑗) = 𝑥 (0) + 𝑠 (𝑗) 𝑠 (𝑗) given by Eq. 7.2
end for

while Δ𝑥 > 𝜏𝑥 or Δ 𝑓 > 𝜏 𝑓 do


n o
Simplex size (Eq. 7.6) and standard deviation (Eq. 7.7)

Sort 𝑥 (0) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) Order from the lowest (best) to the highest 𝑓 (𝑥 (𝑗) )
1 Í𝑛−1 (𝑖)
 
𝑥𝑐 = 𝑛 𝑖=0 𝑥 The centroid excluding the worst point 𝑥 (𝑛) (Eq. 7.4)

𝑥𝑟 = 𝑥𝑐 + 𝑥𝑐 − 𝑥 (𝑛) Reflection, Eq. 7.3 with 𝛼 = 1

if 𝑓 (𝑥 𝑟 ) < 𝑓 (𝑥 (0) ) then  Is reflected point is better than the best?

𝑥𝑒 = 𝑥𝑐 + 2 𝑥𝑐 − 𝑥 (𝑛) Expansion, Eq. 7.3 with 𝛼 = 2

if 𝑓 (𝑥 𝑒 ) < 𝑓 (𝑥 (0) )
then Is expanded point better than the best?
𝑥 (𝑛) = 𝑥 𝑒 Accept expansion and replace worst point
else
𝑥 (𝑛) = 𝑥 𝑟 Accept reflection
end if
else if 𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) then Is reflected better than second worst?
𝑥 (𝑛) = 𝑥 𝑟 Accept reflected point
else
if 𝑓 (𝑥 𝑟 ) > 𝑓 (𝑥 (𝑛) ) then  Is reflected point worse than the worst?

𝑥 𝑖𝑐 = 𝑥 𝑐 − 0.5 𝑥 𝑐 − 𝑥 (𝑛) Inside contraction, Eq. 7.3 with 𝛼 = −0.5

if 𝑓 (𝑥 𝑖𝑐 ) < 𝑓 (𝑥 (𝑛) )
then Inside contraction better than worst?
𝑥 (𝑛) = 𝑥 𝑖𝑐 Accept inside contraction
else
for 𝑗 = 1 to 𝑛 do  
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, Eq. 7.5 with 𝛾 = 0.5
end for
end if
else  
𝑥 𝑜𝑐 = 𝑥 𝑐 + 0.5 𝑥 𝑐 − 𝑥 (𝑛) Outside contraction, Eq. 7.3 with 𝛼 = 0.5

if 𝑓 (𝑥 𝑜𝑐 ) < 𝑓 (𝑥 𝑟 ) then Is contraction better than reflection?


𝑥 (𝑛) = 𝑥 𝑜𝑐 Accept outside contraction
else
for 𝑗 = 1 to 𝑛 do  
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, Eq. 7.5 with 𝛾 = 0.5
7 Gradient-Free Optimization 289

end for
end if
end if
end if
end while

𝑘 = 𝑘+1

𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (0) ) 𝑓 (𝑥 𝑒 ) ≤ 𝑓 (𝑥 (0) )
𝑥 (𝑛) = 𝑥 𝑒

else
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) 𝑥 (𝑛) = 𝑥 𝑟

𝑓 (𝑥 𝑟 ) ≥ 𝑓 (𝑥 (𝑛) ) 𝑓 (𝑥 𝑖𝑐 ) ≤ 𝑓 (𝑥 (𝑛) )
𝑥𝑐 𝑥 (𝑛) = 𝑥 𝑖𝑐

else

else

else 𝑓 (𝑥 𝑜𝑐 ) ≤ 𝑓 (𝑥 𝑟 )
𝑥 (𝑛) = 𝑥 𝑜𝑐

Like most direct-search methods, Nelder–Mead cannot directly Fig. 7.5 Flowchart of Nelder–Mead
(Alg. 7.1).
handle constraints. One approach to handling constraints would be to
use a penalty method (discussed in Section 5.4) to form an unconstrained
problem. In this case, the penalty does not need not be differentiable,
so a linear penalty method would suffice.

Example 7.1 Nelder–Mead algorithm applied to the bean function

Figure 7.6 shows the sequence of simplices that results when minimizing
the bean function using a Nelder–Mead simplex. The initial simplex on the
upper left is equilateral. The first iteration is an expansion, followed by an
inside contraction, another reflection, and an inside contraction before the
shrinking. The simplices then shrink dramatically in size, slowly converging to
the minimum.
Using a convergence tolerance of 10−6 in the difference between 𝑓best and
𝑓worst , the problem took 68 function evaluations.
7 Gradient-Free Optimization 290

2
𝑥0

𝑥2 1

𝑥∗

Fig. 7.6 Sequence of simplices that


−1
−2 −1 0 1 2 3 minimize the bean function.
𝑥1

7.4 Generalized Pattern Search

GPS builds upon the ideas of a coordinate search algorithm. In co-


ordinate search, we evaluate points along a mesh aligned with the
coordinate directions, move toward better points, and shrink the mesh
when we find no improvement nearby. Consider a two-dimensional
coordinate search for an unconstrained problem. At a given point
𝑥 𝑘 , we evaluate points that are a distance Δ 𝑘 away in all coordinate
directions, as shown in Fig. 7.7. If the objective function improves at any
of these points (four points in this case), we recenter with 𝑥 𝑘+1 at the
most improved point, keep the mesh size the same at Δ 𝑘+1 = Δ 𝑘 , and
start with the next iteration. Alternatively, if none of the points offers an
improvement, we keep the same center (𝑥 𝑘+1 = 𝑥 𝑘 ) and shrink the mesh
to Δ 𝑘+1 < Δ 𝑘 . This process repeats until it meets some convergence
criteria.
We now explore various ways in which GPS improves coordinate
Δ𝑘
search. Coordinate search moves along coordinate directions, but
this is not necessarily desirable. Instead, the GPS search directions
only need to form a positive spanning set. Given a set of directions 𝑥𝑘

𝐷 = {𝑑1 , 𝑑2 , . . . , 𝑑𝑛 𝑑 }, the set 𝐷 is a positive spanning set if the vectors


are linearly independent and a nonnegative linear combination of these
vectors spans the 𝑛-dimensional space.∗ Coordinate vectors fulfill this
requirement, but there is an infinite number of options. The vectors 𝑑 Fig. 7.7 Local mesh for a two-
are referred to as positive spanning directions. We only consider linear dimensional coordinate search at it-
eration 𝑘.
combinations with positive multipliers, so in two dimensions, the unit ∗
Section 5.2 discusses the concept of span
coordinate vectors 𝑒ˆ1 and 𝑒ˆ2 are not sufficient to span two-dimensional and polyhedral cones; Fig. 5.6 is particu-
space; however, 𝑒ˆ1 , 𝑒ˆ2 , −ˆ𝑒1 , and −ˆ𝑒2 are sufficient. larly relevant.
7 Gradient-Free Optimization 291

For a given dimension 𝑛, the largest number of vectors that could


be used while remaining linearly independent (known as the maximal
set) is 2𝑛. Conversely, the minimum number of possible vectors needed
to span the space (known as the minimal set) is 𝑛 + 1. These sizes are
necessary but not sufficient conditions.
Some algorithms randomly generate a positive spanning set, whereas
other algorithms require the user to specify a set based on knowledge
of the problem. The positive spanning set need not be fixed throughout
the optimization. A common default for a maximal set is the set of
coordinate directions ±ˆ𝑒 𝑖 . In three dimensions, this would be:



 𝑑1 = [1, 0, 0]




 𝑑2 = [0, 1, 0]


 𝑑3 = [0, 0, 1]

𝐷 = {𝑑1 , . . . , 𝑑6 }, where . (7.8)

 0, 0]


𝑑 4 = [−1,



 𝑑5 = [0, −1, 0]


 𝑑6 = [0, 0, −1]

A potential default minimal set is the positive coordinate directions +ˆ𝑒 𝑖
and a vector filled with −1 (or more generally, the negative sum of the
other vectors). As an example in three dimensions:




 𝑑1 = [1, 0, 0]


 𝑑2 = [0, 1, 0]

𝐷 = {𝑑1 , . . . , 𝑑4 }, where . (7.9)
 𝑑3 = [0, 0, 1]




 𝑑4 = [−1, −1, −1]

Figure 7.8 shows an example maximal set (four vectors) and minimal
set (three vectors) for a two-dimensional problem.
These direction vectors are then used to create a mesh. Given a
current center point 𝑥 𝑘 , which is the best point found so far, and a mesh
size Δ 𝑘 , the mesh is created as follows:

𝑥 𝑘 + Δ 𝑘 𝑑 for all 𝑑 ∈ 𝐷 . (7.10)


Fig. 7.8 A maximal set of positive
spanning vectors in two dimensions
For example, in two dimensions, if the current point is 𝑥 𝑘 = [1, 1], the
(left) and a minimal set (right).
mesh size is Δ 𝑘 = 0.5, and we use the coordinate directions for 𝑑, then
the mesh points would be {[1, 1.5], [1, 0.5], [0.5, 1], [1.5, 1]}.
The evaluation of points in the mesh is called polling or a poll. In
the coordinate search example, we evaluated every point in the mesh,
which is usually inefficient. More typically, we use opportunistic polling,
which terminates polling at the first point that offers an improvement.
7 Gradient-Free Optimization 292

Figure 7.9 shows a two-dimensional example where the order of eval-


uation is 𝑑1 = [1, 0], 𝑑2 = [0, 1], 𝑑3 = [−1, 0], 𝑑4 = [0, −1]. Because we
found an improvement at 𝑑2 , we do not continue evaluating 𝑑3 and 𝑑4 .
Opportunistic polling may not yield the best point in the mesh, but the 𝑓 = 8.6
improvement in efficiency is usually worth the trade-off. Some algo-
rithms add a user option for utilizing a full poll, in which case all points
in the mesh are evaluated. If more than one point offers a reduction, 𝑓 = 10.2 𝑓 = 12.4
the best one is accepted. Another approach that is sometimes used is
𝑥𝑘
called dynamic polling. In this approach, a successful poll reorders the
direction vectors so that the direction that was successful last time is
checked first in the next poll.
GPS consists of two phases: a search phase and a poll phase. The
Fig. 7.9 A two-dimensional example
search phase is global, whereas the poll phase is local. The search phase of opportunistic polling with 𝑑1 =
is intended to be flexible and is not specified by the GPS algorithm. [1, 0], 𝑑2 = [0, 1], 𝑑3 = [−1, 0], 𝑑4 =
Common options for the search phase include the following: [0, −1]. An improvement in 𝑓 was
found at 𝑑2 , so we do not evaluate 𝑑3
• No search phase. and 𝑑4 (shown with a faded color).

• A mesh search, similar to polling but with large spacing across


the domain.
• An alternative solver, such as Nelder–Mead or a genetic algorithm.
• A surrogate model, which could then use any number of solvers
that include gradient-based methods. This approach is often used
when the function is expensive, and a lower-fidelity surrogate
can guide the optimizer to promising regions of the larger design
space.
• Random evaluation using a space-filling method (see Section 10.2).

The type of search can change throughout the optimization. Like the
polling phase, the goal of the search phase is to find a better point
(i.e., 𝑓 (𝑥 𝑘+1 ) < 𝑓 (𝑥 𝑘 )) but within a broader domain. We begin with a
search at every iteration. If the search fails to produce a better point, we
continue with a poll. If a better point is identified in either phase, the
iteration ends, and we begin a new search. Optionally, a successful poll
could be followed by another poll. Thus, at each iteration, we might
perform a search and a poll, just a search, or just a poll. Δ𝑘
We describe one option for a search procedure based on the same
mesh ideas as the polling step. The concept is to extend the mesh
throughout the entire domain, as shown in Fig. 7.10. In this example, the
Fig. 7.10 Meshing strategy extended
mesh size Δ 𝑘 is shared between the search and poll phases. However, it across the domain. The same direc-
is usually more effective if these sizes are independent. Mathematically, tions (and potentially spacing) are
we can define the global mesh as the set repeated at each mesh point, as indi-
cated by the lighter arrows through-
𝐺 = {𝑥 𝑘 + Δ 𝑘 𝐷𝑧 for all 𝑧 𝑖 ∈ Z+ }, (7.11) out the entire domain.
7 Gradient-Free Optimization 293

where 𝐷 is a matrix whose columns contain the basis vectors 𝑑. The


vector 𝑧 consists of nonnegative integers, and we consider all possible
combinations of integers that fall within the bounds of the domain.
We choose a fixed number of search evaluation points and randomly
select points from the global mesh for the search strategy. If improve-
ment is found among that set, then we recenter 𝑥 𝑘+1 at this improved
point, grow the mesh (Δ 𝑘+1 > Δ 𝑘 ), and end the iteration (and then
restart the search). A simple search phase along these lines is described
in Alg. 7.2 and the main GPS algorithm is shown in Alg. 7.3.

Algorithm 7.2 An example search phase for GPS

Inputs:
𝑥 𝑘 : Center point
Δ 𝑘 : Mesh size
𝑥, 𝑥: Lower and upper bounds
𝐷: Column vectors representing positive spanning set
𝑛 𝑠 : Number of search points
𝑓 𝑘 : The function previously evaluated at 𝑓 (𝑥 𝑘 )
Outputs:
success: True if successful in finding improved point
𝑥 𝑘+1 : New center point
𝑓 𝑘+1 : Corresponding function value

success = false
𝑥 𝑘+1 = 𝑥 𝑘
𝑓 𝑘+1 = 𝑓 𝑘
Construct global mesh 𝐺, using directions 𝐷, mesh size Δ 𝑘 , and bounds 𝑥, 𝑥
for 𝑖 = 1 to 𝑛 𝑠 do
Randomly select 𝑠 ∈ 𝐺
Evaluate 𝑓𝑠 = 𝑓 (𝑠)
if 𝑓𝑠 < 𝑓 𝑘 then
𝑥 𝑘+1 = 𝑠
𝑓 𝑘+1 = 𝑓𝑠
success = true
break
end if
end for

The convergence of the GPS algorithm is often determined by a


user-specified maximum number of iterations. However, other criteria
are also used, such as a threshold on mesh size or a threshold on the
improvement in the function value over previous iterations.
7 Gradient-Free Optimization 294

Algorithm 7.3 Generalized Pattern Search

Inputs:
𝑥 0 : Starting point
𝑥, 𝑥: Lower and upper bounds
Δ0 : Starting mesh size
𝑛 𝑠 : Number of search points
𝑘max : Maximum number of iterations
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value

𝐷 = [𝐼, −𝐼] where 𝐼 is (𝑛 × 𝑛) A coordinate aligned maximal positive spanning set


(for example)
𝑘=0
𝑥 𝑘 = 𝑥0
Evaluate 𝑓 𝑘 = 𝑓 (𝑥 𝑘 )
while 𝑘 < 𝑘max do Or other termination criteria
search_success, 𝑥 𝑘+1 , 𝑓 𝑘+1 = search(𝑥 𝑘 , Δ 𝑘 , 𝑓 𝑘 ) Any search strategy
if search_success then
Δ 𝑘+1 = min(2Δ 𝑘 , Δmax ) Or some other growth rate
𝑘 = 𝑘+1
continue Move on to next iteration
else Poll
poll_success = false
for 𝑗 = 1 to 𝑛 𝑑 do
𝑠 = 𝑥 𝑘 + Δ𝑘 𝑑 𝑗 Where 𝑑 𝑗 is a column of 𝐷
Evaluate 𝑓𝑠 = 𝑓 (𝑠)
if 𝑓𝑠 < 𝑓 𝑘 then
𝑥 𝑘+1 = 𝑠
𝑓 𝑘+1 = 𝑓𝑠
Δ 𝑘+1 = Δ 𝑘
k=k+1
poll_success = true
break
end if
end for
end if
if not poll_success then
𝑥 𝑘+1 = 𝑥 𝑘
𝑓 𝑘+1 = 𝑓 𝑘
Δ 𝑘+1 = 0.5Δ 𝑘 Shrink
end if
k=k+1
end while
7 Gradient-Free Optimization 295

GPS can handle linear and nonlinear constraints. For linear con-
straints, one effective strategy is to change the positive spanning di-
rections so that they align with any linear constraints that are nearby
(Fig. 7.11). For nonlinear constraints, penalty approaches (Section 5.4)
are applicable, although the filter method (Section 5.5.3) is another
effective approach.

Example 7.2 Minimization of a multimodal function with GPS Fig. 7.11 Mesh direction changed dur-
ing optimization to align with linear
In this example, we optimize the Jones function (Appendix D.1.4). We start constraints when close to the con-
at 𝑥 = [0, 0] with an initial mesh size of Δ = 0.1. We evaluate two search points straint.
at each iteration and run for 12 iterations. The iterations are plotted in Fig. 7.12.

3 3 3

2 2 2

1 1 1
𝑥2 𝑥2 𝑥2
𝑥0
0 0 0

−1 −1 −1

−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=2 𝑘=4


3 3 3

2 2 2

1 1 1
𝑥2 𝑥2 𝑥2
0 0 0

−1 −1 −1

−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1

𝑘=6 𝑘=8 𝑘 = 12

Fig. 7.12 Convergence history of a


GPS algorithm on the multimodal
MADS is a well-known extension of GPS. The main difference Jones function. Faded points indicate
past iterations.
between these two methods is in the number of possibilities for polling
directions.141 In GPS, the polling directions are relatively restrictive 141. Audet and J. E. Dennis, Mesh adap-
(e.g., left side of Fig. 7.13 for a minimal basis in two dimensions). MADS tive direct search algorithms for constrained
𝑝 optimization, 2006.
adds a new sizing parameter called the poll size parameter (Δ 𝑘 ) that can
be varied independently from the existing mesh size parameter (Δ𝑚 𝑘
).
𝑝
These sizes are constrained by Δ 𝑘 ≥ Δ𝑚 𝑘
so the mesh sizing can become
7 Gradient-Free Optimization 296

smaller while allowing the poll size (which dictates the maximum
magnitude of the step) to remain large. This provides a much denser
set of options in poll directions (e.g., the grid points on the right panel
of Fig. 7.13). MADS randomly chooses the polling directions from this
much larger set of possibilities while maintaining a positive spanning
set.†
† TheNOMAD software is an open-source
implementation of MADS.142
𝑝 142. Le Digabel, Algorithm 909: NOMAD:
Δ𝑚 Δ𝑘
𝑘 Nonlinear optimization with the MADS
algorithm, 2011.

Fig. 7.13 Typical GPS spanning di-


rections (left). In contrast, MADS
randomly selects from many poten-
tial spanning directions by utilizing
a finer mesh (right).
GPS MADS

7.5 DIRECT Algorithm

The DIRECT algorithm is different from the other gradient-free opti-


mization algorithms in this chapter in that it is based on mathematical
arguments.∗ This is a deterministic method guaranteed to converge ∗ Jones et al.52
developed this method, aim-
ing for a global search that did not rely on
to the global optimum under conditions that are not too restrictive tunable parameters (e.g., population size
(although it might require a prohibitive number of function evaluations). in genetic algorithms).53
DIRECT has been extended to handle constraints without relying on 52. Jones et al., Lipschitzian optimization
without the Lipschitz constant, 1993.
penalty or filtering methods, but here we only explain the algorithm
53. Jones and Martins, The DIRECT
for unconstrained problems.143 algorithm—25 years later, 2021.
One way to ensure that we find the global optimum within a finite 143. Jones, Direct Global Optimization
design space is by dividing this space into a regular rectangular grid Algorithm, 2009.

and evaluating every point in this grid. This is called an exhaustive search,
and the precision of the minimum depends on how fine the grid is. The
cost of this brute-force strategy is high and goes up exponentially with
the number of design variables.
The DIRECT method relies on a grid, but it uses an adaptive meshing
scheme that dramatically reduces the cost. It starts with a single 𝑛-
dimensional hypercube that spans the whole design space—like many
other gradient-free methods, DIRECT requires upper and lower bounds
on all the design variables. Each iteration divides this hypercube into
smaller ones and evaluates the objective function at the center of each
of these. At each iteration, the algorithm only divides rectangles
determined to be potentially optimal. The fundamental strategy in the
7 Gradient-Free Optimization 297

DIRECT method is how it determines this subset of potentially optimal


rectangles, which is based on the mathematical concept of Lipschitz
continuity.
We start by explaining Lipschitz continuity and then describe
an algorithm for finding the global minimum of a one-dimensional
function using this concept—Shubert’s algorithm. Although Shubert’s
algorithm is not practical in general, it will help us understand the
mathematical rationale for the DIRECT algorithm. Then we explain the
DIRECT algorithm for one-dimensional functions before generalizing
it for 𝑛 dimensions.

Lipschitz Constant

Consider the single-variable function 𝑓 (𝑥) shown in Fig. 7.14. For a


trial point 𝑥 ∗ , we can draw a cone with slope 𝐿 by drawing the lines
𝑓+ (𝑥) = 𝑓 (𝑥 ∗ ) + 𝐿(𝑥 − 𝑥 ∗ ), (7.12)

𝑓− (𝑥) = 𝑓 (𝑥 ) − 𝐿(𝑥 − 𝑥 ), ∗
(7.13)
to the left and right, respectively. We show this cone in Fig. 7.14 (left),
as well as cones corresponding to other values of 𝑘.

𝑓 (𝑥 ∗ ) Fig. 7.14 From a given trial point 𝑥 ∗ ,


we can draw a cone with slope 𝐿 (left).
𝑓 𝑓
𝐿 𝐿 For a function to be Lipschitz contin-
𝑓− 𝑓+ uous, we need all cones with slope 𝐿
1 1
to lie under the function for all points
in the domain (right).
𝑥 𝑥
𝑥∗ 𝑥∗

A function 𝑓 is said to be Lipschitz continuous if there is a cone slope


𝐿 such that the cones for all possible trial points in the domain remain
under the function. This means that there is a positive constant 𝑘 such
that
| 𝑓 (𝑥) − 𝑓 (𝑥 ∗ )| ≤ 𝐿 |𝑥 − 𝑥 ∗ | , for all 𝑥, 𝑥 ∗ ∈ 𝐷 , (7.14)
where 𝐷 is the function domain. Graphically, this condition means
that we can draw a cone with slope 𝐿 from any trial point evaluation
𝑓 (𝑥 ∗ ) such that the function is always bounded by the cone, as shown
in Fig. 7.14 (right). Any 𝑘 that satisfies Eq. 7.14 is a Lipschitz constant for
the corresponding function.

Shubert’s Algorithm

If a Lipschitz constant for a single-variable function is known, Shubert’s


algorithm can find the global minimum of that function. Because the
7 Gradient-Free Optimization 298

Lipschitz constant is not available in the general case, the DIRECT


algorithm is designed to not require this constant. However, we explain
Shubert’s algorithm first because it provides some of the basic concepts
used in the DIRECT algorithm.
Shubert’s algorithm starts with a domain within which we want to
find the global minimum—[𝑎, 𝑏] in Fig. 7.15. Using the property of the
Lipschitz constant 𝐿 defined in Eq. 7.14, we know that the function is
always above a cone of slope 𝐿 evaluated at any point in the domain.

𝑓 𝑓

𝑎 𝑥1 𝑏 𝑥2 𝑥1 𝑥3
𝑥 𝑥

𝑘=0 𝑘=1

𝑓 𝑓

Fig. 7.15 Shubert’s algorithm requires


an initial domain and a valid Lips-
chitz constant and then increases the
𝑥5 𝑥3 𝑥4 𝑥4 lower bound of the global minimum
𝑥 𝑥
with each successive iteration.
𝑘=2 𝑘=3

Shubert’s algorithm starts by sampling the endpoints of the interval


within which we want to find the global minimum ([𝑎, 𝑏] in Fig. 7.15).
We start by establishing a first lower bound on the global minimum by
finding the cone’s intersection (𝑥1 in Fig. 7.15, 𝑘 = 0) for the extremes of
the domain. We evaluate the function at 𝑥1 and can now draw a cone
about this point to find two more intersections (𝑥2 and 𝑥3 ). Because
these two points always intersect at the same objective lower bound
value, they both need to be evaluated. Each subsequent iteration of
Shubert’s algorithm adds two new points to either side of the current
point. These two points are evaluated, and the lower bounding function
7 Gradient-Free Optimization 299

is updated with the resulting new cones. We then iterate by finding the
two points that minimize the new lower bounding function, evaluating
the function at these points, updating the lower bounding function,
and so on.
The lowest bound on the function increases at each iteration and
ultimately converges to the global minimum. At the same time, the
segments in 𝑥 decrease in size. The lower bound can switch from
distinct regions as the lower bound in one region increases beyond the
lower bound in another region.
The two significant shortcomings of Shubert’s algorithm are that
(1) a Lipschitz constant is usually not available for a general function,
and (2) it is not easily extended to 𝑛 dimensions. The DIRECT algorithm
addresses these two shortcomings.

One-Dimensional DIRECT

Before explaining the 𝑛-dimensional DIRECT algorithm, we introduce


the one-dimensional version based on principles similar to those of the
Shubert algorithm.

+𝐿 −𝐿
𝑓 𝑓

𝑓 (𝑐) − 12 𝐿(𝑏 − 𝑎)
𝑑 = 12 (𝑏 − 𝑎)

𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏 𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏
𝑥 𝑥

Like Shubert’s method, DIRECT starts with the domain [𝑎, 𝑏]. How- Fig. 7.16 The DIRECT algorithm eval-
uates the middle point (left), and each
ever, instead of sampling the endpoints 𝑎 and 𝑏, it samples the midpoint.
successive iteration trisects the seg-
Consider the closed domain [𝑎, 𝑏] shown in Fig. 7.16 (left). For each ments that have the greatest potential
segment, we evaluate the objective function at the segment’s midpoint. (right).
In the first segment, which spans the whole domain, the midpoint is
𝑐 0 = (𝑎 + 𝑏)/2. Assuming some value of 𝐿, which is not known and
that we will not need, the lower bound on the minimum would be
𝑓 (𝑐) − 𝐿(𝑏 − 𝑎)/2.
We want to increase this lower bound on the function minimum
by dividing this segment further. To do this in a regular way that
reuses previously evaluated points and can be repeated indefinitely,
7 Gradient-Free Optimization 300

we divide it into three segments, as shown in Fig. 7.16 (right). Now we


have increased the lower bound on the minimum. Unlike the Shubert
algorithm, the lower bound is a discontinuous function across the
segments, as shown in Fig. 7.16 (right).
Instead of continuing to divide every segment into three other
segments, we only divide segments selected according to a potentially
optimal criterion. To better understand this criterion, consider a set of
segments [𝑎 𝑖 , 𝑏 𝑖 ] at a given DIRECT iteration, where segment 𝑖 has a
half-length 𝑑 𝑖 = (𝑏 𝑖 − 𝑎 𝑖 )/2 and a function value 𝑓 (𝑐 𝑖 ) evaluated at the
segment center 𝑐 𝑖 = (𝑎 𝑖 + 𝑏 𝑖 )/2. If we plot 𝑓 (𝑐 𝑖 ) versus 𝑑 𝑖 for a set of
segments, we get the pattern shown in Fig. 7.17.

𝑓 (𝑐)

𝐿
𝑓 (𝑐 𝑗 )

𝑓min Fig. 7.17 Potentially optimal segments


𝑓min − 𝜀| 𝑓min | in the DIRECT algorithm are identi-
fied by the lower convex hull of this
plot.
0 𝑑 𝑑𝑗

The overall rationale for the potentially optimal criterion is that two
metrics quantify this potential: the size of the segment and the function
value at the center of the segment. The larger the segment is, the greater
the potential for that segment to contain the global minimum. The
lower the function value, the greater that potential is as well. For a set
of segments of the same size, we know that the one with the lowest
function value has the best potential and should be selected. If two
segments have the same function value and different sizes, we should
select the one with the largest size. For a general set of segments with
various sizes and value combinations, there might be multiple segments
that can be considered potentially optimal.
We identify potentially optimal segments as follows. If we draw a
line with a slope corresponding to a Lipschitz constant 𝐿 from any point
in Fig. 7.17, the intersection of this line with the vertical axis is a bound
on the objective function for the corresponding segment. Therefore,
the lowest bound for a given 𝐿 can be found by drawing a line through
the point that achieves the lowest intersection.
However, we do not know 𝐿, and we do not want to assume a value
because we do not want to bias the search. If 𝐿 were high, it would favor
dividing the larger segments. Low values of 𝐿 would result in dividing
the smaller segments. The DIRECT method hinges on considering all
7 Gradient-Free Optimization 301

possible values of 𝐿, effectively eliminating the need for this constant.


To eliminate the dependence on 𝐿, we select all the points for which
there is a line with slope 𝐿 that does not go above any other point. This
corresponds to selecting the points that form a lower convex hull, as
shown by the piecewise linear function in Fig. 7.17. This establishes a
lower bound on the function for each segment size.
Mathematically, a segment 𝑗 in the set of current segments 𝑆 is said
to be potentially optimal if there is a 𝐿 ≥ 0 such that

𝑓 (𝑐 𝑗 ) − 𝐿𝑑 𝑗 ≤ 𝑓 (𝑐 𝑖 ) − 𝐿𝑑 𝑖 for all 𝑖∈𝑆 (7.15)


𝑓 (𝑐 𝑗 ) − 𝐿𝑑 𝑗 ≤ 𝑓min − 𝜀 | 𝑓min | , (7.16)

where 𝑓min is the best current objective function value, and 𝜀 is a small
positive parameter. The first condition corresponds to finding the
points in the lower convex hull mentioned previously.
The second condition in Eq. 7.16 ensures that the potential minimum
is better than the lowest function value found so far by at least a small
amount. This prevents the algorithm from becoming too local, wasting
function evaluations in search of smaller function improvements. The
parameter 𝜀 balances the search between local and global. A typical
value is 𝜀 = 10−4 , and its range is usually such that 10−7 ≤ 𝜀 ≤ 10−2 .
There are efficient algorithms for finding the convex hull of an
arbitrary set of points in two dimensions, such as the Jarvis march.144 144. Jarvis, On the identification of the
convex hull of a finite set of points in the
These algorithms are more than we need because we only require the plane, 1973.
lower part of the convex hull, so the algorithms can be simplified for
our purposes.
As in the Shubert algorithm, the division might switch from one
part of the domain to another, depending on the new function values.
Compared with the Shubert algorithm, the DIRECT algorithm produces
a discontinuous lower bound on the function values, as shown in
Fig. 7.18.

Fig. 7.18 The lower bound for the


DIRECT method is discontinuous at
𝑎 𝑐 𝑏 the segment boundaries.
𝑥
7 Gradient-Free Optimization 302

DIRECT in 𝑛 Dimensions

The 𝑛-dimensional DIRECT algorithm is similar to the one-dimensional


version but becomes more complex.† The main difference is that we † In this chapter, we present an improved

version of DIRECT.143
deal with hyperrectangles instead of segments. A hyperrectangle can
143. Jones, Direct Global Optimization
be defined by its center-point position 𝑐 in 𝑛-dimensional space and a Algorithm, 2009.
half-length in each direction 𝑖, 𝛿𝑒 𝑖 , as shown in Fig. 7.19. The DIRECT
algorithm assumes that the initial dimensions are normalized so that
we start with a hypercube.

δe2
Fig. 7.19 Hyperrectangle in three di-
c δe1 mensions, where 𝑑 is the maximum
δe3
distance between the center and the
d vertices, and 𝛿𝑒 𝑖 is the half-length in
each direction 𝑖.

To identify the potentially optimal rectangles at a given iteration, we


use exactly the same conditions in Eqs. 7.15 and 7.16, but 𝑐 𝑖 is now the
center of the hyperrectangle, and 𝑑 𝑖 is the maximum distance from the
center to a vertex. The explanation illustrated in Fig. 7.17 still applies
in the 𝑛-dimensional case and still involves simply finding the lower
convex hull of a set of points with different combinations of 𝑓 and 𝑑.
The main complication introduced in the 𝑛-dimensional case is
the division (trisection) of a selected hyperrectangle. The question is
which directions should be divided first. The logic to handle this in
the DIRECT algorithm is to prioritize reducing the dimensions with
the maximum length, ensuring that hyperrectangles do not deviate too
much from the proportions of a hypercube. First, we select the set of the
longest dimensions for division (there are often multiple dimensions
with the same length). Among this set of the longest dimensions, we
select the direction that has been divided the least over the whole history
of the search. If there are still multiple dimensions in the selection, we
simply select the one with the lowest index. Algorithm 7.4 details the
full algorithm.‡ ‡ Alg. 7.4 follows the revised version of
DIRECT,143 which differs from the original
Figure 7.20 shows the first three iterations for a two-dimensional ex- version.145 The original version trisected
ample and the corresponding visualization of the conditions expressed all the long sides of the selected rectangles
instead of just one side.
in Eqs. 7.15 and 7.16. We start with a square that contains the whole
143. Jones, Direct Global Optimization
domain and evaluate the center point. The value of this point is plotted Algorithm, 2009.
on the 𝑓 –𝑑 plot on the far right. 145. Jones et al., Efficient global optimiza-
The first iteration trisects the starting square along the first dimen- tion of expensive black-box functions, 1998.

sion and evaluates the two new points. The values for these three points
7 Gradient-Free Optimization 303

Iteration Select rectangles Trisect and sample

1 𝑓

2 𝑓

3 𝑓 Fig. 7.20 DIRECT iterations for two-


dimensional case (left) and corre-
sponding identification of potentially
optimal rectangles (right).
𝑑

are plotted in the second column from the right in the 𝑓 –𝑑 plot, where
the center point is reused, as indicated by the arrow and the matching
color. At this iteration, we have two points that define the convex hull.
In the second iteration, we have three rectangles of the same size, so
we divide the one with the lowest value and evaluate the centers of
the two new rectangles (which are squares in this case). We now have
another column of points in the 𝑓 –𝑑 plot corresponding to a smaller 𝑑
and an additional point that defines the lower convex hull. Because the
convex hull now has two points, we trisect two different rectangles in
the third iteration.

Algorithm 7.4 DIRECT in 𝑛-dimensions

Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Optimal point

𝑘=0 Initialize iteration counter


Normalize bounded space to hypercube and evaluate its center, 𝑐 0
𝑓min = 𝑓 (𝑐0 ) Stores the minimum function value so far
Initialize 𝑡(𝑖) = 0 for 𝑖 = 1, . . . , 𝑛 Counts the times dimension 𝑖 has been trisected
while not converged do
7 Gradient-Free Optimization 304

Find set 𝑆 of potentially optimal hyperrectangles


for each hyperrectangle in 𝑆 do
Find the set 𝐼 of dimensions with maximum side length
Select 𝑖 in 𝐼 with the lowest 𝑡(𝑖), breaking ties in favor of lower 𝑖
Divide the rectangle into thirds along dimension 𝑖
𝑡(𝑖) = 𝑡(𝑖) + 1
Evaluate the center points of the outer two hyperrectangles
Update 𝑓min based on these evaluations
end for
𝑘 = 𝑘+1 Increment iteration counter
end while

Example 7.3 Minimization of multimodal function with DIRECT

Consider the multimodal Jones function (Appendix D.1.4). Applying the 𝑓


DIRECT method to this function, we get the 𝑓 -𝑑 plot shown in Fig. 7.21, where 40

the final points and convex hull are highlighted. The sequence of rectangles 30
is shown in Fig. 7.22. The algorithm converges to the global minimum after 20
dividing the rectangles around the other local minima a few times.
10

0
3
−10

−20
2
10−2 10−1 100
𝑑

1
Fig. 7.21 Potentially optimal rectan-
𝑥2 gles for the DIRECT iterations shown
0 in Fig. 7.22.

Fig. 7.22 The DIRECT method quickly


−1
determines the region with the global
minimum of the Jones function af-
ter briefly exploring the regions with
−2
−2 −1 0 1 2 3 4 other minima.
𝑥1

∗ The first GA software was written in 1954,

followed by other seminal work.146 Ini-


tially, these GAs were not developed to
7.6 Genetic Algorithms perform optimization but rather to model
the evolutionary process. GAs were even-
tually applied to optimization.147
Genetic algorithms (GAs) are the most well-known and widely used
146. Barricelli, Esempi numerici di processi
type of evolutionary algorithm. They were also among the earliest to di evoluzione, 1954.
have been developed.∗ Like many evolutionary algorithms, GAs are 147. Jong, An analysis of the behavior of a
population based. The optimization starts with a set of design points (the class of genetic adaptive systems, 1975.
7 Gradient-Free Optimization 305

population) rather than a single starting point, and each optimization


iteration updates this set in some way. Each GA iteration is called
a generation, each of which has a population with 𝑛 𝑝 points. Each
point is represented by a chromosome, which contains the values for
all the design variables, as shown in Fig. 7.23. Each design variable is
represented by a gene. As we will see later, there are different ways for
genes to represent the design variables.

Population
Gene Chromosome

𝑥 (0) 𝑥1 𝑥2 ... 𝑥𝑛

Fig. 7.23 Each GA iteration involves


𝑥 (1) a population of design points, where
each design is represented by a chro-
..
. mosome, and each design variable is
represented by a gene.
𝑥 (𝑛 𝑝 )

GAs evolve the population using an algorithm inspired by biological Population 𝑃𝑘


reproduction and evolution using three main steps: (1) selection, (2)
crossover, and (3) mutation. Selection is based on natural selection,
where members of the population that acquire favorable adaptations
are more likely to survive longer and contribute more to the population
gene pool. Crossover is inspired by chromosomal crossover, which is Selection
the exchange of genetic material between chromosomes during sexual
Parents
reproduction. Mutation mimics genetic mutation, which is a permanent
change in the gene sequence that occurs naturally.
Algorithm 7.5 and Fig. 7.24 show how these three steps perform
optimization. Although most GAs follow this general procedure, there
is a great degree of flexibility in how the steps are performed, leading
to many variations in GAs. For example, there is no single method Crossover

specified for the generation of the initial population, and the size of Offspring
that population varies. Similarly, there are many possible methods
for selecting the parents, generating the offspring, and selecting the
survivors. Here, the new population (𝑃𝑘+1 ) is formed exclusively by
the offspring generated from the crossover. However, some GAs add
an extra selection process that selects a surviving population of size 𝑛 𝑝 Mutation
among the population of parents and offspring. Population 𝑃𝑘+1
In addition to the flexibility in the various operations, GAs use differ-
ent methods for representing the design variables. The design variable
representation can be used to classify genetic algorithms into two broad
categories: binary-encoded and real-encoded genetic algorithms. Binary-
encoded algorithms use bits to represent the design variables, whereas
the real-encoded algorithms keep the same real value representation Fig. 7.24 GA iteration steps.
7 Gradient-Free Optimization 306

used in most other algorithms. The details of the operations in Alg. 7.5
depend on whether we are using one or the other representation, but
the principles remain the same. In the rest of this section, we describe a
particular way of performing these operations for each of the possible
design variable representations.

Algorithm 7.5 Genetic algorithm

Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value

𝑘 = 0n o
𝑃 𝑘 = 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑛 𝑝 ) Generate initial population
while 𝑘 < 𝑘max do
Compute 𝑓 (𝑥) for all 𝑥 ∈ 𝑃 𝑘 Evaluate objective function
Select 𝑛 𝑝 /2 parent pairs from 𝑃 𝑘 for crossover Selection
Generate a new population of 𝑛 𝑝 offspring (𝑃 𝑘+1 ) Crossover
Randomly mutate some points in the population Mutation
𝑘 = 𝑘+1
end while

7.6.1 Binary-Encoded Genetic Algorithms


The original genetic algorithms were based on binary encoding because
they more naturally mimic chromosome encoding. Binary-coded GAs
are applicable to discrete or mixed-integer problems.† When using † One popular binary-encoded genetic
algorithm implementation is the elitist
binary encoding, we represent each variable as a binary number with nondominated sorting genetic algorithm
𝑚 bits. Each bit in the binary representation has a location, 𝑖, and a (NSGA-II; discussed in Section 9.3.4 in
connection with multiobjective optimiza-
value, 𝑏 𝑖 (which is either 0 or 1). If we want to represent a real-valued tion).148
variable, we first need to consider a finite interval 𝑥 ∈ [𝑥, 𝑥], which we 148. Deb et al., A fast and elitist multiobjec-
can then divide into 2𝑚 − 1 intervals. The size of the interval is given by tive genetic algorithm: NSGA-II, 2002.

𝑥−𝑥
Δ𝑥 = . (7.17)
2𝑚 − 1
To have a more precise representation, we must use more bits.
When using binary-encoded GAs, we do not need to encode the
design variables because they are generated and manipulated directly
in the binary representation. Still, we do need to decode them be-
fore providing them to the evaluation function. To decode a binary
7 Gradient-Free Optimization 307

representation, we use

Õ
𝑚−1
𝑥=𝑥+ 𝑏 𝑖 2𝑖 Δ𝑥 . (7.18)
𝑖=0

Example 7.4 Binary representation of a real number

Suppose we have a continuous design variable 𝑥 that we want to represent


in the interval [−20, 80] using 12 bits. Then, we have 212 − 1 = 4, 095 intervals,
and using Eq. 7.17, we get Δ𝑥 ≈ 0.0244. This interval is the error in this
finite-precision representation. For the following sample binary representation:

𝑖 1 2 3 4 5 6 7 8 9 10 11 12
𝑏𝑖 0 0 0 1 0 1 1 0 0 0 0 1

We can use Eq. 7.18 to compute the equivalent real number, which turns out to
be 𝑥 ≈ 32.55.

Initial Population

The first step in a genetic algorithm is to generate an initial set (pop-


ulation) of points. As a rule of thumb, the population size should
be approximately one order of magnitude larger than the number of
design variables, and this size should be tuned.
One popular way to choose the initial population is to do it at random.
Using binary encoding, we can assign each bit in the representation of
the design variables a 50 percent chance of being either 1 or 0. This
can be done by generating a random number 0 ≤ 𝑟 ≤ 1 and setting the
bit to 0 if 𝑟 ≤ 0.5 and 1 if 𝑟 > 0.5. For a population of size 𝑛 𝑝 , with 𝑛
design variables, where each variable is encoded using 𝑚 bits, the total
number of bits that needs to be generated is 𝑛 𝑝 × 𝑛 × 𝑚.
To achieve better spread in a larger dimensional space, the sampling
methods described in Section 10.2 are generally more effective than
random populations.
Although we then need to evaluate the function across many points
(a population), these evaluations can be performed in parallel.

Selection

In this step, we choose points from the population for reproduction


in a subsequent step (called a mating pool). On average, it is desirable
to choose a mating pool that improves in fitness (thus mimicking the
7 Gradient-Free Optimization 308

concept of natural selection), but it is also essential to maintain diversity.


In total, we need to generate 𝑛 𝑝 /2 pairs.
The simplest selection method is to randomly select two points from
the population until the requisite number of pairs is complete. This
approach is not particularly effective because there is no mechanism to
move the population toward points with better objective functions.
Tournament selection is a better method that randomly pairs up 𝑛 𝑝
points and selects the best point from each pair to join the mating pool.
The same pairing and selection process is repeated to create 𝑛 𝑝 /2 more
points to complete a mating pool of 𝑛 𝑝 points.

Example 7.5 Tournament selection process

Figure 7.25 illustrates the process with a small population. Each member of
the population ends up in the mating pool zero, one, or two times, with better
points more likely to appear in the pool. The best point in the population will
always end up in the pool twice, whereas the worst point in the population is
always eliminated.

12 2
10 2
10 15

7 6
7 6
15 7
Fig. 7.25 Tournament selection exam-
2 10 ple. The best point in each randomly
2 10 selected pair is moved into the mating
pool.
6 12

Another standard method is roulette wheel selection. This concept


is patterned after a roulette wheel used in a casino. Better points
are allocated a larger sector on the roulette wheel to have a higher
probability of being selected.
First, the objective function for all the points in the population must
be converted to a fitness value because the roulette wheel needs all
positive values and is based on maximizing rather than minimizing. To
achieve that, we first perform the following conversion to fitness:

− 𝑓𝑖 + Δ𝐹
𝐹= , (7.19)
max(1, Δ𝐹 − 𝑓low )
7 Gradient-Free Optimization 309

where Δ𝐹 = 1.1 𝑓high −0.1 𝑓low is based on the highest and lowest function
values in the population, and the denominator is introduced to scale
the fitness.
Then, to find the sizes of the sectors in the roulette wheel selection,
we take the normalized cumulative sum of the scaled fitness values to
compute an interval for each member in the population 𝑗 as

Í𝑗
𝐹𝑖
𝑖=1
𝑆𝑗 = . (7.20)
Í𝑛𝑝
𝐹𝑖
𝑖=1

We can now create a mating pool of 𝑛 𝑝 points by turning the roulette


wheel 𝑛 𝑝 times. We do this by generating a random number 0 ≤ 𝑟 ≤ 1
at each turn. The 𝑗th member is copied to the mating pool if

𝑆 𝑗−1 < 𝑟 ≤ 𝑆 𝑗 . (7.21)

This ensures that the probability of a member being selected for repro- 0
0.875
duction is proportional to its scaled fitness value. 𝑥 (4)
𝑥 (1)

Example 7.6 Roulette wheel selection process 0.25


𝑥 (2)
Assume that 𝐹 = [5, 10, 20, 45]. Then 𝑆 = [0.25, 0.3125, 0.875, 1], which
0.3125
divides the “wheel” into four segments, shown graphically in Fig. 7.26. We 𝑥 (3)
would then draw four random numbers (say, 0.6, 0.2, 0.9, 0.7), which would
correspond to the following 𝑛 𝑝 /2 pairs: (𝑥3 and 𝑥1 ), (𝑥4 and 𝑥3 ).
Fig. 7.26 Roulette wheel selection ex-
ample. Fitter members receive a pro-
portionally larger segment on the
wheel.
Crossover
Parent 1
In the reproduction operation, two points (offspring) are generated 1 1 0 0 0 1 0 1
from a pair of points (parents). Various strategies are possible in genetic Parent 2
algorithms. Single-point crossover usually involves generating a random 1 0 1 0 1 1 1 0
integer 1 ≤ 𝑘 ≤ 𝑚 − 1 that defines the crossover point. This is illustrated
in Fig. 7.27. For one of the offspring, the first 𝑘 bits are taken from parent Offspring 1

1 and the remaining bits from parent 2. For the second offspring, the 1 1 0 0 0 1 1 0

first 𝑘 bits are taken from parent 2 and the remaining ones from parent Offspring 2

1. Various extensions exist, such as two-point crossover or 𝑛-point 1 0 1 0 1 1 0 1

crossover. Crossover point

Mutation Fig. 7.27 The crossover point deter-


mines which parts of the chromo-
Mutation is a random operation performed to change the genetic infor- some from each parent get inherited
mation and is needed because even though selection and reproduction by each offspring.
7 Gradient-Free Optimization 310

effectively recombine existing information, occasionally some useful


genetic information might be lost. The mutation operation protects
against such irrecoverable loss and introduces additional diversity into
the population.
When using bit representation, every bit is assigned a small permu- Before mutation
1 0 0 1 0 1 1 0
tation probability, say 𝑝 = 0.005 ∼ 0.05. This is done by generating a
random number 0 ≤ 𝑟 ≤ 1 for each bit, which is changed if 𝑟 < 𝑝. An After mutation
1 0 0 1 1 1 1 0
example is illustrated in Fig. 7.28.
Fig. 7.28 Mutation randomly switches
7.6.2 Real-Encoded Genetic Algorithms some of the bits with low probability.

As the name implies, real-encoded GAs represent the design variables


in their original representation as real numbers. This has several
advantages over the binary-encoded approach. First, real encoding
represents numbers up to machine precision rather than being limited
by the number of bits chosen for the binary encoding. Second, it
avoids the “Hamming cliff” issue of binary encoding, which is caused
by the fact that many bits must change to move between adjacent
real numbers (e.g., 0111 to 1000). Third, some real-encoded GAs can
generate points outside the design variable bounds used to create the
initial population; in many problems, the design variables are not
bounded. Finally, it avoids the burden of binary coding and decoding.
The main disadvantage is that integer or discrete variables cannot be
handled. For continuous problems, a real-encoded GA is generally
more efficient than a binary-encoded GA.140 We now describe the 140. Simon, Evolutionary Optimization
Algorithms, 2013.
required changes to the GA operations in the real-encoded approach.

Initial Population

The most common approach is to pick the 𝑛 𝑝 points using random


sampling within the provided design bounds. Each member is often
chosen at random within some initial bounds. For each design variable
𝑥 𝑖 , with bounds such that 𝑥 𝑖 ≤ 𝑥 𝑖 ≤ 𝑥 𝑖 , we could use,

𝑥 𝑖 = 𝑥 𝑖 + 𝑟(𝑥 𝑖 − 𝑥 𝑖 ) (7.22)

where 𝑟 is a random number such that 0 ≤ 𝑟 ≤ 1. Again, the sam-


pling methods described in Section 10.2 are more effective for higher-
dimensional spaces.

Selection

The selection operation does not depend on the design variable encod-
ing. Therefore, we can use one of the selection approaches described
for the binary-encoded GA: tournament or roulette wheel selection.
7 Gradient-Free Optimization 311

Crossover

When using real encoding, the term crossover does not accurately
describe the process of creating the two offspring from a pair of points.
Instead, the approaches are more accurately described as a blending,
although the name crossover is still often used.
There are various options for the reproduction of two points encoded
using real numbers. A standard method is linear crossover, which
generates two or more points in the line defined by the two parent
points. One option for linear crossover is to generate the following two
points:
𝑥 𝑐1 = 0.5𝑥 𝑝1 + 0.5𝑥 𝑝2 ,
(7.23)
𝑥 𝑐2 = 2𝑥 𝑝2 − 𝑥 𝑝1 ,
where parent 2 is more fit than parent 1 ( 𝑓 (𝑥 𝑝2 ) < 𝑓 (𝑥 𝑝1 )). An example
of this linear crossover approach is shown in Fig. 7.29, where we can
see that child 1 is the average of the two parent points, whereas child 2
is obtained by extrapolating in the direction of the “fitter” parent.
Another option is a simple crossover like the binary case where a
𝑥 𝑐2
random integer is generated to split the vectors—for example, with a
split after the first index: 𝑥 𝑝2

𝑥 𝑝1 = [𝑥 1 , 𝑥2 , 𝑥3 , 𝑥4 ] 𝑥 𝑐1
𝑥 𝑝2 = [𝑥 5 , 𝑥6 , 𝑥7 , 𝑥8 ]
𝑥 𝑝1
⇓ (7.24)
𝑥 𝑐1 = [𝑥 1 , 𝑥6 , 𝑥7 , 𝑥8 ] Fig. 7.29 Linear crossover produces
two new points along the line defined
𝑥 𝑐2 = [𝑥 5 , 𝑥2 , 𝑥3 , 𝑥4 ] . by the two parent points.

This simple crossover does not generate as much diversity as the


binary case and relies more heavily on effective mutation. Many other
strategies have been devised for real-encoded GAs.149
149. Deb, Multi-Objective Optimization
Using Evolutionary Algorithms, 2001.
Mutation

As with a binary-encoded GA, mutation should only occur with a small


probability (e.g., 𝑝 = 0.005 ∼ 0.05). However, rather than changing
each bit with probability 𝑝, we now change each design variable with
probability 𝑝.
Many mutation methods rely on random variations around an
existing member, such as a uniform random operator:

𝑥new 𝑖 = 𝑥 𝑖 + (𝑟 𝑖 − 0.5)Δ𝑖 , for 𝑖 = 1, . . . 𝑛 , (7.25)

where 𝑟 𝑖 is a random number between 0 and 1, and Δ𝑖 is a preselected


maximum perturbation in the 𝑖th direction. Many nonuniform methods
7 Gradient-Free Optimization 312

exist as well. For example, we can use a normal probability distribution


𝑥new 𝑖 = 𝑥 𝑖 + 𝒩(0, 𝜎𝑖 ), for 𝑖 = 1, . . . 𝑛 , (7.26)
where 𝜎𝑖 is a preselected standard deviation, and random samples are
drawn from the normal distribution. During the mutation operations,
bound checking is necessary to ensure the mutations stay within the
lower and upper limits.

Example 7.7 Genetic algorithm applied to the bean function

3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=3 𝑘=6


3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘 = 10 𝑘 = 20 𝑘 = 50

Figure 7.30 shows the evolution of the population when minimizing the Fig. 7.30 Evolution of the population
bean function using a bit-encoded GA. The initial population size was 40, and using a bit-encoded GA to minimize
the simulation was run for 50 generations. Figure 7.31 shows the evolution the bean function, where 𝑘 is the gen-
eration number.
when using a real-encoded GA but otherwise uses the same parameters as the
bit-encoded optimization. The real-encoded GA converges faster in this case.

7.6.3 Constraint Handling


Various approaches exist for handling constraints. Like the Nelder–
Mead method, we can use a penalty method (e.g., augmented La-
7 Gradient-Free Optimization 313

3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=1 𝑘=2


3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=4 𝑘=6 𝑘 = 10

grangian, linear penalty). However, there are additional options for Fig. 7.31 Evolution of the population
GAs. In the tournament selection, we can use other selection criteria using a real-encoded GA to minimize
the bean function, where 𝑘 is the gen-
that do not depend on penalty parameters. One such approach for eration number.
choosing the best selection among two competitors is as follows:

1. Prefer a feasible solution.


2. Among two feasible solutions, choose the one with a better
objective.
3. Among two infeasible solutions, choose the one with a smaller
constraint violation.

This concept is a lot like the filter methods discussed in Section 5.5.3.

7.6.4 Convergence
Rigorous mathematical convergence criteria, like those used in gradient-
based optimization, do not apply to GAs. The most common way to
terminate a GA is to specify a maximum number of iterations, which
corresponds to a computational budget. Another similar approach is
to let the algorithm run indefinitely until the user manually terminates
the algorithm, usually by monitoring the trends in population fitness.
7 Gradient-Free Optimization 314

A more automated approach is to track a running average of the


population’s fitness. However, it can be challenging to decide what
tolerance to apply to this criterion because we generally are not inter-
ested in the average performance. A more direct metric of interest
is the fitness of the best member in the population. However, this
can be a problematic criterion because the best member can disappear
as a result of crossover or mutation. To avoid this and to improve
convergence, many GAs employ elitism. This means that the fittest
population member is retained to guarantee that the population does
not regress. Even without this behavior, the best member often changes
slowly, so the user should not terminate the algorithm unless the best
member has not improved for several generations.

7.7 Particle Swarm Optimization

Like a GA, particle swarm optimization (PSO) is a stochastic population-


based optimization algorithm based on the concept of “swarm intel-
ligence”. Swarm intelligence is the property of a system whereby
the collective behaviors of unsophisticated agents interacting locally
with their environment cause coherent global patterns. In other words:
dumb agents, properly connected into a swarm, can yield smart results.∗ ∗ PSO was first proposed by Eberhart and
Kennedy.150 Eberhart was an electrical en-
The “swarm” in PSO is a set of design points (agents or particles) that gineer, and Kennedy was a social psychol-
move in 𝑛-dimensional space, looking for the best solution. Although ogist.

these are just design points, the history for each point is relevant to 150. Eberhart and Kennedy, New opti-
mizer using particle swarm theory, 1995.
the PSO algorithm, so we adopt the term particle. Each particle moves
according to a velocity. This velocity changes according to the past
objective function values of that particle and the current objective values
of the rest of the particles. Each particle remembers the location where
it found its best result so far, and it exchanges information with the
swarm about the location where the swarm has found the best result
so far.
The position of particle 𝑖 for iteration 𝑘 + 1 is updated according to
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑣 𝑘+1 Δ𝑡 , (7.27)

where Δ𝑡 is a constant artificial time step. The velocity for each particle
is updated as follows:
(𝑖) (𝑖) (𝑖)
(𝑖) (𝑖) 𝑥best − 𝑥 𝑘 𝑥best − 𝑥 𝑘
𝑣 𝑘+1 = 𝛼𝑣 𝑘 + 𝛽 +𝛾 . (7.28)
Δ𝑡 Δ𝑡
The first component in this update is the “inertia”, which determines
how similar the new velocity is to the velocity in the previous iteration
7 Gradient-Free Optimization 315

through the parameter 𝛼. Typical values for the inertia parameter 𝛼 are
in the interval [0.8, 1.2]. A lower value of 𝛼 reduces the particle’s inertia
and tends toward faster convergence to a minimum. A higher value of 𝛼
increases the particle’s inertia and tends toward increased exploration to
potentially help discover multiple minima. Some methods are adaptive,
choosing the value of 𝛼 based on the optimizer’s progress.151 151. Zhan et al., Adaptive particle swarm
optimization, 2009.
The second term represents “memory” and is a vector pointing
(𝑖)
toward the best position particle 𝑖 has seen in all its iterations so far, 𝑥best .
The weight in this term consists of a random number 𝛽 in the interval
[0, 𝛽max ] that introduces a stochastic component to the algorithm. Thus,
𝛽 controls how much influence the best point found by the particle so
far has on the next direction.
The third term represents “social” influence. It behaves similarly
to the memory component, except that 𝑥best is the best point the entire
swarm has found so far, and 𝛾 is a random number between [0, 𝛾max ]
that controls how much of an influence this best point has in the next
direction. The relative values of 𝛽 and 𝛾 thus control the tendency
toward local versus global search, respectively. Both 𝛽 max and 𝛾max are
in the interval [0, 2] and are typically closer to 2. Sometimes, rather
than using the best point in the entire swarm, the best point is chosen
within a neighborhood.
Because the time step is artificial, we can eliminate it by multiplying
Eq. 7.28 by Δ𝑡 to yield a step:
   
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥 best − 𝑥 𝑘 . (7.29)

We then use this step to update the particle position for the next
iteration:
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 . (7.30)
The three components of the update in Eq. 7.29 are shown in Fig. 7.32
for a two-dimensional case.

𝑥best

(𝑖)
𝑥 best (𝑖)
𝑥 𝑘+1
 
(𝑖) (𝑖)
𝛽 𝑥best − 𝑥 𝑘
(𝑖)
Δ𝑥 𝑘+1
 
(𝑖)
𝑥 𝑘−1
(𝑖)
𝛾 𝑥best − 𝑥 𝑘
Fig. 7.32 Components of the PSO up-
(𝑖)
𝑥𝑘 (𝑖)
date.
𝛼Δ𝑥 𝑘
7 Gradient-Free Optimization 316

The first step in the PSO algorithm is to initialize the set of particles
(Alg. 7.6). As with a GA, the initial set of points can be determined
randomly or can use a more sophisticated sampling strategy (see
Section 10.2). The velocities are also randomly initialized, generally
using some fraction of the domain size (𝑥 − 𝑥).

Algorithm 7.6 Particle swarm optimization algorithm

Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
𝛼: Inertia parameter
𝛽 max : Self influence parameter
𝛾max : Social influence parameter
Δ𝑥max : Maximum velocity
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value

𝑘=0
for 𝑖 = 1 to 𝑛 do Loop to initialize all particles
(𝑖)
Generate position 𝑥0 within specified bounds.
(𝑖)
Initialize “velocity” Δ𝑥 0
end for
while not converged do Main iteration loop
for 𝑖 = 1 to 𝑛 do  
(𝑖)
if 𝑓 𝑥 (𝑖) < 𝑓 𝑥best then Best individual points
(𝑖)
𝑥best = 𝑥 (𝑖)
end if
if 𝑓 (𝑥 (𝑖) ) < 𝑓 (𝑥best ) then Best swarm point
𝑥best = 𝑥 (𝑖)
end if
end for
for 𝑖 = 1 to 𝑛 do    
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥best − 𝑥 𝑘
   
(𝑖) (𝑖)
Δ𝑥 𝑘+1 = max min Δ𝑥 𝑘+1 , Δ𝑥max , −Δ𝑥 max Limit velocity
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥𝑘 + Δ𝑥 𝑘+1
   
Update the particle position
(𝑖) (𝑖)
𝑥 𝑘+1 = max min 𝑥 𝑘+1 , 𝑥 ,𝑥 Enforce bounds
end for
𝑘 = 𝑘+1
end while
7 Gradient-Free Optimization 317

The main loop in the algorithm computes the steps to be added to


each particle and updates their positions. Particles must be prevented
from going beyond the bounds. If a particle reaches a boundary and
has a velocity pointing out of bounds, it is helpful to reset to velocity to
zero or reorient it toward the interior for the next iteration. It is also
helpful to impose a maximum velocity. If the velocity is too large, the
updated positions are unrelated to their previous positions, and the
search is more random. The maximum velocity might also decrease
across iterations to shift from exploration toward exploitation.

Example 7.8 PSO algorithm applied to the bean function

Figure 7.33 shows the particle movements that result when minimizing the
bean function using a particle swarm method. The initial population size was
40, and the optimization required 600 function evaluations. Convergence was
assumed if the best value found by the population did not improve by more
than 10−4 for three consecutive iterations.

3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=0 𝑘=1 𝑘=3


3 3 3

2 2 2

𝑥2 1 𝑥2 1 𝑥2 1

0 0 0

−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

𝑘=5 𝑘 = 12 𝑘 = 17

Fig. 7.33 Sequence of PSO iterations


that minimize the bean function.

Several convergence criteria are possible, some of which are similar


to the Nelder–Mead algorithm and GAs. Examples of convergence
7 Gradient-Free Optimization 318

criteria include the distance (sum or norm) between each particle and
the best particle, the best particle’s objective function value changes
for the last few generations, and the difference between the best and
worst member. For PSO, another alternative is to check whether the
velocities for all particles (as measured by a metric such as norm or
mean) are below some tolerance. Some of these criteria that assume
all the particles congregate (distance, velocities) do not work well for
multimodal problems. In those cases, tracking only the best particle’s
objective function value may be more appropriate.

Tip 7.2 Compare optimization algorithms fairly

It is challenging to compare different algorithms fairly, especially when they


use different convergence criteria. You can either compare the computational
cost of achieving an objective with a specified accuracy or compare the objective
achieved for a specified computational cost. To compare algorithms that use
different convergence criteria, you can run them for as long as you can afford
using the lowest convergence tolerance possible and tabulate the number of
function evaluations and the respective objective function values. To compare
the computational cost for a specified tolerance, you can determine the number
of function evaluations that each algorithm requires to achieve a given number
of digit agreement in the objective function. Alternatively, you can compare the
objective achieved for the different algorithms for a given number of function
evaluations. Comparison becomes more challenging for constrained problems
because a better objective that is less feasible is not necessarily better. In that
case, you need to make sure that all results are feasible to the same tolerance.
When comparing algorithms that include stochastic procedures (e.g., GAs,
PSO), you should run each optimization multiple times to get statistically
significant data and compare the mean and variance of the performance metrics.
Even for deterministic algorithms, results can vary significantly with starting
points (or other parameters), so running multiple optimizations is preferable.

𝑓
30
Example 7.9 Comparison of algorithms for a multimodal discontinuous
function 20

We now return to the Jones function (Appendix D.1.4), but we make it 10


discontinuous by adding the following function: Discontinuous
0
Original
Δ 𝑓 = 4dsin(𝜋𝑥1 ) sin(𝜋𝑥2 )e . (7.31)
−10

By taking the ceiling of the product of the two sine waves, this function creates a 𝑥∗
−20
0 1 2 3
checkerboard pattern with 0s and 4s. In this latter case, each gradient evaluation −1
𝑥1
is counted as an evaluation in addition to each function evaluation. Adding this
function to the Jones function produces the discontinuous pattern shown in Fig. 7.34 Slice of the Jones function
Fig. 7.34. This is a one-dimensional slice of constant 𝑥 2 through the optimum of with the added checkerboard pattern.
7 Gradient-Free Optimization 319

the Jones function; the full two-dimensional contour plot is shown in Fig. 7.35.
The global optimum remains the same as the original function.

3 3 3
179 evaluations 119 evaluations 99 evaluations
2 2 2

1 1 1
𝑥2 𝑥2 𝑥2

0 0 0

𝑥∗ 𝑥∗
−1 −1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

Nelder–Mead algorithm Generalized pattern search DIRECT algorithm

3 3 3
2420 evaluations 760 evaluations 96 evaluations
2 2 2

1 1 1
𝑥2 𝑥2 𝑥2

0 0 0

𝑥∗
−1 −1 −1
𝑥∗ 𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1

Genetic algorithm Particle swarm optimization Quasi-Newton method

The resulting optimization paths demonstrate that some gradient-free Fig. 7.35 Convergence path for
algorithms effectively handle the discontinuities and find the global minimum. gradient-free algorithms compared
with gradient-based algorithms with
Nelder–Mead converges quickly, but not necessarily to the global minimum. multistart.
GPS and DIRECT quickly converge to the global minimum. GAs and PSO
also find the global minimum, but they require many more evaluations. The
gradient-based algorithm (quasi-Newton) with multistart also converges the
global minimum in two of the six random starts.

7.8 Summary

Gradient-free optimization algorithms are needed when the objective


and constraint functions are not smooth enough or when it is not
possible to compute derivatives with enough precision. One major
advantage of gradient-free methods is that they tend to be robust to
numerical noise and discontinuities, making them easier to use than
gradient-based methods.
7 Gradient-Free Optimization 320

However, the overall cost of gradient-free optimization is sensitive to


the cost of the function evaluations because they require many iterations
for convergence, and the number of iterations scales poorly with the
number of design variables.
There is a wide variety of gradient-free methods. They can perform
a local or global search, use mathematical or heuristic criteria, and
be deterministic or stochastic. A global search does not guarantee
convergence to the global optimum but increases the likelihood of such
convergence. We should be wary when heuristics establish convergence
because the result might not correspond to the actual mathematical
optimum. Heuristics in the optimization algorithm also limit the rate
of convergence compared with algorithms based on mathematical
principles.
In this chapter, we covered only a small selection of popular gradient-
free algorithms. The Nelder–Mead algorithm is a local search algorithm
based on heuristics and is easy to implement. GPS and DIRECT are
based on mathematical criteria.
Evolutionary algorithms are global search methods based on the
evolution of a population of designs. They stem from appealing
heuristics inspired by natural or societal phenomena and have some
stochastic element in their algorithms. The GAs and PSO algorithms
covered in this chapter are only two examples from the plethora of
evolutionary algorithms that have been invented.
Many of the methods presented in this chapter do not directly
address constrained problems; in those cases, penalty or filtering
methods are typically used to enforce constraints.
7 Gradient-Free Optimization 321

Problems

7.1 Answer true or false and justify your answer.


a. Gradient-free optimization algorithms are not as efficient as
gradient-based algorithms, but they converge to the global
optimum.
b. None of the gradient-free algorithms checks the KKT condi-
tions for optimality.
c. The Nelder–Meade algorithm is a deterministic local search
algorithm using heuristic criteria and direct function evalua-
tions.
d. The simplex is a geometric figure defined by a set of 𝑛 points,
where 𝑛 is the dimensionality of the design variable space.
e. The DIRECT algorithm is a deterministic global search al-
gorithm using mathematical criteria and direct function
evaluations.
f. The DIRECT method favors small rectangles with better
function values over large rectangles with worse function
values.
g. Evolutionary algorithms are stochastic global search algo-
rithms based on heuristics and direct function evaluations.
h. GAs start with a population of designs that gradually de-
creases to a single individual design at the optimum.
i. Each design in the initial population of a GA should be
carefully selected to ensure a successful optimization.
j. Stochastic procedures are necessary in the GAs to maintain
population diversity and therefore reduce the likelihood of
getting stuck in local minima.
k. PSO follows a model developed by biologists in the research
of how bee swarms search for pollen and nectar.
l. All evolutionary algorithms are based on either evolutionary
genetics or animal behavior.
7.2 Program the Nelder–Mead algorithm and perform the following
studies:
a. Reproduce the bean function results shown in Ex. 7.1.
b. Add random noise to the function with a magnitude of 10−4
using a normal distribution and see if that makes a difference
in the convergence of the Nelder–Mead algorithm. Compare
the results to those of a gradient-based algorithm.
7 Gradient-Free Optimization 322

c. Consider the following function:

𝑓 (𝑥1 , 𝑥2 , 𝑥3 ) = |𝑥1 | + 2|𝑥 2 | + 𝑥32 . (7.32)

Minimize this function with the Nelder–Mead algorithm


and a gradient-based algorithm. Discuss your results.
d. Exploration: Study the logic of the Nelder–Mead algorithm
and devise possible improvements. For example, is it a good
idea to be greedier and do multiple expansions?

7.3 Program the DIRECT algorithm and perform the following stud-
ies:

a. Reproduce the Jones function results shown in Ex. 7.3.


b. Use a gradient-based algorithm with a multistart strategy to
minimize the same function. On average, how many different
starting points do you need to find the global minimum?
c. Minimize the Hartmann function (defined in Appendix D.1.5)
using both methods. Compare and discuss your results.
d. Exploration: Develop a hybrid approach that starts with
DIRECT and then switches to the gradient-based algorithm.
Are you able to reduce the computational cost of DIRECT
significantly while converging to the global minimum?

7.4 Program a GA and perform the following studies:

a. Reproduce the bean function results shown in Ex. 7.7.


b. Use your GA to minimize the Harmann function. Estimate
the rate of convergence and compare the performance of the
GA with a gradient-based algorithm.
c. Study the effect of adding checkerboard steps (Eq. 7.31) with
a suitable magnitude to this function. How does this affect
the performance of the GA and the gradient-based algorithm
compared with the smooth case? Study the effect of reducing
the magnitude of the steps.
d. Exploration: Experiment with different population sizes,
types of crossover, and mutation probability. Can you
improve on your original algorithm? Is that improvement
still observed for other problems?

7.5 Program the PSO algorithm and perform the following studies:

a. Reproduce the bean function results shown in Ex. 7.8.


7 Gradient-Free Optimization 323

b. Use your PSO to minimize the 𝑛-dimensional Rosenbrock


function (defined in Appendix D.1.2) with 𝑛 = 4. Estimate
the convergence rate and discuss the performance of PSO
compared with a gradient-based algorithm.
c. Study the effect of adding noise to the objective function for
both algorithms (see Prob. 7.2). Experiment with different
levels of noise.
d. Exploration: Experiment with different population sizes and
with the values of the coefficients in Eq. 7.29. Are you able
to improve the performance of your implementation for
multiple problems?
7.6 Study the effect of increased problem dimensionality using the
𝑛-dimensional Rosenbrock function defined in Appendix D.1.2.
Solve the problem using three approaches:
a. Gradient-free algorithm
b. Gradient-based algorithm with gradients computed using
finite differences
c. Gradient-based algorithm with exact gradients
You can either use an off-the-shelf optimizer or your own im-
plementation. In each case, repeat the minimization for 𝑛 =
2, 4, 8, 16, . . . up to at least 128, and see how far you can get with
each approach. Plot the number of function calls required as a
function of the problem dimension (𝑛) for all three methods on
one figure. Discuss any differences in optimal solutions found by
the various algorithms and dimensions. Compare and discuss
your results.
7.7 Consider the aircraft wing design problem described in Ap-
pendix D.1.6. We add a wrinkle to the drag computation to make
the objective discontinuous. Previously, the approximation for
the skin friction coefficient assumed that the boundary layer on
the wing was fully turbulent. In this assignment, we assume
that the boundary layer is fully laminar when the wing chord
Reynolds number is less or equal to 𝑅𝑒 = 6 × 105 . The laminar
skin friction coefficient is given by
1.328
𝐶𝑓 = √ .
𝑅𝑒
For 𝑅𝑒 > 6 × 105 , the boundary layer is assumed to be fully
turbulent, and the previous skin friction coefficient approximation
(Eq. D.14) holds.
7 Gradient-Free Optimization 324

Minimize the power with respect to span and chord by doing the
following:

a. Implement one gradient-free algorithm of your choice, or


alternatively, make up your own algorithm (and give it a
good name!)
b. Use the quasi-Newton method you programmed in Prob. 4.9.
c. Use an existing optimizer

Discuss the relative performance of these methods as applied to


this problem.
Discrete Optimization
8
Most algorithms in this book assume that the design variables are
continuous. However, sometimes design variables must be discrete.
Common examples of discrete optimization include scheduling, net-
work problems, and resource allocation. This chapter introduces some
techniques for solving discrete optimization problems.

By the end of this chapter you should be able to:

1. Identify problems where you can avoid using discrete


variables.
2. Convert problems with integer variables to ones with
binary variables.
3. Understand the basics of various discrete optimization
algorithms (branch and bound, greedy, dynamic program-
ming, simulated annealing, binary genetic).
4. Identify which algorithms are likely to be most suitable
for a given problem.

8.1 Binary, Integer, and Discrete Variables

Discrete optimization variables can be of three types: binary (sometimes


called zero-one), integer, and discrete. A light switch, for example, can
only be on or off and is a binary decision variable that is either 0 or 1.
The number of wheels on a vehicle is an integer design variable because
it is not useful to build a vehicle with half a wheel. The material in a
structure that is restricted to titanium, steel, or aluminum is an example
of a discrete variable. These cases can all be represented as integers
(including the discrete categories, which can be mapped to integers).
An optimization problem with integer design variables is referred to as
integer programming, discrete optimization, or combinatorial optimization.∗ ∗ Sometimes subtle differences in meaning
are intended, but typically, and in this chap-
Problems with both continuous and discrete variables are referred to ter, these terms can be used interchange-
as mixed-integer programming. ably.

325
8 Discrete Optimization 326

Unfortunately, discrete optimization is nondeterministic polynomial-


time complete, or NP-complete, which means that we can easily verify
a solution, but there is no known approach to find a solution efficiently.
Furthermore, the time required to solve the problem becomes much
worse as the problem size grows.

Example 8.1 The drawback of an exhaustive search

The scaling issue in discrete optimization is illustrated by a well-known


discrete optimization problem: the traveling salesperson problem. Consider a
set of cities represented graphically on the left of Fig. 8.1. The problem is to
find the shortest possible route that visits each city exactly once and returns
to the starting city. The path on the right of Fig. 8.1 shows one such solution
(not necessarily the optimum). If there were only a handful of cities, you could
imagine doing an exhaustive search. You would enumerate all possible paths,
evaluate them, and return the one with the shortest distance. Unfortunately,
this is not a scalable algorithm. The number of possible paths is (𝑛 − 1)!, where
𝑛 is the number of cities. If, for example, we used all 50 U.S. state capitals as the
set of cities, then there would be 49! = 6.08 × 1062 possible paths! This is such a
large number that we cannot evaluate all paths using an exhaustive search.

Fig. 8.1 Example of the traveling sales-


person problem.

It is possible to construct algorithms that find the global optimum


of discrete problems, such as exhaustive searches. Exhaustive search
ideas can also be used for continuous problems (see Section 7.5, for
example, but the cost is much higher). Although an exhaustive search
may eventually arrive at the correct answer, executing that algorithm
to completion is often not practical, as Ex. 8.1 highlights. Discrete
optimization algorithms aim to search the large combinatorial space
more efficiently, often using heuristics and approximate solutions.

8.2 Avoiding Discrete Variables

Even though a discrete optimization problem limits the options and thus
conceptually sounds easier to solve, discrete optimization problems
8 Discrete Optimization 327

are usually much more challenging to solve than continuous problems.


Thus, it is often desirable to find ways to avoid using discrete design
variables. There are a few approaches to accomplish this.

Tip 8.1 Avoid discrete variables when possible

Unless your optimization problem fits specific forms that are well suited to
discrete optimization, your problem is likely expensive to solve, and it may be
helpful to consider approaches to avoid discrete variables.

The first approach is an exhaustive search. We just discussed how


exhaustive search scales poorly, but sometimes we have many continu-
ous variables and only a few discrete variables with few options. In
that case, enumerating all options is possible. For each combination
of discrete variables, the optimization is repeated using all continuous
variables. We then choose the best feasible solution among all the opti-
mizations. This approach yields the global optimum, assuming that the
continuous optimization finds the global optimum in the continuous
variable space.

Example 8.2 Evaluate discrete variables exhaustively when the number of


combinations is small
Consider the optimization of a propeller. Although most of the design vari-
ables are continuous (e.g., propeller blade shape, twist, and chord distributions),
the number of blades on a propeller is not. Fortunately, the number of blades
falls within a reasonably small set (e.g., two to six). Assuming there are no other
discrete variables, we could just perform five optimizations corresponding to
each option and choose the best solution among the optima.

A second approach is rounding. We can optimize the discrete design


variables for some problems as if they were continuous and round the
optimal design variable values to integer values afterward. This can be
a reasonable approach if the magnitude of the design variables is large
or if there are many continuous variables and few discrete variables.
After rounding, it is best to repeat the optimization once more, allowing
only the continuous design variables to vary. This process might not
lead to the true optimum, and the solution might not even be feasible.
Furthermore, if the discrete variables are binary, rounding is generally
too crude. However, rounding is an effective and practical approach
for many problems.
Dynamic rounding is a variation of the rounding approach. Rather
than rounding all continuous variables at once, dynamic rounding is an
8 Discrete Optimization 328

iterative process. It rounds only one or a subset of the discrete variables,


fixes them, and re-optimizes the remaining variables using continuous
optimization. The process is repeated until all discrete variables are
fixed, followed by one last optimization with the continuous variables.
A third approach to avoiding discrete variables is to change the
parameterization. For example, one approach in wind farm layout
optimization is to parametrize the wind turbine locations as a discrete
set of points on a grid. To turn this into a continuous problem, we could
parametrize the position of each turbine using continuous coordinates.
The trade-off of this continuous parameterization is that we can no
longer change the number of turbines, which is still discrete. To re-
parameterize, sometimes a continuous alternative is readily apparent,
but more often, it requires a good deal of creativity.
Sometimes, an exhaustive search is not feasible, rounding is unac-
ceptable, and a continuous representation is impossible. Fortunately,
there are several techniques for solving discrete optimization problems.

8.3 Branch and Bound

A popular method for solving integer optimization problems is the


branch-and-bound method. Although it is not always the most efficient
method,∗ it is popular because it is robust and applies to a wide variety ∗ Better
methods may exist that leverage
the specific problem structure, some of
of discrete problems. One case where the branch-and-bound method which are discussed in this chapter.
is especially effective is solving convex integer programming problems
where it is guaranteed to find the global optimum. The most common
convex integer problem is a linear integer problem (where all the
objectives and constraints are linear in the design variables). This
method can be extended to nonconvex integer optimization problems,
but it is generally far less effective for those problems and is not
guaranteed to find the global optimum. In this section, we assume linear
mixed-integer problems but include a short discussion on nonconvex
problems.
A linear mixed-integer optimization problem can be expressed as
follows:
minimize 𝑐 | 𝑥
subject to ˆ ≤ 𝑏ˆ
𝐴𝑥
(8.1)
𝐴𝑥 + 𝑏 = 0
𝑥 𝑖 ∈ Z+ for some or all 𝑖 ,
where Z+ represents the set of all positive integers, including zero.
8 Discrete Optimization 329

8.3.1 Binary Variables


Before tackling the integer variable case, we explore the binary variable
case, where the discrete entries in 𝑥 𝑖 must be 0 or 1. Most integer
problems can be converted to binary problems by adding additional
variables and constraints. Although the new problem is larger, it is
usually far easier to solve.

Example 8.3 Converting an integer problem to a binary one

Consider a problem where an engineering device may use one of 𝑛 different


materials: 𝑦 ∈ (1 . . . 𝑛). Rather than having one design variable 𝑦, we can
convert the problem to have 𝑛 binary variables 𝑥 𝑖 , where 𝑥 𝑖 = 0 if material 𝑖 is
not selected and 𝑥 𝑖 = 1 if material 𝑖 is selected. We would also need to add an
additional linear constraint to make sure that one (and only one) material is
selected:
Õ
𝑛
𝑥𝑖 = 1 .
𝑖=1

The key to a successful branch-and-bound problem is a good relax-


ation. Relaxation aims to construct an approximation of the original
optimization problem that is easier to solve. Such approximations are
often accomplished by removing constraints.
Many types of relaxation are possible for a given problem, but for lin-
ear mixed-integer programming problems, the most natural relaxation
is to change the integer constraints to continuous bound constraints
(0 ≤ 𝑥 𝑖 ≤ 1). In other words, we solve the corresponding continuous
linear programming problem, also known as an LP (discussed in Sec-
tion 11.2). If the solution to the original LP happens to return all binary
values, that is the final solution, and we terminate the search. If the LP
returns fractional values, then we need to branch.
Branching is done by adding constraints and solving the new
optimization problems. For example, we could branch by adding
constraints on 𝑥1 to the relaxed LP, creating two new optimization
problems: one with the constraint 𝑥1 = 0 and another with the constraint
𝑥1 = 1. This procedure is then repeated with additional branching as
needed.
Figure 8.2 illustrates the branching concept for binary variables. If
we explored all of those branches, this would amount to an exhaustive
search. The main benefit of the branch-and-bound algorithm is that we
can find ways to eliminate branches (referred to as pruning) to narrow
down the search scope.
8 Discrete Optimization 330

𝑥1 = 0 1

𝑥2 = 0 1 0 1

Fig. 8.2 Enumerating the options for


𝑥3 = 0 1 0 1 0 1 0 1 a binary problem with branching.

There are two ways to prune. If any of the relaxed problems is


infeasible, we know that everything from that node downward (i.e.,
that branch) is also infeasible. Adding more constraints cannot make
an infeasible problem feasible again. Thus, that branch is pruned, and
we go back up the tree. We can also eliminate branches by determining
that a better solution cannot exist on that branch. The algorithm keeps
track of the best solution to the problem found so far.
If one of the relaxed problems returns an objective that is worse
than the best we have found, we can prune that branch. We know this
because adding constraints always leads to a solution that is either the
same or worse, never better (assuming that we find the global optimum,
which is guaranteed for LP problems).
The solution from a relaxed problem provides a lower bound—the
best that could be achieved if continuing on that branch. The logic for
these various possibilities is summarized in Alg. 8.1.
The initial best known solution can be set as 𝑓best = ∞ if nothing is
known, but if a known feasible solution exists (or can be found quickly
by some heuristic), providing any finite best point can speed up the
optimization.
Many variations exist for the branch-and-bound algorithm. One
variation arises from the choice of which variables to branch on at a
given node.
One common strategy is to branch on the variable with the largest
fractional component. For example, if 𝑥ˆ = [1.0, 0.4, 0.9, 0.0], we could
branch on 𝑥2 or 𝑥3 because both are fractional. We hypothesize that
we are more likely to force the algorithm to make faster progress by
branching on variables that are closer to midway between integers. In
this case, that value would be 𝑥2 = 0.4. We would choose to branch on
the value closest to 0.5, that is,

min |𝑥 𝑖 − 0.5| . (8.2)


𝑖

Another variation of branch and bound arises from how the tree
search is performed. Two common strategies are depth-first and breadth-
8 Discrete Optimization 331

first. A depth-first strategy continues as far down as possible (e.g., by


always branching left) until it cannot go further, and then it follows
right branches. A breadth-first strategy explores all nodes on a given
level before increasing depth. Various other strategies exist. In general,
we do not know beforehand which one is best for a given problem.
Depth-first is a common strategy because, in the absence of more
information about a problem, it is most likely to be the fastest way
to find a solution—reaching the bottom of the tree generally forces a
solution. Finding a solution quickly is desirable because its solution
can then be used as a lower bound on other branches.
The depth-first strategy requires less memory storage because
breadth-first must maintain a longer history as the number of lev-
els increases. In contrast, depth-first only requires node storage equal
to the number of levels.

Algorithm 8.1 Branch-and-bound algorithm

Inputs:
𝑓best : Best known solution, if any; otherwise 𝑓best = ∞
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Optimal function value

Let 𝒮 be the set of indices for binary constrained design variables


while branches remain do
Solve relaxed problem for 𝑥, ˆ 𝑓ˆ
if relaxed problem is infeasible then
Prune this branch, back up tree
else
if 𝑥ˆ𝑖 ∈ {0, 1} for all 𝑖 ∈ 𝒮 then A solution is found
𝑓best = min( 𝑓best , 𝑓ˆ), back up tree
else
if 𝑓ˆ > 𝑓best then
Prune this branch, back up tree
else A better solution might exist
Branch further
end if
end if
end if
end while
8 Discrete Optimization 332

Example 8.4 A binary branch-and-bound optimization

Consider the following discrete problem with binary design variables:

minimize − 2.5𝑥 1 − 1.1𝑥 2 − 0.9𝑥3 − 1.5𝑥4


subject to 4.3𝑥1 + 3.8𝑥2 + 1.6𝑥3 + 2.1𝑥4 ≤ 9.2
4𝑥1 + 2𝑥2 + 1.9𝑥3 + 3𝑥4 ≤ 9
𝑥 𝑖 ∈ {0, 1} for all 𝑖.

To solve this problem, we begin at the first node by solving the linear
relaxation. The binary constraint is removed and instead replaced with
continuous bounds: 0 ≤ 𝑥 𝑖 ≤ 1. The solution to this LP is as follows:

𝑥 ∗ = [1, 0.5274, 0.4975, 1]


𝑓 ∗ = −5.0279.

There are nonbinary values in the solution, so we need to branch. As


mentioned previously, a typical choice is to branch on the variable with the
most fractional component. In this case, that is 𝑥3 , so we create two additional
problems, which add the constraints 𝑥3 = 0 and 𝑥 3 = 1, respectively (Fig. 8.3).

𝑥 ∗ = [1, 0.53, 0.50, 1]|


𝑓 ∗ = −5.03
𝑥3 = 0 1 Fig. 8.3 Initial binary branch.

Although depth-first was recommended previously, in this example, we


use breadth-first because it yields a more concise example. The depth-first tree
is also shown at the end of the example. We solve both of the problems at this
next level as shown in Fig. 8.4. Neither of these optimizations yields all binary
values, so we have to branch both of them. In this case, the left node branches
on 𝑥 2 (the only fractional component), and the right node also branches on 𝑥2
(the most fractional component).

𝑥 ∗ = [1, 0.53, 0.50, 1]|


𝑓 ∗ = −5.03
𝑥3 = 0 1 Fig. 8.4 Solutions along these two
𝑥 ∗ = [1, 0.74, 0, 1]| 𝑥 ∗ = [1, 0.47, 1, 0.72]| branches.
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00

The first branch (see Fig. 8.5) yields a feasible binary solution! The corre-
sponding function value 𝑓 = −4 is saved as the best value so far. There is no
need to continue on this branch because the solution cannot be improved on
this particular branch.
We continue solving along the rest of this row (Fig. 8.6). The third node
in this row yields another binary solution. In this case, the function value is
𝑓 = −4.9, which is better, so this becomes the new best value so far. The second
8 Discrete Optimization 333

𝑥 ∗ = [1, 0.53, 0.50, 1]|


𝑓 ∗ = −5.03
𝑥3 = 0 1
𝑥 ∗ = [1, 0.74, 0, 1]| 𝑥 ∗ = [1, 0.47, 1, 0.72]|
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00
𝑥2 = 0 1 0 1
𝑥 ∗ = [1, 0, 0, 1]| Fig. 8.5 The first feasible solution.
𝑓 ∗ = −4

and fourth nodes do not yield a solution. Typically, we would have to branch
these further, but they have a lower bound that is worse than the best solution
so far. Thus, we can prune both of these branches.

𝑥 ∗ = [1, 0.53, 0.50, 1]|


𝑓 ∗ = −5.03
𝑥3 = 0 1
𝑥 ∗ = [1, 0.74, 0, 1]| 𝑥 ∗ = [1, 0.47, 1, 0.72]|
𝑓 ∗ = −4.81 𝑓 ∗ = −5.00
𝑥2 = 0 1 0 1
𝑥 ∗ = [1, 0, 0, 1]| 𝑥 ∗ = [0.40, 1, 1, 1]|
𝑓 ∗ = −4 𝑓 ∗ = −4.49 Fig. 8.6 The rest of the solutions on
𝑥 ∗ = [0.77, 1, 0, 1]| 𝑥 ∗ = [1, 0, 1, 1]| this row.
𝑓 ∗ = −4.52 𝑓 ∗ = −4.9

All branches have been pruned, so we have solved the original problem:

𝑥 ∗ = [1, 0, 1, 1]
𝑓 ∗ = −4.9.

𝑥3 = 0 1

𝑥2 = 0 1 0 1

𝑓 ∗ = −4 𝑓 ∗ = −4.9 bounded
𝑥1 = 0 1

𝑓 ∗ = −2.6
𝑥4 = 0 1
Fig. 8.7 Search path using a depth-
first strategy.
𝑓 ∗ = −3.6 infeasible

Alternatively, we could have used a depth-first strategy. In this case, it is


less efficient, but in general, the best strategy is not known beforehand. The
depth-first tree for this same example is shown in Fig. 8.7. Feasible solutions to
the problem are shown as 𝑓 ∗ .
8 Discrete Optimization 334

8.3.2 Integer Variables


If the problem cannot be cast in binary form, we can use the same proce-
dure with integer variables. Instead of branching with two constraints
(𝑥 𝑖 = 0 and 𝑥 𝑖 = 1), we branch with two inequality constraints that
encourage integer solutions. For example, if the variable we branched
on was 𝑥 𝑖 = 3.4, we would branch with two new problems with the
following constraints: 𝑥 𝑖 ≤ 3 or 𝑥 𝑖 ≥ 4.

Example 8.5 Branch and bound with integer variables

Consider the following problem:

minimize − 𝑥1 − 2𝑥2 − 3𝑥3 − 1.5𝑥4


subject to 𝑥 1 + 𝑥2 + 2𝑥3 + 2𝑥 4 ≤ 10
7𝑥1 + 8𝑥2 + 5𝑥 3 + 𝑥4 = 31.5
𝑥 𝑖 ∈ Z+ for 𝑖 = 1, 2, 3
𝑥4 ≥ 0 .

We begin by solving the LP relaxation, replacing the integer constraints


with a lower bound constraint of zero (𝑥 𝑖 ≥ 0). The solution to that problem is

𝑥 ∗ = [0, 1.1818, 4.4091, 0], 𝑓 ∗ = −15.59 .

We begin by branching on the most fractional value, which is 𝑥3 . We create


two new branches:
• The original LP with the added constraint 𝑥 3 ≤ 4
• The original LP with the added constraint 𝑥 3 ≥ 5
Even though depth-first is usually more efficient, we use breadth-first because
it is easier to display on a figure. The solution to that first problem is

𝑥 ∗ = [0, 1.4, 4, 0.3], 𝑓 ∗ = −15.25.

The second problem is infeasible, so we can prune that branch.


Recall that the last variable is allowed to be continuous, so we now branch
on 𝑥2 by creating two new problems with additional constraints: 𝑥2 ≤ 1 and
𝑥2 ≥ 2.
The problem continues using the same procedure shown in the breadth-
first tree in Fig. 8.8. The figure gives some indication of why solving integer
problems is more time-consuming than solving binary ones. Unlike the binary
case, the same value is revisited with tighter constraints. For example, the
constraint 𝑥3 ≤ 4 is enforced early on. Later, two additional problems are
created with tighter bounds on the same variable: 𝑥3 ≤ 2 and 𝑥 3 ≥ 3. In
general, the same variable could be revisited many times as the constraints are
slowly tightened, whereas in the binary case, each variable is only visited once
because the values can only be 0 or 1.
8 Discrete Optimization 335

𝑥3 ≤ 4 𝑥3 ≥ 5

infeasible
𝑥2 ≤ 1 𝑥2 ≥ 2

𝑥1 ≤ 0 𝑥1 ≥ 1 𝑥3 ≤ 2 𝑥3 ≥ 3

infeasible bounded 𝑓 ∗ = −13.75


𝑥2 ≤ 0 𝑥2 ≥ 1

bounded
𝑥1 ≤ 1 𝑥1 ≥ 2
Fig. 8.8 A breadth search of the mixed-
integer programming example.
infeasible bounded

Once all the branches are pruned, we obtain the solution:

𝑥 ∗ = [0, 2, 3, 0.5]
𝑓 ∗ = −13.75.

Nonconvex mixed-integer problems can also be used with the


branch-and-bound method and generally use this latter strategy of
forming two branches of continuous constraints. In this case, the
relaxed problem is not a convex problem, so there is no guarantee that
we have found a lower bound for that branch. Furthermore, the cost
of each suboptimization problem is increased. Thus, for nonconvex
discrete problems, this approach is usually only practical for a relatively
small number of discrete design variables.

8.4 Greedy Algorithms

Greedy algorithms are among the simplest methods for discrete opti-
mization problems. This method is more of a concept than a specific
algorithm. The implementation varies with the application. The idea is
to reduce the problem to a subset of smaller problems (often down to a
single choice) and then make a locally optimal decision. That decision
is locked in, and then the next small decision is made in the same
manner. A greedy algorithm does not revisit past decisions and thus
ignores much of the coupling between design variables.
8 Discrete Optimization 336

Example 8.6 A weighted directed graph

As an example, consider the weighted directed graph shown in Fig. 8.9.


This graph might represent a transportation problem for shipping goods,
information flow through a social network, or a supply chain problem. The
objective is to traverse from node 1 to node 12 with the lowest possible total
cost (the numbers above the path segments denote the cost of each path). A
series of discrete choices must be made at each step, and those choices limit the
available options in the next step.

5
5 3

2 2 9
5
2 4 3
Global 4 6
4
1 6
1 3 10 12
6 7
5 2
Greedy 7 5 1
3 Fig. 8.9 The greedy algorithm in this
4 4 11
weighted directed graph results in
5 2 a total cost of 15, whereas the best
possible cost is 10.
8

A greedy algorithm simply makes the best choice assuming each decision
is the only decision to be made. Starting at node 1, we first choose to move to
node 3 because that is the lowest cost between the three options (node 2 costs
2, node 3 costs 1, node 4 costs 5). We then choose to move to node 6 because
that is the smallest cost between the next two available options (node 6 costs
4, node 7 costs 6), and so on. The path selected by the greedy algorithm is
highlighted in the figure and results in a total cost of 15. The global optimum
is also highlighted in the figure and has a total cost of 10.

The greedy algorithm used in Ex. 8.6 is easy to apply and scalable
but does not generally find the global optimum. To find that global
optimum, we have to consider the impact of our choices on future
decisions. A method to achieve this for certain problem structures is
discussed in the next section.
Even for a fixed problem, there are many ways to construct a greedy
algorithm. The advantage of the greedy approach is that the algorithms
are easy to construct, and they bound the computational expense of
the problem. One disadvantage of the greedy approach is that it
usually does not find an optimal solution (and in some cases finds the
worst solution!152 ). Furthermore, the solution is not necessarily feasible. 152. Gutin et al., Traveling salesman should
not be greedy: domination analysis of greedy-
type heuristics for the TSP, 2002.
8 Discrete Optimization 337

Despite the disadvantages, greedy algorithms can sometimes quickly


find solutions reasonably close to an optimal solution.

Example 8.7 Greedy algorithms

A few other examples of greedy algorithms are listed below. For the
traveling salesperson problem (Ex. 8.1), always select the nearest city as the
next step. Consider the propeller problem (Ex. 8.2 but with additional discrete
variables (number of blades, type of material, and number of shear webs). A
greedy method could optimize the discrete variables one at a time, with the
others fixed (i.e., optimize the number of blades first, fix that number, then
optimize material, and so on). As a final example, consider the grocery store
shopping problem discussed in a separate chapter (Ex. 11.1).∗ A few possible ∗ This is a form of the knapsack problem,

which is a classic problem in discrete op-


greedy algorithms for this problem include: always pick the cheapest food
timization discussed in more detail in the
item next, or always pick the most nutritious food item next, or always pick the following section.
food item with the most nutrition per unit cost.

8.5 Dynamic Programming

Dynamic programming is a valuable approach for discrete optimiza-


tion problems with a particular structure. This structure can also be
exploited in continuous optimization problems and problems beyond
optimization. The required structure is that the problem can be posed
as a Markov chain (for continuous problems, this is called a Markov
process). A Markov chain or process satisfies the Markov property,
where a future state can be predicted from the current state without
needing to know the entire history. The concept can be generalized to a
finite number of states (i.e., more than one but not the entire history)
and is called a variable-order Markov chain.
If the Markov property holds, we can transform the problem into a
recursive one. Using recursion, a smaller problem is solved first, and
then larger problems are solved that use the solutions from the smaller
problems.
This approach may sound like a greedy optimization, but it is not.
We are not using a heuristic but fully solving the smaller problems.
Because of the problem structure, we can reuse those solutions. We will
illustrate this in examples. This approach has become particularly useful
in optimal control and some areas of economics and computational
biology. More general design problems, such as the propeller example
(Ex. 8.2), do not fit this type of structure (i.e., choosing the number
of blades cannot be broken up into a smaller problem separate from
choosing the material).
8 Discrete Optimization 338

A classic example of a Markov chain is the Fibonacci sequence,


defined as follows:
𝑓0 = 0
𝑓1 = 1 (8.3)
𝑓𝑛 = 𝑓𝑛−1 + 𝑓𝑛−2 .
We can compute the next number in the sequence using only the last
two states.∗ We could implement the computation of this sequence ∗ We can also convert this to a standard first-

order Markov chain by defining 𝑔𝑛 = 𝑓𝑛−1


using recursion, as shown algorithmically in Alg. 8.2 and graphically and considering our state to be ( 𝑓𝑛 , 𝑔𝑛 ).
in Fig. 8.10 for 𝑓5 . Then, each state only depends on the pre-
vious state.

Algorithm 8.2 Fibonacci with recursion

procedure fib(𝑛)
if 𝑛 ≤ 1 then
return 𝑛
else
return fib(𝑛 − 1) + fib(𝑛 − 2)
end if
end procedure

fib(5)

fib(4) fib(3)

fib(3) fib(2) fib(2) fib(1)

Fig. 8.10 Computing Fibonacci se-


fib(2) fib(1) fib(1) fib(0) fib(1) fib(0) quence using recursion. The function
fib(2) is highlighted as an example
to show the repetition that occurs in
this recursive procedure.
fib(1) fib(0)

Although this recursive procedure is simple, it is inefficient. For


example, the calculation for fib(2) (highlighted in Fig. 8.10) is repeated
multiple times. There are two main approaches to avoiding this
inefficiency. The first is a top-down procedure called memoization,
where we store previously computed values to avoid having to compute
them again. For example, the first time we need fib(2), we call the fib
function and store the result (the value 1). As we progress down the
tree, if we need fib(2) again, we do not call the function but retrieve
the stored value instead.
8 Discrete Optimization 339

A bottom-up procedure called tabulation is more common. This


procedure is how we would typically show the Fibonacci sequence. We
start from the bottom ( 𝑓0 ) and work our way forward, computing each
new value using the previous states. Rather than using recursion, this
involves a simple loop, as shown in Alg. 8.3. Whereas memoization fills
entries on demand, tabulation systematically works its way up, filling
in entries. In either case, we reduce the computational complexity of
this algorithm from exponential complexity (approximately 𝒪(2𝑛 )) to
linear complexity (𝒪(𝑛)).

Algorithm 8.3 Fibonacci with tabulation

𝑓0 = 0
𝑓1 = 1
for 𝑖 = 2 to 𝑛 do
𝑓𝑖 = 𝑓𝑖−1 + 𝑓𝑖−2
end for

These procedures can be applied to optimization, but before intro-


ducing examples, we formalize the mathematics of the approach. One
main difference in optimization is that we do not have a set formula
like a Fibonacci sequence. Instead, we need to make a design decision
at each state, which changes the next state. For example, in the problem
shown in Fig. 8.9, we decide which path to take.
Mathematically, we express a given state as 𝑠 𝑖 and make a design 𝑠𝑖
𝑥𝑖
𝑠 𝑖+1
𝑥 𝑖+1
𝑠 𝑖+2
decision 𝑥 𝑖 , which transitions us to the next state 𝑠 𝑖+1 (Fig. 8.11),
Fig. 8.11 Diagram of state transitions
𝑠 𝑖+1 = 𝑡 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ), (8.4)
in a Markov chain.

where 𝑡 is a transition function.† At each transition, we compute the † For some problems, the transition func-
cost function 𝑐.‡ For generality, we specify a cost function that may tion is stochastic.
change at each iteration 𝑖: ‡ It
is common to use discount factors on
future costs.
𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ). (8.5)
We want to make a set of decisions that minimize the sum of the
current and future costs up to a certain time, which is called the value
function,

𝑣(𝑠 𝑖 ) = minimize (𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ) + 𝑐 𝑖+1 (𝑠 𝑖+1 , 𝑥 𝑖+1 ) + . . . + 𝑐 𝑛 (𝑠 𝑛 , 𝑥 𝑛 )) , (8.6)


𝑥 𝑖 ,...,𝑥 𝑛

where 𝑛 defines the time horizon up to which we consider the cost.


For continuous problems, the time horizon may be infinite. The value
§ We use 𝑣 and 𝑐 for the scalars in Eq. 8.6
function (Eq. 8.6) is the minimum cost, not just the cost for some arbitrary instead of Greek letters because the con-
set of decisions.§ nection to “value” and “cost” is clearer.
8 Discrete Optimization 340

Bellman’s principle of optimality states that because of the structure


of the problem (where the next state only depends on the current state),
we can determine the best solution at this iteration 𝑥 ∗𝑖 if we already know
all the optimal future decisions 𝑥 ∗𝑖+1 , . . . , 𝑥 𝑛∗ . Thus, we can recursively
solve this problem from the back (bottom) by determining 𝑥 𝑛∗ , then

𝑥 𝑛−1 , and so on back to 𝑥 ∗𝑖 . Mathematically, this recursive procedure is
captured by the Bellman equation:

𝑣(𝑠 𝑖 ) = minimize (𝑐(𝑠 𝑖 , 𝑥 𝑖 ) + 𝑣(𝑠 𝑖+1 )) . (8.7)


𝑥𝑖

We can also express this equation in terms of our transition function to


show the dependence on the current decision:

𝑣(𝑠 𝑖 ) = minimize (𝑐(𝑠 𝑖 , 𝑥 𝑖 ) + 𝑣(𝑡 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ))) . (8.8)


𝑥𝑖

Example 8.8 Dynamic programming applied to a graph problem

Let us solve the graph problem posed in Ex. 8.6 using dynamic programming.
For convenience, we repeat a smaller version of the figure in Fig. 8.12. We use
the tabulation (bottom-up) approach. To do this, we construct a table where we
keep track of the cost to move from this node to the end (node 12) and which
node we should move to next:

Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost
Next

We start from the end. The last node is simple: there is no cost to move 5 5 3
from node 12 to the end (we are already there), and there is no next node. 2
2 5 9
2 4 3
4 6
Node 1 2 3 4 5 6 7 8 9 10 11 12 4
1 6
1 3 6 10 12
Cost 0 5
7
2
Next – 3 7 5 1

4
4 5 2 11
Now we move back one level to consider nodes 9, 10, and 11. These nodes
8
all lead to node 12 and are thus straightforward. We need to be more careful
with the formulas as we get to the more complicated cases next. Fig. 8.12 Small version of Fig. 8.9 for
convenience.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 6 2 0
Next 12 12 12 –
8 Discrete Optimization 341

Now we move back one level to nodes 5, 6, 7, and 8. Using the Bellman
equation for node 5, the cost is

cost(5) = min[3 + cost(9), 2 + cost(10), 1 + cost(11)]. (8.9)

We have already computed the minimum value for cost(9), cost(10), and cost(11),
so we just look up these values in the table. In this case, the minimum total
value is 3 and is associated with moving to node 11. Similarly, the cost for node
6 is
cost(6) = min[5 + cost(9), 4 + cost(10)]. (8.10)
The result is 8, and it is realized by moving to node 9.

Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 8 3 6 2 0
Next 11 9 12 12 12 –

We repeat this process, moving back and reusing optimal solutions to find
the global optimum. The completed table is as follows:

Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 10 8 12 9 3 8 7 4 3 6 2 0
Next 2 5 6 8 11 9 11 11 12 12 12 –

From this table, we see that the minimum cost is 10. This cost is achieved
by moving first to node 2. Under node 2, we see that we next go to node 5, then
11, and finally 12. Thus, the tabulation gives us the global minimum for cost
and the design decisions to achieve that.

To illustrate the concepts more generally, consider another classic


problem in discrete optimization—the knapsack problem. In this
problem, we have a fixed set of items we can select from. Each item
has a weight 𝑤 𝑖 and a cost 𝑐 𝑖 . Because the knapsack problem is usually
written as a maximization problem and cost implies minimization,
we should use value instead. However, we proceed with cost to be
consistent with our earlier notation. The knapsack has a fixed capacity
𝐾 (a scalar) that cannot be exceeded.
The objective is to choose the items that yield the highest total
cost subject to the capacity of our knapsack. The design variables 𝑥 𝑖
are either 1 or 0, indicating whether we take or do not take item 𝑖.
This problem has many practical applications, such as shipping, data
transfer, and investment portfolio selection.
8 Discrete Optimization 342

The problem can be written as


Õ
𝑛
maximize 𝑐𝑖 𝑥𝑖
𝑥
𝑖=1
Õ𝑛
(8.11)
subject to 𝑤𝑖 𝑥𝑖 ≤ 𝐾
𝑖=1
𝑥 𝑖 ∈ {0, 1} .

In its present form, the knapsack problem has a linear objective and
linear constraints, so branch and bound would be a good approach.
However, this problem can also be formulated as a Markov chain, so we
can use dynamic programming. The dynamic programming version
accommodates variations such as stochasticity and other constraints
more easily.
To pose this problem as a Markov chain, we define the state as the
remaining capacity of the knapsack 𝑘 and the number of items we
have already considered. In other words, we are interested in 𝑣(𝑘, 𝑖),
where 𝑣 is the value function (optimal value given the inputs), 𝑘 is
the remaining capacity in the knapsack, and 𝑖 indicates that we have
already considered items 1 through 𝑖 (this does not mean we have
added them all to our knapsack, only that we have considered them).
We iterate through a series of decisions 𝑥 𝑖 deciding whether to take
item 𝑖 or not, which transitions us to a new state where 𝑖 increases and
𝑘 may decrease, depending on whether or not we took the item.
The real problem we are interested in is 𝑣(𝐾, 𝑛), which we solve
using tabulation. Starting at the bottom, we know that 𝑣(𝑘, 0) = 0 for
any 𝑘. This means that no matter the capacity, the value is 0 if we have
not considered any items yet. To work forward, consider a general case
considering item 𝑖, with the assumption that we have already solved
up to item 𝑖 − 1 for any capacity. If item 𝑖 cannot fit in our knapsack
(𝑤 𝑖 > 𝑘), then we cannot take the item. Alternatively, if the weight is
less than the capacity, we need to decide whether to select item 𝑖 or
not. If we do not, then the value is unchanged, and 𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1).
If we do select item 𝑖, then our value is 𝑐 𝑖 plus the best we could do
with the previous items but with a capacity that was smaller by 𝑤 𝑖 :
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1). Whichever of these decisions yields a
better value is what we should choose.
To determine which items produce this cost, we need to add more
logic. To keep track of the selected items, we define a selection matrix
𝑆 of the same size as 𝑣 (note that this matrix is indexed starting at zero
in both dimensions). Every time we accept an item 𝑖, we register that in
the matrix as 𝑆 𝑘,𝑖 = 1. Algorithm 8.4 summarizes this process.
8 Discrete Optimization 343

Algorithm 8.4 Knapsack with tabulation

Inputs:
𝑐 𝑖 : Cost of item 𝑖
𝑤 𝑖 : Weight of item 𝑖
𝐾: Total available capacity
Outputs:
𝑥 ∗ : Optimal selections
𝑣(𝐾, 𝑛): Corresponding cost, 𝑣(𝑘, 𝑖) is the optimal cost for capacity 𝑘 considering items 1
through 𝑖 ; note that indexing starts at 0

for 𝑘 = 0 to 𝐾 do
𝑣(𝑘, 0) = 0 No items considered; value is zero for any capacity
end for
for 𝑖 = 1 to 𝑛 do Iterate forward solving for one additional item at a time
for 𝑘 = 0 to 𝐾 do
if 𝑤 𝑖 > 𝑘 then
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1) Weight exceeds capacity; value unchanged
else
if 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1) > 𝑣(𝑘, 𝑖 − 1) then Take item
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)
𝑆(𝑘, 𝑖) = 1
else Reject item
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1)
end if
end if
end for
end for
𝑘=𝐾 Initialize
𝑥 ∗ = {} Initialize solution 𝑥 ∗ as an empty set
for 𝑖 = 𝑛 to 1 by −1 do Loop to determine which items we selected
if 𝑆 𝑘,𝑖 = 1 then
add 𝑖 to 𝑥 ∗ Item 𝑖 was selected
𝑘 = 𝑘 − 𝑤𝑖
end if
end for

We fill all entries in the matrix 𝑣[𝑘, 𝑖] to extract the last value
𝑣[𝐾, 𝑛]. For small numbers, filling this matrix (or table) is often
illustrated manually, hence the name tabulation. As with the Fibonacci
example, using dynamic programming instead of a fully recursive
solution reduces the complexity from 𝒪(2𝑛 ) to 𝒪(𝐾𝑛), which means it
is pseudolinear. It is only pseudolinear because there is a dependence
on the knapsack size. For small capacities, the problem scales well
even with many items, but as the capacity grows, the problem scales
8 Discrete Optimization 344

much less efficiently. Note that the knapsack problem requires integer
weights. Real numbers can be scaled up to integers (e.g., 1.2, 2.4 become
12, 24). Arbitrary precision floats are not feasible given the number of
combinations to search across.

Example 8.9 Knapsack problem with dynamic programming

Consider five items with the following weights and costs:

𝑤 𝑖 = [4, 5, 2, 6, 1]
𝑐 𝑖 = [4, 3, 3, 7, 2].

The capacity of our knapsack is 𝐾 = 10. Using Alg. 8.4, we find that the optimal
cost is 12. The value matrix is as follows:
0 0 0 0 0 0
 
0 2 
 0 0 0 0
0 3 
 0 0 3 3
0 5 
 0 0 3 3
0 5 
 4 4 4 4
 
0 4 4 4 4 6.
 
0 4 4 7 7 7
 
0 4 4 7 7 9 

0 10
 4 4 7 10
0 12
 4 7 7 10
0 12
 4 7 7 11

For this example, the selection matrix 𝑆 is as follows:

0 0 0 0 0 0
 
0 1
 0 0 0 0
0 0
 0 0 1 0
0 1
 0 0 1 0
0 1
 1 0 0 0
 
𝑆 = 0 1 0 0 0 1 .
 
0 1 0 1 0 0
 
0 1 0 1 0 1

0 0
 1 0 1 1
0 1
 1 1 0 1
0 1
 1 1 0 1

Following this algorithm, we find that we selected items 3, 4, and 5 for a total
cost of 12, as expected, and a total weight of 9.

Like greedy algorithms, dynamic programming is more of a tech-


nique than a specific algorithm. The implementation varies with the
particular application.
8 Discrete Optimization 345

8.6 Simulated Annealing

Simulated annealing∗ is a methodology designed for discrete opti- developed by Kirkpatrick et al.153
∗ First

and Černý.154
mization problems. However, it can also be effective for continuous
multimodal problems, as we will discuss. The algorithm is inspired by 153. Kirkpatrick et al., Optimization by
simulated annealing, 1983.
the annealing process of metals. The atoms in a metal form a crystal
154. Černý, Thermodynamical approach to
lattice structure. If the metal is heated, the atoms move around freely. the traveling salesman problem: An efficient
As the metal cools down, the atoms slow down, and if the cooling is slow simulation algorithm, 1985.

enough, they reconfigure into a minimum-energy state. Alternatively,


if the metal is quenched or cooled rapidly, the metal recrystallizes with
a different higher-energy state (called an amorphous metal).
From statistical mechanics, the Boltzmann distribution (also called
Gibbs distribution) describes the probability of a system occupying a
given energy state:  
−𝑒
𝑃(𝑒) ∝ exp , (8.12)
𝑘𝐵𝑇
where 𝑒 is the energy level, 𝑇 is the temperature, and 𝑘 𝐵 is Boltzmann’s
constant. This equation shows that as the temperature decreases, the
probability of occupying a higher-energy state decreases, but it is not
zero. Therefore, unlike in classical mechanics, an atom could jump
to a higher-energy state with some small probability. This property
imparts an exploratory nature to the optimization algorithm, which
avoids premature convergence to a local minimum. The temperature
level provides some control on the level of expected exploration.
An early approach to simulate this type of probabilistic thermo-
dynamic model was the Metropolis algorithm.155 In the Metropolis 155. Metropolis et al., Equation of state
calculations by fast computing machines,
algorithm, the probability of transitioning from energy state 𝑒1 to energy 1953.
state 𝑒2 is formulated as
   
−(𝑒2 − 𝑒1 )
𝑃 = min exp ,1 , (8.13)
𝑘𝐵𝑇

where this probability is limited to be no greater than 1. This limit


is needed because the exponent yields a value greater than 1 when
𝑒2 < 𝑒1 , which would be nonsensical. Simulated annealing leverages
this concept in creating an optimization algorithm.
In the optimization analogy, the objective function is the energy
level. Temperature is a parameter controlled by the optimizer, which
begins high and is slowly “cooled” to drive convergence. At a given
iteration, the design variables are given by 𝑥, and the objective (or
energy) is given by 𝑓 (𝑥 (𝑘) ). A new state 𝑥new is selected at random in
the neighborhood of 𝑥. If the energy level decreases, the new state
is accepted. If the energy level increases, the new state might still be
8 Discrete Optimization 346

accepted with probability


  
© − 𝑓 (𝑥new ) − 𝑓 𝑥 ª
(𝑘)

exp ­­ ®,
® (8.14)
𝑇
« ¬
where Boltzmann’s constant is removed because it is just an arbitrary
scale factor in the optimization context. Otherwise, the state remains
unchanged. Constraints can be handled in this algorithm without
resorting to penalties by rejecting any infeasible step.
We must supply the optimizer with a function that provides a
random neighboring design from the set of possible design configurations.
A neighboring design is usually related to the current design instead
of picking a pure random design from the entire set. In defining the
neighborhood structure, we might wish to define transition probabilities
so that all neighbors are not equally likely. This type of structure is
common in Markov chain problems. Because the nature of different
discrete problems varies widely, we cannot provide a generic neighbor-
selecting algorithm, but an example is shown later for the specific case
of a traveling salesperson problem.
Finally, we need to determine the annealing schedule (or cooling
schedule), a process for decreasing the temperature throughout the
optimization. A common approach is an exponential decrease:

𝑇 = 𝑇0 𝛼 𝑘 , (8.15)

where 𝑇0 is the initial temperature, 𝛼 is the cooling rate, and 𝑘 is the


iteration number. The cooling rate 𝛼 is a number close to 1, such as
0.8–0.99. Another simple approach to iterate toward zero temperature
is as follows:  𝛽
𝑘
𝑇 = 𝑇0 1 − , (8.16)
𝑘max
where the exponent 𝛽 is usually in the range of 1–4. A higher exponent
corresponds to spending more time at low temperatures. In many
approaches, the temperature is kept constant for a fixed number of
iterations (or a fixed number of successful moves) before moving to
the next decrease. Many methods are simple schedules with a prede-
termined rate, although more complex adaptive methods also exist.† † See Andresen and Gordon156 , for exam-

ple.
The annealing schedule can substantially impact the algorithm’s perfor-
156. Andresen and Gordon, Constant ther-
mance, and some experimentation is required to select an appropriate modynamic speed for minimizing entropy
schedule for the problem at hand. One essential requirement is that the production in thermodynamic processes and
simulated annealing, 1994.
temperature should start high enough to allow for exploration. This
should be significantly higher than the maximum expected energy
8 Discrete Optimization 347

change (change in objective) but not so high that computational time is


wasted with too much random searching. Also, cooling should occur
slowly to improve the ability to recover from a local optimum, imitating
the annealing process instead of the quenching process.
The algorithm is summarized in Alg. 8.5; for simplicity in the
description, the annealing schedule uses an exponential decrease at
every iteration.

Algorithm 8.5 Simulated Annealing

Inputs:
𝑥 0 : Starting point
𝑇0 : Initial temperature
Outputs:
𝑥 ∗ : Optimal point

for 𝑘 = 0 to 𝑘max do   Simple iteration; convergence metrics can be used instead


𝑥new = neighbor 𝑥 (𝑘) Randomly generate from neighbors
 
if 𝑓 (𝑥new ) ≤ 𝑓 𝑥 (𝑘) then Energy decreased; jump to new state

𝑥 (𝑘+1)
= 𝑥new
else
𝑟 ∈ 𝒰[0, 1]    ! Randomly draw from uniform distribution
− 𝑓 (𝑥 new )− 𝑓 𝑥 (𝑘)
𝑃 = exp 𝑇

if 𝑃 ≥ 𝑟 then Probability high enough to jump


𝑥 (𝑘+1) = 𝑥new
else
𝑥 (𝑘+1) = 𝑥 (𝑘) Otherwise remain at current state
end if
end if
𝑇 = 𝛼𝑇 Reduce temperature
end for

Example 8.10 Traveling salesperson with simulated annealing

This example sets up the traveling salesperson problem with 50 points


randomly distributed (from uniform sampling) on a square grid with sides of
length 1 (left of Fig. 8.13). The objective is the total Euclidean distance of a path
that traverses all points and returns to the starting point. The design variables
are a sequence of integers corresponding to the order in which the salesperson
traverses the points.
We generate new neighboring designs using the technique from Lin,157 157. Lin, Computer solutions of the traveling
where one of two options is randomly chosen at each iteration: (1) randomly salesman problem, 1965.
8 Discrete Optimization 348

choose two points and flip the direction of the path segments between those
two points, or (2) randomly choose two points and move the path segments to
follow another randomly chosen point. The distance traveled by the randomly
generated initial set of points is 26.2.
We specify an iteration budget of 25,000 iterations, set the initial temperature
to be 10, and decrease the temperature by a multiplicative factor of 0.95 at every
100 iterations. The right panel of Fig. 8.13 shows the final path, which has a
length of 5.61. The final path might not be the global optimum (remember,
these finite time methods are only approximations of the full combinatorial
search), but the methodology is effective and fast for this problem in finding at
least a near-optimal solution. Figure 8.14 shows the iteration history.

Fig. 8.13 Initial and final paths for


traveling salesperson problem.

30

20
Distance

10

0 Fig. 8.14 Convergence history of the


0 0.5 1 1.5 2 2.5 simulated annealing algorithm.
Iteration ·104

The simulated annealing algorithm can be applied to continuous


multimodal problems as well. The motivation is similar because the
initial high temperature permits the optimizer to escape local minima,
whereas a purely descent-based approach would not. By slowly cooling,
the initial exploration gives way to exploitation. The only real change
in the procedure is in the neighbor function. A typical approach is
to generate a random direction and choose a step size proportional
to the temperature. Thus, smaller, more conservative steps are taken
as the algorithm progresses. If bound constraints are present, they
would be enforced at this step. Purely random step directions are
8 Discrete Optimization 349

not particularly efficient for many continuous problems, particularly


when most directions are ill-conditioned (e.g., a narrow valley or near
convergence). One variation adopts concepts from the Nelder–Mead
algorithm (Section 7.3) to improve efficiency.158 Overall, simulated 158. Press et al., Numerical Recipes in C:
The Art of Scientific Computing, 1992.
annealing has made more impact on discrete problems compared with
continuous ones.

8.7 Binary Genetic Algorithms

The binary form of a genetic algorithm (GA) can be directly used with
discrete variables. Because the binary form already requires a discrete
representation for the population members, using discrete design
variables is a natural fit. The details of this method were discussed in
Section 7.6.1.

8.8 Summary

This chapter discussed various strategies for approaching discrete


optimization problems. Some problems can be well approximated
using rounding, can be reparameterized in a continuous way, or only
have a few discrete combinations, allowing for explicit enumeration.
For problems that can be posed as linear (or convex in general), branch
and bound is effective. If the problem can be posed as a Markov chain,
dynamic programming is a useful method.
If none of these categorizations are applicable, then a stochastic
method, such as simulated annealing or GAs, may work well. These
stochastic methods typically struggle as the dimensionality of the
problem increases. However, simulated annealing can scale better for
some problems if there are clever ways to quickly evaluate designs in
the neighborhood, as is done with the traveling salesperson problem.
An alternative to these various algorithms is to use a greedy strategy,
which can scale well. Because this strategy is a heuristic, it usually
results in a loss in solution quality.
8 Discrete Optimization 350

Problems

8.1 Answer true or false and justify your answer.

a. All discrete variables can be represented by integers.


b. Discrete optimization algorithms sometimes use heuristics
and find only approximate solutions.
c. The rounding technique solves a discrete optimization prob-
lem with continuous variables and then rounds each result-
ing design variable, objective, and constraint to the nearest
integer.
d. Exhaustive search is the only way to be sure we have found
the global minimum for a problem that involves discrete
variables.
e. The branch-and-bound method is guaranteed to find the
global optimum for convex problems.
f. When using the branch-and-bound method for binary vari-
ables, the same variable might have to be revisited.
g. When using the branch-and-bound method, the breadth-first
strategy requires less memory storage than the depth-first
strategy.
h. Greedy algorithms never reconsider a decision once it has
been made.
i. The Markov property applies when a future state can be
predicted from the current state without needing to know
any previous state.
j. Both memoization and tabulation reduce the computational
complexity of dynamic programming such that it no longer
scales exponentially.
k. Simulated annealing can be used to minimize smooth uni-
modal functions of continuous design variables.
l. Simulated annealing, genetic algorithms, and dynamic pro-
gramming include stochastic procedures.

8.2 Branch and bound. Solve the following problem using a manual
branch-and-bound approach (i.e., show each LP subproblem), as
8 Discrete Optimization 351

is done in Ex. 8.4:


maximize 0.5𝑥1 + 2𝑥2 + 3.5𝑥3 + 4.5𝑥 4
subject to 5.5𝑥1 + 0.5𝑥 2 + 3.5𝑥3 + 2.3𝑥4 ≤ 9.2
2𝑥1 + 4𝑥2 + 2𝑥 4 ≤ 8
1𝑥1 + 3𝑥2 + 3𝑥 3 + 4𝑥4 ≤ 4
𝑥 𝑖 ∈ {0, 1} for all 𝑖.

8.3 Solve an integer linear programming problem. A chemical company


produces four types of products: A, B, C, and D. Each requires
labor to produce and uses some combination of chlorine, sulfuric
acid, and sodium hydroxide in the process. The production
process can also produce these chemicals as a by-product, rather
than just consuming them. The chemical mixture and labor
required for the production of the three products are listed in the
following table, along with the availability per day. The market
values for one barrel of A, B, C, and D are $50, $30, $80, and $30,
respectively. Determine the number of barrels of each to produce
to maximize profit using three different approaches:

a. As a continuous linear programming problem with rounding


b. As an integer linear programming problem
c. Exhaustive search

A B C D Limit
Chlorine 0.74 −0.05 1.0 −0.15 97
Sodium hydroxide 0.39 0.4 0.91 0.44 99
Sulfuric acid 0.86 0.89 0.09 0.83 52
Labor (person-hours) 5 7 7 6 1000

Discuss the results.

8.4 Solve a dynamic programming problem. Solve the knapsack problem


with the following weights and costs:

𝑤 𝑖 = [2, 5, 3, 4, 6, 1]
𝑐 𝑖 = [5, 3, 1, 5, 7, 2]

and a capacity of 𝐾 = 12. Maximize the cost subject to the capacity


constraint. Use the following two approaches:

a. A greedy algorithm where you take the item with the best
cost-to-weight ratio (that fits within the remaining capacity)
at each iteration
8 Discrete Optimization 352

b. Dynamic programming

8.5 Simulated annealing. Construct a traveling salesperson problem


with 50 randomly generated points. Implement a simulated
annealing algorithm to solve it.

8.6 Binary genetic algorithm. Solve the same problem as previously


(traveling salesperson) with a binary genetic algorithm.
Multiobjective Optimization
9
Up to this point in the book, all of our optimization problem formula-
tions have had a single objective function. In this chapter, we consider
multiobjective optimization problems, that is, problems whose formula-
tions have more than one objective function. Some common examples
of multiobjective optimization include risk versus reward, profit versus
environmental impact, acquisition cost versus operating cost, and drag
versus noise.

By the end of this chapter you should be able to:

1. Identify scenarios where multiobjective optimization is


useful.
2. Understand the concept of dominance and identify a Pareto
set.
3. Use various methods for performing multiobjective opti-
mization and understand the pros and cons of the methods.

9.1 Multiple Objectives

Before discussing how to solve multiobjective problems, we must first


explore what it means to have more than one objective. In some
sense, there is no such thing as a multiobjective optimization problem.
Although many metrics are important to the engineer, only one thing
can be made best at a time. A common technique when presented with
multiple objectives is to assign weights to the various objectives and
combine them into a single objective.
More generally, multiobjective optimization helps explore the trade-
offs between different metrics. Still, if we select one design from
the presented options, we have indirectly chosen a single objective.
However, the corresponding objective function may be difficult to
formulate beforehand.

353
9 Multiobjective Optimization 354

Tip 9.1 Are you sure you have multiple objectives?

A common pitfall for beginner optimization practitioners is to categorize


a problem as multiobjective without critical evaluation. When considering
whether you should use more than one objective, you should ask whether or not
there is a more fundamental underlying objective or if some of the “objectives”
are actually constraints. Solving a multiobjective problem is much more costly
than solving a single objective one, so you should make sure you need multiple
objectives.

Example 9.1 Selecting an objective

Determining the appropriate objective is often a real challenge. For ex-


ample, in designing an aircraft, we may decide that minimizing drag and
minimizing weight are important. However, these metrics compete with each
other and cannot be minimized simultaneously. Instead, we may conclude
that maximizing range (the distance the aircraft can fly) is the underlying
metric that matters most for our application and appropriately balances the
trade-offs between weight and drag. Or perhaps maximizing range is not the
right metric. Range may be important, but only insofar as we reach some
threshold. Increasing the range does not increase the value because the range
is a constraint. The underlying objective in this scenario may be some other
metric, such as operating costs.

Despite these considerations, there are still good reasons to pursue


a multiobjective problem. A few of the most common reasons are as
follows:

1. Multiobjective optimization can quantify the trade-off between


different objectives. The benefits of this approach will become ap-
parent when we discuss Pareto surfaces and can lead to important
design insights.
2. Multiobjective optimization provides a “family” of designs rather
than a single design. A family of options is desirable when
decision-making needs to be deferred to a later stage as more
information is gathered. For example, an executive team or
higher-fidelity numerical simulations may be used to make later
design decisions.
3. For some problems, the underlying objective is either unknown
or too difficult to compute. For example, cost and environmental
impact may be two important metrics for a new design. Although
the latter could arguably be turned into a cost, doing so may
9 Multiobjective Optimization 355

be too difficult to quantify and add an unacceptable amount of


uncertainty (see Chapter 12).

Mathematically, the only change to our optimization problem for-


mulation is that the objective statement,

minimize 𝑓 (𝑥) , (9.1)


𝑥

becomes

 𝑓1 (𝑥) 
 
 𝑓2 (𝑥) 
 
minimize 𝑓 (𝑥) =  .  , where 𝑛𝑓 ≥ 2 . (9.2)
 .. 
 
𝑥
 𝑓𝑛 (𝑥)
 𝑓 

The constraints are unchanged unless some of them have been refor-
mulated as objectives. This multiobjective formulation might require
trade-offs when trying to minimize all functions simultaneously be-
cause, beyond some point, further reduction in one objective can only
be achieved by increasing one or more of the other objectives.
One exception occurs if the objectives are independent because they
depend on different sets of design variables. Then, the objectives are
said to be separable, and they can be minimized independently. If there
are constraints, these need to be separable as well. However, separable
objectives and constraints are rare because functions tend to be linked
in engineering systems.
Given that multiobjective optimization requires trade-offs, we need
a new definition of optimality. In the next section, we explain how there
is an infinite number of optimal points, forming a surface in the space of
objective functions. After defining optimality for multiple objectives, we
present several possible methods for solving multiobjective optimization
problems.
𝐵
9.2 Pareto Optimality 𝑓2 𝐴

With multiple objectives, we have to reconsider what it means for a 𝐶

point to be optimal. In multiobjective optimization, we use the concept


of Pareto optimality. 𝑓1

Figure 9.1 shows three designs measured against two objectives that
Fig. 9.1 Three designs, 𝐴, 𝐵, and 𝐶,
we want to minimize: 𝑓1 and 𝑓2 . Let us first compare design A with
are plotted against two objectives, 𝑓1
design B. From the figure, we see that design A is better than design B and 𝑓2 . The region in the shaded
in both objectives. In the language of multiobjective optimization, we rectangle highlights points that are
say that design A dominates design B. One design is said to dominate dominated by design 𝐴.
9 Multiobjective Optimization 356

another design if it is superior in all objectives (design A dominates


any design in the shaded rectangle). Comparing design A with design
C, we note that design A is better in one objective ( 𝑓1 ) but worse in the
other objective ( 𝑓2 ). Neither design dominates the other.
A point is said to be nondominated if none of the other evaluated
points dominate it (Fig. 9.2). If a point is nondominated by any point
in the entire domain, then that point is called Pareto optimal. This does
𝑓2
not imply that this point dominates all other points; it simply means no
other point dominates it. The set of all Pareto optimal points is called
the Pareto set. The Pareto set refers to the vector of points 𝑥 ∗ , whereas
the Pareto front refers to the vector of functions 𝑓 (𝑥 ∗ ). 𝑓1

Fig. 9.2 A plot of all the evaluated


Example 9.2 A Pareto front in wind farm optimization
points in the design space plotted
The Pareto front is a valuable tool to produce design insights. Figure 9.3 against two objectives, 𝑓1 and 𝑓2 . The
shows a notional Pareto front for a wind farm optimization. The two objectives set of red points is not dominated
by any other and is thus the current
are maximizing power production (shown with a negative sign so that it is
approximation of the Pareto set.
minimized) and minimizing noise.
The Pareto front is helpful to understand trade-off sensitivities. For example,
the left endpoint represents the maximum power solution, and the right
endpoint represents the minimum noise solution. The nature of the curve on

Noise
the left side tells us how much power we have to sacrifice for a given reduction
in noise. If the slope is steep, as is the case in the figure, we can see that a small
sacrifice in maximum power production can be exchanged for significantly
reduced noise. However, if more significant noise reductions are sought, then
large power reductions are required. Conversely, if the left side of the figure −Power
had a flatter slope, we would know that small reductions in noise would require
Fig. 9.3 A notional Pareto front rep-
significant decreases in power. Understanding the magnitude of these trade-off
resenting power and noise trade-offs
sensitivities helps make high-level design decisions. for a wind farm optimization prob-
lem.

9.3 Solution Methods

Various solution methods exist for solving multiobjective problems.


This chapter does not cover all methods but highlights some of the more
commonly used approaches. These include the weighted-sum method,
the epsilon-constraint method, the normal boundary intersection (NBI)
method, and evolutionary algorithms.

9.3.1 Weighted Sum


The weighted-sum method is easy to use, but it is not particularly
efficient. Other methods exist that are just as simple but have better
9 Multiobjective Optimization 357

performance. It is only introduced because it is well known and is


frequently used. The idea is to combine all of the objectives into one
objective using a weighted sum, which can be written as

Õ
𝑛𝑓

𝑓¯(𝑥) = 𝑤 𝑖 𝑓𝑖 (𝑥) , (9.3)


𝑖=1

where 𝑛 𝑓 is the number of objectives, and the weights are usually


normalized such that
Õ
𝑛𝑓

𝑤𝑖 = 1 . (9.4)
𝑖=1

If we have two objectives, the objective reduces to

𝑓¯(𝑥) = 𝑤 𝑓1 (𝑥) + (1 − 𝑤) 𝑓2 (𝑥) , (9.5)

where 𝑤 is a weight in [0, 1].


Consider a two-objective case. Points on the Pareto set are deter-
mined by choosing a weight 𝑤, completing the optimization for the
composite objective, and then repeating the process for a new value of
𝑤. It is straightforward to see that at the extremes 𝑤 = 0 and 𝑤 = 1,
the optimization returns the designs that optimize one objective while
ignoring the other. The weighted-sum objective forms an equation for
a line with the objectives as the ordinates. Conceptually, we can think 𝑓2
of this method as choosing a slope for the line (by selecting 𝑤), then
−𝑤
pushing that line down and to the left as far as possible until it is just 1−𝑤
tangent to the Pareto front (Fig. 9.4). With this form of the objective,
the slope of the line would be 𝑓1

d 𝑓2 −𝑤 Fig. 9.4 The weighted-sum method


= . (9.6) defines a line for each value of 𝑤 and
d 𝑓1 1−𝑤
finds the point tangent to the Pareto
This procedure identifies one point in the Pareto set, and the procedure front.
must then be repeated with a new slope.
The main benefit of this method is that it is easy to use. However, 𝑤=1

the drawbacks are that (1) uniform spacing in 𝑤 leads to nonuniform


spacing along with the Pareto set, (2) it is not apparent which values 𝑓2

of 𝑤 should be used to sweep out the Pareto set evenly, and (3) this
method can only return points on the convex portion of the Pareto 𝑤=0
front.
𝑓1
In Fig. 9.5, we highlight the convex portions of the Pareto front from
Fig. 9.4. If we utilize the concept of pushing a line down and to the left, Fig. 9.5 The convex portions of this
we see that these are the only portions of the Pareto front that can be Pareto front are the portions high-
found using a weighted-sum method. lighted.
9 Multiobjective Optimization 358

9.3.2 Epsilon-Constraint Method


The epsilon-constraint method works by minimizing one objective
while setting all other objectives as additional constraints:159 159. Haimes et al., On a bicriterion formu-
lation of the problems of integrated system
identification and system optimization, 1971.
minimize 𝑓𝑖
𝑥
subject to 𝑓𝑗 ≤ 𝜀𝑗 for all 𝑗 ≠ 𝑖
(9.7)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0 .

Then, we must repeat this procedure for different values of 𝜀 𝑗 .


This method is visualized in Fig. 9.6. In this example, we constrain
𝑓1 to be less than or equal to a certain value and minimize 𝑓2 to find the
corresponding point on the Pareto front. We then repeat this procedure
for different values of 𝜀.
One advantage of this method is that the values of 𝜀 correspond
directly to the magnitude of one of the objectives, so determining
appropriate values for 𝜀 is more intuitive than selecting the weights in 𝑓2
the previous method. However, we must be careful to choose values
that result in a feasible problem. Another advantage is that this method
reveals the nonconvex portions of the Pareto front. Both of these reasons
strongly favor using the epsilon-constraint method over the weighted- 𝑓1 𝜀
sum method, especially because this method is not much harder to use.
Fig. 9.6 The vertical line represents
Its main limitation is that, like the weighted-sum method, a uniform
an upper bound constraint on 𝑓1 . The
spacing in 𝜀 does not generally yield a uniform spacing of the Pareto other objective, 𝑓2 , is minimized to re-
front (though it is usually much better spaced than weighted-sum), and veal one point in the Pareto set. This
therefore it might still be inefficient, particularly with more than two procedure is then repeated for differ-
ent constraints on 𝑓1 to sweep out the
objectives.
Pareto set.

9.3.3 Normal Boundary Intersection


The NBI method is designed to address the issue of nonuniform spacing
along the Pareto front.160 We first find the extremes of the Pareto set; in 160. Das and Dennis, Normal-boundary
intersection: A new method for generating
other words, we minimize the objectives one at a time. These extreme the Pareto surface in nonlinear multicriteria
points are referred to as anchor points. Next, we construct a plane that optimization problems, 1998.

passes through the anchor points. We space points along this plane
(usually evenly) and, starting from those points, solve optimization
problems that search along directions normal to this plane.
This procedure is shown in Fig. 9.7 for a two-objective case. In this
case, the plane that passes through the anchor points is a line. We
now space points along this line by choosing a vector of weights 𝑏, as
illustrated on the left-hand of Fig. 9.7. The weights are constrained such
9 Multiobjective Optimization 359

Í
that 𝑏 𝑖 ∈ [0, 1], and 𝑖 𝑏 𝑖 = 1. If we make 𝑏 𝑖 = 1 and all other entries
zero, then this equation returns one of the anchor points, 𝑓 (𝑥 ∗𝑖 ). For
two objectives, we would set 𝑏 = [𝑤, 1 − 𝑤] and vary 𝑤 in equal steps
between 0 and 1.

𝑓 (𝑥1∗ ) 𝑓 (𝑥1∗ )
𝑏 = [0.8, 0.2]
𝛼 𝑛ˆ
𝑓2 𝑓2 Fig. 9.7 A notional example of the
𝑃𝑏 + 𝑓 ∗ 𝑃𝑏 + 𝑓 ∗ NBI method. A plane is created that
passes through the single-objective
optima (the anchor points), and solu-
𝑓∗ 𝑓∗
𝑓 (𝑥2∗ ) 𝑓 (𝑥2∗ ) tions are sought normal to that plane
for a more evenly spaced Pareto front.
𝑓1 𝑓1

Starting with a specific value of 𝑏, we search along a direction


perpendicular to the line defined by the anchor points, represented by
𝛼 𝑛ˆ in Fig. 9.7 (right). We seek to find the point along this direction
that is the farthest away from the anchor points line (a maximization
problem), with the constraint that the point is consistent with the
objective functions. The resulting optimal point found along this
direction is a point on the Pareto front. We then repeat this process for
another set of weighting parameters in 𝑏.
We can see how this method is similar to the epsilon-constraint
method, but instead of searching along lines parallel to one of the axes,
we search along lines perpendicular to the plane defined by the anchor
points. The idea is that even spacing along this plane is more likely to
lead to even spacing along the Pareto front.
Mathematically, we start by determining the anchor points, which
are just single-objective optimization problems. From the anchor
points, we define what is called the utopia point. The utopia point is an
ideal point that cannot be obtained, where every objective reaches its
minimum simultaneously (shown in the lower-left corner of Fig. 9.7):

 𝑓1 (𝑥 ∗ ) 
 1 
 𝑓2 (𝑥 ∗ ) 
 2 
𝑓 = .  ,

(9.8)
 .. 
 
 𝑓𝑛 (𝑥 ∗ )
 𝑛 

where 𝑥 ∗𝑖 denotes the design variables that minimize objective 𝑓𝑖 . The


utopia point defines the equation of a plane that passes through all
anchor points,
𝑃𝑏 + 𝑓 ∗ , (9.9)
9 Multiobjective Optimization 360

where the 𝑖th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ . A single vector 𝑏, whose length


is given by the number of objectives, defines a point on the plane.
We now define a vector (𝑛) ˆ that is normal to this plane, in the
direction toward the origin. We search along this vector using a step
length 𝛼, while maintaining consistency with our objective functions
𝑓 (𝑥) yielding
𝑓 (𝑥) = 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ . (9.10)
Computing the exact normal (𝑛) ˆ is involved, and the vector does
not need to be exactly normal. As long as the vector points toward the
Pareto front, then it will still yield well-spaced points. In practice, a
quasi-normal vector is often used, such as,

𝑛˜ = −𝑃𝑒 , (9.11)

where 𝑒 is a vector of 1s.


We now solve the following optimization problem, for a given vector
𝑏, to yield a point on the Pareto front:

maximize 𝛼
𝑥,𝛼

subject to 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ = 𝑓 (𝑥)
(9.12)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0 .

This means that we find the point farthest away from the anchor-point
plane, starting from a given value for 𝑏, while satisfying the original
problem constraints. The process is then repeated for additional values
of 𝑏 to sweep out the Pareto front.
In contrast to the previously mentioned methods, this method yields
a more uniformly spaced Pareto front, which is desirable for computa-
tional efficiency, albeit at the cost of a more complex methodology.
For most multiobjective design problems, additional complexity
beyond the NBI method is unnecessary. However, even this method
can still have deficiencies for problems with unusual Pareto fronts,
and new methods continue to be developed. For example, the normal
constraint method uses a very similar approach,161 but with inequality
161. Ismail-Yahaya and Messac, Effective
constraints to address a deficiency in the NBI method that occurs when generation of the Pareto frontier using the
normal constraint method, 2002.
the normal line does not cross the Pareto front. This methodology has
undergone various improvements, including better scaling through
162. Messac and Mattson, Normal con-
normalization.162 A more recent improvement performs an even more straint method with guarantee of even repre-
sentation of complete Pareto frontier, 2004.
efficient generation of the Pareto frontier by avoiding regions of the
163. Hancock and Mattson, The smart nor-
Pareto front where minimal trade-offs occur.163 mal constraint method for directly generating
a smart Pareto set, 2013.
9 Multiobjective Optimization 361

Example 9.3 A two-dimensional normal boundary interface problem

3 (2, 3)

𝑓2 2
𝑛ˆ
1 (5, 1)
𝑓∗

0 Fig. 9.8 Search directions are normal


1 2 3 4 5 6 to the line connecting anchor points.
𝑓1

First, we optimize the objectives one at a time, which in our example results
in the two anchor points shown in Fig. 9.8: 𝑓 (𝑥1∗ ) = (2, 3) and 𝑓 (𝑥2∗ ) = (5, 1).
The utopia point is then
 
∗ 2
𝑓 = .
1
For the matrix 𝑃, recall that the 𝑖th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ :
 
0 3
𝑃= .
2 0

Our quasi-normal vector is given by −𝑃𝑒 (note that the true normal is
[−2, −3]):  
−3
𝑛˜ = .
−2 ∗ The first application of an evolutionary al-

We now have all the parameters we need to solve Eq. 9.12. gorithm for solving a multiobjective prob-
lem was by Schaffer.164
164. Schaffer, Some experiments in machine
learning using vector evaluated genetic
algorithms. 1984.
9.3.4 Evolutionary Algorithms
Gradient-free methods can, and occasionally do, use all of the previously
described methods. However, evolutionary algorithms also enable a
𝑓2
fundamentally different approach. Genetic algorithms (GAs), a specific
type of evolutionary algorithm, were introduced in Section 7.6.∗
A GA is amenable to an extension that can handle multiple objectives
because it keeps track of a large population of designs at each iteration. 𝑓1
If we plot two objective functions for a given population of a GA
iteration, we get something like that shown in Fig. 9.9. The points Fig. 9.9 Population for a multiob-
jective GA iteration plotted against
represent the current population, and the highlighted points in the
two objectives. The nondominated
lower left are the current nondominated set. As the optimization set is highlighted at the bottom left
progresses, the nondominated set moves further down and to the left and eventually converges toward the
and eventually converges toward the actual Pareto front. Pareto front.
9 Multiobjective Optimization 362

In the multiobjective version of the GA, the reproduction and


mutation phases are unchanged from the single-objective version. The
primary difference is in determining the fitness and the selection
procedure. Here, we provide an overview of one popular approach,
the elitist nondominated sorting genetic algorithm (NSGA-II).† † The NSGA-II algorithm was developed
by Deb et al.148 Some key developments
A step in the algorithm is to find a nondominated set (i.e., the include using the concept of domination in
current approximation of the Pareto set), and several algorithms exist the selection process, preserving diversity
among the nondominated set, and using
to accomplish this. In the following, we use the algorithm by Kung elitism.165
et al.,166 which is one of the fastest. This procedure is described in 148. Deb et al., A fast and elitist multiobjec-
Alg. 9.1, where “front” is a shorthand for the nondominated set (which tive genetic algorithm: NSGA-II, 2002.

is just the current approximation of the Pareto front). The algorithm 165. Deb, Introduction to evolutionary
multiobjective optimization, 2008.
recursively divides the population in half and finds the nondominated
166. Kung et al., On finding the maxima of
set for each half separately. a set of vectors, 1975.

Algorithm 9.1 Find the nondominated set using Kung’s algorithm

Inputs:
𝑝: A population sorted by the first objective
Outputs:
𝑓 : The nondominated set for the population

procedure front(𝑝)
if length(𝑝) = 1 then If there is only one point, it is the front
return f
end if
Split population into two halves 𝑝 𝑡 and 𝑝 𝑏
⊲ Because input was sorted, 𝑝 𝑡 will be superior to 𝑝 𝑏 in the first objective
𝑡 = front(𝑝 𝑡 ) Recursive call to find front for top half
𝑏 = front(𝑝 𝑏 ) Recursive call to find front for bottom half
Initialize 𝑓 with the members from 𝑡 merged population
for 𝑖 = 1 to length(𝑏) do
dominated = false Track whether anything in 𝑡 dominates 𝑏 𝑖
for 𝑗 = 1 to length(𝑡) do
if 𝑡 𝑗 dominates 𝑏 𝑖 then
dominated = true
break No need to continue search through 𝑡
end if
end for
if not dominated then 𝑏 𝑖 was not dominated by anything in 𝑡
Add 𝑏 𝑖 to 𝑓
end if
end for
return 𝑓
end procedure
9 Multiobjective Optimization 363

Before calling the algorithm, the population should be sorted by


the first objective. First, we split the population into two halves, where
the top half is superior to the bottom half in the first objective. Both
populations are recursively fed back through the algorithm to find their
nondominated sets. We then initialize a merged population with the
members of the top half. All members in the bottom half are checked,
and any that are nondominated by any member of the top half are added
to the merged population. Finally, we return the merged population as
the nondominated set.
With NSGA-II, in addition to determining the nondominated set,
we want to rank all members by their dominance depth, which is also
called nondominated sorting. In this approach, all nondominated points
in the population (i.e., the current approximation of the Pareto set) are
given a rank of 1. Those points are then removed from the set, and
the next set of nondominated points is given a rank of 2, and so on.
Figure 9.10 shows a sample population and illustrates the positions of rank = 1
the points with various rank values. There are alternative procedures rank = 2
rank = 3
that perform nondominated sorting directly, but we do not detail them rank ≥ 4
here. This algorithm is summarized in Alg. 9.2. 𝑓2
The new population in the GA is filled by placing all rank 1 points
in the new population, then all rank 2 points, and so on. At some point,
an entire group of constant rank will not fit within the new population.
Points with the same rank are all equivalent as far as Pareto optimality is 𝑓1
concerned, so an additional sorting mechanism is needed to determine
which members of this group to include. Fig. 9.10 Points in the population
highlighted by rank.

Algorithm 9.2 Perform nondominated sorting

Inputs:
𝑝: A population
Outputs:
rank: The rank for each member in the population

𝑟=1 Initialize current rank


𝑠=𝑝 Set subpopulation as entire population
while length(𝑠) > 0 do
𝑓 = front(sort(𝑠)) Identify the current front
Set rank for every member of 𝑓 to 𝑟
Remove all members of 𝑓 from 𝑠
𝑟 = 𝑟+1 Increment rank
end while
9 Multiobjective Optimization 364

We perform selection within a group that can only partially fit


to preserve diversity. Points in this last group are ordered by their
crowding distance, which is a measure of how spread apart the points
are. The algorithm seeks to preserve points that are well spread. For
each point, a hypercube in objective space is formed around it, which,
in NSGA-II, is referred to as a cuboid. Figure 9.11 shows an example
cuboid considering the rank 3 points from Fig. 9.10. The hypercube
extends to the function values of its nearest neighbors in the function
space. That does not mean that it necessarily touches its neighbors 𝑓2
because the two closest neighbors can differ for each objective. The
sum of the dimensions of this hypercube is the crowding distance.
When summing the dimensions, each dimension is normalized by
the maximum range of that objective value. For example, considering 𝑓1
only 𝑓1 for the moment, if the objectives were in ascending order, then
the contribution of point 𝑖 to the crowding distance would be Fig. 9.11 A cuboid around one
point, demonstrating the definition
𝑓1 𝑖+1 − 𝑓1 𝑖−1 of crowding distance (except that the
𝑑1,𝑖 = . (9.13) distances are normalized).
𝑓1 𝑛 𝑝 − 𝑓1 1

where 𝑛 𝑝 is the size of the population. Sometimes, instead of using the


first and last points in the current objective set, user-supplied values are
used for the min and max values of 𝑓 that appear in that denominator.
The anchor points (the single-objective optima) are assigned a crowding
distance of infinity because we want to preference their inclusion. The
algorithm for crowding distance is shown in Alg. 9.3.

Algorithm 9.3 Crowding distance

Inputs:
𝑝: A population
Outputs:
𝑑: Crowding distances

Initialize 𝑑 with zeros


for 𝑖 = 1 to number of objectives do
Set 𝑓 as a vector containing the 𝑖th objective for each member in 𝑝
𝑠 = sort( 𝑓 ) and let 𝐼 contain the corresponding indices (𝑠 = 𝑓𝐼 )
𝑑 𝐼1 = ∞ Anchor points receive an infinite crowding distance
𝑑𝐼𝑛 = ∞
for 𝑗 = 2 to 𝑛 𝑝 − 1 do Add distance for interior points
𝑑𝐼 𝑗 = 𝑑𝐼 𝑗 + (𝑠 𝑗+1 − 𝑠 𝑗−1 )/(𝑠 𝑛 𝑝 − 𝑠 1 )
end for
end for
9 Multiobjective Optimization 365

We can now put together the overall multiobjective GA, as shown


in Alg. 9.4, where we use the components previously described (non-
dominated set, nondominated sorting, and crowding distance).

Algorithm 9.4 Elitist nondominated sorting genetic algorithm

Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
Outputs:
𝑥 ∗ : Best point

Generate initial population


while Stopping criterion is not satisfied do
Using a parent population 𝑃, proceed as a standard GA for selection,
crossover, and mutation, but use a crowded tournament selection to produced
an offspring population 𝑂
𝐶 =𝑃∪𝑂 Combine populations
Compute rank𝑖 for 𝑖 = 1, 2, . . . of 𝐶 using Alg. 9.2
⊲ Fill new parent population with as many whole ranks as possible
𝑃=∅
𝑟=1
while true do
set 𝐹 as all 𝐶 𝑖 with rank𝑖 = 𝑟
if length(𝑃) + length(𝐹) > 𝑛 𝑝 then
break
end if
add 𝐹 to 𝑃
𝑟 = 𝑟+1
end while
⊲ For last rank that does not fit, add by crowding distance
if length(𝑃) < 𝑛 𝑝 then Population is not full
d = crowding(𝐹) Alg. 9.3, using last 𝐹 from terminated previous loop
𝑚 = 𝑛 𝑝 − length(𝑃) Determine how many members to add
Sort 𝐹 by the crowding distance 𝑑 in descending order
Add the first 𝑚 entries from 𝐹 to 𝑃
end if
end while

The crossover and mutation operations remain the same. Tourna-


ment selection (Fig. 7.25) is modified slightly to use this algorithm’s
ranking and crowding metrics. In the tournament, a member with a
lower rank is superior. If two members have the same rank, then the
9 Multiobjective Optimization 366

one with the larger crowding distance is selected. This procedure is


called crowded tournament selection.
After reproduction and mutation, instead of replacing the parent
generation with the offspring generation, both the parent generation and
the offspring generation are saved as candidates for the next generation.
This strategy is called elitism, which means that the best member in the
population is guaranteed to survive.
The population size is now twice its original size (2𝑛 𝑝 ), and the
selection process must reduce the population back down to size 𝑛 𝑝 . This
is done using the procedure explained previously. The new population
is filled by including all rank 1 members, rank 2 members, and so on
until an entire rank can no longer fit. Inclusion for members of that
last rank is done in the order of the largest crowding distance until
the population is filled. Many variations are possible, so although the
algorithm is based on the concepts of NSGA-II, the details may differ
somewhat.
The main advantage of this multiobjective approach is that if an evo-
lutionary algorithm is appropriate for solving a given single-objective
problem, then the extra information needed for a multiobjective prob-
lem is already there, and solving the multiobjective problem does not
incur much additional computational cost. The pros and cons of this
approach compared to the previous approaches are the same as those
of gradient-based versus gradient-free methods, except that the multi-
objective gradient-based approaches require solving multiple problems
to generate the Pareto front. Still, solving multiple gradient-based
problems may be more efficient than solving one gradient-free problem,
especially for problems with a large number of design variables.

Example 9.4 Filling a new population in NSGA-II

After reproduction and mutation, we are left with a combined population


of parents and offspring. In this small example, the combined population is of
size 12, so we must reduce it back to 6. This example has two objectives, and 𝑓2
L G
10
the values for each member in the population are shown in the following table, B
A
where we assign a letter to each member. The population is plotted in Fig. 9.12. 8
E
F
A B C D E F G H I J K L 6
I
𝑓1 5 7 10 1 3 10 5 6 9 6 9 4 D C
4
H
𝑓2 8 9 4 4 7 6 10 3 5 1 2 10 K
2 J
First, we compute the ranks using Alg. 9.2, resulting in the following output:
0
0 2 4 6 8 10
A B C D E F G H I J K L 𝑓1
3 4 3 1 2 4 4 2 3 1 2 3
Fig. 9.12 Population for Ex. 9.4.
9 Multiobjective Optimization 367

We see that the current nondominated set consists of points D and J and that
there are four different ranks.
Next, we start filling the new population in the order of rank. Our maximum
capacity is 6, so all rank 1 {D, J} and rank 2 {E, H, K} fit. We cannot add rank 3
{A, C, I, L} because the population size would be 9. So far, our new population
consists of {D, J, E, H, K}. To choose which items from rank 3 continue forward,
we compute the crowding distance for the members of rank 3:

A C I L
1.67 ∞ 1.5 ∞

We would then add, in order {C, L, A, I}, but we only have room for one, so we
add C and complete this iteration with a new population of {D, J, E, H, K, C}.

9.4 Summary

Multiobjective optimization is particularly useful in quantifying trade-


off sensitivities between critical metrics. It is also useful when we seek
a family of potential solutions rather than a single solution. Some
scenarios where a family of solutions might be preferred include when
the models used in optimization are low fidelity and higher-fidelity
design tools will be applied or when more investigation is needed.
A multiobjective approach can produce candidate solutions for later
refinement.
The presence of multiple objectives changes what it means for a de-
sign to be optimal. A design is Pareto optimal when it is nondominated
by any other design. The weighted-sum method is perhaps the most
well-known approach, but it is not recommended because other meth-
ods are just as easy and much more efficient. The epsilon-constraint
method is still simple yet almost always preferable to the weighted-sum
method. It typically provides a better spaced Pareto front and can
resolve any nonconvex portions of the front. If we are willing to use a
more complex approach, the normal boundary intersection method is
even more efficient at capturing a well-spaced Pareto front.
Some gradient-free methods, such as a multiobjective GA, can also
generate Pareto fronts. If a gradient-free approach is a good fit in the
single objective version of the problem, adding multiple objectives can
be done with little extra cost. Although gradient-free methods are
sometimes associated with multiobjective problems, gradient-based
algorithms may be more effective for many multiobjective problems.
9 Multiobjective Optimization 368

Problems

9.1 Answer true or false and justify your answer.

a. The solution of multiobjective optimization problems is


usually an infinite number of points.
b. It is advisable to include as many objectives as you can in
your problem formulation to make sure you get the best
possible design.
c. Multiobjective optimization can quantify trade-offs between
objectives and constraints.
d. If the objectives are separable, that means that they can be
minimized independently and that there is no Pareto front.
e. A point 𝐴 dominates point 𝐵 if it is better than 𝐵 in at least
one objective.
f. The Pareto set is the set of points that dominate all other
points in the objective space.
g. When a point is Pareto optimal, you cannot make either of
the objectives better.
h. The weighted-sum method obtains the Pareto front by solv-
ing optimization problems with different objective functions.
i. The epsilon-constraint method obtains the Pareto front by
constraining one objective and minimizing all the others.
j. The utopia point is the point where every objective has a
minimum value.
k. It is not possible to compute a Pareto front with a single-
objective optimizer.
l. Because GAs optimize by evolving a population of diverse
designs, they can be used for multiobjective optimization
without modification.

9.2 Which of the following function value pairs would be Pareto


optimal in a multiobjective minimization problem (there may be
more than one)?

• (20, 4)
• (18, 5)
• (34, 2)
• (19, 6)
9 Multiobjective Optimization 369

9.3 You seek to minimize the following two objectives:


𝑓1 (𝑥) = 𝑥12 + 𝑥22
𝑓2 (𝑥) = (𝑥1 − 1)2 + 20(𝑥2 − 2)2 .

Identify the Pareto front using the weighted-sum method with


11 evenly spaced weights: 0, 0.1, 0.2, . . . , 1. If some parts of the
front are underresolved, discuss how you might select weights
for additional points.
9.4 Repeat Prob. 9.3 with the epsilon-constraint method. Constrain 𝑓1
with 11 evenly spaced points between the anchor points. Contrast
the Pareto front with that of the previous problem, and discuss
whether improving the front with additional points will be easier
with the previous method or with this method.
9.5 Repeat Prob. 9.3 with the normal boundary intersection method
using the following 11 evenly spaced points:
𝑏 = [0, 1], [0.1, 0.9], [0.2, 0.8], . . . , [1, 0].

9.6 Consider a two-objective population with the following combined


parent/offspring population (objective values shown for all 16
members):

𝑓1 𝑓2
6.0 8.0
6.0 4.0
5.0 6.0
2.0 8.0
10.0 5.0
6.0 0.5
8.0 3.0
4.0 9.0
9.0 7.0
8.0 6.0
3.0 1.0
7.0 9.0
1.0 2.0
3.0 7.0
1.5 1.5
4.0 6.5

Develop code based on the NSGA-II procedure and determine


the new population at the end of this iteration. Detail the results
of each step during the process.
Surrogate-Based Optimization
10
A surrogate model, also known as a response surface model or metamodel,
is an approximate model of a functional output that represents a “curve
fit” to some underlying data. The goal of a surrogate model is to build
a model that is much faster to compute than the original function, but
that still retains sufficient accuracy away from known data points.
Surrogate-based optimization (SBO) performs optimization using Optimizer
the surrogate model, as shown in Fig. 10.1. When used in optimization,
the surrogate might define the full optimization model (i.e., the inputs 𝑥 𝑓ˆ , 𝑔ˆ
are design variables, and the outputs are objective and constraint
Surrogate
functions), or the surrogate could be just a component of the overall
model
model. SBO is more targeted than the broader field of surrogate
modeling. Instead of aiming for a globally accurate surrogate, SBO just 𝑥 𝑓,𝑔

needs the surrogate model to be accurate enough to lead the optimizer


Model
to the true optimum.
In SBO, the surrogate model is usually improved during optimiza-
tion as needed but can sometimes be constructed beforehand and Fig. 10.1 Surrogate-based optimiza-
tion replaces the original model with
remain fixed during optimization. Some optimization algorithms inter-
a surrogate model in the optimization
rogate both the surrogate model and the original model, an approach process.
that is sometimes called surrogate-assisted optimization.

By the end of this chapter you should be able to:

1. Identify and describe the steps in surrogate-based opti-


mization.
2. Understand and use sampling methods.
3. Optimize parameters for a given surrogate model.
4. Perform cross-validation as part of model selection.
5. Describe different surrogate-based optimization ap-
proaches and the infill process.

370
10 Surrogate-Based Optimization 371

10.1 When to Use a Surrogate Model

There are various scenarios for which surrogate models are helpful.
One scenario is when the original model is computationally expensive.
Surrogate models can be queried with minimal computational cost, but
constructing them requires multiple evaluations of the original model.
Suppose the number of evaluations needed to build a sufficiently
accurate surrogate model is less than that needed to optimize the
original model directly. In that case, SBO may be a worthwhile option.
Constructing a surrogate model becomes even more compelling when
it is reused in multiple optimizations.
Surrogate modeling can be effective in handling noisy models
because they create a smooth representation of noisy data. This can be
particularly advantageous when using gradient-based optimization.
One scenario that leads to both expensive evaluation and noisy
output is experimental data. When the model data are experimental
and the optimizer cannot query the experiment in an automated way,
we can construct a surrogate model based on the experimental data.
Then, the optimizer can query the surrogate model in the optimization.
Surrogate models are also helpful when we want to understand
the design space, that is, how the objective and constraints (outputs)
vary with respect to the design variables (inputs). By constructing a
continuous model over discrete data, we obtain functional relationships
that can be visualized more effectively.
When multiple sources of data are available, surrogate models can
fuse the data to build a single model. The data could come from
numerical models with different levels of fidelity or experimental data.
For example, surrogate models can calibrate numerical model data
using experimental data. This is helpful because experimental data is
usually much more scarce than numerical data. The same reasoning
applies to low- versus high-fidelity numerical data.
One potential issue with surrogate models is the curse of dimension-
ality, which refers to poor scalability with the number of inputs. The
larger the number of inputs, the more model evaluations are needed
to construct a surrogate model that is accurate enough. Therefore, the
reasons for using surrogate models cited earlier might not be enough if
the optimization problem has a large number of design variables.
The SBO process is shown in Fig. 10.2. First, we use sampling
methods to choose the initial points to evaluate the function or conduct
experiments. These points are sometimes referred to as training data.
Next, we build a surrogate model from the sampled points. We can
then perform optimization by querying the surrogate model. Based
10 Surrogate-Based Optimization 372

on the results of the optimization, we include additional points in the


sample and reconstruct the surrogate (infill). We repeat this process
Sample
until some convergence criterion or a maximum number of iterations is
reached. In some procedures, infill is omitted; the surrogate is entirely
constructed upfront and not subsequently updated.
Construct
The optimization step can be performed using any of the methods surrogate
we covered previously. Because surrogate models are smooth and
their gradients are easily computed, gradient-based optimization is
preferred (see Chapter 4). However, some surrogate models can be Perform
optimization
highly multimodal, in which case a global search is preferred, either
using gradient-based with multistart (see Tip 4.8) or a global gradient-
free method (see Chapter 7).
This chapter discusses sampling, constructing a surrogate, and No
Converged? Infill
infill with some associated optimization strategies. We devote separate
sections to two surrogate modeling methods that are more involved
and widely used: kriging and deep neural nets. Many of the concepts
Yes
discussed in this chapter have a wide range of applications beyond
optimization. Done

Tip 10.1 Surrogate models can be useful within your model


Fig. 10.2 Overview of surrogate-
In the context of SBO, we usually replace the function evaluation with a based optimization procedure.
surrogate model, as shown in Fig. 10.1. However, it might not be advantageous
to replace the whole model, but replace only part of that model instead. If a
component of the model is evaluated frequently and does not have too many
inputs, this approach might be worthwhile.
For example, when performing trajectory optimization of an aircraft, we
need to evaluate the lift and drag of the aircraft at each point of the trajectory.
This typically requires many points, and computing the lift and drag at each
point might be prohibitive. Therefore, it might be worthwhile to use surrogate
models that predict the lift and drag as functions of the angle of attack. If the
optimization design variables include parameters that change the lift and drag
characteristics, such as the wing shape, then the surrogate model needs to be
rebuilt at every optimization iteration.

10.2 Sampling

Sampling methods, also known as sampling plans, select the evaluation


points to construct the initial surrogate. These evaluation points must be
chosen carefully. A straightforward approach is full factorial sampling,
where we discretize each dimension and evaluate at all combinations
10 Surrogate-Based Optimization 373

of the resulting grid. This is not efficient because it scales exponentially


with the number of input variables.

Example 10.1 Full factorial sampling is not scalable

Imagine a numerical model that computes the endurance of an aircraft.


Suppose we only wanted to understand how endurance varied with one
variable, such as wingspan. In that case, we could evaluate the model multiple
times and fit a curve that could predict endurance at wingspans that we did
not directly evaluate. If the model evaluation is computationally expensive, so
that evaluating many points is prohibitive, we might use a relatively coarse
sampling (say, 12 points). As long as the endurance changes are gradual across
the domain, fitting a spline through these few points can generate a useful
predictive model.
Now imagine that we have nine additional input variables that we care
about: wing area, taper ratio, wing root twist, wingtip twist, wing dihedral,
propeller spanwise position, battery size, tail area, and tail longitudinal position.
If we discretized all 10 variables with the same coarse 12 intervals each, a
full factorial sample would require 1210 model evaluations (approximately 62
billion) to assess all combinations. Thus, this type of sampling plan is not
scalable.

Example 10.1 highlights one of the significant challenges of sampling


methods: the curse of dimensionality. For SBO, even with better
sampling plans, using a large number of variables is costly. We need to
identify the most important or most influential variables. Knowledge
of the particular domain is helpful, as is exploring the magnitude of
the entries in a gradient vector (Chapter 6) across multiple points in the
domain. We can use various strategies to help us decide which variables
matter most, but for our purposes, we assume that the most influential
variables have already been determined so that the dimensionality is
reasonable. Having selected a set of variables, we are now interested in
sampling methods that characterize the design space of interest more
efficiently than full factorial sampling.
In addition to their use in SBO, the sampling strategies discussed in
this section are useful in many other applications, including various
applications discussed in this book: initializing a genetic algorithm
(Section 7.6), particle swarm optimization (Section 7.7) or a multistart
gradient-based algorithm (Tip 4.8), or choosing the points to run
in a Monte Carlo simulation (Section 12.3.3). Because the function
behavior at each sample is independent, we can efficiently parallelize
the evaluation of the functions.
10 Surrogate-Based Optimization 374

10.2.1 Latin Hypercube Sampling


Latin hypercube sampling (LHS) is a popular sampling method that is
built on a random process but is more effective and efficient than pure
random sampling. Random sampling scales better than full factorial
searches, but it tends to exhibit clustering and requires many points
to reach the desired distribution (i.e., the law of large numbers). For
example, Fig. 10.3 compares 50 randomly generated points across uni-
form distributions in two dimensions versus Latin hypercube sampling.
In random sampling, each sample is independent of past samples, but
in LHS, we choose all samples beforehand to ensure a well-spread
distribution.

𝑥2 𝑥2
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2
Fig. 10.3 Contrast between random
0 0 and Latin hypercube sampling with
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 50 points using uniform distribu-
𝑥1 𝑥1
tions.
Random Latin hypercube sampling

To describe the methodology, consider two random variables with


bounds, whose design space we can represent as a square. Say we
wanted only eight samples; we could divide the design space into
eight intervals in each dimension, generating the grid of cells shown in
Fig. 10.4.
A full factorial search would identify a point in each cell, but that
does not scale well. To be as efficient as possible and still cover the
variation, we would want each row and each column to have one sample
in it. In other words, the projection of points onto each dimension
Fig. 10.4 A two-dimensional design
should be uniform. For example, the left side of Fig. 10.5 shows the space divided into eight intervals in
projection of a uniform LHS onto each dimension. We see that the each dimension.
points create a uniformly spread histogram.
The concept where one and only one point exists in any given row
or column is called a Latin square, and the generalization to higher
dimensions is called a Latin hypercube. There are many ways to achieve
this, and some choices are better than others. Consider the sampling
10 Surrogate-Based Optimization 375

𝑥2 𝑥2

Fig. 10.5 Example LHS with projec-


𝑥1 𝑥1
tions onto the axes.
Uniform distribution in each direction Normal distribution in each direction

plan shown on the left of Fig. 10.6. This plan meets our criteria but
clearly does not fill the space and likely will not capture the relationships
between design parameters well. Alternatively, the right side of Fig. 10.6
has a sample in each row and column while also spanning the space
much more effectively.

Fig. 10.6 Contrasting sampling strate-


A sampling strategy whose projec- A sampling strategy whose projec- gies that both fulfill the uniform pro-
tion uniformly spans each dimension tion uniformly spans each dimension jection requirement.
but does not fill the space well and fills the space more effectively

LHS can be posed as an optimization problem where we seek to


maximize the distance between the samples. The constraint is that the
projection on each axis must follow a chosen probability distribution.
The specified distribution is often uniform, as in the previous examples,
but it could also be any distribution, such as a normal distribution, as
shown on the right side of Fig. 10.5. This optimization problem does not
have a unique solution, so random processes are used to determine the
combination of points. Additionally, points are not usually placed in
cell centers but at a random location within a given cell to allow for the
possibility of reaching any point in the domain. The advantage of the
10 Surrogate-Based Optimization 376

LHS approach is that rather than relying on the law of large numbers to
fill out our chosen probability distributions, we enforce it as a constraint.
This method may still require many samples to characterize the design
space accurately, but it usually requires far fewer than pure random
sampling.
Instead of defining LHS as an optimization problem, a much simpler
approach is typically used in which we ensure one sample per interval,
but we rely on randomness to choose point combinations. Although
this does not necessarily yield a maximum spread, it works well in
practice and is simple to implement. Before discussing the algorithm,
we discuss how to generate other distributions besides just uniform
distributions.
We can convert from uniformly sampled points to an arbitrary
distribution using a technique called inversion sampling. Assume that
we want to generate samples 𝑥 from an arbitrary probability density
function (PDF) 𝑝(𝑥) or, equivalently, from the corresponding cumulative
distribution function (CDF) 𝑃(𝑥).∗ The probability integral transform ∗ PDFs and CDFs are reviewed in Ap-

pendix A.9.
states that for any continuous CDF, 𝑦 = 𝑃(𝑥), the variable 𝑦 is uniformly
distributed (a simple proof, but it is not shown here to avoid introducing
additional notation). The procedure is to randomly sample from a
uniform distribution (e.g., generate 𝑦), then compute the corresponding
𝑥 such that 𝑃(𝑥) = 𝑦, which we denote as 𝑥 = 𝑃 −1 𝑦. This latter step is
known as an inverse CDF, a percent-point function, or a quantile function.
This process is depicted in Fig. 10.7 for a normal distribution. This
same procedure allows us to use LHS with any distribution, simply by
generating the samples on a uniform distribution.

1
CDF

0.8

0.6 Fig. 10.7 An example of inversion


sampling with a normal distribution.
𝑦
A few uniform samples are shown on
0.4 the 𝑦-axis. The points are evaluated
by the inverse CDF, represented by
the arrows passing through the CDF
0.2 for a normal distribution. If enough
PDF
samples are drawn, the resulting dis-
tribution will be the PDF of a normal
0
−4 −3 −2 −1 0 1 2 3 4 distribution.
𝑥

A typical algorithm is described in Alg. 10.1. For each axis, we parti-


tion the CDF in 𝑛 𝑠 evenly spaced regions (evenly spaced along the CDF,
10 Surrogate-Based Optimization 377

which means that each region is equiprobable). We generate a random


number within each evenly spaced interval, where 0 corresponds to the
bottom of the interval and 1 to the top. We then evaluate the inverse
CDF as described previously so that the points match our specified
distribution (the CDF for a uniform distribution is just a line 𝑃(𝑥) = 𝑥,
so the output is not changed). Next, the column of points for that axis
is randomly permuted. This process is repeated for each axis according
to its specified probability distribution.

Algorithm 10.1 Latin hypercube sampling

Inputs:
𝑛 𝑠 : Number of samples
𝑛 𝑑 : Number of dimensions
𝑃 = {𝑃1 , . . . , 𝑃𝑛 𝑑 }: (optionally) A set of cumulative distribution functions
Outputs:
𝑋 = {𝑥1 , . . . , 𝑥 𝑛 𝑠 }: Set of sample points

for 𝑗 = 1 to 𝑛 𝑑 do
for 𝑖 = 1 to 𝑛 𝑠 do
𝑖 𝑅 𝑖𝑗
𝑉𝑖𝑗 = − where 𝑅 𝑖𝑗 ∈ U[0, 1] Randomly choose a value in each equally
𝑛𝑠 𝑛𝑠
spaced cell from uniform distribution
end for
𝑋∗𝑗 = 𝑃 𝑗−1 (𝑉∗𝑗 ) where 𝑃 𝑗 is a CDF Evaluate inverse CDF
Randomly permute the entries of this column 𝑋∗𝑗 Alternatively, permute the
indices 1 . . . 𝑛 𝑠 in the prior for loop
end for

𝑥2
An example using Alg. 10.1 for eight points is shown in Fig. 10.8. 3

In this example, we use a uniform distribution for 𝑥1 and a normal 2

distribution for 𝑥2 . There is one point in each equiprobable interval. As 1


stated before, randomness does not necessarily ensure a good spread, 0
but optimizing the spread is difficult because the function is highly −1
multimodal. Instead, to encourage high spread, we could generate −2
multiple Latin hypercube samples with this algorithm and select the
−3
one with the largest sum of the distance between points. 0 0.25 0.5 0.75 1
𝑥1

10.2.2 Low-Discrepancy Sequences Fig. 10.8 An example from the LHS


algorithm showing uniform distribu-
Low-discrepancy sequences generate deterministic sequences of points tion in 𝑥1 and a Gaussian distribution
that are well spread. Each new point added in the sequence maintains in 𝑥 2 with eight sample points. The
low discrepancy—discrepancy refers to the variation in point density equiprobable bins are shown as grid
lines.
throughout the domain. Hence, a low-discrepancy set is close to even
10 Surrogate-Based Optimization 378

density (i.e., well spread). These sequences are called quasi-random


because they often serve as suitable replacements for applications
that use random sequences, but they are not random or even pseudo-
random.
An advantage of low-discrepancy sequences over LHS is that most
of the approaches do not require selecting all the samples beforehand.
These methods generate deterministic sequences; in other words, we
generate the same sequence of points whether we choose them before-
hand or add more later. This property is particularly advantageous in
iterative procedures. We may choose an initial sampling plan and add
more well-spread points to the sample later. This is not necessarily an
advantage for the methods of this chapter because the optimization
drives the selection of new points rather than continuing to seek spread-
out samples. However, this feature is useful for other applications, such
as quadrature, Monte Carlo simulations, and other problems where an
iterative sampling process is used to determine statistical convergence
(see Section 12.3). Low-discrepancy sequences add more points that
are well spread without having to throw out the existing samples. Even
in non-iterative procedures, these sampling strategies can be a useful
alternative.
Several of these sequences are built on generalizations of the one-
dimensional van der Corput sequence to more than one dimension. Such
sequences are defined by representing an integer 𝑖 in a given integer
base 𝑏 (the van der Corput sequence is always base 2):

𝑖 = 𝑎0 + 𝑎1 𝑏 + 𝑎 2 𝑏 2 + . . . + 𝑎 𝑟 𝑏 𝑟 where 𝑎 ∈ [0, 𝑏 − 1] . (10.1)


If the base is 2, this is just a standard binary sequence (Section 7.6.1).
After determining the relevant coefficients (𝑎 𝑗 ), the 𝑖th element of the
sequence is
𝑎0 𝑎1 𝑎2 𝑎𝑟
𝜙𝑏 (𝑖) = + 2 + 3 + . . . + 𝑟+1 . (10.2)
𝑏 𝑏 𝑏 𝑏
An algorithm to generate an element in this sequence, also known as a
radical inverse function for base 𝑏, is given in Alg. 10.2.
For base 2, the sequence is as follows:
1 1 3 1 5 3 7 1 9
, , , , , , , , ,... . (10.3)
2 4 4 8 8 8 8 16 16
The interval is divided in half, and then each subinterval is also halved,
with new points spreading out across the domain (see Fig. 10.9).
Similarly, for base 3, the interval is split into thirds, then each
subinterval is split into thirds, and so on:
1 2 1 4 7 2 5 8 1
, , , , , , , , ,... . (10.4)
3 3 9 9 9 9 9 9 27
10 Surrogate-Based Optimization 379

𝑖=1

9 Fig. 10.9 Van Der Corput sequence.


1 1 1 3 1 9 5 3 7
16 8 4 8 2 16 8 4 8

Algorithm 10.2 Radical inverse function

Inputs:
𝑖: 𝑖 th point in sequence
𝑏: Base (integer)
Outputs:
𝜙: Generated point

𝑏𝑑 = 𝑏 Base used in denominator


𝜙=0
while 𝑖 > 0 do
†A set of numbers is pairwise prime if
𝑎 = mod(𝑖, 𝑏) Coefficient
there is no positive integer that can evenly
𝜙 = 𝜙 + 𝑎/𝑏 𝑑 divide any pair of them, except 1. Typi-
𝑏𝑑 = 𝑏𝑑 · 𝑏 Increase exponent in denominator cally, though, we just use the first 𝑛 𝑥 prime
numbers.
𝑖 = Int(𝑖/𝑏) Integer division
𝑥2
end while 1

0.8

0.6
Halton Sequence

A Halton sequence uses pairwise prime numbers (larger than 1) for the 0.4

base of each dimension of the problem.† The 𝑖th point in the Halton 0.2
sequence is
0
𝜙(𝑖, 𝑏1 ), 𝜙(𝑖, 𝑏2 ), . . . , 𝜙(𝑖, 𝑏 𝑛 𝑥 ) , (10.5) 0 0.2 0.4 0.6 0.8 1
𝑥1
where the 𝑏 𝑗 set is pairwise prime. As an example in two dimensions,
Fig. 10.10 shows 30 generated points of the Halton sequence where 𝑥1 Fig. 10.10 Halton sequence with base
2 for 𝑥1 and base 3 for 𝑥2 . First, 30
uses base 2, and 𝑥2 uses base 3, and then a subsequent 20 generated
points are selected (in blue), and then
points are added (in another color), showing the reuse of existing points. 20 points are added (in red). These
If the dimensionality of the problem is high, then some of the points would be identical to 50 points
base combinations lead to points that are highly correlated and thus chosen at once.
10 Surrogate-Based Optimization 380

undesirable for a sampling plan. For example, the left of Fig. 10.11
shows 50 generated points where 𝑥1 uses base 17, and 𝑥2 uses base 19.
To avoid this issue, we can use a scrambled Halton sequence.

𝑥2 𝑥2
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fig. 10.11 Halton sequence with base
𝑥1 𝑥1
17 for 𝑥1 and base 19 for 𝑥2 .
Standard Halton sequence Scrambled Halton

Scrambling can be accomplished by generating a permutation array


containing a random permutation of the integers 𝑝 = [0, 1, . . . , 𝑏 − 1].
Then, rather than using the integers 𝑎 directly in Eq. 10.2, we use
the entries of 𝑎 as the indices of the permutation array. If 𝑝 is the
permutation array, we have:
𝑝𝑎 𝑝𝑎 𝑝𝑎 𝑝 𝑎𝑟
𝜙𝑏 (𝑖) = 0 + 21 + 32 + . . . + 𝑟+1 . (10.6)
𝑏 𝑏 𝑏 𝑏
The permutation array is fixed for all digits 𝑎 and for all 𝑛 𝑝 points in the
domain. The right side of Fig. 10.11 shows the same example (with base
17 and base 19) but with scrambling to weaken the strong correlations.

Hammersley Sequence
𝑥2
The Hammersley sequence is closely related to the Halton sequence. 1

However, it provides better spacing if we know beforehand the number


0.8
of points (𝑛 𝑝 ) that we are going to use. This approach only needs 𝑛 𝑥 − 1
bases (still pairwise prime) because the first dimension uses regular 0.6
spacing:
0.4
𝑖
, 𝜙(𝑖, 𝑏1 ), 𝜙(𝑖, 𝑏2 ), . . . , 𝜙(𝑖, 𝑏 𝑛 𝑥 −1 ) . (10.7)
𝑛𝑝 0.2

Because this sequence needs to know the number of points (𝑛 𝑝 ) be- 0


forehand, it is less useful for iterative procedures. However, the imple- 0 0.2 0.4 0.6 0.8 1
𝑥1
mentation is straightforward, and so it may still be a useful alternative
to LHS. Figure 10.12 shows 50 points generated from a Hammersley Fig. 10.12 Hammersley sequence
sequence where the 𝑥2 -axis uses base 2. with base 2 for the 𝑥2 -axis.
10 Surrogate-Based Optimization 381

Other Sequences

A wide variety of other low-discrepancy sequences exist. The Faure


sequence is similar to the Halton, but it uses the same base for all
dimensions and uses permutation scrambling for each dimension in-
stead.167,168 Sobol sequences use base 2 sequences but with a reordering 167. Faure, Discrépance des suites associées
à un systéme de numération (en dimension
based on “direction numbers”.169 Niederreiter sequences are effectively s). 1982.
a generalization of Sobol sequences to other bases.170 168. Faure and Lemieux, Generalized
Halton sequences in 2008: A comparative
study, 2009.
10.3 Constructing a Surrogate 169. Sobol, On the distribution of points
in a cube and the approximate evaluation of
integrals, 1967.
Once sampling is completed, we have a list of data points, often called
170. Niederreiter, Low-discrepancy and
training data:   low-dispersion sequences, 1988.
𝑥 (𝑖) , 𝑓 (𝑖) , (10.8)

where 𝑥 (𝑖) is an input vector from the sampling plan, and 𝑓 (𝑖) contains
 
the corresponding outputs from evaluating the model: 𝑓 (𝑖) = 𝑓 𝑥 (𝑖) .
We seek to construct a surrogate model from this data set. Surrogate
models can be based on physics, mathematics, or a combination of
the two. Incorporating known physics into a model is often desirable
to improve model accuracy. However, functional relationships are
unknown for many complex problems, and a data-driven mathematical
model can be more effective.
Surrogate-based models can be based on interpolation or regression, Regression
as illustrated in Fig. 10.13. Interpolation builds a function that exactly
matches the provided training data. Regression models do not try to 𝑓
Interpolation
match the training data points exactly; instead, they minimize the
error between a smooth trend function and the training data. The
nature of the training data can help decide between these two types
of surrogate models. Regression is particularly useful when the data 𝑥

are noisy. Interpolatory models may produce undesirable oscillations


Fig. 10.13 Interpolation models
when fitting the noise. In contrast, regression models can find a smooth match the training data at the pro-
function that is less sensitive to the noise. Interpolation is useful when vided points, whereas regression
the data are highly multimodal (and not noisy). This is because a models minimize the error between
the training data and a function with
regression model may smooth over variations that are actually physical, an assumed trend.
whereas an interpolatory model can accurately capture those variations.
There are two main steps involved in either type of surrogate model.
First, we select a set of basis functions, which represent the form for
the model. Second, we determine the model parameters that provide
the best fit to the provided data. Determining the model parameters
is an optimization problem, which we discuss first. We discuss linear
regression and nonlinear regression, which are techniques for choosing
10 Surrogate-Based Optimization 382

model parameters for a given set of basis functions. Next, we discuss


cross validation, which is a critical technique for selecting an appropriate
model form. Finally, we discuss common basis functions.

10.3.1 Linear Least Squares Regression


A linear regression model does not mean that the surrogate is linear in
the input variables but rather that the model is linear in its coefficients
(i.e., linear in the parameters we are estimating). For example, the
following equation is a two-dimensional linear regression model, where
we use 𝑓ˆ to represent our estimated model of the function 𝑓 :

𝑓ˆ(𝑥) = 𝑤 1 𝑥12 + 𝑤 2 𝑥 1 𝑥2 + 𝑤 3 exp(𝑥2 ) + 𝑤 4 𝑥1 + 𝑤5 . (10.9)

This function is highly nonlinear, but it is classified as a linear regression


model because the regression seeks to choose the appropriate values
for the coefficients 𝑤 𝑖 (and the function is linear in 𝑤).
A general linear regression model can be expressed as
Õ
𝑓ˆ = 𝑤 | 𝜓(𝑥) = 𝑤 𝑖 𝜓 𝑖 (𝑥) , (10.10)
𝑖

where 𝑤 is a vector of weights, and 𝜓 is a vector of basis functions.


In this section, we assume that the basis functions are provided. In
general, the basis functions can be any set of functions that we choose
(and typically they are nonlinear). It is usually desirable for these
functions to be orthogonal.

Example 10.2 Data fitting can be posed as a linear regression model

Consider a quadratic fit: 𝑓ˆ(𝑥) = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐. This can be posed as a linear


regression model (Eq. 10.10) where the coefficients we wish to estimate are
𝑤 = [𝑎, 𝑏, 𝑐] and the basis functions are 𝜓 = [𝑥 2 , 𝑥, 1]. For a more general
𝑛-dimensional polynomial model, the basis functions would be polynomials
with terms combining the dependencies on all input variables 𝑥 up to a certain
order. For example, for two input variables up to second order, the basis
functions would be 𝜓 = [1, 𝑥1 , 𝑥2 , 𝑥1 𝑥2 , 𝑥12 , 𝑥22 ], and 𝑤 would consist of seven
coefficients.

The coefficients are chosen to minimize the error between our ∗ The choice of minimizing the sum of the
predicted function values 𝑓ˆ and the actual function values 𝑓 (𝑖) . Because squares rather than the sum of the abso-
we want to minimize both positive and negative errors, we minimize lute values or some other metric is not
arbitrary. The motivation for using the
the sum of the square of the errors (or a weighted sum of squared sum of the squares is discussed further in
errors):∗ the following section.
10 Surrogate-Based Optimization 383

Õ   2
minimize 𝑓ˆ 𝑤; 𝑥 (𝑖) − 𝑓 (𝑖) . (10.11)
𝑤
𝑖

The solution to this optimization problem is called a least squares solution.


If the regression model is linear, we can simplify this objective and
solve the problem analytically. Recall that 𝑓ˆ = 𝜓 | 𝑤, so the objective
can be written as
Õ  | 2
minimize 𝜓 𝑥 (𝑖) 𝑤 − 𝑓 (𝑖) . (10.12)
𝑤
𝑖

We can express this in matrix form by defining the following:


 |
 
— 𝜓 𝑥 (1) —
  | 
 
— 𝜓 𝑥 (2) —
Ψ =   .
 (10.13)
 
..
  . | 
 
— 𝜓 𝑥 (𝑛 ) —
 
𝑠

The matrix Ψ is of size (𝑛 𝑠 × 𝑛𝑤 ), where 𝑛 𝑠 is the number of samples,


𝑛𝑤 the number of parameters in 𝑤, and 𝑛 𝑠 ≥ 𝑛𝑤 . This means that there
should be more equations than unknowns or that we have sampled
more points than the number of coefficients we need to estimate. This
should make sense because our surrogate function is only an assumed
form and generally not an exact fit to the actual underlying function.
Thus, we need more data to create a good fit.
Then the optimization problem can be written in matrix form as:

minimize kΨ𝑤 − 𝑓 k 22 . (10.14)


𝑤

Expanding the squared norm (i.e., k𝑥 k 22 = 𝑥 | 𝑥) gives

minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 + 𝑓 | 𝑓 . (10.15)
𝑤

We can omit the last term from the objective because our optimization
variables are 𝑤, and the last term has no 𝑤 dependence:

minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 . (10.16)
𝑤

This fits the general form for an unconstrained quadratic programming


(QP) problem, as shown in Section 5.5.1:

1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥 , (10.17)
𝑥 2
10 Surrogate-Based Optimization 384

where

𝑄 = 2Ψ| Ψ (10.18)
|
𝑞 = −2Ψ 𝑓 . (10.19)

Recall that an equality constrained QP (of which unconstrained is a


subset) has an analytic solution as long as the QP is positive definite.
In our case, we can show that 𝑄 is positive definite as long as Ψ is full
rank:
𝑥 | 𝑄𝑥 = 2𝑥 | Ψ| Ψ𝑥 = 2kΨ𝑥k 22 > 0 . (10.20)
This is not surprising because the objective is a sum of squared values.
Referring back to the solution in Section 5.5.1, and removing the portions
associated with the constraints, the solution is

𝑄𝑥 = −𝑞 . (10.21)

In our case, this becomes

2Ψ| Ψ𝑤 = 2Ψ| 𝑓 . (10.22)

After simplifying, we have an analytic solution for the weights:

𝑤 = (Ψ| Ψ)−1 Ψ| 𝑓 . (10.23)

We sometimes express the linear relationship in Eq. 10.12 as Ψ𝑤 = 𝑓 ,


although the case where there are more equations than unknowns does
not typically have a solution (the problem is overdetermined). Instead, we
seek the solution that minimizes the error kΨ𝑤 − 𝑓 k 2 , that is, Eq. 10.23.
The quantity Ψ† = (Ψ| Ψ)−1 Ψ| is called the pseudoinverse of Ψ (or more
specifically, the Moore–Penrose pseudoinverse), and thus we can write
Eq. 10.23 in the more compact form

𝑤 = Ψ† 𝑓 . (10.24)

This allows for a similar form to solving a linear system of equations


where an inverse would be used instead. In solving both a linear system
and the linear least-squares equation (Eq. 10.23), we do not explicitly
invert a matrix. For linear least squares, a QR factorization is commonly
used for improved numerical conditioning as compared to solving
Eq. 10.23 directly.
10 Surrogate-Based Optimization 385

Tip 10.2 Least squares is not the same as a linear system solution

In MATLAB or Julia, the backslash operator is overloaded, so you can solve


an overdetermined system of equations 𝐴𝑥 = 𝑏 with x = A\b, but keep in mind
that for an 𝐴 of size (𝑚 × 𝑛), where 𝑚 > 𝑛, this syntax performs a least-squares
solution, not a linear system solution as it would for a full rank (𝑛 × 𝑛) system.
The overloading of this operator is generally not used in other languages; for
example, in Python, rather than using numpy.linalg.solve, you would use
numpy.linalg.lstsq.

Example 10.3 Linear regression

Consider the quadratic fit discussed in Ex. 10.2. We are provided the data
points, 𝑥 and 𝑓 , shown as circles in Fig. 10.14. From these data, we construct 𝑓
the matrix Ψ for our basis functions as follows: 20

 𝑥 (1) 2 
 𝑥 (1) 1
 
 𝑥 (2) 2  10
 𝑥 (2) 1
Ψ =  ..
 .

 
 . 
 (𝑛 ) 2 
𝑥 𝑠 1
0

 𝑥 (𝑛 𝑠 )
−2 −1 0 1 2
We can then solve for the coefficients 𝑤 using the linear least squares solution 𝑥
(Eq. 10.23). Substituting the coefficients and respective basis functions into
Eq. 10.10, we obtain the surrogate model, Fig. 10.14 Linear least squares exam-
ple with a quadratic fit on a one-
𝑓ˆ(𝑥) = 𝑤 1 𝑥 2 + 𝑤 2 𝑥 + 𝑤3 , dimensional function.

which is also plotted in Fig. 10.14 as a solid line.

A common variation of this approach is to use regularized least


squares, which adds a term in the objective. The new objective is

minimize kΨ𝑤 − 𝑓 k 22 + 𝜇k𝑤k 22 , (10.25)


𝑤

where 𝜇 is a weight assigned to the second term. This second term


attempts to reduce the magnitudes of the entries in 𝑤 while balancing
the fit in the first term. This approach is particularly beneficial if the
data contain strong outliers or are particularly noisy. The rationale for
this approach is that we may want to accept a higher error (quantified
by the first term) in exchange for smaller values for the coefficients.
This generally leads to simpler, more generalizable models (e.g., by
reducing the influence of some terms). A related extension uses a
second term of the form k𝑤 − 𝑤 0 k 22 . The idea is that we want a good fit,
10 Surrogate-Based Optimization 386

while maintaining parameters that are close to some known nominal


values 𝑤 0 .
A regularized least squares problem can be solved with the same
linear least squares approach. We can write the previous problem using
concatenated matrices and vectors:
    2
Ψ 𝑓

minimize
𝑤 𝜇𝐼 𝑤 −
0
. (10.26)
2

This is of the same linear form as before (k𝐴𝑤 − 𝑏 k 2 ), so we can reuse


the solution (Eq. 10.23):

𝑤 ∗ = (𝐴| 𝐴)−1 𝐴| 𝑏
(10.27)
= (Ψ| Ψ + 𝜇𝐼)−1 Ψ| 𝑓 .

For linear least squares (with or without regularization), we have


seen that the optimization problem of determining the appropriate
coefficients can be found analytically. We can also add linear constraints
to the problem (equality or inequality), and the optimization remains
a QP. In that case, the problem is still convex. Although it does not
generally have an analytic solution, we can still quickly find the global
optimum. This topic is discussed in Section 11.3.

10.3.2 Maximum Likelihood Interpretation


This section presents an alternative motivation for the sum of squared
error approach used in the previous section. It is somewhat of a
diversion from the present discussion, but it will be helpful in several
results later in this chapter. In the previous section, we assumed linear
models of the form
𝑓ˆ(𝑖) = 𝑤 | 𝑥 (𝑖) , (10.28)
where we use 𝑥 for simplicity in writing instead of 𝜓(𝑥 (𝑖) ). The deriva-
tion remains the same for any arbitrary function of 𝑥. The function 𝑓ˆ is
just a model, so we could say that it is equal to our actual observations
𝑓 (𝑖) plus an error term:

𝑓 (𝑖) = 𝑤 | 𝑥 (𝑖) + 𝜀(𝑖) , (10.29)

where 𝜀 captures the error associated with the 𝑖th data point. We
assume that the error is normally distributed with mean zero and a
standard deviation of 𝜎:
!
  1 𝜀(𝑖)
2
𝑝 𝜀(𝑖) = √ exp − 2 . (10.30)
𝜎 2𝜋 2𝜎
10 Surrogate-Based Optimization 387

The use of Gaussian uncertainty can be motivated by the central


limit theorem, which states that for large sample sizes, the sum of
random variables tends toward a Gaussian distribution regardless of
the original distribution associated with each variable. In other words,
the sample distribution of the sample mean is approximately Gaussian.
Because we assume the error terms to be the sum of various random,
independent perturbations, then by the central limit theorem, we expect
the errors to be normally distributed.
We now substitute Eq. 10.29 into Eq. 10.30 to show the probability
of 𝑓 conditioned on 𝑥 and parameterized by 𝑤:
 2
  © 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖) ª
√ exp ­­− ®.
1
𝑝 𝑓 (𝑖) |𝑥 (𝑖) ; 𝑤 =
2𝜎2 ® (10.31)
𝜎 2𝜋
« ¬
Once we include all the data points 𝑥 (𝑖) , we would like to compute the
probability of observing 𝑓 conditioned on the inputs 𝑥 for a given set
of parameters in 𝑤. We call this the likelihood function 𝐿(𝑤). In this case,
assuming all errors are independent, the total probability for observing
the outputs is the product of the probability of observing each output:
 2
Ö
𝑛𝑠
© 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖) ª
√ exp ­­− ®.
1
𝐿(𝑤) =
2𝜎2 ® (10.32)
𝜎 2𝜋
« ¬
𝑖=1

Now we can pose this as an optimization problem where we wish to


find the parameters 𝑤 that maximize the likelihood function; in other
words, we maximize the probability that our model is consistent with
the observed data. Because the objective is a product of multiple terms,
it is helpful to take the logarithm of the objective. Maximizing 𝐿 or
maximizing ℓ = ln(𝐿) does not change the solution to the problem but
makes it easier to solve. We call this the log likelihood function:
 2
©Ö © ªª
𝑛𝑠 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)

ℓ (𝑤) = ln ­­ √ exp ­­− ®®


1
2𝜎 2 ®® (10.33)
𝜎 2𝜋
« « ¬¬
𝑖=1
 2
Õ 𝑛𝑠
© 1 © 𝑓 −𝑤 𝑥
(𝑖) | (𝑖)
ªª
­
ln ­ √ exp ­− ­ ®®
=
2𝜎 2 ®® (10.34)
𝜎 2𝜋
« « ¬¬
𝑖=1
 2
  Õ
𝑛𝑠 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
1
= 𝑛 𝑠 ln √ − . (10.35)
𝜎 2𝜋 𝑖=1
2𝜎 2
10 Surrogate-Based Optimization 388

The first term has no dependence on 𝑤, and so when optimizing ℓ (𝑤);


it is just a scalar term that can be removed as follows:

maximize ℓ (𝑤) ⇒ (10.36)


 2
𝑤

© Õ ª
𝑛𝑠 𝑓 −𝑤 𝑥(𝑖) | (𝑖)

maximize ­­− ®
® ⇒ (10.37)
𝑤 2𝜎2
« ¬
𝑖=1

Õ 𝑛𝑠
( 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖) )2
minimize . (10.38)
𝑤
𝑖=1
2𝜎 2

Similarly, the denominator of the second term has no dependence on 𝑤


and is just a scalar that can also be removed:
𝑛𝑠 
Õ 2
maximize ℓ (𝑤) ⇒ minimize 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖) . (10.39)
𝑤 𝑤
𝑖=1

Thus, maximizing the log likelihood function (maximizing the prob-


ability of observing the data) is equivalent to minimizing the sum of
squared errors (the least-squares formulation). This derivation provides
another motivation for using the sum of squared errors in regression.

10.3.3 Nonlinear Least Squares Regression


A surrogate model can be nonlinear in the coefficients. For example,
building on the simple function shown earlier in Eq. 10.9 we can add
the coefficients 𝑤 6 and 𝑤 7 as follows:

𝑓ˆ(𝑥) = 𝑤 1 𝑥1𝑤6 + 𝑤 2 𝑥 1 𝑥2 + 𝑤 3 exp(𝑤 7 𝑥2 ) + 𝑤 4 𝑥1 + 𝑤5 . (10.40)

The addition of 𝑤 6 and 𝑤 7 makes this function nonlinear in the coeffi-


cients. We can still estimate these parameters, but not analytically.
Equation 10.11 is still relevant; we still seek to minimize the sum of
the squared errors:
Õ   2
minimize 𝑓ˆ 𝑤; 𝑥 (𝑖) − 𝑓 (𝑖) . (10.41)
𝑤
𝑖

For general nonlinear regression models, we cannot write a more


specific form for 𝑓ˆ as we could for the linear case, so we leave the
objective as it is.
This is a nonlinear least-squares problem. The optimization problem is
unconstrained, so any of the methods from Chapter 4 apply. We could
also easily add constraints, for example, bounds on parameters, known
10 Surrogate-Based Optimization 389

relationships between parameters, and so forth, and use the methods


from Chapter 5.
In contrast to the linear case, we need to provide a starting point, our
best guess for the parameters, and we may need to deal with scaling,
noise, multimodality, or any of the other potential challenges of general
nonlinear optimization. Still, this is a relatively straightforward problem
within the broader realm of engineering optimization problems.
Although the methods of Chapter 4 can be used if the problem
remains unconstrained, there are more specialized methods available
that take advantage of the specific structure of the problem. One
popular approach to solving the nonlinear least-squares problem is the
Levenberg–Marquardt algorithm, which we discuss in this section.
As a stepping stone towards the Levenberg–Marquardt algorithm,
we first derive the Gauss–Newton algorithm, which is a modification
of Newton’s method (Section 3.8) for solving nonlinear least-squares
problems. One way to think of this algorithm is as an iterative lin-
earization of the residual. Once it is linearized, we can apply the same
methods we derived for linear least squares. We linearize the residual
𝑟 = 𝑓ˆ(𝑤) − 𝑓 at iteration 𝑘 as

𝑟(𝑤) ≈ 𝑟(𝑤 𝑘 ) + 𝐽𝑟 Δ𝑤 , (10.42)

where Δ𝑤 is the step and the Jacobian is

𝜕𝑟 𝑖
𝐽𝑟 𝑖𝑗 = . (10.43)
𝜕𝑤 𝑗

After the linearization, the objective becomes

minimize k𝐽𝑟 Δ𝑤 + 𝑟 k 22 . (10.44)

This is now the same form as linear least squares (Eq. 10.14), so we can
reuse its solution (Eq. 10.23) to solve for the step
 −1
(10.45)
| |
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 .

We now have an update formula for the coefficients at each iteration:


 −1
(10.46)
| |
𝑤 𝑘+1 = 𝑤 𝑘 − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 .

An alternative derivation for this formula is to consider a Newton


Í
step for an unconstrained optimizer. The objective is 𝑒 = 𝑖 𝑟 𝑖2 , and the
formula for a Newton step (Section 4.4.3) is

𝑤 𝑘+1 = 𝑤 𝑘 − 𝐻𝑒−1 ∇𝑒 . (10.47)


10 Surrogate-Based Optimization 390

The gradient is
𝜕𝑟 𝑖
∇𝑒 𝑗 = 2𝑟 𝑖 , (10.48)
𝜕𝑤 𝑗
or in matrix form:
∇𝑒 = 2𝐽𝑟 𝑟 . (10.49)
|

The Hessian in index notation is


𝜕𝑟 𝑖 𝜕𝑟 𝑖 𝜕2 𝑟 𝑖
𝐻𝑒 𝑗 𝑘 = 2 + 2𝑟 𝑖 . (10.50)
𝜕𝑤 𝑗 𝜕𝑤 𝑘 𝜕𝑤 𝑗 𝜕𝑤 𝑘
We can write it in matrix form as follows:

𝐻𝑒 = 2𝐽𝑟 𝐽𝑟 + 2𝑟𝐻𝑟 . (10.51)


|

If we neglect the second term in the Hessian, then the Newton update
is:
1 |  −1 |
𝑤 𝑘+1 = 𝑤 𝑘 − 𝐽 𝐽𝑟 2𝐽𝑟 𝑟
2 𝑟 (10.52)
|  −1 |
= 𝑤 𝑘 − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 ,
which is the same update as before.
Thus, another interpretation of this method is that a Gauss–Newton
step is a modified Newton step where the second derivatives of the
residual are neglected (and thus, a quasi-Newton approach to estimate
second derivatives is not needed). This method is particularly effective
near convergence because as 𝑟 → 0 (i.e., as we approach the solution to
our residual minimization), the neglected term also approaches zero.
The appeal of this approach is that we can often obtain an accurate
prediction for the Hessian using only the first derivatives because of
the known structure of the objective.
When the second term is not small, then the Gauss–Newton step
may be too inaccurate. We could use a line search, but the Levenberg–
Marquardt algorithm utilizes a different strategy. The idea is to regular-
ize the problem as discussed in the previous section or, in other words,
provide the ability to dampen the steps as needed. Each linearized
subproblem becomes

minimize k𝐽𝑟 Δ𝑤 + 𝑟 k 22 + 𝜇kΔ𝑤k 22 . (10.53)


Δ𝑤

Recall that the solution to this problem (see Eq. 10.27) is


 −1
(10.54)
| |
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 + 𝜇𝐼 𝐽𝑟 𝑟 .

If 𝜇 = 0, then we retain the Gauss–Newton step. Conversely, as 𝜇


becomes large, so that the 𝐽𝑟 𝐽𝑟 is negligible, the step becomes
|

1 |
Δ𝑤 = − 𝐽𝑟 𝑟 . (10.55)
𝜇
10 Surrogate-Based Optimization 391

This is precisely the steepest-descent direction for our objective (see


Eq. 10.49), although with a small magnitude because 𝜇 is large. The
parameter 𝜇 provides some control for directions ranging between
Gauss–Newton and steepest descent.
The Levenberg–Marquardt algorithm has been revised to improve
the scaling for components of the gradient that are small. The second
minimization term weights all parameters equally. The scaling can be
improved by multiplying by a diagonal matrix in the regularization as
follows:
minimize k𝐽𝑟 Δ𝑤 + 𝑟 k 22 + 𝜇k𝐷Δ𝑤 k 22 , (10.56)
Δ𝑤

where 𝐷 is defined as

𝐷 2 = diag 𝐽𝑟 𝐽𝑟 . (10.57)
|

This matrix scales the objective by the diagonal elements of the Hessian.
Thus, when 𝜇 is large, and the direction tends toward the steepest de-
scent, the components of the gradient are scaled by the curvature, which
reduces the amount of zigzagging. The solution to the minimization
problem of Eq. 10.56 is
  −1
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 + 𝜇 diag 𝐽𝑟 𝐽𝑟 (10.58)
| | |
𝐽𝑟 𝑟 .

Finally, we describe one of the possible heuristics for selecting and


updating the damping parameter 𝜇. After a successful step (a sufficient
reduction in the objective), 𝜇 is increased by a factor of 𝜌. Conversely,
an unsuccessful step is rejected, and 𝜇 is reduced by a factor (𝜇 = 𝜇/𝜌).
Í
Rather than returning a scalar objective ( 𝑖 𝑟 𝑖2 ), the user function
should return a vector of the residuals because that vector is needed
in the update steps (Eq. 10.58). A potential convergence metric is a
tolerance on objective value changes between subsequent iterations.
The full procedure is described in Alg. 10.3.

Algorithm 10.3 Levenberg–Marquardt algorithm for solving a nonlinear least


squares problem

Inputs:
𝑥 0 : Starting point
𝜇0 : Initial damping parameter
𝜌: Damping parameter factor
Outputs:
𝑥 ∗ : Optimal solution

𝑘=0
𝑥 = 𝑥0
10 Surrogate-Based Optimization 392

𝜇 = 𝜇0
𝑟, 𝐽 = residual(𝑥)
𝑒 = k𝑟 k 22 Residual error

 −1 |
while |Δ| > 𝜏 do
𝑠 = − 𝐽 | 𝐽 + 𝜇 diag(𝐽 | 𝐽) 𝐽 𝑟 Evaluate step
𝑟 𝑠 , 𝐽𝑠 = residual(𝑥 + 𝑠)
𝑒 𝑠 = k𝑟 𝑠 k 22
Δ = 𝑒𝑠 − 𝑒 Change in residual error
if Δ < 0 then Objective decreased; accept step
𝑥=𝑥+𝑠
𝑟, 𝐽, 𝑒 = 𝑟 𝑠 , 𝐽𝑠 , 𝑒 𝑠
𝜇 = 𝜇/𝜌
else Reject step
𝜇=𝜇·𝜌 Increase damping
end if
𝑘 = 𝑘+1
end while

Example 10.4 Rosenbrock as a nonlinear least-squares problem

The Rosenbrock function is a sum of squared terms, so it can be posed as a


nonlinear least squares problem:
 
(1 − 𝑥1 )
𝑟(𝑥) = .
10(𝑥 2 − 𝑥12 )

In the following example, we use the same starting point as Ex. 4.18
(𝑥0 = [−1.2, −1]), an initial damping parameter of 𝜇 = 0.01, an update factor
of 𝜌 = 10, and a tolerance of 𝜏 = 10−6 (change in sum of squared errors). The
iteration path is shown on the left of Fig. 10.15, and the convergence of the sum
of squared errors is shown on the right side.

10.3.4 Cross Validation


The other important consideration for developing a surrogate model
is the choice of the basis functions in 𝜓. In some instances, we may
know something about the model behavior and thus what type of basis
functions should be used, but generally, the best way to determine the
basis functions is through cross validation. Cross validation is also
helpful in characterizing error, even if we already have a chosen set
of basis functions. One of the reasons we use cross validation is to
prevent overfitting. Overfitting occurs when we have too many degrees
10 Surrogate-Based Optimization 393

𝑥2 k𝑟 k 22
2 102
42 iterations
10−1
𝑥0 𝑥∗ 10−4
1
10−7

10−10
0 10−13

10−16
0 1 0 10 20 30 40
−1 Fig. 10.15 Levenberg–Marquardt al-
𝑥1 𝑘
gorithm applied to the minimization
Iteration history Convergence of the sum of squared of the Rosenbrock function.
residuals

of freedom and closely fit a given set of data, but the resulting model
has a poor predictive ability. In other words, we are fitting noise. The
following example illustrates this idea with a one-dimensional function.

Example 10.5 The dangers of overfitting

Consider the set of training data (Fig. 10.16, left), which we use to create a
surrogate function. This is a one-dimensional problem so that it can be easily
visualized. In general, however, visualization is limited, and determining the
right basis functions to use can be difficult. If we use a polynomial basis, we
might attempt to determine the appropriate order by trying each case (e.g.,
quadratic, cubic, quartic) and measuring the error in our fit (Fig. 10.16, center).

𝑓 𝑓
5
4 4
4

2 3 2
Error

2
0 0

1
−2 −2
0
0 5 10 15 20
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
𝑥 Order of polynomial 𝑥

Training data The error in fitting the data decreases A 19th-order polynomial fit to the
with the order of the polynomial data has low error but poor predic-
tive ability

It seems as if the higher the order of the polynomial, the lower the error. Fig. 10.16 Fitting different order poly-
For example, a 20th-order polynomial reduces the error to almost zero. The nomials to data.
problem is that although the error is low on this set of data, the predictive
capability of such a model for other data points is poor. For example, the right
10 Surrogate-Based Optimization 394

side of Fig. 10.16 shows a 19th-order polynomial fit to the data. The model
passes right through the points, but it does not work well for many of the
points that are not part of the training set (which is the whole purpose of the
surrogate).
The opposite of overfitting is underfitting, which is also a potential issue.
When underfitting, we do not have enough degrees of freedom to create a
useful model (e.g., imagine using a linear fit for the previous example).

The solution to the overfitting problem highlighted in Ex. 10.5 is


cross validation. Cross validation means that we use one set of data for
training (creating the model) and a different set of data for assessing
its predictive error. There are many different ways to perform cross
validation; we describe two. Simple cross validation is illustrated in
Fig. 10.17 and consists of the following steps:

1. Randomly split your data into a training set and a validation set
(e.g., a 70–30 split).
2. Train each candidate model (the different options for 𝜓) using
only the training set, but evaluate the error with the validation set.
The error on previously unseen data is called the generalization
error (𝑒 𝑔 in Fig. 10.17).
3. Choose the model with the lowest generalization error, and
optionally retrain that model using all of the data.

Train Test

𝑥 (𝑖) , 𝑓 (𝑖) 𝑥 (𝑖) , 𝑓 (𝑖)


Fig. 10.17 Simple cross-validation
𝑓ˆ   2
process.
Regression 𝑒 𝑔 = 𝑓ˆ 𝑥 (𝑖) − 𝑓 (𝑖)
2

An alternative option that is more involved but uses the data more
effectively is called 𝑘-fold cross validation. It is particularly advantageous
when we have a small data set where we cannot afford to leave much out.
This procedure is illustrated in Fig. 10.18 and consists of the following
steps:

1. Randomly split your data into 𝑛 sets (e.g., 𝑛 = 10).


2. Train each candidate model using the data from all sets except
one (e.g., 9 of the 10 sets) and use the remaining set for valida-
tion. Repeat for all 𝑛 possible validation sets and average the
performance.
10 Surrogate-Based Optimization 395

3. Choose the model with the lowest average generalization error.


Optionally, retrain with all the data.

The extreme version of this process, when training data are very limited,
is leave-one-out cross validation (i.e., each testing subset consists of one
data point).

Test Train 𝑒𝑔1

Test Train 𝑒𝑔2

.. ..
. .

Train Test 𝑒𝑔 𝑛

Fig. 10.18 Diagram of 𝑘-fold cross-


1 Í𝑛 validation process.
𝑒𝑔 = 𝑛 𝑘=1 𝑒 𝑔 𝑘

Example 10.6 Cross validation helps to avoid overfitting

·104
3 15

2 10
Error

Error

1 5

0 0 Fig. 10.19 Error from 𝑘-fold cross val-


5 10 15 20 2 4 6 8 10 12 idation.
Order of polynomial Order of polynomial
𝑓

4
This example continues from Ex. 10.5. First, we perform 𝑘-fold cross
validation using 10 divisions. The average error across the divisions using the 2
training data is shown in Fig. 10.19 (with a smaller 𝑦-axis scale on the right).
The error increases dramatically as the polynomial order increases. Zoom- 0
ing in on the flat region, we see a range of options with similar errors. Among
the similar solutions, we generally prefer the simplest model. In this case, −2
a fourth-order polynomial seems reasonable. A fourth-order polynomial is
−3 −2 −1 0 1 2
compared against the data in Fig. 10.20. This model has a much better predictive 𝑥
ability.
Fig. 10.20 A fourth-order polynomial
fit to the data.
10 Surrogate-Based Optimization 396

10.3.5 Common Basis Functions


Although cross validation can help us find the lowest generalization
error among a provided set of basis functions, we still need to determine
what sets of options to consider. This selection is crucial because our
model is only as good as the available options, but increasing the
number of options increases computational time. The possibilities for
basis functions are as numerous as the types of function. As stated
before, it is generally desirable that they form an orthogonal set. We
focus on a few commonly used functions.

Polynomials

Polynomials, of which we have already seen a few examples, are useful


in many applications. However, we typically use low-order polynomials
for regression because high-order polynomials rarely generalize well.
Polynomials can be particularly effective in cases where a knowledge
of the physics suggests them to be an appropriate choice (e.g., drag
varies quadratically with speed) Because a lot of structure is already
built into the model form, fewer data points are needed to create a
reasonable model (e.g., a quadratic function in 𝑛 dimensions needs
at least 𝑛(𝑛 + 1)/2 + 𝑛 + 1 points, so this amounts to 6 points in two
dimensions, 10 points in three dimensions, and so on).

Radial Basis Functions

Another common type of basis function is a radial basis function. Radial


basis functions are functions that depend on the distance from some
center point and can be written as follows:
   

𝜓 (𝑖) = 𝜓 𝑥 − 𝑐 (𝑖) = 𝜓 𝑟 (𝑖) , (10.59)

where 𝑐 is the center point, and 𝑟 is the radius about the center point.
Although the center points can be placed anywhere, we usually choose
the sampling data as centering points:
 

𝜓(𝑖) = 𝜓 𝑥 − 𝑥 (𝑖) . (10.60)

This is often a useful choice because it captures the idea that our
ability to predict function behavior is related to how close we are to
known function values (in other words, nearby points are more highly
correlated). This form naturally lends itself to interpolation, although
regularization can be added to allow for regression. Polynomials are
often combined with radial basis functions because the polynomial can
10 Surrogate-Based Optimization 397

capture global function behavior, while the radial basis functions can
introduce modifications to capture local behavior.
One popular radial basis function is the Gaussian basis:

2
© Õ (𝑖) ª
𝜓 (𝑖) (𝑥) = exp ­− 𝜃𝑗 𝑥 − 𝑥 𝑗 ® , (10.61)
« 𝑗 ¬

where 𝜃 𝑗 are the model parameters. One of the forms of kriging


discussed in the following section can be viewed as a radial basis
function model with a Gaussian basis.

Tip 10.3 Surrogate modeling toolbox

The surrogate modeling toolbox (SMT)† is a useful package for surrogate † https://smt.readthedocs.io/

modeling, with a particular focus on providing derivatives for use in gradient-


based optimization.171 SMT includes surrogate modeling techniques that utilize 171. Bouhlel et al., A Python surrogate
modeling framework with derivatives, 2019.
gradients as training data to enhance accuracy and scalability with the number
of inputs.172 172. Bouhlel and Martins, Gradient-
enhanced kriging for high-dimensional
problems, 2019.

10.4 Kriging

Kriging is a popular surrogate modeling technique that can build


approximations of highly nonlinear engineering simulations. We may
not have a simple parametric form for such simulations that we can use
with regression and expect a good fit. Instead of tuning the parameters
of a functional form that describes what the function is, kriging tunes the
parameters of a statistical model that describes how the function behaves.
The kriging statistical model that approximates 𝑓 consists of two
terms: a function 𝜇(𝑥) that is meant to capture some of the function
behavior and a random variable 𝑍(𝑥). Thus, we can write the kriging
model as

𝐹(𝑥) = 𝜇(𝑥) + 𝑍(𝑥) where 𝑍(𝑥) ∼ 𝒩(0, 𝜎 2 ) . (10.62)

When we evaluate the function we want to approximate at point 𝑥, we


get a scalar value 𝑓 (𝑥). In contrast, when we evaluate the stochastic
process (Eq. 10.62) at 𝑥, we get a random variable 𝐹(𝑥) that has a normal
distribution with mean 𝜇 and variance 𝜎2 . Although we wrote 𝜇 as a
function of 𝑥, most kriging models consider this to be constant because
the random variable term alone is effective in capturing the function
behavior. For the rest of this section, we discuss the case with constant
10 Surrogate-Based Optimization 398

𝜇, which is called ordinary kriging. Kriging is also referred to as Gaussian


process interpolation, or more generally in the regression case discussed
later in this section, as Gaussian process regression.
The power of the statistical model lies in how it treats the correlation
between the random variables. Although we do not know the exact form
of the error term 𝑍(𝑥), we can still make some reasonable assumptions
about it. Consider two points  in a sampling
  plan, 𝑥 and 𝑥 , and
(𝑖) (𝑗)

the corresponding terms, 𝑍 𝑥 (𝑖) and 𝑍 𝑥 (𝑗) . Intuitively, we expect


   
𝑍 𝑥 (𝑖) to be close to 𝑍 𝑥 (𝑗) whenever 𝑥 (𝑖) is close to 𝑥 (𝑗) . Therefore, it
 
seems reasonable to assume that the correlation between 𝑍 𝑥 (𝑖) and
 
𝑍 𝑥 (𝑗) is a function of the distance between the two points.
In kriging,
 we assume
 that this correlation is given by a kernel
function 𝐾 𝑥 , 𝑥 :
(𝑖) (𝑗)

      
𝐾 𝑥 (𝑖) , 𝑥 (𝑗) = corr 𝑍 𝑥 (𝑖) , 𝑍 𝑥 (𝑗) (10.63)
 
As a matrix, the kernel is represented as 𝐾 𝑖𝑗 = 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) . Various
kernel functions are used with kriging.∗ The most commonly used ∗ Kernel functions must be symmetric and

kernel function is positive definite because a covariance ma-

!
trix is always symmetric and positive defi-
  Õ nite.
(𝑖) (𝑗) 𝑝 𝑙
𝑛𝑑
𝐾 𝑥 ,𝑥(𝑖) (𝑗)
= exp − 𝜃𝑙 𝑥 𝑙 − 𝑥 𝑙 , (10.64)
𝑙=1

where 𝜃𝑙 ≥ 0, 0 ≤ 𝑝 𝑙 ≤ 2, and 𝑛 𝑑 is the number of dimensions (i.e., the


length of the vector 𝑥). If every 𝑝 𝑙 = 2, this becomes a Gaussian kernel.
Let us examine how the statistical model 𝐹 defined in Eq. 10.62
captures the typical behavior of the function 𝑓 . The parameter 𝜇
captures the typical value, and 𝜎2 captures the expected variance. The
kernel (or correlation) function (Eq. 10.64) implicitly models continuous
functions. If 𝑓 is continuous, we know that, as |𝑥 (𝑖) − 𝑥 (𝑗) | → 0, then
| 𝑓 (𝑥 (𝑖) ) − 𝑓 (𝑥 (𝑗) )| → 0. This is captured in the kernel function because
as |𝑥 (𝑖) − 𝑥 (𝑗) | → 0, the correlation approaches 1. The parameter 𝜃𝑙
captures how active the function 𝑓 is in the 𝑙th coordinate direction.
(𝑖) (𝑗)
A unit difference in variable 𝑙 (|𝑥 𝑙 − 𝑥 𝑙 | = 1) has a more significant
impact on the correlation when 𝜃𝑙 is large. The exponent 𝑝 𝑙 describes
the smoothness of the function in the 𝑙th coordinate direction. Values
of 𝑝 𝑙 close to 2 produce smooth functions, whereas values closer to zero
produce functions with more variation.
Kriging surrogate modeling involves two main steps. The first step
consists of using the data to estimate the statistical model parameters
10 Surrogate-Based Optimization 399

𝜇, 𝜎2 , 𝜃1 , . . . , 𝜃𝑛 𝑑 , and 𝑝1 , . . . , 𝑝 𝑛 𝑑 . The second step consists of making


predictions using the statistical model and these estimated parameter
values.
The parameter estimation uses the same maximum likelihood ap-
proach from Section 10.3.2, but now it is more  complicated. Let us
denote the random variable as 𝐹 (𝑖) ≡ 𝐹 𝑥 (𝑖) and the vector of random
 
variables as 𝐹 = 𝐹 (1) , . . . , 𝐹 (𝑛 𝑠 ) , where 𝑛 𝑠 is the number of samples.
 
Similarly, 𝑓 (𝑖) ≡ 𝑓 𝑥 (𝑖) and the vector of observed function values is
𝑓 ≡ [ 𝑓 (1) , . . . , 𝑓 (𝑛 𝑠 ) ]. Using this notation, we can say that the vector 𝐹 is
jointly normally distributed. This is also known as a multivariate Gaus-
sian distribution.† The probability density function (PDF) (the likelihood † There are other ways to derive kriging
that do not require making assumptions
that 𝐹 = 𝑓 ) is on the random variable distribution type.
 
1 1
𝑝( 𝑓 ) = exp − ( 𝑓 − 𝑒𝜇)| Σ−1 ( 𝑓 − 𝑒𝜇) , (10.65)
(2𝜋)𝑛 𝑠 /2 |Σ| 1/2 2

where 𝑒 is a vector of 1s with size 𝑛 𝑠 , and |Σ| is the determinant of the


covariance,  
Σ𝑖𝑗 = 𝜎 2 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) . (10.66)

The covariance between two elements 𝐹 (𝑖) and 𝐹 (𝑗) of 𝐹 is related to


correlation by the following definition:‡ ‡ Covariance and correlation are briefly re-

     
viewed in Appendix A.9.

Σ𝑖𝑗 = cov 𝐹 (𝑖) , 𝐹 (𝑗) = 𝜎2 corr 𝐹 (𝑖) , 𝐹 (𝑗) = 𝜎2 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) . (10.67)

We assume stationarity of the second moment, that is, the variance 𝜎 2


is constant in the domain.
The statistical model parameters 𝜃1 , . . . , 𝜃𝑛 𝑑 and 𝑝1 , . . . , 𝑝 𝑛 𝑑 enter
the likelihood (Eq. 10.65) via their effect on the kernel 𝐾 (Eq. 10.64) and
hence on the covariance matrix Σ (Eq. 10.66).
We estimate the parameters using the maximum log likelihood
approach from Section 10.3.2; that is, we maximize the probability of
observing our data 𝑓 conditioned on the parameters 𝜇 and Σ. Using
a PDF (Eq. 10.65) where 𝜇 is constant and the covariance is Σ = 𝜎2 𝐾,
yields the following likelihood function:
 
1 ( 𝑓 − 𝑒𝜇)| 𝐾 −1 ( 𝑓 − 𝑒𝜇)
𝐿(𝜇, 𝜎, 𝜃, 𝑝) = exp − .
(2𝜋)𝑛 𝑠 /2 𝜎 𝑛 𝑠 |𝐾| 1/2 2𝜎2

We now need to find the parameters 𝜇, 𝜎, 𝜃𝑖 , and 𝑝 𝑖 that maximize


this likelihood function, that is, maximize the probability of our obser-
vations 𝑓 . As before, we take the logarithm to form the log likelihood
10 Surrogate-Based Optimization 400

function:
𝑛𝑠 𝑛𝑠  
ℓ (𝜇, 𝜎, 𝜃, 𝑝) = − ln(2𝜋) − ln 𝜎2
2 2
1 ( 𝑓 − 𝑒𝜇)| 𝐾 −1 ( 𝑓 − 𝑒𝜇)
− ln |𝐾| − . (10.68)
2 2𝜎2
We can maximize part of this term analytically by taking derivatives
with respect to 𝜇 and 𝜎, setting them equal to zero, and solving for their
optimal values to obtain:

𝑒 | 𝐾 −1 𝑓
𝜇∗ = (10.69)
𝑒 | 𝐾 −1 𝑒
( 𝑓 − 𝑒𝜇∗ )| 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ )
𝜎 ∗2 = . (10.70)
𝑛𝑠

We now substitute these values back into the log likelihood function
(Eq. 10.68), which yields

𝑛𝑠   1
ℓ (𝜃, 𝑝) = − ln 𝜎∗2 − ln |𝐾| . (10.71)
2 2

This function, also called the concentrated likelihood function, only de-
pends on the kernel 𝐾, which depends on 𝜃 and 𝑝.
We cannot solve for optimal values of 𝜃 and 𝑝 analytically. Instead,
we rely on numerical optimization to maximize Eq. 10.71. Because 𝜃 can
vary across a broad range, it is often better to search using logarithmic
scaling. Once we solve that optimization problem, we compute the
mean and variance in Eqs. 10.69 and 10.70.
Now that we have a fitted model, we can make predictions at new
points where we have not sampled. We do this by substituting 𝑥 𝑝 into
a formula called the kriging predictor. The formula is unique, but there
are many ways to derive it. One way to derive it is to find the function
value at 𝑥 𝑝 that is the most consistent with the behavior of the function
captured by the fitted kriging model.
Let 𝑓𝑝 be our guess for the value of the function at 𝑥 𝑝 . One way
to assess the consistency of our guess is to add (𝑥 𝑝 , 𝑓𝑝 ) as an artificial
point to our training data (so that we now have 𝑛 𝑠 + 1 points) and
estimate the likelihood using the parameters from our fitted kriging
model. The likelihood of this augmented data can now be thought of
as a function of 𝑓𝑝 : high values correspond to guessed values of 𝑓𝑝 that
are consistent with function behavior captured by the fitted kriging
model. Therefore, the value of 𝑓𝑝 that maximizes the likelihood of this
augmented data set is a natural way to predict the value of the function.
10 Surrogate-Based Optimization 401

This is an optimization problem with a closed-form solution, and the


corresponding formula is the kriging predictor.
Now we outline the derivation of the kriging predictor.§ With the § Jones173 provides the complete deriva-
tion; here we show only a few key steps.
augmented point, our function values are 𝑓¯ = [ 𝑓 , 𝑓𝑝 ], where 𝑓 is the
173. Jones, A taxonomy of global optimiza-
𝑛 𝑠 -vector of function values from the original training data. Then, the tion methods based on response surfaces,
correlation matrix with the additional data point is 2001.

 
𝐾 𝑘
𝐾¯ = | , (10.72)
𝑘 1

where 𝑘 is the correlation of the new point with the training data given
by      
 
 corr 𝐹 𝑥 (1) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (1) , 𝑥 𝑝 
 
 
𝑘= ..
 . (10.73)
    .   
 
corr 𝐹 𝑥 (𝑛𝑠 ) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (𝑛𝑠 ) , 𝑥 𝑝 
 
The 1 in the bottom right of the augmented correlation matrix (Eq. 10.72)
is because the correlation of the new variable 𝐹(𝑥 𝑝 ) with itself is 1. The
log likelihood function with these new augmented vectors and the
previously determined parameters is as follows (see Eq. 10.68):

𝑛𝑠 𝑛𝑠 1 ( 𝑓¯ − 𝑒𝜇∗ )| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗ )
ℓ ( 𝑓𝑝 ) = − ln(2𝜋) − ¯ −
ln(𝜎∗2 ) − ln | 𝐾| .
2 2 2 2𝜎∗2
We want to maximize this function with respect to 𝑓𝑝 . Because only the
last term depends on 𝑓𝑝 (it is a part of 𝑓¯) we can omit the other terms
and formulate the following:
 
𝑓¯ − 𝑒𝑣)| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗
maximize ℓ ( 𝑓𝑝 ) = − . (10.74)
𝑓𝑝 2𝜎∗2

This problem can be solved analytically, yielding the mean value of the
kriging prediction,

𝑓𝑝 = 𝜇∗ + 𝑘 | 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ ) . (10.75)

The mean square error of the kriging prediction (that is, the expected
squared value of the error) is given by¶ ¶ The formula for mean squared error does

  not come from the augmented likelihood


(1 − 𝑘 | 𝐾 −1 𝑒)2 approach, but is a byproduct of showing
𝜎𝑝2 = 𝜎∗2 1 − 𝑘 | 𝐾 −1 𝑘 + . (10.76) that the kriging predictor is the “best lin-
𝑒 | 𝐾 −1 𝑒 ear unbiased predictor” for the assumed
statistical model.174
One attractive feature of kriging models is that they are interpolatory 174. Sacks et al., Design and analysis of
computer experiments, 1989.
and thus match the training data exactly. To see how this is true, if
10 Surrogate-Based Optimization 402

𝑥 𝑝 is the same as one of the training data points, 𝑥 (𝑖) , then 𝑘 is just 𝑖th
column of 𝐾. Hence, 𝐾 −1 𝑘 is a vector 𝑒 𝑖 , with all zeros except for 1 in
the 𝑖th element. In the prediction (Eq. 10.75), 𝑘 | 𝐾 −1 = 𝑒 𝑖 and so the
|

last term is 𝑓 (𝑖) − 𝜇∗ , which means that 𝑓𝑝 = 𝑓 (𝑖) .


In the mean square error (Eq. 10.76), 𝑘 | 𝐾 −1 𝑘 is the same as 𝑘 | 𝑒 𝑖 .
This is the 𝑖th element of 𝑘, which is 1. Therefore, the first two terms in
the brackets in Eq. 10.76 cancel, and the last term is zero, yielding 𝜎𝑝2 = 0.
This is expected; if we already sampled the point, the uncertainty about
its function value should be zero.
When describing a fitted kriging model, we often refer q to the
standard error as the square root of this quantity (i.e., 𝜎𝑝2 ). The
standard error is directly related to the confidence interval (e.g., ±1
standard error corresponds to a 68 percent confidence interval).

Example 10.7 One-dimensional kriging model

In this example, we consider the decaying sinusoid:


𝑓 (𝑥) = exp(−0.1𝑥) sin(𝑥) .
We assume, however, that this function is unknown, and we sample at the
𝑓
following points: 1
𝑥 = [0.5, 2, 2.5, 9, 10] . Kriging model

We can fit a kriging model to this data by following the procedure in this 0.5
section. This includes solving the optimization problem of Eq. 10.71 using a
gradient-based method with exact derivatives. We fix 𝑝 = 2 and search for 𝜃 in 0
the range [10−3 , 102 ] with the exponent as the optimization variable.
Actual
The resulting interpolation is shown in Fig. 10.21, where we plot the mean −0.5
line. The shaded area represents the uncertainty corresponding to ±1 standard
0 5 10
error. The uncertainty goes to zero at the known data points and is largest 𝑥
when far from known data points.
Fig. 10.21 Kriging model showing the
training data (dots), the kriging pre-
If we can provide the gradients of the function at the training data dictor (blue line) and the confidence
interval corresponding to ±1 stan-
points (in addition to the function values), we can use that information
dard error (shaded areas), compared
to build a more accurate kriging model. This approach is called gradient- to the actual function (gray line).
enhanced kriging (GEK). The methodology is the same as before, except
we add more observed outputs (i.e., in addition to the function values at
the sampled points, we add their gradients). In addition to considering
the correlation between the function values at different sampled points,
the kernel matrix 𝐾 needs to be expanded to consider correlations
between function values and gradients, gradients and function values,
and among gradient components.
We can use still use equation (Eq. 10.75) for the GEK predictor and
equation (Eq. 10.76) for the mean square error if we plug in “expanded
10 Surrogate-Based Optimization 403

versions” of the outputs 𝑓 , the vector 𝑘, the matrix 𝐾, and the vector of
1s, 𝑒.
We expand the output vector to include not just the function values
at the sampled points but also their gradients:
 𝑓1 
 
 .. 
 . 
 
 𝑓𝑛 𝑠 

≡  .
𝑓GEK  (10.77)
 ∇ 𝑓1 
 . 
 .. 
 
∇ 𝑓 
 𝑛𝑠 
This vector is of length 𝑛 𝑠 + 𝑛 𝑠 𝑛 𝑑 , where 𝑛 𝑑 is the dimension of 𝑥. The
gradients are usually provided at the same 𝑥 locations as the function
samples, but that is not required.
Recall that the term 𝑒𝜇∗ in Eq. 10.75 for the kriging predictor
represents the expected value of the random variables 𝐹 (1) , . . . , 𝐹 (𝑛 𝑠 ) .
Now that we have expanded the outputs to include the gradients at the
sampled points, the mean vector needs to be expanded to include the
expected values of ∇𝐹 (𝑖) , which are all zero. We can still use 𝑒𝜇∗ in the
formula for the predictor if we use the following definition:

𝑒GEK ≡ [1, . . . , 1, 0, . . . , 0] , (10.78)

where 1 occurs for the first 𝑛 𝑠 entries, and 0 for the remaining 𝑛 𝑠 𝑛 𝑑
entries.
The additional correlations (between function values and derivatives
and between the derivatives themselves) are as follows:
    
corr 𝐹 𝑥 (𝑖) , 𝐹 𝑥 (𝑗) = 𝐾 𝑖𝑗
 
©   𝜕𝐹 𝑥 (𝑗) ª 𝜕𝐾 𝑖𝑗
­
corr ­𝐹 𝑥 ®=
𝜕𝑥 𝑙 ® 𝜕𝑥 (𝑗)
(𝑖)
,
«   ¬ 𝑙

© 𝜕𝐹 𝑥
(𝑖)
  ª 𝜕𝐾 𝑖𝑗
corr ­­ , 𝐹 𝑥 (𝑗) ®® = (10.79)
𝜕𝑥 𝑙 (𝑖)
𝜕𝑥 𝑙
«     ¬
© 𝜕𝐹 𝑥 𝜕𝐹 𝑥 (𝑗) ª
(𝑖)
𝜕2 𝐾 𝑖𝑗
corr ­­ ®=
𝜕𝑥 𝑘 ® 𝜕𝑥 (𝑖) 𝜕𝑥 (𝑗)
, .
𝜕𝑥 𝑙
« ¬ 𝑙 𝑘

Here, weuse 𝑙 and


 𝑘 to represent a component of a vector, and we
use 𝐾 𝑖𝑗 ≡ 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) as shorthand. For our particular kernel choice
10 Surrogate-Based Optimization 404

(Eq. 10.64), these correlations become the following:


!
Õ
𝑛𝑑  
(𝑖) (𝑗) 2
𝐾 𝑖𝑗 = exp − 𝜃𝑙 𝑥𝑙 − 𝑥𝑙
𝑘=1
𝜕𝐾 𝑖𝑗  
(𝑖) (𝑗)
= 2𝜃𝑙 𝑥 𝑙 − 𝑥 𝑙 𝐾 𝑖𝑗
(𝑗)
𝜕𝑥 𝑙
𝜕𝐾 𝑖𝑗 𝜕𝐾 𝑖𝑗 (10.80)
=−
(𝑖) (𝑗)
𝜕𝑥 𝑙 𝜕𝑥 𝑙
  




(𝑖) (𝑗) (𝑖) (𝑗)
𝜕2 𝐾 −4𝜃 𝜃 𝑥 − 𝑥 𝑥 − 𝑥 𝐾 𝑖𝑗 𝑙≠𝑘
 2
𝑖𝑗 𝑙 𝑘 𝑘 𝑘 𝑙 𝑙
=
(𝑖) (𝑗) 
 −4𝜃𝑙2 𝑥 (𝑖) (𝑗)
𝐾 𝑖𝑗 + 2𝜃𝑙 𝐾 𝑖𝑗

𝜕𝑥 𝑙 𝜕𝑥 𝑘 𝑙
− 𝑥𝑙 𝑙 = 𝑘,

where we used 𝑝 = 2. Putting this all together yields the expanded


correlation matrix:  
𝐾 𝐽𝐾
𝐾 GEK ≡ | , (10.81)
𝐽𝐾 𝐻𝐾
where the (𝑛 𝑠 × 𝑛 𝑠 𝑛 𝑑 ) block representing the first derivatives is

 𝜕𝐾11 | 𝜕𝐾 1𝑛 𝑠 | 
 
 𝜕𝑥 (1) ... 
 𝜕𝑥 (𝑛 𝑠 ) 
𝐽𝐾 =  ... ..
.
..
.

 (10.82)
 𝜕𝐾 | 𝜕𝐾 𝑛 𝑠 𝑛 𝑠 | 
 𝑛𝑠 1
 (1) ... 
 𝜕𝑥 𝜕𝑥 (𝑛 𝑠 ) 
and the (𝑛 𝑠 𝑛 𝑑 × 𝑛 𝑠 𝑛 𝑑 ) matrix of second derivatives is

 𝜕2 𝐾11 𝜕2 𝐾 1𝑛 𝑠 
 
 𝜕𝑥 (1) 𝜕𝑥 (1) ... 
 𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 ) 
𝐻𝐾 =  ..
.
..
.
..
.
 .
 (10.83)
 2 
 𝐾 𝜕 𝐾 𝑛𝑠 𝑛𝑠 
2
 
𝜕 𝑛𝑠 1
 𝜕𝑥 (𝑛𝑠 ) 𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 ) 𝜕𝑥 (𝑛 𝑠 ) 
...

We can still get the estimates 𝜇∗ and 𝜎∗2 with Eqs. 10.69 and 10.70
using the expanded versions of 𝐾, 𝑒, 𝑓 and replacing 𝑛 𝑠 in Eq. 10.76
with 𝑛 𝑠 (𝑛 𝑑 + 1), which is the new length of the outputs.
The predictor equations (Eqs. 10.75 and 10.76) also apply with the
expanded matrices and vectors. However, we also need to expand 𝑘 in
these computations to include the correlations between the gradients
at the sampled points with the gradient at the point 𝑥 where we make
10 Surrogate-Based Optimization 405

a prediction. Thus, the expanded 𝑘 is:

 
   𝑘
  
 
 © 𝜕𝐹 𝑥 ª 𝜕𝐾 𝑥 , 𝑥 𝑝 
(1) (1)
 corr ­ ® 
 ­ 𝜕𝑥 (1) , 𝐹(𝑥 𝑝 )® = 
 𝜕𝑥 (1) 
 « ¬ 

≡  .
𝑘GEK ..  (10.84)
   .
  
 
 
 © 𝜕𝐹 𝑥 ª 𝜕𝐾 𝑥 , 𝑥 𝑝 
(𝑛 𝑠 ) (𝑛 𝑠 )
corr ­ ® 
 ­ 𝜕𝑥 (𝑛𝑠 ) , 𝐹(𝑥 𝑝 )® = 
 𝜕𝑥 (𝑛 𝑠 )

 « ¬ 

Example 10.8 Gradient-enhanced kriging

We repeat Ex. 10.7 but this time include the gradients (Fig. 10.22). The 𝑓
1
standard error reduces dramatically between points. The additional information Fit
contained in the derivatives significantly helps in creating a more accurate fit.
0.5

Example 10.9 Two-dimensional kriging Actual


−0.5
The Jones function (Appendix D.1.4) is shown on the left in Fig. 10.23. Using
0 5 10
GEK with only 10 training points from a Hammersley sequence (shown as the 𝑥
dots), created the surrogate model on the right. A reasonable representation of
this multimodal space can be captured even with a small number of samples. Fig. 10.22 A GEK fit to the input data
(circles) and a shaded confidence in-
3 3
terval.

2 2

1 1
𝑥2 𝑥2

0 0

−1 −1

−1 0 1 2 3 −1 0 1 2 3
Fig. 10.23 Kriging fit to the multi-
𝑥1 𝑥1
modal Jones function.
Original function Kriging fit

175. Han et al., Weighted gradient-enhanced


kriging for high-dimensional surrogate
One difficulty with GEK is that the kernel matrix quickly grows in modeling and design optimization, 2017.
size as the dimension of the problem increases, the number of samples 172. Bouhlel and Martins, Gradient-
increases, or both. Various approaches have been proposed to improve enhanced kriging for high-dimensional
problems, 2019.
10 Surrogate-Based Optimization 406

the scaling with higher dimensions, such as a weighted sum of smaller


correlation matrices175 or a partial least squares approach.172
The version of kriging in this section is interpolatory. For noisy
data, a regression approach can be used by modifying the correlation
matrix as follows:
𝐾 reg ≡ 𝐾 + 𝜏𝐼 , (10.85)
with 𝜏 > 0. This adds a positive constant along the diagonal, so the
model no longer correlates perfectly with the provided points. The
parameter 𝜏 is then an additional parameter to estimate in the maximum
likelihood optimization. Even for interpolatory models, this term is
often still added to the covariance matrix with a small constant value
of 𝜏 (near machine precision) to ensure that the correlation matrix is
invertible. This section focused on the most common choices when
using kriging, but many other versions exist.176 176. Forrester et al., Engineering Design
via Surrogate Modelling: A Practical Guide,
2008.
10.5 Deep Neural Networks

Like kriging, deep neural nets can be used to approximate highly non-
linear simulations where we do not need to provide a parametric form.
Neural networks follow the same basic steps described for other surro-
gate models but with a unique model leading to specialized approaches
for derivative computation and optimization strategy. Neural networks
loosely mimic the brain, which consists of a vast network of neurons.
In neural networks, each neuron is a node that represents a simple
function. A network defines chains of these simple functions to obtain
composite functions that are much more complex. For example, three
simple functions, 𝑓 (1) , 𝑓 (2) , and 𝑓 (3) , may be chained into the composite
function (or network):
  
𝑓 (𝑥) = 𝑓 (3) 𝑓 (2) 𝑓 (1) (𝑥) . (10.86)

Even though each function may be simple, the composite function


can express complex behavior. Most neural networks are feedforward
networks, meaning that information flows from inputs 𝑥 to outputs 𝑓 .
Recurrent neural networks include feedback connections.
Figure 10.24 shows a diagram of a neural network. Each node
represents a neuron. The neurons are connected between consecutive
layers, forming a dense network. The first layer is the input layer, the
last one is the output layer, and the middle ones are the hidden layers.
The total number of layers is called the network’s depth. Deep neural
networks have many layers, enabling the modeling of complex behavior.
The first and last layers can be viewed as the inputs and outputs
of a surrogate model. Each neuron in the hidden layer represents a
10 Surrogate-Based Optimization 407

Fig. 10.24 Deep neural network with


two hidden layers.
Input layer Hidden layers Output layer

function. This means that the output from a neuron is a number, and
thus the output from a whole layer can be represented as a vector 𝑥.
We represent the vector of values for layer 𝑘 by 𝑥 (𝑘) , and the value for
(𝑘)
the 𝑖th neuron in layer 𝑘 by 𝑥 𝑖 .
Consider a neuron in layer 𝑘. This neuron is connected to many
neurons from the previous layer 𝑘 − 1 (see the first part of Fig. 10.25).
We need to choose a functional form for each neuron in the layer that
takes in the values from the previous layer as inputs. Chaining together
linear functions would yield another linear function. Therefore, some
layers must use nonlinear functions.
1
The most common choice for hidden layers is a layer of linear
functions followed by a layer of functions that create nonlinearity. A 0.8

neuron in the linear layer produces the following intermediate variable: Sigmoid
0.6
𝑎(𝑧)
Õ
𝑛 0.4
(𝑘−1)
𝑧= 𝑤𝑗 𝑥𝑗 +𝑏. (10.87)
0.2
𝑗=1

5
In vector form:
−5 𝑧

𝑧=𝑤 𝑥 | (𝑘−1)
+𝑏. (10.88)
4
The first term is a weighted sum of the values from the neurons in the
previous layer. The 𝑤 vector contains the weights. The term 𝑏 is the
𝑎(𝑧)
bias, which is an offset that scales the significance of the overall output. 2
These two terms are analogous to the weights used in the previous ReLU
section but with the constant term separated for convenience. The
second column of Fig. 10.25 illustrates the linear (summation and bias) −5 𝑧 5
layer.
Fig. 10.26 Activation functions.
10 Surrogate-Based Optimization 408

𝑧1

(𝑘−1) 𝑧2
𝑥1

𝑤1
(𝑘−1) 𝑧3
𝑥2
𝑤2

(𝑘−1)
𝑤3 Í  (𝑘−1)

(𝑘)
𝑥3 𝑧4 = 𝑗 𝑤𝑗 𝑥𝑗 + 𝑏4 𝑥 (𝑘) = 𝑎(𝑧) 𝑥4

.. 𝑤𝑛 𝑧5
.

(𝑘−1)
𝑥𝑛 ..
.

𝑧𝑚

Fig. 10.25 Typical functional form for


Summation a neuron in the neural net.
Inputs Weights Activation Output
and bias

Next, we pass 𝑧 through an activation function, which we call 𝑎(𝑧).


Historically, one of the most common activation functions has been the
sigmoid function:
1
𝑎(𝑧) = . (10.89)
1 + 𝑒 −𝑧
This function is shown in the top plot of Fig. 10.26. The sigmoid function
produces values between 0 and 1, so large negative inputs result in
insignificant outputs (close to 0), and large positive inputs produce
outputs close to 1.
Most modern neural nets use a rectified linear unit (ReLU) as the
activation function:
𝑎(𝑧) = max(0, 𝑧) . (10.90)
This function is shown in the bottom plot of Fig. 10.26. The ReLU
has been found to be far more effective than the sigmoid function in
producing accurate neural nets. This activation function eliminates
negative inputs. Thus, the bias term can be thought of as a threshold
establishing what constitutes a significant value. The final two columns
of Fig. 10.25 illustrate the activation step.
Combining the linear function with the activation function produces
10 Surrogate-Based Optimization 409

the output for the 𝑖th neuron:


 
(𝑘)
𝑥𝑖 = 𝑎 𝑤 | 𝑥 (𝑘−1) + 𝑏 𝑖 . (10.91)

To compute the outputs for all the neurons in this layer, the weights
𝑤 for one neuron form one row in a matrix of weights 𝑊 and we can
write:
𝑥 (𝑘)   𝑊1,𝑛 𝑘−1  𝑥 (𝑘−1)   𝑏 
 1  ©  𝑊1,1 ··· 𝑊1,𝑗 ···  1   1 ª
 .  ­  ..   .   . ®
 ..  ­ .
.. ..
  ..   ..  ®
  ­  
  (𝑘−1)    ®®
. .
 (𝑘) 
𝑥 𝑖  = 𝑎 ­­  𝑊𝑖,1 · · · 𝑊𝑖,𝑗 · · · 𝑊𝑖,𝑛 𝑘−1  𝑥 𝑗  +  𝑏 𝑖  ®
 .  ­ .  
 .  ­  .. .. .. 
  ..   ..  ®®
 .  ­ . .   .   . ®
 (𝑘)    (𝑘−1)   
𝑥 𝑛  · · · 𝑊𝑛 𝑘 ,𝑛 𝑘−1  𝑥 
 𝑘 « 𝑊𝑛 𝑘 ,1 · · · 𝑊𝑛 𝑘 ,𝑗
 𝑛 𝑘−1   𝑛 𝑘  ¬
𝑥
(10.92)
or
 
𝑥 (𝑘) = 𝑎 𝑊 𝑥 (𝑘−1) + 𝑏 . (10.93)

The activation function is applied separately for each row. The following
equation is more explicit (where 𝑤 𝑖 is the 𝑖th row of 𝑊):
 
(𝑘) | (𝑘−1)
𝑥𝑖 = 𝑎 𝑤𝑖 𝑥𝑖 + 𝑏𝑖 . (10.94)

This neural net is now parameterized by a number of weights. Like


other surrogate models, we need to determine the optimal value for
these parameters (i.e., train the network) using training data. In the
example of Fig. 10.24, there is a layer of 5 neurons, 7 neurons, 7 neurons,
and then 4 neurons, and so there would be 5 × 7 + 7 × 7 + 7 × 4 weights
and 7 + 7 + 4 bias terms, giving a total of 130 variables. This represents
a small neural net because there are few inputs and few outputs. Large
neural nets can have millions of variables. We need to optimize those
variables to minimize a cost function.
As before, we use a maximum likelihood estimate where we optimize
the parameters 𝜃 (weights and biases in this case) to maximize the
probability of observing the output data 𝑦 conditioned on our inputs
𝑥. As shown in Section 10.3.2, this results in a sum of squared errors
function:
𝑛  
Õ  2
minimize 𝑓ˆ 𝜃; 𝑥 (𝑖) − 𝑓 (𝑖) . (10.95)
𝜃
𝑖=1

We now have the objective and variables in place to train the neural
net. As with the other models discussed in this chapter, it is critical to
set aside some data for cross validation.
10 Surrogate-Based Optimization 410

Because the optimization problem (Eq. 10.95) often has a large


number of parameters 𝜃, we generally use a gradient-based optimization
algorithm (however the algorithms of Chapter 4 are modified as we will
discuss shortly). To solve Eq. 10.95 using gradient-based optimization,
we require the derivatives of the objective function with respect to the
weighs 𝜃. Because the objective is a scalar and the number of weights
is large, reverse-mode algorithmic differentiation (AD) (see Section 6.6)
is ideal to compute the required derivatives.
Reverse-mode AD is known in the machine learning community as
backpropagation.∗ Whereas general-purpose reverse-mode AD operates ∗ The machine learning community inde-

pendently developed backpropagation be-


at the code level, backpropagation usually operates on larger sets of fore becoming aware of the connection to
operations and data structures defined in machine learning libraries. reverse-mode AD.58

Although less general, this approach can increase efficiency and stability. 58. Baydin et al., Automatic differentiation
in machine learning: A survey, 2018.
The ReLU activation function (Fig. 10.26, bottom) is not differentiable
at 𝑧 = 0, but in practice, this is generally not problematic—primarily
because these methods typically rely on inexact gradients anyway, as
discussed next.
The objective function in Eq. 10.95 consists of a sum of subfunctions,
each of which depends on a single data point (𝑥 (𝑖) , 𝑓 (𝑖) ). Objective
functions vary across machine learning applications, but most have this
same form:
minimize 𝑓 (𝜃) , (10.96)
𝜃
where
Õ
𝑛   Õ
𝑛
𝑓 (𝜃) = ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) = ℓ 𝑖 (𝜃) . (10.97)
𝑖=1 𝑖=1
As previously mentioned, the challenge with these problems is that
we often have large training sets where 𝑛 may be in the billions. That
means that computing the objective can be costly, and computing the
gradient can be even more costly.
If we divide the objective by 𝑛 (which does not change the solution),
the objective function becomes an approximation of the expected value
(see Appendix A.9):


𝑛
𝑓 (𝜃) = ℓ 𝑖 (𝜃) = E(ℓ (𝜃)) (10.98)
𝑛
𝑖=1

From probability theory, we know that we can estimate an expected


value from a smaller set of random samples. For the application of
estimating
 (1) a gradient,
we call this subset of random samples a minibatch
𝑆 = 𝑥 ...𝑥 (𝑚) , where 𝑚 is usually between 1 and a few hundred.
The entries 𝑥 (1) , . . . , 𝑥 (𝑚) do not correspond to the first 𝑛 entries but are
drawn randomly from a uniform probability distribution (Fig. 10.27).
10 Surrogate-Based Optimization 411

Using the minibatch, we can estimate the gradient as the sum of the
subfunction gradients at different training points:

1 Õ  
∇𝜃 𝑓 (𝜃) ≈ ∇𝜃 ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) . (10.99)
𝑚
𝑖∈𝑆

Thus, we divide the training data into these minibatches and use a new
minibatch to estimate the gradients at each iteration in the optimization.

Training data Testing data

Minibatch 1 Minibatch 2 Minibatch 3

This approach works well for these specific problems because of Fig. 10.27 Minibatches are randomly
the unique form for the objective (Eq. 10.98). As an example, for one drawn from the training data.
million training samples, a single gradient evaluation would require
evaluating all one million training samples. Alternatively, for a similar
cost, a minibatch approach can update the optimization variables a
million times using the gradient estimated from one training sample
at a time. This latter process usually converges much faster, mainly
because we are only fitting parameters against limited data in these
problems, so we generally do not need to find the exact minimum.
Typically, this gradient is used with steepest descent methods (Sec-
tion 4.4.1), more typically referred to as gradient descent in the machine
learning communities. As discussed in Chapter 4, steepest descent
is not the most effective optimization algorithm. However, steepest
descent with the minibatch updates, called stochastic gradient descent,
has been found to work well in machine learning applications. This
suitability is primarily because (1) many machine learning optimiza-
tions are performed repeatedly, (2) the true objective is difficult to
formalize, and (3) finding the absolute minimum is not as important as
finding a good enough solution quickly. One key difference in stochastic
gradient descent relative to the steepest descent method is that we do
not perform a line search. Instead, the step size (called the learning rate
in machine learning applications) is a preselected value that is usually
decreased between major optimization iterations.
Stochastic minibatching is easily applied to first-order methods and
has thus driven the development of improvements on stochastic gradient 177. Ruder, An overview of gradient descent
descent, such as momentum, Adagrad, and Adam.177 Although some of optimization algorithms, 2016.
10 Surrogate-Based Optimization 412

these methods may seem somewhat ad hoc, there is mathematical rigor


to many of them.178 Batching makes the gradients noisy, so second- 178. Goh, Why momentum really works,
2017.
order methods are generally not pursued. However, ongoing research
is exploring stochastic batch approaches that might effectively leverage
the benefits of second-order methods.

10.6 Optimization and Infill

Once a surrogate model has been built, optimization may be performed


using the surrogate function values. That is, instead of minimizing the
expensive function 𝑓 (𝑥), we minimize the model 𝑓ˆ(𝑥), as previously
illustrated in Fig. 10.1.
The surrogate model may be static, but more commonly, it is
updated between optimization iterations by adding new training data
and rebuilding the model.
The process by which we select new data points is called infill. There
are two main approaches to infill: prediction-based exploitation and
error-based exploration. Typically, only one infill point is chosen at a
time. The assumption is that evaluating the model is computationally
expensive, but rebuilding and evaluating the surrogate is cheap.

10.6.1 Exploitation
For models that do not provide uncertainty estimates, the only real
option is exploitation. A prediction-based exploitation infill strategy
adds an infill point wherever the surrogate predicts the optimum. The
reasoning behind this approach is that in SBO, we do not necessarily
care about having a globally accurate surrogate; instead, we only care
about having an accurate surrogate near the optimum.
The most logical point to sample is thus the optimum predicted by
the surrogate. Likely, the location predicted by the surrogate will not
be at the true optimum. However, evaluating this point adds valuable
information in the region of interest.
We rebuild the surrogate and re-optimize, repeating the process until
convergence. This approach usually results in the quickest convergence
to an optimum, which is desirable when the actual function is expensive
to evaluate. The downside is that we may converge prematurely to an
inferior local optimum for problems with multiple local optima.
Even though the approach is called exploitation, the optimizer used
on the surrogate can be a global search method (gradient-based or
gradient-free), although it is usually a local search method. If uncer-
tainty is present, using the mean value of the surrogate as the infill
criteria results in essentially an exploitation strategy.
10 Surrogate-Based Optimization 413

The algorithm is outlined in Alg. 10.4. Convergence could be based


on a maximum number of iterations or a tolerance for the objective
function’s fractional change.

Algorithm 10.4 Exploitation-driven surrogate-based optimization

Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Variable lower and upper bounds
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value

𝑥 (𝑖) = sample(𝑛
  𝑠 , 𝑛𝑑 ) Sample

𝑓 (𝑖) = 𝑓 𝑥 (𝑖) Evaluate function


𝑘=0  
while 𝑘 < 𝑘max and 𝑓ˆ∗ − 𝑓new / 𝑓ˆ∗ < 𝜏 do
 
𝑓ˆ = surrogate 𝑥 (𝑖) , 𝑓 (𝑖) Construct surrogate model

𝑥 ∗ , 𝑓ˆ∗
= min 𝑓ˆ(𝑥) Perform optimization on the surrogate function
𝑓new = 𝑓 (𝑥 ∗ ) Evaluate true function at predicted optimum
𝑥 (𝑖) = 𝑥 (𝑖) ∪ 𝑥 ∗ Append new point to training data
𝑓 (𝑖) = 𝑓 (𝑖) ∪ 𝑓new Append corresponding function value
𝑘 = 𝑘+1
end while

10.6.2 Efficient Global Optimization


An alternative approach to infill uses error-based exploration. This
approach requires using kriging (Section 10.4) or another surrogate
approach that predicts not just function values but also error estimates.
Although many infill metrics exist within this category, we focus on a
popular one called expected improvement, and the associated algorithm,
efficient global optimization (EGO).145 145. Jones et al., Efficient global optimiza-
tion of expensive black-box functions, 1998.
As stated previously, sampling where the mean is low is an ex-
ploitation strategy, but we do not necessarily want to sample where
the uncertainty is high. That may lead to wasteful function calls in
regions of the design space where the surrogate model is inaccurate
but which are far from any optimum. In effect, this strategy would be
like a larger sampling plan aiming to reduce error everywhere in the
surrogate. Instead, we want to sample where we have the maximum
probability of finding a better point.
10 Surrogate-Based Optimization 414

Let the best solution we have found so far be 𝑓 ∗ = 𝑓 (𝑥 ∗ ). The


improvement for any new test point 𝑥 is then given by

𝐼(𝑥) = max ( 𝑓 ∗ − 𝑓 (𝑥), 0) . (10.100)

If 𝑓 (𝑥) ≥ 𝑓 ∗ , there is no improvement, but if 𝑓 (𝑥) < 𝑓 ∗ , the improvement


is positive. However, 𝑓 (𝑥) is not a deterministic value in this model but
rather a probability distribution. Thus, the expected improvement is
the expected value (or mean) of the improvement:

𝐸𝐼(𝑥) = E (max( 𝑓 ∗ − 𝑓 (𝑥), 0)) . (10.101)

The expected value for a kriging model can be found analytically as:
   

𝑓 ∗ − 𝜇 𝑓 (𝑥) 𝑓 ∗ − 𝜇 𝑓 (𝑥)
𝐸𝐼(𝑥) = ( 𝑓 − 𝜇 𝑓 (𝑥))Φ + 𝜎 𝑓 (𝑥)𝜙 ,
𝜎 𝑓 (𝑥) 𝜎 𝑓 (𝑥)

(10.102)
where Φ and 𝜙 are the CDF and PDF, respectively, for the standard
normal distribution, and 𝜇 𝑓 and 𝜎 𝑓 are the mean and standard error
functions produced from kriging ( Eqs. 10.75 and 10.76).
The algorithm is similar to that of the previous section (Alg. 10.4),
but instead of choosing the minimum of the surrogate, the selected
infill point is the point with the greatest expected improvement. The
corresponding algorithm is detailed in Alg. 10.5.

Algorithm 10.5 Efficient global optimization

Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Lower and upper bounds
𝜏: Minimum expected improvement
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value

𝑥 (𝑖) = sample(𝑛 𝑠 , 𝑛 𝑑 ) Sample


𝑓 (𝑖) = 𝑓 (𝑥 (𝑖) ) Evaluate function
𝑓 ∗ = min{ 𝑓 (𝑖) } Best point so far; also update corresponding 𝑥 ∗
𝑘=0
while 𝑘 < 𝑘max and 𝑓𝑒 𝑖 > 𝜏 do
𝜇(𝑥), 𝜎(𝑥) = GP(𝑥 (𝑖) , 𝑓 (𝑖) ) Construct Gaussian process surrogate model
𝑥 𝑘 , 𝑓𝑒 𝑖 = max 𝐸𝐼(𝑥) Maximize expected improvement
𝑓 𝑘 = 𝑓 (𝑥 𝑘 ) Evaluate true function at predicted optimum
𝑓 ∗ = min{ 𝑓 ∗ , 𝑓 𝑘 } Update best point and 𝑥 ∗ if necessary
10 Surrogate-Based Optimization 415

𝑥 (𝑖) ← [𝑥 (𝑖) , 𝑥 𝑘 ] Add new point to training data


𝑓 (𝑖) ← [ 𝑓 (𝑖) , 𝑓 𝑘 ]
𝑘 = 𝑘+1
end while

Example 10.10 Expected improvement

Consider the same one-dimensional function of Ex. 10.7 using kriging 𝑓


1
(without gradients), where the data points and fit are shown again in Fig. 10.28.
The best point we have found so far is denoted in the figure as 𝑥 ∗ , 𝑓 ∗ . For
a Gaussian process model, the fit also provides a 1-standard-error region, 0.5

represented by the shaded region in Ex. 10.7.


Now imagine we want to evaluate this function at some new test point, 0
𝑥test = 3.25. In Fig. 10.28, the full probability distribution for the objective
at 𝑥 test is shown in red. This probability distribution occurs at a fixed value (𝑥 ∗ , 𝑓 ∗ )
of 𝑥, so we can visualize it in a dimension coming out of the page. The −0.5
0 𝑥test 5 10
integral of the shaded red region is the probability of improvement over the 𝑥
best point. The expected value is similar to the probability of improvement.
However, rather than returning a probability, it returns the expected magnitude Fig. 10.28 At a given test point (𝑥test =
3.25), we highlight the probability dis-
of improvement. That magnitude may be more helpful in defining stopping
tribution and the expected improve-
criteria than quantifying a probability; that is, if the amount of improvement is ment in the shaded red region.
negligible, it does not matter that the associated probability is high.
Now, let us evaluate the expected improvement not just at 𝑥test , but across
the domain. The result is shown by the red function in the top left of Fig. 10.29.
The highest peak suggests that we expect the largest improvement close to
our best known point at this first iteration. We also see significant potential
for improvement in the middle region of high uncertainty. The expected
improvement metric does not simply capture regions with high uncertainty
but rather regions that are likely to lead to improvement (which may also have
high uncertainty). On the left side of the figure, for example, we anticipate zero
expected improvement. For our next sample, we would choose the location
with the greatest expected improvement, rebuild the surrogate model, and
repeat.
A few select iterations in the convergence process are shown in the remaining
panes of Fig. 10.29. On the top right, after the first promising valley is well
explored, the middle region becomes the most likely location of potential
improvements. Eventually, the potential improvements are minor, below our
convergence threshold, and we terminate (bottom right).
10 Surrogate-Based Optimization 416

·10−2 ·10−3
4 8
3 6
𝐸𝐼(𝑥) 2 𝐸𝐼(𝑥) 4
1 2
0 0
1 1

0.5 0.5

𝑓 𝑓
0 0

−0.5 −0.5

0 2 4 6 8 10 12 0 2 4 6 8 10 12
𝑥 𝑥

𝑘=1 𝑘=5

·10−3 ·10−5
1.5 1
𝐸𝐼(𝑥) 1 𝐸𝐼(𝑥) 0.5
0.5
0 0
1 1

0.5 0.5

𝑓 𝑓
0 0

−0.5 −0.5

0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 10.29 Expected improvement
𝑥 𝑥
evaluated across the domain.
𝑘 = 10 𝑘 = 12

10.7 Summary

Surrogate-based optimization can be an effective approach to optimiza-


tion problems where models are expensive to evaluate or noisy. The
first step in building a surrogate model is sampling, in which we select
the points that are evaluated to obtain the training data. Full factorial
searches are too expensive for even a modest number of variables,
and random sampling does not provide good coverage, so we need
techniques that provide good coverage with a small number of sam-
ples. Popular techniques for this kind of sampling include LHS and
low-discrepancy sequences.
The next step is surrogate selection and construction. For a given
choice of basis functions, regression is used to select optimal model
10 Surrogate-Based Optimization 417

parameters. Cross validation is a critical component of this process. We


want good predictive capability, which means that the models work well
on data that the model has not been trained against. Model selection
often involves trade-offs of more rigid models that do not need as much
training data versus more flexible models that require more training
data. Polynomials are often used for regression problems because
a relatively small number of samples can be used to capture model
behavior. Radial basis functions are more often used for interpolation
because they can handle multimodal behavior but may require more
training data.
Kriging and deep neural nets are two options that model more com-
plex and multimodal design spaces. When using these models, special
considerations are needed for efficiency, such as using symmetric matrix
factorizations and gradients for kriging and using backpropagation
and stochastic gradients for deep neural nets.
The last step of the process is infill, where points are sampled
during optimization to update the surrogate model. Some approaches
are exploitation-based, where we perform optimization using the
surrogate and then use the optimal solution to update the model.
Other approaches are exploration-based, where we sample not just
at the deterministic optimum but also at points where the expected
improvement is high. Exploration-based approaches require surrogate
models that provide uncertainty estimates, such as kriging models.
10 Surrogate-Based Optimization 418

Problems

10.1 Answer true or false and justify your answer.

a. You should use surrogate-based optimization when a prob-


lem has an expensive simulation and many design variables
because it is immune to the “curse of dimensionality”.
b. Latin hypercube sampling is a random process that is more
efficient than pure random sampling.
c. LHS seeks to minimize the distance between the samples,
with the constraint that the projection on each axis must
follow a chosen probability distribution.
d. Polynomial regressions are not considered to be surrogate
models because they are too simple and do not consider any
of the model physics.
e. There can be some overlap between the training points and
cross-validation points, as long as that overlap is small.
f. Cross validation is a required step in selecting basis functions
for SBO.
g. In addition to modeling the function values, kriging surro-
gate models also provide an estimate of the uncertainty in
the values.
h. A prediction-based exploitation infill strategy adds an infill
point wherever the surrogate predicts the largest error.
i. Maximizing the expected improvement maximizes the prob-
ability of finding a better function value.
j. Neural networks require many nodes with a variety of
sophisticated activation functions to represent challenging
nonlinear models.
k. Backpropagation is the computation of the derivatives of
the neural net error with respect to the activation function
weights using reverse-mode AD.

10.2 Latin hypercube sampling. Implement an LHS sampling algorithm


and plot 20 points across two dimensions with uniform projection
in both dimensions. Overlay the grid to check that one point
occurs in each bin.

10.3 Inversion sampling. Use inversion sampling with Latin hypercube


sampling to create and plot 100 points across two dimensions.
Each dimension should follow a normal distribution with zero
10 Surrogate-Based Optimization 419

mean and a standard deviation of 1 (cross-terms in covariance


matrix are 0).

10.4 Linear regression. Use the following training data sampled at 𝑥 with
the resulting function value 𝑓 (also tabulated on the resources
website):

𝑥 = [ − 2.0000, −1.7895, −1.5789, −1.3684, −1.1579,


− 0.9474, −0.7368, −0.5263, −0.3158, −0.1053,
0.1053, 0.3158, 0.5263, 0.7368, 0.9474,
1.1579, 1.3684, 1.5789, 1.7895, 2.0000]

𝑓 = [7.7859, 5.9142, 5.3145, 5.4135, 1.9367,


2.1692, 0.9295, 1.8957, −0.4215, 0.8553,
1.7963, 3.0314, 4.4279, 4.1884, 4.0957,
6.5956, 8.2930, 13.9876, 13.5700, 17.7481].

Use linear regression to determine the coefficients for a polynomial


basis of [𝑥 2 , 𝑥, 1] to predict 𝑓 (𝑥). Plot your fit against the training
data and report the coefficients for the polynomial bases.

10.5 Cross validation. Use the following training data sampled at 𝑥 with
resulting function value 𝑓 (also tabulated on resources website):

𝑥 = [ − 3.0, −2.6053, −2.2105, −1.8158, −1.4211,


− 1.0263, −0.6316, −0.2368, 0.1579, 0.5526,
0.9474, 1.3421, 1.7368, 2.1316, 2.5263,
2.9211, 3.3158, 3.7105, 4.1053, 4.5]

𝑓 = [43.1611, 28.1231, 12.9397, 3.7628, −2.5457,


− 4.267, 2.8101, −0.6364, 1.1996, −0.9666,
− 2.7332, −6.7556, −9.4515, −7.0741, −7.6989,
− 8.4743, −7.9017, −2.0284, 11.9544, 33.7997].

a. Create a polynomial surrogate model using the set of poly-


nomial basis functions 𝑥 𝑖 for 𝑖 = 0 to 𝑛. Plot the error in the
surrogate model while increasing 𝑛 (the maximum order of
the polynomial model) from 1 to 20.
b. Plot the polynomial fit for 𝑛 = 16 against the data and
comment on its suitability.
10 Surrogate-Based Optimization 420

c. Re-create the error plot versus polynomial order using 𝑘-fold


cross validation with 10 divisions. Be sure to limit the 𝑦-axes
to the area of interest.
d. Plot the polynomial fit against the data for a polynomial order
that produces low error under cross validation, and report
the coefficients for the polynomial. Justify your selection.

10.6 Nonlinear least squares. Implement a Levenberg–Marquardt al-


gorithm and demonstrate its performance on the Rosenbrock
function from three different starting points.

10.7 Kriging. Implement kriging (without gradients) and demonstrate


its fit on the following one-dimensional function:

𝑦 = exp(−𝑥) cos(5𝑥),

where 𝑥 ∈ [0, 2.5], using the following five sample points: 𝑥 =


[0, 0.2, 1.0, 1.2, 2.2].

10.8 Efficient global optimization. Use EGO with the function from the
previous problem, showing the iteration history until the expected
improvement reduces below 0.001.
Convex Optimization
11
General nonlinear optimization problems are difficult to solve. De-
pending on the particular optimization algorithm, they may require
tuning parameters, providing derivatives, adjusting scaling, and trying
multiple starting points. Convex optimization problems do not have
any of those issues and are thus easier to solve. The challenge is that
these problems must meet strict requirements. Even for candidate
problems with the potential to be convex, significant experience is
usually needed to recognize and utilize techniques that reformulate the
problems into an appropriate form.

By the end of this chapter you should be able to:

1. Understand the benefits and limitations of convex opti-


mization.
2. Identify and solve linear and quadratic optimization prob-
lems.
3. Formulate and solve convex optimization problems.
4. Identify and solve geometric programming problems.

11.1 Introduction

Convex optimization problems have desirable characteristics that make


them more predictable and easier to solve. Because a convex prob-
lem has provably only one optimum, convex optimization methods
always converge to the global minimum. Solving convex problems is
straightforward and does not require a starting point, parameter tuning,
or derivatives, and such problems scale well up to millions of design
179. Diamond and Boyd, Convex optimiza-
variables.179 tion with abstract linear operators, 2015.
All we need to solve a convex problem is to set it up appropriately;
there is no need to worry about convergence, local optima, or noisy
functions. Some convex problems are so straightforward that they
are not recognized as an optimization problem and are just thought

421
11 Convex Optimization 422

of as a function or operation. A familiar example of the latter is the


linear-least-squares problem (described previously in Section 10.3.1
and revisited in a subsequent section).
Although these are desirable properties, the catch is that convex
problems must satisfy strict requirements. Namely, the objective and
all inequality constraints must be convex functions, and the equality
constraints must be affine.∗ ∗ An affine function consists of a linear
transformation and a translation. Infor-
A function 𝑓 is convex if mally, this type of function is often referred
to as linear (including in this book), but
strictly, these are distinct concepts. For
𝑓 ((1 − 𝜂)𝑥1 + 𝜂𝑥2 ) ≤ (1 − 𝜂) 𝑓 (𝑥1 ) + 𝜂 𝑓 (𝑥2 ) (11.1) example: 𝐴𝑥 is a linear function in 𝑥 ,
whereas 𝐴𝑥 + 𝑏 is an affine function in
for all 𝑥1 and 𝑥2 in the domain, where 0 ≤ 𝜂 ≤ 1. This requirement is 𝑥.
illustrated in Fig. 11.1 for the one-dimensional case. The right-hand
side of the inequality is just the equation of a line from 𝑓 (𝑥1 ) to 𝑓 (𝑥 2 )
(the blue line), whereas the left-hand side is the function 𝑓 (𝑥) evaluated
at all points between 𝑥1 and 𝑥2 (the black curve). The inequality says
that the function must always be below a line joining any two points in
the domain. Stated informally, a convex function looks something like
a bowl.
Unfortunately, even these strict requirements are not enough. In 𝑓 (𝑥2 )

general, we cannot identify a given problem as convex or take advantage 𝑓 (𝑥1 )


of its structure to solve it efficiently and must therefore treat it as
a general nonlinear problem. There are two approaches to taking 𝑥1 𝑥2

advantage of convexity. The first one is to directly formulate the


Fig. 11.1 Convex function definition
problem in a known convex form, such as a linear program or a in the one-dimensional case: The
quadratic program (discussed later in this chapter). The second option function (black line) must be below a
is to use disciplined convex optimization, a specific set of rules and line that connects any two points in
the domain (blue line).
mathematical functions that we can use to build up a convex problem.
By following these rules, we can automatically translate the problem
into an efficiently solvable form.
Although both of these approaches are straightforward to apply, they
also expose the main weakness of these methods: we need to express
the objective and inequality constraints using only these elementary
functions and operations. In most cases, this requirement means that
the model must be simplified. Often, a problem is not directly expressed
in a convex form, and a combination of experience and creativity is
needed to reformulate the problem in an equivalent manner that is
convex.
Simplifying models usually results in a fidelity reduction. This
is less problematic for optimization problems intended to be solved
repeatedly, such as in optimal control and machine learning, which are
domains in which convex optimization is heavily used. In these cases,
simplification by local linearization, for example, is less problematic
11 Convex Optimization 423

because the linearization can be updated in the next time step. However,
this fidelity reduction is problematic for design applications.
In design scenarios, the optimization is performed once, and the
design cannot continue to be updated after it is created. For this reason,
convex optimization is less frequently used for design applications, ex-
cept for some limited uses in geometric programming, a topic discussed
in more detail in Section 11.6.
This chapter just introduces convex optimization and is not a re-
placement for more comprehensive textbooks on the topic.† We focus on † Boyd and Vandenberghe86 is the most
cited textbook on convex optimization.
understanding what convex optimization is useful for and describing
86. Boyd and Vandenberghe, Convex
the most widely used forms. Optimization, 2004.
The known categories of convex optimization problems include
linear programming, quadratic programming, second-order cone pro-
gramming, semidefinite programming, cone programming, and graph
form programming. Each of these categories is a subset of the next
(Fig. 11.2).‡ ‡ Several references exist with examples for

those categories that we do not discuss in


We focus on the first three because they are the most widely used, detail.180–181
including in other chapters in this book. The latter three forms are 180. Lobo et al., Applications of second-
less frequently formulated directly. Instead, users apply elementary order cone programming, 1998.

functions and operations and the rules specified by disciplined convex 181. Parikh and Boyd, Block splitting for
distributed optimization, 2013.
programming, and a software tool transforms the problem into a
182. Vandenberghe and Boyd, Semidefi-
suitable conic form that can be solved. Section 11.5 describes this nite programming, 1996.
procedure. 183. Vandenberghe and Boyd, Applica-
tions of semidefinite programming, 1999.
After covering the three main categories of convex optimization
problems, we discuss geometric programming. Geometric program- Graph form programming
(GFP)
ming problems are not convex, but with a change of variables, they
Cone programming
can be transformed into an equivalent convex form, thus extending the (CP)
types of problems that can be solved with convex optimization. Semidefinite programming
(SDP)
Second-order cone
11.2 Linear Programming programming (SOCP)
Quadratic
A linear program (LP) is an optimization problem with a linear objective programming (QP)
and linear constraints and can be written as
Linear programming
(LP)
minimize |
𝑓 𝑥
𝑥
subject to 𝐴𝑥 + 𝑏 = 0 (11.2)
𝐶𝑥 + 𝑑 ≤ 0 ,
Fig. 11.2 Relationship between vari-
where 𝑓 , 𝑏, and 𝑑 are vectors and 𝐴 and 𝐶 are matrices. All LPs are ous convex optimization problems.

convex.
11 Convex Optimization 424

Example 11.1 Formulating a linear programming problem

Suppose we are shopping and want to find how best to meet our nutritional
needs for the lowest cost. We enumerate all the food options and use the
variable 𝑥 𝑗 to represent how much of food 𝑗 we purchase. The parameter 𝑐 𝑗 is
the cost of a unit amount of food 𝑗. The parameter 𝑁𝑖𝑗 is the amount of nutrient
𝑖 contained in a unit amount of food 𝑗. We need to make sure we have at least
𝑟 𝑖 of nutrient 𝑖 to meet our dietary requirements. We can now formulate the
cost objective as Õ
minimize 𝑐 𝑗 𝑥 𝑗 = 𝑐| 𝑥 .
𝑥
𝑗
To meet the nutritional requirement of nutrient 𝑖, we need to satisfy
Õ
𝑁𝑖𝑗 𝑥 𝑗 ≥ 𝑟 𝑖 ⇒ 𝑁 𝑥 ≥ 𝑟 .
𝑗

Finally, we cannot purchase a negative amount of food, so 𝑥 ≥ 0. The objective


and all of the constraints are linear in 𝑥, so this is an LP (where 𝑓 ≡ 𝑐, 𝐶 ≡ −𝑁,
𝑑 ≡ 𝑟 in Eq. 11.2). We do not need to artificially restrict which foods we include
in our initial list of possibilities. The formulation allows the optimizer to select
a given food item 𝑥 𝑖 to be zero (i.e., do not purchase any of that food item),
according to what is optimal.
As a concrete example, consider a simplified version (and a reductionist
view of nutrition) with 10 food options and three nutrients with the amounts
listed in the following table.

Food Cost Nutrient 1 Nutrient 2 Nutrient 3


A 0.46 0.56 0.29 0.48
B 0.54 0.84 0.98 0.55
C 0.40 0.23 0.36 0.78
D 0.39 0.48 0.14 0.59
E 0.49 0.05 0.26 0.79
F 0.03 0.69 0.41 0.84
G 0.66 0.87 0.87 0.01
H 0.26 0.85 0.97 0.77
I 0.05 0.88 0.13 0.13
J 0.60 0.62 0.69 0.10

If the amount of each food is 𝑥, the cost column is 𝑐, and the nutrient
columns are 𝑛1 , 𝑛2 , and 𝑛3 , we can formulate the LP as

minimize 𝑐| 𝑥
𝑥
subject to 5 ≤ 𝑛1 𝑥 ≤ 8
|

7 ≤ 𝑛2 𝑥
|

1 ≤ 𝑛3 𝑥 ≤ 10
|

𝑥 ≤ 4.
11 Convex Optimization 425

The last constraint ensures that we do not overeat any one item and get tired of
it. LP solvers are widely available, and because the inputs of an LP are just a
table of numbers some solvers do not even require a programming language.
The solution for this problem is

𝑥 = [0, 1.43, 0, 0, 0, 4.00, 0, 4.00, 0.73, 0] ,

suggesting that our optimal diet consists of items B, F, H, and I in the proportions
shown here. The solution reached the upper limit for nutrient 1 and the lower
limit for nutrient 2.

LPs frequently occur with allocation or assignment problems, such


as choosing an optimal portfolio of stocks, deciding what mix of
products to build, deciding what tasks should be assigned to each
worker, or determining which goods to ship to which locations. These
types of problems frequently occur in domains such as operations
research, finance, supply chain management, and transportation.∗ ∗ SeeSection 2.3 for a brief historical back-
ground on the development of LP and its
A common consideration with LPs is whether or not the variables applications.
should be discrete. In Ex. 11.1, 𝑥 𝑖 is a continuous variable, and purchas-
ing fractional amounts of food may or may not be possible, depending
on the type of food. Suppose we were performing an optimal stock
allocation. In that case, we can purchase fractional amounts of stock.
However, if we were optimizing how much of each product to man-
ufacture, it might not be feasible to build 32.4 products. In these
cases, we need to restrict the variables to be integers using integer
constraints. These types of problems require discrete optimization
algorithms, which are covered in Chapter 8. Specifically, we discussed
a mixed-integer LP in (Section 8.3).

11.3 Quadratic Programming

A quadratic program (QP) has a quadratic objective and linear constraints.


Quadratic programming was introduced in Section 5.5 in the context of
sequential quadratic programming. A general QP can be expressed as
follows:
1 |
minimize 𝑥 𝑄𝑥 + 𝑓 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (11.3)

𝐶𝑥 + 𝑑 ≤ 0 .

A QP is only convex if the matrix 𝑄 is positive semidefinite. If 𝑄 = 0, a


QP reduces to an LP.
11 Convex Optimization 426

One of the most common QP examples is least squares regression,


which was discussed previously in Section 10.3.1 and is used in many
applications such as data fitting.
The linear least-squares problem has an analytic solution if 𝐴 has
full rank, so the machinery of a QP is not necessary. However, we can
add constraints in QP form to solve constrained least squares problems,
which do not have analytic solutions in general.

Example 11.2 A constrained least squares QP

The left pane of Fig. 11.3 shows some example data that are both noisy and
biased relative to the true (but unknown) underlying curve, represented as a
dashed line. Given the data points, we would like to estimate the underlying
functional relationship. We assume that the relationship is cubic and write it as

𝑦(𝑥) = 𝑎1 𝑥 3 + 𝑎2 𝑥 2 + 𝑎3 𝑥 + 𝑎4 .

We need to estimate the coefficients 𝑎1 , . . . , 𝑎 4 . As discussed previously, this


can be posed as a QP problem or, even more simply, as an analytic problem.
The middle pane of Fig. 11.3 shows the resulting least squares fit.

30 30 30

20 20 20

𝑦 𝑦 𝑦
10 10 10

0 0 0

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥 𝑥 𝑥

Suppose that we know the upper bound of the function value based on Fig. 11.3 True function on the left,
measurements or additional data at a few locations. In this example, assume least squares in the middle, and con-
strained least squares on the right.
that we know that 𝑓 (−2) ≤ −2, 𝑓 (0) ≤ 4, and 𝑓 (2) ≤ 26. These requirements
can be posed as linear constraints:

 𝑎1 
(−2)3 (−2)2 1   −2
 −2  𝑎2   
 0 1  ≤4 .
 0 0  𝑎3   
 23 1    26 
 22 2 𝑎   
 4
After adding these linear constraints and retaining a quadratic objective
(the sum of the squared error), the resulting problem is still a QP. The resulting
solution is shown in the right pane of Fig. 11.3, which results in a much more
accurate fit.
11 Convex Optimization 427

Example 11.3 Linear-quadratic regulator (LQR) controller

Another common example of a QP occurs in optimal control. Consider the


following discrete-time linear dynamic system:

𝑥 𝑡+1 = 𝐴𝑥 𝑡 + 𝐵𝑢𝑡 ,

where 𝑥 𝑡 is the deviation from a desired state at time 𝑡 (e.g., the positions and
velocities of an aircraft), and 𝑢𝑡 represents the control inputs that we want to
optimize (e.g., control surface deflections). This dynamic equation can be used
as a set of linear constraints in an optimization problem, but we must decide
on an objective.
We would like to have small 𝑥 𝑡 because that would mean reducing the error
in our desired state quickly, but we would also like to have small 𝑢𝑡 because
small control inputs require less energy. These are competing objectives, where
a small control input will take longer to minimize error in a state, and vice
versa.
One way to express this objective is as a quadratic function,

1Õ | 
𝑛
minimize
|
𝑥 𝑡 𝑄𝑥 𝑡 + 𝑢𝑡 𝑅𝑢𝑡 ,
𝑥,𝑢 2
𝑡=0

where the weights in 𝑄 and 𝑅 reflect our preferences on how important it is


to have a small state error versus small control inputs.∗ This function has a ∗ Thisis an example of a multiobjective
function, which is explained in Chapter 9.
form similar to kinetic energy, and the LQR problem could be thought of as
determining the control inputs that minimize the energy expended, subject
to the vehicle dynamics. This choice of the objective function was intentional
because the problem is a convex QP (as long as we choose positive weights).
Because it is convex, this problem can be solved reliably and efficiently, which
are necessary conditions for a robust control law.

11.4 Second-Order Cone Programming

A second-order cone program (SOCP) has a linear objective and a second-


order cone constraint:

minimize 𝑓 |𝑥
𝑥
subject to
|
k𝐴 𝑖 𝑥 + 𝑏 𝑖 k 2 ≤ 𝑐 𝑖 𝑥 + 𝑑 𝑖 (11.4)
𝐺𝑥 + ℎ = 0 .

If 𝐴 𝑖 = 0, then this form reduces to an LP.


One useful subset of SOCP is a quadratically constrained quadratic
program (QCQP). A QCQP is the same as a QP but has quadratic
11 Convex Optimization 428

inequality constraints instead of linear ones, that is,

1 |
minimize 𝑥 𝑄𝑥 + 𝑓 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (11.5)
1 |
𝑥 𝑅 𝑖 𝑥 + 𝑐 𝑖 𝑥 + 𝑑 𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑚 ,
|
2
where 𝑄 and 𝑅 must be positive semidefinite for the QCQP to be
convex. A QCQP reduces to a QP if 𝑅 = 0. We formulated QCQPs
when solving trust-region problems in Section 4.5. However, for trust-
region problems, only an approximate solution method is typically
used.
Every QCQP can be expressed as an SOCP (although not vice versa).
The QCQP in Eq. 11.5 can be written in the equivalent form,

minimize 𝛽
𝑥,𝛽

subject to k𝐹𝑥 + 𝑔k 2 ≤ 𝛽 (11.6)


𝐴𝑥 + 𝑏 = 0
k𝐺 𝑖 𝑥 + ℎ 𝑖 k 2 ≤ 0 .

If we square both sides of the first and last constraints, this formulation
is exactly equivalent to the QCQP where 𝑄 = 2𝐹 | 𝐹, 𝑓 = 2𝐹 | 𝑔, 𝑅 𝑖 =
2𝐺 𝑖 𝐺 𝑖 , 𝑐 𝑖 = 2𝐺 𝑖 ℎ 𝑖 , and 𝑑 𝑖 = ℎ 𝑖 ℎ 𝑖 . The matrices 𝐹 and 𝐺 𝑖 are the square
| | |

roots of the matrices 𝑄 and 𝑅 𝑖 , respectively (divided by 2), and would


be computed from a factorization.

11.5 Disciplined Convex Optimization

Disciplined convex optimization builds convex problems using a specific


set of rules and mathematical functions. By following this set of
rules, the problem can be translated automatically into a conic form
that we can efficiently solve using convex optimization algorithms.46 46. Grant et al., Disciplined convex pro-
gramming, 2006.
Table 11.1 shows several examples of convex functions that can be used
to build convex problems. Notice that not all functions are continuously
differentiable because this is not a requirement of convexity.
A disciplined convex problem can be formulated using any of these
functions for the objective and inequality constraints. We can also use
various operations that preserve convexity to build up more complex
functions. Some of the more common operations are as follows:

• Multiplying a convex function by a positive constant


• Adding convex functions
11 Convex Optimization 429

Functions Examples

𝛼↑
𝑒 𝑎𝑥

( 𝛼≥1

−𝑥 𝑎 if 0 ≤ 𝑎 ≤ 1 𝛼≤0

𝑥𝑎 otherwise
0≤𝛼≤1

− log(𝑥)

k𝑥k 1 , k𝑥 k 2 , . . .

max(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 )


ln 𝑒 𝑥1 + 𝑒 𝑥2 + . . . + 𝑒 𝑥 𝑛
Table 11.1 Examples of convex func-
tions.

• Composing a convex function with an affine function (i.e., if 𝑓 (𝑥)


is convex, then 𝑓 (𝐴𝑥 + 𝑏) is also convex)
• Taking the maximum of two convex functions

Although these functions and operations greatly expand the types


of convex problems that we can solve beyond LPs and QPs, they are
still restrictive within the broader scope of nonlinear optimization. Still,
for objectives and constraints that require only simple mathematical
expressions, there is the possibility that the problem can be posed as a
disciplined convex optimization problem.
The original expression of a problem is often not convex but can be
made convex through a transformation to a mathematically equivalent
problem. These transformation techniques include implementing a
change of variables, adding slack variables, or expressing the objective
11 Convex Optimization 430

in a different form. Successfully recognizing and applying these


techniques is a skill requiring experience.

Tip 11.1 Software for disciplined convex programming

CVX and its variants are free popular tools for disciplined convex program-
ming with interfaces for multiple programming languages.∗ ∗ https://stanford.edu/~boyd/software.

html

Example 11.4 A supervised learning classification problem

A classification problem seeks to determine a decision boundary between


two sets of data. For example, given a large set of engineering parts, each
associated with a label identifying whether it was defective or not, we would
like to determine an optimal set of parameters that allow us to predict whether
a new part will be defective or not. First, we have to decide on a set of features, or
properties that we use to characterize each data point. For an engineering part,
for example, these features might include dimensions, weights and moments
of inertia, or surface finish.
If the data are separable, we could find a hyperplane,

𝑓 (𝑥) = 𝑎 | 𝑥 + 𝛽 ,

that separates the two data sets, or in other words, a function that classifies the
objects. For example, if we call one data set 𝑦 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑦 , and the other
𝑧 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑧 , we need to satisfy the following constraints:

𝑎 | 𝑦𝑖 + 𝛽 ≥ 𝜀
(11.7)
𝑎 | 𝑧 𝑖 + 𝛽 ≤ −𝜀 ,

for some small tolerance 𝜀. In general, there are an infinite number of separating
hyperplanes, so we seek the one that maximizes the distance between the points.
However, such a problem is not yet well defined because we can multiply 𝑎 and
𝛽 in the previous equations by an arbitrary constant to achieve any separation
we want, so we need to normalize or fix some reference dimension (only the
ratio of the parameters matters in defining the hyperplane, not their absolute
magnitudes). We define the optimization problem as follows:

maximize 𝛾
by varying 𝛾, 𝑎, 𝛽
subject to 𝑎 | 𝑦 𝑖 + 𝛽 ≥ 𝛾 for 𝑖 = 1 . . . 𝑛 𝑦
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −𝛾 for 𝑗 = 1, . . . , 𝑛 𝑧
k𝑎k ≤ 1 .

The last constraint provides a normalization to prevent the problem from being
unbounded. This norm constraint is always active (k𝑎 k = 1), but we express
11 Convex Optimization 431

it as an inequality so that the problem remains convex (recall that equality


constraints must be affine, but inequality constraints can be any convex function).
The objective and inequality constraints are all convex functions, so we can
solve it in a disciplined convex programming environment. Alternatively, in
this case, we could employ a change of variables to put the problem in QP form
if desired.
An example is shown in Fig. 11.4 for data with two features for easy
visualization. The middle line shows the separating hyperplane and the outer
lines are a distance of 𝛾 away, just passing through a data point from each set. 2𝛾
𝑥2
If the data are not completely separable, we need to modify our approach.
Even if the data are separable, outliers may undesirably pull the hyperplane
so that points are closer to the boundary than is necessary. To address these
issues, we need to relax the constraints. As discussed, Eq. 11.7 can always be
𝑥1
multiplied by an arbitrary constant. Therefore, we can equivalently express the
constraints as follows: Fig. 11.4 Two separable data sets are
𝑎 | 𝑦𝑖 + 𝛽 ≥ 1 shown as points with two different
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −1 . colors. A classification boundary
with maximum width is shown.
To relax these constraints, we add nonnegative slack variables, 𝑢𝑖 and 𝑣 𝑗 :

𝑎 | 𝑦 𝑖 + 𝛽 ≥ 1 − 𝑢𝑖
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −(1 − 𝑣 𝑗 ) ,

where we seek to minimize the sum of the entries in 𝑢 and 𝑣. If they sum
to 0, we have the original constraints for a completely separable function.
However, recall that we are interested in not just creating separation but also in
maximizing the distance to the classification boundary. To accomplish this, we
use a regularization approach where our two objectives include maximizing
the distance from the boundary and maximizing the sum of the classification
margins. The width between the two planes 𝑎 | 𝑥 + 𝛽 = 1 and 𝑎 | 𝑥 + 𝛽 = −1 is † Inthe machine learning community, this
2/k𝑎 k. Therefore, to maximize the separation distance, we minimize k𝑎 k. The optimization problem is known as a support
optimization problem is defined as follows:† vector machine. This problem is an example
of supervised learning because classification

©Õ Õ labels were provided. Classification can be


ª
k𝑎 k + 𝜔 ­ 𝑣𝑗®
done without labels but requires a different
minimize 𝑢𝑖 + approach under the umbrella of unsuper-
« 𝑖 𝑗 ¬ vised learning.

by varying 𝑎, 𝛽, 𝑢, 𝑣
subject to 𝑎 | 𝑦 𝑖 + 𝛽 ≥ (1 − 𝑢𝑖 ), 𝑖 = 1, . . . , 𝑛 𝑦
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −(1 − 𝑣 𝑗 ), 𝑗 = 1, . . . , 𝑛 𝑧
𝑥2
2
𝑢≥0 k𝑎k

𝑣 ≥ 0.

Here, 𝜔 is a user-chosen weight reflecting a preference for the trade-offs in 𝑥1


separation margin and stricter classification. The problem is still convex, and
an example is shown in Fig. 11.5 with a weight of 𝜔 = 1. Fig. 11.5 A classification boundary is
The methodology can handle nonlinear classifiers by using a different form shown for nonseparable data using a
with kernel functions like those discussed in Section 10.4. regularization approach.
11 Convex Optimization 432

11.6 Geometric Programming

A geometric program (GP) is not convex but can be transformed into an


equivalent convex problem. GPs are formulated using monomials and
posynomials. A monomial is a function of the following form:
𝑎𝑚
𝑓 (𝑥) = 𝑐𝑥 1𝑎1 𝑥2𝑎2 · · · 𝑥 𝑚 , (11.8)

where 𝑐 > 0, and all 𝑥 𝑖 > 0. A posynomial is a sum of monomials:

Õ
𝑛
𝑎1𝑗 𝑎2𝑗 𝑎𝑚 𝑗
𝑓 (𝑥) = 𝑐 𝑗 𝑥1 𝑥2 · · · 𝑥 𝑚 , (11.9)
𝑗=1

where all 𝑐 𝑗 > 0.

Example 11.5 Monomials and posynomials in engineering models

Monomials and posynomials appear in many engineering models. For


example, the calculation of lift from the definition of the lift coefficient is a
monomial:
1
𝐿 = 𝐶 𝐿 𝜌𝑉 2 𝑆.
2
Total incompressible drag, a sum of parasitic and induced drag, is a posynomial:

𝐶 𝐿2
𝐷 = 𝐶 𝐷 𝑝 𝑞𝑆 + 𝑞𝑆.
𝜋𝐴𝑅𝑒

A GP in standard form is written as follows:

minimize 𝑓0 (𝑥)
𝑥
subject to 𝑓𝑖 (𝑥) ≤ 1 (11.10)
ℎ 𝑖 (𝑥) = 1 ,

where 𝑓𝑖 are posynomials, and ℎ 𝑖 are monomials. This problem does not
fit into any of the convex optimization problems defined in the previous
section, and it is not convex. This formulation is useful because we can
convert it into an equivalent convex optimization problem.
11 Convex Optimization 433

First, we take the logarithm of the objective and of both sides of the
constraints:
minimize ln 𝑓0 (𝑥)
𝑥
subject to ln 𝑓𝑖 (𝑥) ≤ 0 (11.11)
ln ℎ 𝑖 (𝑥) = 0 .
Let us examine the equality constraints further. Recall that ℎ 𝑖 is a
monomial, so writing one of the constraints explicitly results in the
following form:
𝑎𝑚 
ln 𝑐𝑥 1𝑎1 𝑥2𝑎2 . . . 𝑥 𝑚 = 0. (11.12)
Using the properties of logarithms, this can be expanded to the equiva-
lent expression:

ln 𝑐 + 𝑎1 ln 𝑥 1 + 𝑎 2 ln 𝑥2 + . . . + 𝑎 𝑚 ln 𝑥 𝑚 = 0 . (11.13)

Introducing the change of variables 𝑦 𝑖 = ln 𝑥 𝑖 results in the following


equality constraint:

𝑎1 𝑦1 + 𝑎2 𝑦2 + . . . + 𝑎 𝑚 𝑦𝑚 + ln 𝑐 = 0 , 𝑎 | 𝑦 + ln 𝑐 = 0, (11.14)

which is an affine constraint in 𝑦.


The objective and inequality constraints are more complex because
they are posynomials. The expression ln 𝑓𝑖 written in terms of a
posynomial results in the following:

©Õ 𝑎𝑚 𝑗 ª
𝑛
ln ­ 𝑐 𝑗 𝑥1 𝑥2 . . . 𝑥 𝑚 ® .
𝑎1𝑗 𝑎2𝑗
(11.15)
« 𝑗=1 ¬
Because this is a sum of products, we cannot use the logarithm to
expand each term. However, we still introduce the same change of
variables (expressed as 𝑥 𝑖 = 𝑒 𝑦𝑖 ):

©Õ   ª
𝑛
ln 𝑓𝑖 = ln ­ 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 exp 𝑦2 𝑎2𝑗 . . . exp 𝑦𝑚 𝑎 𝑚 𝑗 ®
« 𝑗=1 ¬
©Õ ª
𝑛
= ln ­ 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 + 𝑦2 𝑎2𝑗 + 𝑦𝑚 𝑎 𝑚 𝑗 ® (11.16)
« 𝑗=1 ¬
 
©Õ ª
𝑛
= ln ­ exp 𝑎 𝑗 𝑦 + 𝑏 𝑗 ® , where 𝑏 𝑗 = ln 𝑐 𝑗 .
|

« 𝑗=1 ¬
This is a log-sum-exp of an affine function. As mentioned in the previous
section, log-sum-exp is convex, and a convex function composed of an
11 Convex Optimization 434

affine function is a convex function. Thus, the objective and inequality


constraints are convex in 𝑦. Because the equality constraints are also
affine, we have a convex optimization problem obtained through a
change of variables.

Example 11.6 Maximizing volume of a box as a geometric program

Suppose we want to maximize the volume of a box with a constraint on the


total surface area (i.e., the material used), and a constraint on the aspect ratio of
the base of the box.∗ We parameterize the box by its height 𝑥 ℎ , width 𝑥 𝑤 , and ∗ Based on an example from Boyd et al.184
depth 𝑥 𝑑 : 184. Boyd et al., A tutorial on geometric
maximize 𝑥 ℎ 𝑥 𝑤 𝑥 𝑑 programming, 2007.

by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
subject to 2(𝑥 ℎ 𝑥 𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥 𝑤 𝑥 𝑑 ) ≤ 𝐴
𝑥𝑤
𝛼𝑙 ≤ ≤ 𝛼ℎ .
𝑥𝑑
We can express this problem in GP form (Eq. 11.10):

minimize 𝑥 −1 𝑥 −1 𝑥 −1
ℎ 𝑤 𝑑
by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
2 2 2
subject to 𝑥 𝑥𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥𝑤 𝑥 𝑑 ≤ 1
𝐴 ℎ 𝐴 𝐴
1
𝑥 𝑤 𝑥 −1
𝑑
≤1
𝛼ℎ
−1
𝛼 𝑙 𝑥 𝑑 𝑥𝑤 ≤ 1.

We can now plug this into a GP solver. For this example, we use the
following parameters: 𝛼 𝑙 = 2, 𝛼 ℎ = 8, 𝐴 = 100. The solution is 𝑥 𝑑 = 2.887, 𝑥 ℎ =
3.849, 𝑥 𝑤 = 5.774, with a total volume of 64.16.

Unfortunately, many other functions do not fit this form (e.g., de-
sign variables that can be positive or negative, terms with negative
coefficients, trigonometric functions, logarithms, and exponents). GP
modelers use various techniques to extend usability, including using a
Taylor series across a restricted domain, fitting functions to posynomi-
als,185 and rearranging expressions to other equivalent forms, including 185. Hoburg et al., Data fitting with
geometric-programming-compatible softmax
implicit relationships. Creativity and some sacrifice in fidelity are functions, 2016.
usually needed to create a corresponding GP from a general nonlinear
programming problem. However, if the sacrifice in fidelity is not too
great, there is a significant advantage because the formulation comes
with all the benefits of convexity—guaranteed convergence, global
optimality, efficiency, no parameter tuning, and limited scaling issues.
11 Convex Optimization 435

One extension to geometric programming is signomial program-


ming. A signomial program has the same form, except that the coeffi-
cients 𝑐 𝑖 can be positive or negative (the design variables 𝑥 𝑖 must still
be strictly positive). Unfortunately, this problem cannot be transformed
into a convex one, so a global optimum is no longer guaranteed. Still, a
signomial program can usually be solved using a sequence of geometric
programs, so it is much more efficient than solving the general nonlinear
problem. Signomial programs have been used to extend the range
of design problems that can be solved using geometric programming
techniques.186,187 186. Kirschen et al., Application of sig-
nomial programming to aircraft design,
2018.
Tip 11.2 Software for geometric programming 187. York et al., Turbofan engine sizing and
tradeoff analysis via signomial programming,
2018.
GPkit† is a freely available software package for posing and solving geo-
† https://gpkit.readthedocs.io
metric programming (and signomial programming) models.

11.7 Summary

Convex optimization problems are highly desirable because they do not


require parameter tuning, starting points, or derivatives and converge
reliably and rapidly to the global optimum. The trade-off is that the form
of the objective and constraints must meet stringent requirements. These
requirements often necessitate simplifying the physics models and
implementing clever reformulations. The reduction in model fidelity is
acceptable in domains where optimizations are performed repeatedly
in time (e.g., controls, machine learning) or for high-level conceptual
design studies. Linear programming and quadratic programming, in
particular, are widely used across many domains and form the basis of
many of the gradient-based algorithms used to solve general nonconvex
problems.
11 Convex Optimization 436

Problems

11.1 Answer true or false and justify your answer.

a. The optimum found through convex optimization is guaran-


teed to be the global optimum.
b. Cone programming problems are a special case of quadratic
programming problems.
c. It is sometimes possible to obtain distinct feasible regions in
linear optimization.
d. A quadratic problem is a problem with a quadratic objective
and quadratic constraints.
e. A quadratic problem is only convex if the Hessian of the
objective function is positive definite.
f. Solving a quadratic problem is easy because the solution can
be obtained analytically.
g. Least squares regression is a type of quadratic programming
problem.
h. Second-order cone programming problems feature a linear
objective and a second-order cone constraint.
i. Disciplined convex optimization builds convex problems by
using convex differentiable functions.
j. It is possible to transform some nonconvex problems into
convex ones by using a change of variables, adding slack
variables, or reformulating the objective function.
k. A geometric program is not convex but can be transformed
into an equivalent convex program.
l. Convex optimization algorithms work well as long as a good
starting point is provided.

11.2 Solve the following using a convex solver (not a general nonlinear
solver):

minimize 𝑥12 + 3𝑥22


subject to 𝑥1 + 4𝑥2 ≥ 2
3𝑥1 + 2𝑥2 ≥ 5
𝑥1 ≥ 0, 𝑥2 ≥ 0 .

11.3 The following foods are available to you at your nearest grocer:
11 Convex Optimization 437

Food Cost Nutrient 1 Nutrient 2 Nutrient 3


A 7.68 0.16 1.41 2.40
B 9.41 0.47 0.58 3.95
C 6.74 0.87 0.56 1.78
D 3.95 0.62 1.59 4.50
E 3.13 0.29 0.42 2.65
F 6.63 0.46 1.84 0.16
G 5.86 0.28 1.23 4.50
H 0.52 0.25 1.61 4.70
I 2.69 0.28 1.11 3.11
J 1.09 0.26 1.88 1.74

Minimize the amount you spend while making sure you get at
least 5 units of nutrient 1, between 8 and 20 units of nutrient 2,
and between 5 and 30 units of nutrient 3. Also be sure not to buy
more than 4 units of any one food item, just for variety. Determine
the optimal amount of each item to purchase and the total cost.

11.4 Consider the aircraft wing design problem described in Ap-


pendix D.1.6. Modify or approximate the model as needed to
formulate it as a GP. Solve the new formulation using a GP solver.
If you want to make it more challenging, do not read the hints
that follow. All equations except the Gaussian efficiency curve
are compatible with GP. However, you may need additional
optimization variables and constraints. For example, you could
add 𝐿 and 𝑣 to a set of variables and impose

1 2
𝐿= 𝜌𝑣 𝑏𝑐𝐶 𝐿
2
as an equality constraint. This is equivalent to a GP-compatible
monomial constraint
𝜌𝑣 2 𝑏𝑐𝐶 𝐿
= 1.
2𝐿

The efficiency curve can be approximated by a posynomial func-


tion. For example, assuming that the optimal speed is 𝑣 ∗ ≈ 18 m/s,
you may use
  10
𝜂
4 + 16 = 𝑣 ,
𝜂max
which is only valid if 𝜂 ∈ [0, 𝜂max ] and 𝑣 ∈ [16, 20] m/s.
Optimization Under Uncertainty
12
Uncertainty is always present in engineering design. Manufacturing
processes create deviations from the specifications, operating conditions
vary from the ideal, and some parameters are inherently variable.
Optimization with deterministic inputs can lead to poorly performing
designs. Optimization under uncertainty (OUU) is the optimization of
systems in the presence of random parameters or design variables. The
objective is to produce robust and reliable designs. A design is robust
when the objective function is less sensitive to inherent variability. A
design is reliable when it is less prone to violating a constraint when
accounting for the variability.∗ ∗ Although we maintain a distinction in
this book, some of the literature includes
This chapter discusses how uncertainty can be used in the objective both of these concepts under the umbrella
function to obtain robust designs and how it can be used in constraints of “robust optimization”.

to get reliable designs. We introduce methods that propagate input un-


certainties through a computational model to produce output statistics.
We assume familiarity with basic statistics concepts such as expected
value, variance, probability density functions (PDFs), cumulative distri-
bution functions (CDFs), and some common probability distributions.
A brief review of these topics is provided in Appendix A.9 if needed.

By the end of this chapter you should be able to:

1. Define robustness and reliability in the context of opti-


mization under uncertainty.
2. Describe and use several strategies for both robust opti-
mization and reliability.
3. Understand the pros and cons for the following forward-
propagation methods: first-order perturbation methods,
direct quadrature, Monte Carlo methods, and polynomial
chaos.
4. Use forward-propagation methods in optimization.

438
12 Optimization Under Uncertainty 439

12.1 Robust Design

We call a design robust if its performance is less sensitive to inher-


ent variability. In optimization, “performance” is directly associated
with the objective function. Satisfying the design constraints is a
requirement, but adding a margin to a constraint does not increase
performance in the standard optimization formulation. Thus, for a
robust design, the objective function is less sensitive to variations in
the random design variables and parameters. We can achieve this by
formulating an objective function that considers such variations and
reflects uncertainty.
A common example of robust design is considering the performance
of an engineering device at different operating conditions. If we had
deterministic operating conditions, it would make sense to maximize
the performance for those conditions. For example, suppose we knew
the exact wind speeds and wind directions a sailboat would experience
in a race. In that case, we could optimize the hull and sail design
to minimize the time around the course. Unfortunately, if variability
does exist, the sailboat designed for deterministic conditions will likely
perform poorly in off-design conditions. A better strategy considers the
uncertainty in the operating conditions and maximizes the expected
performance across a range of conditions. A robust design achieves
good performance even with uncertain wind speeds and directions.
There are many options for formulating robust design optimization
problems. The most common OUU objective is to minimize the expected
value of the objective function (min 𝜇 𝑓 (𝑥)). This yields robust designs
because the average performance under variability is considered.
Consider the function shown on the left in Fig. 12.1. If 𝑥 is determin-
istic, minimizing this function yields the global minimum on the right.
Now consider what happens when 𝑥 is uncertain. “Uncertain” means
that 𝑥 is no longer a deterministic input. Instead, it is a random variable
with some probability distribution. For example, 𝑥 = 0.5 represents a
random variable with a mean of 𝜇𝑥 = 0.5. We can compute the average
value of the objective 𝜇 𝑓 at each 𝑥 from the expected value of a function
(Eq. A.65):
∫ ∞
𝜇 𝑓 (𝑥) = 𝑓 (𝑧)𝑝(𝑧) d𝑧, where 𝑝(𝑧) ∼ 𝒩(𝑥, 𝜎𝑥 ) , (12.1)
−∞

and 𝑧 is a dummy variable for integration. Repeating this integral at


each 𝑥 value gives the expected value as a function of 𝑥.
Figure 12.1 shows the expected value of the objective for three
different standard deviations. The probability distribution of 𝑥 for
12 Optimization Under Uncertainty 440

2 2 2

1.8 1.8 1.8

1.6 1.6 1.6

𝑓 1.4 𝑓 1.4 𝑓 1.4


𝜇 𝑓 (𝑥)
1.2 1.2 1.2
𝑓 (𝑥)
1 1 1

0.8 0.8 0.8

15 3 1.5
𝑝(𝑥) 10 𝑝(𝑥) 2 𝑝(𝑥) 1
5 1 0.5
0 0 0
−1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2
𝑥 𝑥 𝑥

𝜎𝑥 = 0.01 𝜎𝑥 = 0.10 𝜎𝑥 = 0.20

a mean value of 𝑥 = 0.5 and three different standard deviations is Fig. 12.1 The global minimum of the
shown on the bottom row the figure. For a small variance (𝜎𝑥 = expected value 𝜇 𝑓 can shift depend-
ing on the standard deviation of 𝑥,
0.01), the expected value function 𝜇 𝑓 (𝑥) is indistinguishable from the 𝜎𝑥 . The bottom row of figures shows
deterministic function 𝑓 (𝑥), and the global minimum is the same for the normal probability distributions
at 𝑥 = 0.5.
both functions. However, for 𝜎𝑥 = 0.2, the minimum of the expected
value function is different from that of the deterministic function.
Therefore, the minimum on the right is not as robust as the one on
the left. The minimum one on the right is a narrow valley, so the
expected value increases rapidly with increased variance. The opposite
is true for the minimum on the left. Because it is in a broad valley, the
expected value is less sensitive to variability in 𝑥. Thus, a design whose
performance changes rapidly with respect to variability is not robust.
Of course, the mean is just one possible statistical output metric.
Variance, or standard deviation (𝜎 𝑓 ), is another common metric. How-
ever, directly minimizing the variance is less common because although
low variability is often desirable, such an objective has no incentive to
improve mean performance and so usually performs poorly. These two
metrics represent a trade-off between risk (variance) and reward (mean). 𝜎𝑓
The compromise between these two metrics can be quantified through
multiobjective optimization (see Chapter 9), which would result in a
Pareto front with the notional behavior illustrated in Fig. 12.2. Because
both multiobjective optimization and uncertainty quantification are 𝜇𝑓
costly, the overall cost of producing such a Pareto front might be pro-
hibitive. Therefore, we might instead seek to minimize the expected Fig. 12.2 When designing for robust-
ness, there is an inherent trade-off
value while constraining the variance to a value that the designer between risk (represented by the vari-
can tolerate. Another option is to minimize the mean plus weighted ance, 𝜎 𝑓 ) and reward (represented by
standard deviations. the expected value, 𝜇 𝑓 ).
12 Optimization Under Uncertainty 441

Many other relevant statistical objectives do not involve statistical


moments like mean or variance. Examples include minimizing the
95th percentile of the distribution or employing a reliability metric,
Pr( 𝑓 (𝑥) > 𝑓crit ), that minimizes the probability that the objective exceeds
some critical value.

Example 12.1 Robust airfoil optimization

Consider an airfoil optimization, where the profile shape of a wing is


optimized to minimize the drag coefficient while constraining the lift coefficient
to be equal to a target value. Figure 12.3 shows how the drag coefficient of an
RAE 2822 airfoil varies with the Mach number (the airplane speed) in blue, as
evaluated by a Navier–Stokes flow solver.∗ This is a typical drag rise curve, ∗ For more details on this type of prob-
lem and on the aerodynamic shape opti-
where increasing the Mach number leads to stronger shock waves and an
mization framework that produced these
associated increase in wave drag. results, see Martins.127
Now let us optimize the airfoil shape so that we can fly faster without a 127. Martins, Perspectives on aerodynamic
large increase in drag. Minimizing the drag of this airfoil at Mach 0.71 results in design optimization, 2020.

the red drag curve shown in Fig. 12.3. The drag is much lower at Mach 0.71 (as
requested!), but any deviation from the target Mach number causes significant 𝑐𝑑
·10−4
drag penalties. In other words, the design is not robust. 220
One way to improve the design is to use multipoint optimization, where we 200
minimize a weighted sum of the drag coefficient evaluated at different Mach 180 Baseline
numbers. In this case, we use Mach = 0.68, 0.71, 0.725. Compared with the 160
single-point design, the multipoint design has a higher drag at Mach 0.71 but a 140
lower drag at the other Mach numbers, as shown in Fig. 12.3. Thus, a trade-off Multi-point
120
in peak performance was required to achieve enhanced robustness. 100 Single-point
A multipoint optimization is a simplified example of OUU. Effectively, we 0.64 0.66 0.68 0.7 0.72 0.74
have treated the Mach number as a random parameter with a given probability Mach number
at three discrete values. We then minimized the expected value of the drag.
This simple change significantly increased the robustness of the design. Fig. 12.3 Single-point optimization
performs the best at the target speed
but poorly away from the condition.
Multipoint optimization is more ro-
bust to changes in speed.
Example 12.2 Robust wind farm layout optimization † See other wind farm OUU problems with

coupled farm and turbine optimization,188


Wind farm layout optimization is another example of OUU but has a multiobjective trade-offs in mean and vari-
more involved probability distribution than the multipoint formulation.† The ance,189 and more involved uncertainty
positions of wind turbines on a wind farm have a substantial impact on overall quantification techniques discussed later
in this chapter.190
performance because their wakes interfere. The primary goal of wind farm
188. Stanley and Ning, Coupled wind
layout optimization is to position the wind turbines to reduce interference and turbine design and layout optimization with
thus maximize power production. In this example, we optimized the position non-homogeneous wind turbines, 2019.
of nine turbines subject to the constraints that the turbines must stay within a 189. Gagakuma et al., Reducing wind farm
power variance from wind direction using
specified boundary and must not be too close to any other turbine.
wind farm layout optimization, 2021.
One of the primary challenges of wind farm layout optimization is that
190. Padrón et al., Polynomial chaos to
the wind is uncertain and highly variable. To keep this example simple, we efficiently compute the annual energy pro-
assume that wind speed is constant, and only the wind direction is an uncertain duction in wind farm layout optimization,
2019.
12 Optimization Under Uncertainty 442

parameter. Figure 12.4 shows a PDF of the wind direction for an actual wind
farm, known as a wind rose, which is commonly visualized as shown in the
plot on the right. The predominant wind directions are from the west and the
south. Because of the variable nature of the wind, it would be challenging to
intuit the optimal layout.

·10−3 N
8
NW NE
Relative Probability

4
W E

Fig. 12.4 Probability density function


0
SE
of wind direction (left) and corre-
0 90 180 270 360 SW
sponding wind rose (right).
Wind direction (deg) S

We solve this problem using two approaches. The first approach is to


solve the problem deterministically (i.e., ignore the variability). This is usually
done by using mean values for uncertain parameters, often assuming that the
variability is Gaussian or at least symmetric. The wind direction is periodic and
asymmetric, so we optimize using the most probable wind direction (261◦ ).
The second approach is to treat this as an OUU problem. Instead of
maximizing the power for one direction, we maximize the expected value of the
power for all directions. This is straightforward to compute from the definition
of expected value because this is a one-dimensional function. Section 12.3
explains other ways to perform forward propagation.
Figure 12.5 shows the power as a function of wind direction for both cases.
The deterministic approach results in higher power production when the wind
comes from the west (and 180◦ from that), but that power reduces considerably
for other directions. In contrast, the OUU result is less sensitive to changes in
wind direction. The expected value of power is 58.6 MW for the deterministic
case and 66.1 MW for the OUU case, an improvement of over 12 percent.‡ ‡ Instead of using expected power directly,

wind turbine designers use annual energy


production, which is the expected power
N
multiplied by utilization time.
80
OUU NW NE
70
Power [MW]

60

50 W E

40
Fig. 12.5 Wind farm power as a func-
1 dir tion of wind direction for two opti-
30 mization approaches: deterministic
SE
optimization using the most probable
0 90 180 270 360 SW
direction and OUU.
Wind direction [deg] S
12 Optimization Under Uncertainty 443

We can also analyze the trade-off in the optimal layouts. The left side of
Fig. 12.6 shows the optimal layout using the deterministic formulation, with
the wind coming from the predominant direction (the direction we optimized
for). The wakes are shown in blue, and the boundaries are depicted with a
dashed line. The optimization spaced the wind turbines out so that there is
minimal wake interference. However, the performance degrades significantly
when the wind changes direction. The right side of Fig. 12.6 shows the same
layout but with the wind coming from the second-most-probable direction. In
this case, many of the turbines are operating in the wake of another turbine
and produce much less power.

Fig. 12.6 Deterministic cases with the


primary wind direction (left) and the
secondary wind direction (right).

In contrast, the robust layout is shown in Fig. 12.7, with the predominant
wind direction on the left and the second-most-probable direction on the right.
In both cases, the wake effects are relatively minor. The turbines are not ideally
placed for the predominant direction, but trading the performance for that
one direction yields better overall performance when considering other wind
directions.

Fig. 12.7 OUU cases with the pri-


mary wind direction (left) and the
secondary wind direction (right).
12 Optimization Under Uncertainty 444

12.2 Reliable Design

We call a design reliable when it is less prone to failure under variability.


In other words, the constraints have a lower probability of being violated
under variations in the random design variables and parameters. In
a robust design, we consider the effect of uncertainty on the objective
function. In reliable design, we consider that effect on the constraints.
A common example of reliability is structural safety. Consider
Ex. 3.9, where we formulated a mass minimization subject to stress
constraints. In such structural optimization problems, many of the
stress constraints are active at the optimum. Constraining the stress
to be equal to or below the yield stress value as if this value were
deterministic is probably not a good idea because variations in the
material properties or manufacturing could result in structural failure.
Instead, we might want to include this variability so that we can reduce
the probability of failure.
To generate a reliable design, we want the probability of satisfying
the constraints to exceed some preselected reliability level. Thus, we
change deterministic inequality constraints 𝑔(𝑥) ≤ 0 to ensure that the
probability of constraint satisfaction exceeds a specified reliability level
𝑟, that is,
Pr(𝑔(𝑥) ≤ 0) ≥ 𝑟 . (12.2)
For example, if we set 𝑟 𝑖 = 0.999, then constraint 𝑖 must be satisfied with
a probability of 99.9 percent. Thus, we can explicitly set the reliability
level that we wish to achieve, with associated trade-offs in the level of
performance for the objective function.

Example 12.3 Reliability with the Barnes function

Consider the Barnes problem shown on the left side of Fig. 12.8. The three
red lines are the three nonlinear constraints of the problem, and the red regions
highlight regions of infeasibility. With deterministic inputs, the optimal value
is on the constraint line. An uncertainty ellipse shown around the optimal
point highlights the fact that the solution is not reliable. Any variability in the
inputs can cause one or more constraints to be violated.
Conversely, the right side of Fig. 12.8 shows a reliable optimum, with the
same uncertainty ellipse. In this case, it is much more probable that the design
will satisfy all constraints under the input variations. However, as noted in
the introduction, increased reliability presents a performance trade-off, with a
corresponding increase in the objective function. The higher the reliability we
seek, the more we need to give up on performance.
12 Optimization Under Uncertainty 445

Fig. 12.8 The deterministic optimum


𝑥2 𝑥2
design is on the constraint line (left),
and the constraint might be violated
if there is variability. The reliable
design optimum (right) satisfies the
constraints despite the variability.
𝑥1 𝑥1

In some engineering disciplines, increasing reliability is handled


simply through safety factors. These safety factors are deterministic
but are usually derived through statistical means.

Example 12.4 Relating safety factors to reliability

If we were constraining the stress (𝜎) in a structure to be less than the


material’s yield stress (𝜎 𝑦 ), we would not want to use a constraint of the
following form:
𝜎(𝑥) ≤ 𝜎 𝑦 .
This would be dangerous because we know there is inherent variability in the
loads and uncertainty in the yield stress of the material. Instead, we often use
a simple safety factor and enforce the following constraint:

𝜎(𝑥) ≤ 𝜂𝜎 𝑦 ,

where 𝜂 is a total safety factor that accounts for safety factors from loads,
materials, and failure modes. Of course, not all applications have standards-
driven safety factors already determined. The statistical approach discussed in
this chapter is useful in these situations to obtain reliable designs.

12.3 Forward Propagation

In the previous sections, we have assumed that we know the statistics


(e.g., mean and standard deviation) of the outputs of interest (objectives
and constraints). However, we generally do not have that information.
Instead, we might only know the PDFs of the inputs.∗ Forward-propagation ∗ Even characterizing input uncertainty
methods propagate input uncertainties through a numerical model to might not be straightforward, but for for-
ward propagation, we assume this infor-
compute output statistics. mation is provided.
Uncertainty quantification is a large field unto itself, and we only
provide an introduction to it in this chapter. We introduce four well-
known nonintrusive methods for forward propagation: first-order
perturbation methods, direct quadrature, Monte Carlo methods, and
polynomial chaos.
12 Optimization Under Uncertainty 446

12.3.1 First-Order Perturbation Method


Perturbation methods are based on a local Taylor series expansion
of the functional output. In the following, 𝑓 represents an output of
interest, and 𝑥 represents all the random variables (not necessarily all the
variables that 𝑓 depends on). A first-order Taylor series approximation
of 𝑓 about the mean of 𝑥 is given by
Õ
𝑛
𝜕𝑓
𝑓 (𝑥) ≈ 𝑓 (𝜇𝑥 ) + (𝑥 𝑖 − 𝜇𝑥 𝑖 ) , (12.3)
𝜕𝑥 𝑖
𝑖=1

where 𝑛 is the dimensionality of 𝑥. We can estimate the average value


of 𝑓 by taking the expected value of both sides and using the linearity
of expectation as follows:
𝜇 𝑓 = E( 𝑓 (𝑥))
Õ  
𝜕𝑓
≈ E( 𝑓 (𝜇𝑥 )) + E (𝑥 𝑖 − 𝜇𝑥 𝑖 )
𝜕𝑥 𝑖
𝑖
Õ 𝜕𝑓  (12.4)
= 𝑓 (𝜇𝑥 ) + E(𝑥 𝑖 ) − 𝜇𝑥 𝑖
𝜕𝑥 𝑖
𝑖
Õ 𝜕𝑓 
= 𝑓 (𝜇𝑥 ) + 𝜇𝑥 𝑖 − 𝜇𝑥 𝑖 .
𝜕𝑥 𝑖
𝑖

The last first-order term is zero, so we can write

𝜇 𝑓 = 𝑓 (𝜇𝑥 ) . (12.5)

That is, when considering only first-order terms, the mean of the function
is the function evaluated at the mean of the input.
The variance of 𝑓 is given by

𝜎2𝑓 = E( 𝑓 (𝑥)2 ) − (E( 𝑓 (𝑥)))2


"
Õ 𝜕𝑓
≈ E 𝑓 (𝜇𝑥 )2 + 2 𝑓 (𝜇𝑥 ) (𝑥 𝑖 − 𝜇𝑥 𝑖 )+
𝜕𝑥 𝑖
𝑖

ÕÕ 𝜕𝑓 𝜕𝑓 
 (12.6)
(𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) − 𝑓 (𝜇𝑥 )2
𝜕𝑥 𝑖 𝜕𝑥 𝑗 
𝑖 𝑗

ÕÕ 𝜕𝑓 𝜕𝑓 h i
= E (𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) .
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝑖 𝑗

The expectation term in this equation is the covariance matrix Σ(𝑥 𝑖 , 𝑥 𝑗 ),


so we can write this in matrix notation as

𝜎2𝑓 = (∇𝑥 𝑓 )| Σ(∇𝑥 𝑓 ) . (12.7)


12 Optimization Under Uncertainty 447

We often assume that each random input variable is mutually indepen-


dent. This is true for the design variables for a well-posed optimization
problem, but the parameters may or may not be independent.
When the parameters are independent (this assumption is often
made even if not strictly true), the covariance matrix is diagonal, and
the variance estimation simplifies to

𝑛 
Õ 2
𝜕𝑓
𝜎2𝑓 = 𝜎𝑥 𝑖 . (12.8)
𝜕𝑥 𝑖
𝑖=1

These equations are frequently used to propagate errors from experi-


mental measurements. Major limitations of this approach are that (1) it
relies on a linearization (first-order Taylor series), which has limited
accuracy;† (2) it assumes that all uncertain parameters are uncorrelated, † Higher-order Taylor series can also be
used,191 but they are less common because
which is true for design variables but is not necessarily true for parame- of their increased complexity.
ters (this assumption can be relaxed by providing the covariances); and
(3) it implicitly assumes symmetry in the input distributions because
we neglect all higher-order moments (e.g., skewness, kurtosis) and is,
therefore, less applicable for problems that are highly asymmetric, such
as the wind farm example (Ex. 12.2). 191. Cacuci, Sensitivity & Uncertainty
Analysis, 2003.
We have not assumed that the input or output distributions are
normal probability distributions (i.e., Gaussian). However, we can only
estimate the mean and variance with a first-order series and not the
higher-order moments.
The equation for the variance (Eq. 12.8) is straightforward, but
the derivative terms can be challenging when using gradient-based
optimization. The first-order derivatives in Eq. 12.7 can be computed
using any of the methods from Chapter 6. If they are computed
efficiently using a method appropriate to the problem, the forward
propagation is efficient as well. However, second-order derivatives
are required to use gradient-based optimization (assuming some of
the design variables are also random variables). That is because the
uncertain objectives and constraints now contain derivatives, and we
need derivatives of those functions. Because computing accurate second
derivatives is costly, these methods are used less often than the other
techniques discussed in this chapter.
We can use a simpler approach if we ignore variability in the
objective and focus only on the variability in the constraints (reliability-
based optimization). In this case, we can approximate the effect of
the uncertainty by pulling it outside of the optimization iterations.
We demonstrate one such approach, where we make the additional
192. Parkinson et al., A general approach
assumption that each constraint is normally distributed.192 for robust optimal design, 1993.
12 Optimization Under Uncertainty 448

If 𝑔(𝑥) is normally distributed, we can rewrite the probabilistic


constraint (Eq. 12.2) as
𝑔(𝑥) + 𝑧𝜎 𝑔 ≤ 0 , (12.9)
where 𝑧 is chosen for the desired reliability level 𝑟. For example, 𝑧 = 2
implies a reliability level of 97.72 percent (one-sided tail of the normal
distribution). In many cases, an output distribution is reasonably
approximated as normal, but this method tends to be less effective for
cases with nonnormal output.
With multiple active constraints, we must be careful to appropriately
choose the reliability level for each constraint such that the overall
reliability is in the desired range. We often simplify the problem by
assuming that the constraints are uncorrelated. Thus, the total reliability
is the product of the reliabilities of each constraint.
This simplified approach has the following steps:

1. Compute the deterministic optimum.


2. Estimate the standard deviation of each constraint 𝜎 𝑔 using
Eq. 12.8.
3. Adjust the constraints to 𝑔(𝑥)+𝑧𝜎 𝑔 ≤ 0 for some desired reliability
level and re-optimize.
4. Repeat steps 1–3 as needed.

This method is easy to use, and although approximate, the magnitude


of error is usually appropriate for the conceptual design phase. If the
errors are unacceptable, the standard deviation can be computed inside
the optimization. The major limitation of this method is that it only
applies to reliability-based optimization.

Example 12.5 Iterative reliability-based optimization

Consider the following problem:

minimize 𝑓 = 𝑥12 + 2𝑥22 + 3𝑥 32


by varying 𝑥1 , 𝑥2 , 𝑥3
subject to 𝑔1 = −2𝑥1 − 𝑥2 − 2𝑥 3 + 6 ≤ 0
𝑔2 = −5𝑥1 + 𝑥2 + 3𝑥 3 + 10 ≤ 0 .

All the design variables are random variables with standard deviations 𝜎𝑥1 =
𝜎𝑥2 = 0.033, and 𝜎𝑥3 = 0.0167. We seek a reliable optimum, where each
constraint has a target reliability of 99.865 percent.
First, we compute the deterministic optimum, which is

𝑥 ∗ = [2.3515, 0.375, 0.460], 𝑓 ∗ = 6.448 .


12 Optimization Under Uncertainty 449

We compute the standard deviation of each constraint, using Eq. 12.8, about
the deterministic optimum, yielding 𝜎 𝑔1 = 0.081, 𝜎 𝑔2 = 0.176. Using an
inverse CDF function (discussed in Section 10.2.1) shows that a CDF of 0.99865
corresponds to a 𝑧-score of 3. We then re-optimize with the new reliability
constraints to obtain the solution:

𝑥 ∗ = [2.462, 0.3836, 0.4673], 𝑓 ∗ = 7.013 .

In this case, we sacrificed approximately 9 percent in the objective value to


obtain a more reliable design. Because there are two constraints, and each had ·104
a target reliability of 99.865 percent, the estimated overall reliability (assuming 5 Reliable
independence of constraints) is 99.865 percent × 99.865 percent = 99.73 percent.

Number of samples
4
To check these results, we use Monte Carlo simulations (explained in Deterministic
3
Section 12.3.3) with 100,000 samples to produce the output histograms shown
in Fig. 12.9. The deterministic optimum fails often (k 𝑔(𝑥)k ∞ > 0), so its 2

reliability is a surprisingly poor 34.6 percent. The reliable optimum shifts the 1
distribution to the left, yielding a reliability of 99.75 percent, which is close to
0
our design target. −0.5 0 0.5
k 𝑔(𝑥 ∗ )k ∞

Fig. 12.9 Histogram of maximum con-


straint violation across 100,000 sam-
12.3.2 Direct Quadrature ples for both the deterministic and
reliability-based optimization.
Another approach to estimating statistical outputs of interest is to
apply numerical integration (also known as quadrature) directly to their
definitions. For example:

𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥) d𝑥 (12.10)

𝜎2𝑓 = 𝑓 (𝑥)2 𝑝(𝑥) d𝑥 − 𝜇2𝑓 . (12.11)

Discretizing 𝑥 using 𝑛 points, we get the summation


∫ Õ
𝑛
𝑓 (𝑥) d𝑥 ≈ 𝑓 (𝑥 𝑖 )𝑤 𝑖 . (12.12)
𝑖=1

The quadrature strategy determines the evaluation nodes (𝑥 𝑖 ) and the


corresponding weights (𝑤 𝑖 ).
The most common quadratures originate from composite Newton–
Cotes formulas: the composite midpoint, trapezoidal, and Simpson’s
rules. These methods use equally spaced nodes, a specification that
can be relaxed but still results in a predetermined set of fixed nodes. To
reach a specified level of accuracy, it is often desirable to use nesting. In
this strategy, a refined mesh (smaller spacing between nodes) reuses
nodes from the coarser spacing. For example, a simple nesting strategy
12 Optimization Under Uncertainty 450

is to add a new node between all existing nodes. Thus, the accuracy of
the integral can be improved up to a specified tolerance while reusing
previous function evaluations.
Although straightforward to apply, the Newton–Cotes formulas are
usually much less efficient than Gaussian quadrature, at least for smooth,
nonperiodic functions. Efficiency is highly desirable because the output
functions must be called many times for forward propagation, as well
as throughout the optimization. The Newton–Cotes formulas are based
on fitting polynomials: constant (midpoint), linear (trapezoidal), and
quadratic (Simpson’s).The weights are adjusted between the different
methods, but the nodes are fixed. Gaussian quadrature includes the
nodes as degrees of freedom selected by the quadrature strategy. The
method approximates the integrand as a polynomial and then efficiently
evaluates the integral for the polynomial exactly. Because some of the
concepts from Gaussian quadrature are used later in this chapter, we
review them here.
An 𝑛-point Gaussian quadrature has 2𝑛 degrees of freedom (𝑛 node
positions and 𝑛 corresponding weights), so it can be used to exactly
integrate any polynomial up to order 2𝑛 − 1 if the weights and nodes are
appropriately chosen. For example, a 2-point Gaussian quadrature can
exactly integrate all polynomials up to order 3. To illustrate, consider
an integral over the bounds −1 to 1 (we will later see that these bounds
can be used as a general representation of any finite bounds through a
change of variables):
∫ 1
𝑓 (𝑥) d𝑥 ≈ 𝑤 1 𝑓 (𝑥1 ) + 𝑤 2 𝑓 (𝑥2 ) . (12.13)
−1

We want this model to be exact for all polynomials up to order 3. If the


actual function were a constant ( 𝑓 (𝑥) = 𝑎), then the integral equation
would result in the following:

2𝑎 = 𝑎(𝑤 1 + 𝑤2 ). (12.14)

Repeating this process for polynomials of order 1, 2, and 3 yields four


equations and four unknowns:

2 = 𝑤1 + 𝑤2
0 = 𝑤1 𝑥1 + 𝑤2 𝑥2
2 (12.15)
= 𝑤 1 𝑥 12 + 𝑤 2 𝑥22
3
0 = 𝑤 1 𝑥 13 + 𝑤 2 𝑥23 .

Solving these equations yields 𝑤 1 = 𝑤 2 = 1, 𝑥1 = −𝑥2 = 1/ 3. Thus,
we have the weights and node positions that integrate a cubic (or
12 Optimization Under Uncertainty 451

lower-order) polynomial exactly using just two function evaluations,


that is, ∫ 1    
1 1
𝑓 (𝑥) d𝑥 = 𝑓 − √ + 𝑓 √ . (12.16)
−1 3 3
More generally, this means that if we can reasonably approximate a
general function with a cubic polynomial over the interval, we can
provide a good estimate for its integral efficiently.
We would like to extend this procedure to any number of points
without the cumbersome approach just applied. The derivation is
lengthy (particularly for the weights), so it is not repeated here, other
than to explain some of the requirements and the results. The derivation
of Gaussian quadrature requires orthogonal polynomials. Two vectors
are orthogonal if their dot product is zero. The definition is similar
for functions, but because functions have an infinite dimension, we
require an integral instead of a summation. Thus, two functions 𝑓 and
𝑔 are orthogonal over an interval 𝑎 to 𝑏 if their inner product is zero.
Different definitions can be used for the inner product. The simplest
definition is as follows:
∫ 𝑏
𝑓 (𝑥)𝑔(𝑥) d𝑥 = 0 . (12.17)
𝑎

For the Gaussian quadrature derivation, we need a set of polynomials


that are not only orthogonal to each other but also to any polynomial of
lower order. For the previous inner product, it turns out that Legendre
polynomials (𝐿𝑛 is a Legendre polynomial of order 𝑛) possess the
desired properties:
∫ 1
𝑥 𝑘 𝐿𝑛 (𝑥) d𝑥 = 0, for any 𝑘 < 𝑛. (12.18)
−1

Legendre polynomials can be generated by the recurrence relationship,


𝑓 (𝑥)
(2𝑛 + 1) 𝑛 1
𝐿𝑛+1 (𝑥) = 𝑥𝐿𝑛 (𝑥) − 𝐿𝑛−1 (𝑥) , (12.19) 𝐿0
(𝑛 + 1) (𝑛 + 1) 𝐿3
𝐿1
0.5

where 𝐿0 = 1, and 𝐿1 = 𝑥. Figure 12.10 shows a plot of the first few


0
Legendre polynomials.
From the Gaussian quadrature derivation, we find that we can −0.5
𝐿2
integrate any polynomial of order 2𝑛 − 1 exactly by choosing the node
positions 𝑥 𝑖 as the roots of the Legendre polynomial 𝐿𝑛 , with the −1
−1 −0.5 0 0.5 1
corresponding weights given by 𝑥

Fig. 12.10 The first few Legendre poly-


2
𝑤𝑖 = . (12.20) nomials.
(1 − 𝑥 2𝑖 ) [𝐿0𝑛 (𝑥 𝑖 )]2
12 Optimization Under Uncertainty 452

Legendre polynomials are defined over the interval [−1, 1], but we
can reformulate them for an arbitrary interval [𝑎, 𝑏] through a change
of variables:    
𝑏−𝑎 𝑏+𝑎
𝑥= 𝑧+ , (12.21)
2 2
where 𝑧 ∈ [−1, 1].
Using the change of variables, we can write
∫ 𝑏 ∫ 1   
(𝑏 − 𝑎) 𝑏+𝑎 𝑏−𝑎
𝑓 (𝑥) d𝑥 = 𝑓 𝑧+ d𝑧 . (12.22)
𝑎 −1 2 2 2

Now, applying a quadrature rule, we can approximate the integral as


∫  𝑚  
𝑏
𝑏−𝑎 Õ (𝑏 − 𝑎) 𝑏+𝑎
𝑓 (𝑥) d𝑥 ≈ 𝑤𝑖 𝑓 𝑧𝑖 + , (12.23)
𝑎 2 2 2
𝑖=1

where the node locations and respective weights come from the Legen-
dre polynomials.
Recall that what we are after in this section is not just any generic
integral but, rather, metrics such as the expected value,

𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥) d𝑥 . (12.24)

As compared to our original integral (Eq. 12.12), we have an additional


function 𝑝(𝑥), referred to as a weight function. Thus, we extend the
definition of orthogonal polynomials (Eq. 12.17) to orthogonality with
respect to the weight 𝑝(𝑥), also known as a weighted inner product:
∫ 𝑏
h 𝑓 , 𝑔i = 𝑓 (𝑥)𝑔(𝑥)𝑝(𝑥) d𝑥 = 0 . (12.25)
𝑎

For our purposes, the weight function is 𝑝(𝑥), or it is related to it


through a change of variables.
Orthogonal polynomials for various weight functions are listed in
Table 12.1. The weight function in the table does not always correspond
exactly to the typically used PDF (𝑝(𝑥)), so a change of variables (like
Eq. 12.22) might be needed. The formula described previously is known
as Gauss–Legendre quadrature, whereas the variants listed in Table 12.1
are called Gauss–Hermite, and so on. Formulas and tables with node
locations and corresponding weight values exist for most standard
probability distributions. For any given weight function, we can
generate orthogonal polynomials,193 and we can generate orthogonal 193. Golub and Welsch, Calculation of
polynomials for general distributions (e.g., ones that were empirically Gauss quadrature rules, 1969.

derived).
12 Optimization Under Uncertainty 453

Table 12.1 Orthogonal polynomials that correspond to some common probability distri-
butions.
Prob. dist. Weight function Polynomial Support range
Uniform 1 Legendre [−1, 1]
2
Normal 𝑒 −𝑥 Hermite (−∞, ∞)
Exponential 𝑒 −𝑥 Laguerre [0, ∞)
Beta (1 − 𝑥)𝛼 (1 + 𝑥)𝛽 Jacobi (−1, 1)
Gamma 𝑥 𝛼 𝑒 −𝑥 Generalized Laguerre [0, ∞)

We now provide more details on Gauss–Hermite quadrature because


normal distributions are common. The Hermite polynomials (𝐻𝑛 )
follow the recurrence relationship,

𝐻𝑛+1 (𝑥) = 𝑥𝐻𝑛 (𝑥) − 𝑛𝐻𝑛−1 (𝑥) , (12.26)


where 𝐻0 (𝑥) = 1, and 𝐻1 (𝑥) = 𝑥. The first few polynomials are plotted
in Fig. 12.11. For Gauss–Hermite quadrature, the nodes are positioned 𝑓 (𝑥)

at the roots of 𝐻𝑛 (𝑥), and their weights are


2
𝐻1

𝜋𝑛!
 2 . (12.27)
𝑤𝑖 = 𝐻0
√ 0
𝑛 𝐻𝑛−1 ( 2𝑥 𝑖 )
2
𝐻2
A coordinate transformation is needed because the standard normal −2
𝐻3
distribution differs slightly from the weight function in Table 12.1. −3 −2 −1 0 1 2 3
For example, if we are seeking an expected value, with 𝑥 normally 𝑥

distributed, then the integral is given by Fig. 12.11 The first few Hermite poly-
∫ ∞     nomials.
1 1 𝑥 − 𝜇 2
𝜇𝑓 = 𝑓 (𝑥) √ exp − d𝑥. (12.28)
−∞ 𝜎 2𝜋 2 𝜎
We use the change of variables,
𝑥−𝜇
𝑧= √ . (12.29)
2𝜎
Then, the resulting integral becomes
∫ ∞  √   
1
𝜇𝑓 = √ 𝑓 𝜇 + 2𝜎𝑧 exp −𝑧 2 d𝑧. (12.30)
𝜋 −∞

This is now in the appropriate form, so the quadrature rule (using the
Hermite nodes and weights) is

1 Õ
𝑛  √ 
𝜇𝑓 = √ 𝑤 𝑖 𝑓 𝜇 + 2𝜎𝑧 𝑖 . (12.31)
𝜋 𝑖=1
12 Optimization Under Uncertainty 454

Example 12.6 Gauss–Hermite quadrature

Suppose we want to compute the expected value 𝜇 𝑓 for the one-dimensional


function 𝑓 (𝑥) = cos(𝑥 2 ) at 𝑥 = 2, assuming that 𝑥 is normally distributed as
𝑥 ∼ 𝒩(2, 0.2).
Let us use Gauss–Hermite quadrature with an increasing number of |𝜀|
101
nodes. We plot the absolute value of the error, |𝜀|, relative to the exact result Trapezoidal
(𝜇 𝑓 = −0.466842330417276) versus the number of quadrature points in Fig. 12.12.
The Gauss–Hermite quadrature converges quickly; with only six points, we 10−2

reduce the error to around 10−6 . Trapezoidal integration, by comparison, Gauss–Hermite


requires over 35 function evaluations for a similar error. 10−5
In this problem, we could have taken advantage of symmetry, but we are
only interested in the trend (for a smooth function, trapezoidal integration gen-
10−8
erally converges at least quadratically, whereas Gaussian quadrature converges 0 10 20 30 40
𝑛
exponentially).
The first-order method of the previous section predicts 𝜇 𝑓 = −0.6536, which Fig. 12.12 Error in the integral as a
is not an acceptable approximation because of the nonlinearity of 𝑓 . function of the number of nodes.

Gaussian quadrature does not naturally lead to nesting, which, as


previously mentioned, can increase the accuracy by adding points to a
given quadrature. However, methods such as Gauss–Konrod quadra-
ture adapt Gaussian quadrature to utilize nesting. Although Gaussian
quadrature is often used to compute one-dimensional integrals effi-
ciently, it is not always the best method. For non-smooth functions,
trapezoidal integration is usually preferable because polynomials are
ill-suited for capturing discontinuities. Additionally, for periodic func-
tions such as the one shown in Fig. 12.4, the trapezoidal rule is better
than Gaussian quadrature, exhibiting exponential convergence.194,195 194. Wilhelmsen, Optimal quadrature for
periodic analytic functions, 1978.
This is most easily seen by using a Fourier series expansion.196
195. Trefethen and Weideman, The
Clenshaw–Curtis quadrature applies this idea to a general function exponentially convergent trapezoidal rule,
by employing a change of variables (𝑥 = cos 𝜃) to create a periodic 2014.
196. Johnson, Notes on the convergence of
function that can then be efficiently integrated with the trapezoidal trapezoidal-rule quadrature, 2010.
rule. Clenshaw–Curtis quadrature also has the advantage that nesting
is straightforward and thus desirable for higher-dimensional functions,
as discussed next.
The direct quadrature methods discussed so far focused on inte-
gration in one dimension, but most problems have more than one
random variable. Extending numerical integration to multiple dimen-
sions (also known as cubature) is much more challenging. The most
obvious extension for multidimensional quadrature is a full grid tensor
product. This type of grid is created by discretizing each dimension
and then evaluating at every combination of nodes. Mathematically,
12 Optimization Under Uncertainty 455

the quadrature formula can be written as



𝑓 (𝑥) d𝑥1 d𝑥 2 . . . d𝑥 𝑛 ≈
ÕÕ Õ
... 𝑓 (𝑥 𝑖 , 𝑥 𝑗 , . . . , 𝑥 𝑛 )𝑤 𝑖 𝑤 𝑗 . . . 𝑤 𝑛 . (12.32)
𝑖 𝑗 𝑛

Although conceptually straightforward, this approach is subject to the


curse of dimensionality.‡ The number of points we need to evaluate grows ‡ This is the same issue as with the full fac-
torial sampling used to construct surrogate
exponentially with the number of input dimensions. models in Section 10.2.
One approach to dealing with exponential growth is to use a sparse
grid method.197 The basic idea is to neglect higher-order cross terms. 197. Smolyak, Quadrature and interpola-
tion formulas for tensor products of certain
For example, assume that we have a two-dimensional problem and that classes of functions, 1963.
both variables used a fifth-degree polynomial in the quadrature strategy.
The cross terms would include terms up to the 10th order. Although we
can integrate these high-order polynomials exactly, their contributions
become negligible beyond a specific order. We specify a maximum
degree that we want to include and remove all higher-order terms
from the evaluation. This method significantly reduces the number of
evaluation nodes, with minimal loss in accuracy.

Example 12.7 Sparse grid methods for quadrature

Figure 12.13 compares a two-dimension full tensor grid using the Clenshaw–
Curtis exponential rule (left) with a level 5 sparse grid using the same quadrature
strategy (right).

1 1

0.5 0.5

𝑥2 0 𝑥2 0
Fig. 12.13 Comparison between a two-
−0.5 −0.5 dimensional full tensor grid (left)
and a level 5 sparse grid using the
Clenshaw–Curtis exponential rule
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (right).
𝑥1 𝑥1

For a problem with dimension 𝑑 and 𝑛 sample points in each


dimension, the entire tensor grid has a computational complexity
of 𝒪(𝑛 𝑑 ). In contrast, the sparse grid method has a complexity of
𝒪(𝑛(log 𝑛)𝑑−1 ) with comparable accuracy. This scaling alleviates the
curse of dimensionality to some extent. However, the number of
12 Optimization Under Uncertainty 456

evaluation points is still strongly dependent on problem dimensionality,


making it intractable in high dimensions.

12.3.3 Monte Carlo Simulation


Monte Carlo simulation is a sampling-based procedure that computes
statistics and output distributions. Sampling methods approximate the
integrals mentioned in the previous section by using the law of large
numbers. The concept is that output probability distributions can be
approximated by running the simulation many times with randomly
sampled inputs from the corresponding probability distributions. There
are three steps:

1. Random sampling. Sample 𝑛 points 𝑥 𝑖 from the input probability


distributions using a random number generator.
2. Numerical experimentation. Evaluate the outputs at these points,
𝑓𝑖 = 𝑓 (𝑥 𝑖 ).
3. Statistical analysis. Compute statistics on the discrete output
distribution 𝑓𝑖 .

For example, the discrete form of the mean is


𝑛
𝜇𝑓 = 𝑓𝑖 , , (12.33)
𝑛
𝑖=1

and the unbiased estimate of the variance is computed as

𝑛  
!
1 Õ
2
𝜎𝑓 = 𝑓𝑖2 − 𝑛𝜇2𝑓 . (12.34)
𝑛−1
𝑖=1

We can also estimate Pr(𝑔(𝑥) ≤ 0) by counting how many times the


constraint was satisfied and dividing by 𝑛. If we evaluate enough
samples, our output statistics converge to the actual values by the law of
large numbers. Therein also lies this method’s disadvantage: it requires
a large number of samples.
Monte Carlo simulation has three main advantages. First, the
convergence rate is independent of the number of inputs. Whether we
have 3 or 300 random input variables, the convergence rate is similar
because we randomize all input variables for each sample. This is
an advantage over direct quadrature for high-dimensional problems
because, unlike quadrature, Monte Carlo does not suffer from the curse
of dimensionality. Second, the algorithm is easy to parallelize because
all of the function evaluations are independent. Third, in addition to
12 Optimization Under Uncertainty 457

statistics like the mean and variance, Monte Carlo generates the output
probability distributions. This is a unique advantage compared with
first-order perturbation and direct quadrature, which provide summary
statistics but not distributions.

Example 12.8 Monte Carlo applied to a one-dimensional function

Consider the one-dimensional function from Fig. 12.1:


𝑥
𝑓 (𝑥) = exp + 0.2𝑥 6 − 0.2𝑥 5 − 0.75𝑥 4 + 𝑥 2 .
2
We compute the expected value function at each 𝑥 location using Monte Carlo
simulation, for 𝜎 = 0.2. Using different numbers of samples, we obtain the
expected value functions plotted in Fig. 12.14. For 100 samples, the noise in
the expected value is visible. The noise decreases as the number of samples
increases. For 100,000 samples, the noise is barely noticeable in the plot.

2 2 2
Exact
1.8 1.8 1.8
Monte Carlo
1.6 1.6 1.6

𝜇𝑓 1.4 𝜇𝑓 1.4 𝜇𝑓 1.4

1.2 1.2 1.2

1 1 1

0.8 0.8 0.8


−1 0 1 2 −1 0 1 2 −1 0 1 2
𝜇𝑥 𝜇𝑥 𝜇𝑥

𝑛 = 102 𝑛 = 103 𝑛 = 105

Fig. 12.14 Monte Carlo requires a


large number of samples for an accu-
rate prediction of the expected value.
The major disadvantage of the Monte Carlo method is that even
though the convergence rate does not depend on the number of inputs,

the convergence rate is slow—𝒪(1/ 𝑛). This means that every addi-
tional digit of accuracy requires about 100 times more samples. It is
also hard to know which value of 𝑛 to use a priori. Usually, we need
to determine an appropriate value for 𝑛 through convergence testing
(trying larger 𝑛 values until the statistics converge).
One approach to achieving converged statistics with fewer iterations
is to use Latin hypercube sampling (LHS) or low-discrepancy sequences,
as discussed in Section 10.2. Both methods allow us to approximate the
input distributions with fewer samples. Low-discrepancy sequences
are particularly well suited for this application because convergence
testing is iterative. When combined with low-discrepancy sequences,
the method is called quasi-Monte Carlo, and the scaling improves to
12 Optimization Under Uncertainty 458

𝒪(1/𝑛). Thus, each additional digit of accuracy requires 10 times as


many samples. Even with better sampling methods, many simulations
are usually required, which can be prohibitive if used as part of an
OUU problem.

Example 12.9 Forward propagation with Monte Carlo

Consider a problem with the following objective and constraint:

𝑓 (𝑥) = 𝑥12 + 2𝑥22 + 3𝑥32


𝑔(𝑥) = 𝑥1 + 𝑥2 + 𝑥3 − 3.5 ≤ 0 .

Suppose that the current optimization iteration is 𝑥 = [1, 1, 1]. We assume


that the first variable is deterministic, whereas the latter two variables have
uncertainty under a normal distribution with the following standard deviations:
𝜎2 = 0.06 and 𝜎3 = 0.2. We would like to compute the output statistics for 𝑓
(mean, variance, and a histogram) and compute the reliability of the constraint
at this current iteration.
We do not know how many samples we need to get reasonably converged
statistics, so we need to perform a convergence study. For a given number of
samples, we generate random numbers normally distributed with mean 𝑥 𝑖 and
standard deviation 𝜎𝑖 . Then we evaluate the functions and compute the mean
(Eq. 12.33), variance (Eq. 12.34), and reliability of the outputs.
Figure 12.15 shows the convergence of the mean and standard deviation
using a random sampling curve, LHS (Section 10.2.1), and quasi-Monte Carlo
(using Halton sequence sampling from Section 10.2.2). The latter two methods
converge much more quickly than random sampling. LHS performs better for
few samples in this case, but generating the convergence data requires more
function evaluations than quasi-Monte Carlo because an all-new set of sample
points is generated for each 𝑛 (instead of being incrementally generated as in
the Halton sequence for quasi-Monte Carlo). That cost is less problematic for
optimization applications because the convergence testing is only done at the
preprocessing stage. Once a number of samples 𝑛 is chosen for convergence, 𝑛
is fixed throughout the optimization.

6.4 1.5
Random sampling
Random sampling 1.4

6.2 1.3
LHS LHS
1.2
𝜇 𝜎
6 1.1

Halton 1 Halton
Fig. 12.15 Convergence of the mean
5.8 0.9 (left) and standard deviation (right)
0.8 versus the number of samples using
101 102 103 104 105 106 101 102 103 104 105 106 Monte Carlo.
𝑁 𝑁
12 Optimization Under Uncertainty 459

From the data, we conclude that we need about 𝑛 = 104 samples to have
well-converged statistics. Using 𝑛 = 104 yields 𝜇 = 6.127, 𝜎 = 1.235, and Count ·104
𝑟 = 0.9914. The random sampling of these results varies between simulations
(except for the Halton sequence in quasi-Monte Carlo, which is deterministic). 1.5
The production of an output histogram is a key benefit of this method. The
histogram of the objective function is shown in Fig. 12.16. Notice that it is not 1

normally distributed in this case.


0.5

0
5 10
𝑓
12.3.4 Polynomial Chaos
Polynomial chaos (also known as spectral expansions) is a class of forward- Fig. 12.16 Histogram of objective func-
tion for 10,000 samples.
propagation methods that take advantage of the inherent smoothness
of the outputs of interest using polynomial approximations.§ § Polynomial chaos is not chaotic and does

The method extends the ideas of Gaussian quadrature to estimate the not actually need polynomials. The name
polynomial chaos came about because it was
output function, from which the output distribution and other summary initially derived for use in a physical theory
198
statistics can be efficiently generated. In addition to using orthogonal of chaos.
198. Wiener, The homogeneous chaos, 1938.
polynomials to evaluate integrals, we use them to approximate the
output function. As in Gaussian quadrature, the polynomials are
orthogonal with respect to a specified probability distribution (see
Eq. 12.25 and Table 12.1). A general function that depends on uncertain
variables 𝑥 can be represented as a sum of basis functions 𝜓 𝑖 (which
are usually polynomials) with weights 𝛼 𝑖 ,
Õ

𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥). (12.35)
𝑖=0

In practice, we truncate the series after 𝑛 + 1 terms and use

Õ
𝑛
𝑓 (𝑥) ≈ 𝛼 𝑖 𝜓 𝑖 (𝑥) . (12.36)
𝑖=0

The required number of terms 𝑛 for a given input dimension 𝑑 and


polynomial order 𝑜 is
(𝑑 + 𝑜)!
𝑛+1= . (12.37)
𝑑!𝑜!
This approach amounts to a truncated generalized Fourier series.
By definition, we choose the first basis function to be 𝜓0 = 1. This
means that the first term in the series is a constant (polynomial of order
0). Because the basis functions are orthogonal, we know that

h𝜓 𝑖 , 𝜓 𝑗 i = 0 if 𝑖 ≠ 𝑗. (12.38)
12 Optimization Under Uncertainty 460

Polynomial chaos consists of three main steps:


1. Select an orthogonal polynomial basis.
2. Compute coefficients to fit the desired function.
3. Compute statistics on the function of interest.
These three steps are described in the following sections. We begin
with the last step because it provides insight for the first two.

Compute Statistics

Using the polynomial approximation (Eq. 12.36) in the definition of the


mean, we obtain
∫ ∞Õ
𝜇𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥) d𝑥 . (12.39)
−∞ 𝑖

The coefficients 𝛼 𝑖 are constants that can be taken out of the integral, so
we can write
Õ ∫
𝜇𝑓 = 𝛼𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥) d𝑥
𝑖
∫ ∫ ∫
= 𝛼0 𝜓0 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 1 𝜓1 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 2 𝜓2 (𝑥)𝑝(𝑥) d𝑥 + . . . .

We can multiply all terms by 𝜓0 without changing anything because


𝜓0 = 1, so we can rewrite this expression in terms of the inner product
as

𝜇 𝑓 = 𝛼0 𝑝(𝑥) d𝑥 + 𝛼1 h𝜓0 , 𝜓1 i + 𝛼2 h𝜓0 , 𝜓2 i + . . . . (12.40)

Because the polynomials are orthogonal, all the terms except the first
are zero (see Eq. 12.38). From the definition of a PDF (Eq. A.63), we
know that the first term is 1. Thus, the mean of the function is simply
the zeroth coefficient,
𝜇 𝑓 = 𝛼0 . (12.41)
We can derive a formula for the variance using a similar approach.
Substituting the polynomial representation (Eq. 12.36) into the definition
of variance and using the same techniques used in deriving the mean,
we obtain
∫ !2
Õ
𝜎2𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥 − 𝛼20

Õ
𝑖

= 𝛼2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20
𝑖
12 Optimization Under Uncertainty 461

∫ Õ
𝑛 ∫
𝜎2𝑓 = 𝛼 20 𝜓02 𝑝(𝑥) d𝑥 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼20


𝑖=1
Õ
𝑛
= 𝛼 20 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20


𝑖=1
(12.42)
Õ
𝑛
= 𝛼2𝑖 2
𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥
𝑖=1
Õ𝑛
= 𝛼 2𝑖 h𝜓 2𝑖 i .
𝑖=1

That last step is just the definition of the weighted inner product
(Eq. 12.25), providing the variance in terms of the coefficients and
polynomials:
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 h𝜓 2𝑖 i . (12.43)
𝑖=1

The inner product h𝜓2𝑖 i = h𝜓 𝑖 , 𝜓 𝑖 i can often be computed analytically.


For example, using Hermite polynomials with a normal distribution
yields
h𝐻𝑛2 i = 𝑛! . (12.44)
For cases without analytic solutions, Gaussian quadrature of this inner
product is still straightforward and exact because it only includes
polynomials.
For multiple uncertain variables, the formulas are the same, but we
use multidimensional basis polynomials. Denoting these multidimen-
sional basis polynomials as Ψ𝑖 , we can write

𝜇 𝑓 = 𝛼0 (12.45)
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 hΨ2𝑖 i . (12.46)
𝑖=1

The multidimensional basis polynomials are defined by products of one-


dimensional polynomials, as detailed in the next section. Polynomial
chaos computes the mean and variance using these equations and
our definition of the inner product. Other statistics can be estimated
by sampling the polynomial expansion. Because we now have a
simple polynomial representation that no longer requires evaluating
the original (potentially expensive) function 𝑓 , we can use sampling
procedures (e.g., Monte Carlo) to create output distributions without
incurring high costs. Of course, we have to evaluate the function 𝑓 to
generate the coefficients, as we will discuss later.
12 Optimization Under Uncertainty 462

Selecting an Orthogonal Polynomial Basis

As discussed in Section 12.3.2, we already know appropriate orthog-


onal polynomials for many continuous probability distributions (see
Table 12.1¶ ). We also have methods to generate other exponentially ¶ Other polynomials can be used, but
these polynomials are optimal because
convergent polynomial sets for any given empirical distribution.199 they yield exponential convergence.
The multidimensional basis functions we need are defined by tensor 199. Eldred et al., Evaluation of non-
products. For example, if we had two variables from a uniform intrusive approaches for Wiener–Askey
generalized polynomial chaos, 2008.
probability distribution (and thus Legendre bases), then the polynomials
up through the second-order terms would be as follows:

Ψ0 (𝑥) = 𝜓0 (𝑥1 )𝜓0 (𝑥2 ) = 1


Ψ1 (𝑥) = 𝜓1 (𝑥1 )𝜓0 (𝑥2 ) = 𝑥 1
Ψ2 (𝑥) = 𝜓0 (𝑥1 )𝜓1 (𝑥2 ) = 𝑥 2
Ψ3 (𝑥) = 𝜓1 (𝑥1 )𝜓1 (𝑥2 ) = 𝑥 1 𝑥2
1 2 
Ψ4 (𝑥) = 𝜓2 (𝑥1 )𝜓0 (𝑥2 ) = 3𝑥 1 − 1
2
1 2 
Ψ5 (𝑥) = 𝜓0 (𝑥1 )𝜓2 (𝑥2 ) = 3𝑥 2 − 1 .
2
The 𝜓1 (𝑥1 )𝜓2 (𝑥2 ) term, for example, does not appear in this list because
it is a third-order polynomial, and we truncated the series after the
second-order terms. We should expect this number of basis functions
because Eq. 12.37 with 𝑑 = 2 and 𝑜 = 2 yields 𝑛 = 6.

Determine Coefficients

Now that we have selected an orthogonal polynomial basis, 𝜓 𝑖 (𝑥), we


need to determine the coefficients 𝛼 𝑖 in Eq. 12.36. We discuss two
approaches for determining the coefficients. The first approach is
quadrature, which is also known as spectral projection. The second is
with regression, which is also known as stochastic collocation.
Let us start with the quadrature approach. Beginning with the
polynomial approximation
Õ
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥) , (12.47)
𝑖

we take the inner product of both sides with respect to 𝜓 𝑗 ,


Õ
h 𝑓 (𝑥), 𝜓 𝑗 i = 𝛼 𝑖 h𝜓 𝑖 , 𝜓 𝑗 i . (12.48)
𝑖

Using the orthogonality property of the basis functions (Eq. 12.38), all
the terms in the summation are zero except for

h 𝑓 (𝑥), 𝜓 𝑖 i = 𝛼 𝑖 h𝜓 2𝑖 i . (12.49)
12 Optimization Under Uncertainty 463

Thus, we can find each coefficient by



1
𝛼𝑖 = 𝑓 (𝑥)𝜓 𝑖 (𝑥)𝑝(𝑥) d𝑥 , (12.50)
h𝜓 2𝑖 i

where we replaced the inner product with the definition given by


Eq. 12.17.
As expected, the zeroth coefficient corresponds to the definition of
the mean, ∫
𝛼0 = 𝑓 (𝑥)𝑝(𝑥) d𝑥 . (12.51)

These coefficients can be obtained through multidimensional quadra-


ture (see Section 12.3.2) or Monte Carlo simulation (Section 12.3.3),
which means that this approach inherits the same limitations of the cho-
sen quadrature approach. However, the process can be more efficient if
the selected basis functions are good approximations of the distribu-
tions. These integrals are usually evaluated using Gaussian quadrature
(e.g., Gauss–Hermite quadrature if 𝑝(𝑥) is a normal distribution).
Suppose all we are interested in is the mean (Eqs. 12.41 and 12.51).
In that case, the polynomial chaos approach amounts to just Gaussian
quadrature. However, if we want to compute other statistical properties
or produce an output PDF, the additional effort of obtaining the higher-
order coefficients produces a polynomial approximation of 𝑓 (𝑥) that
we can then sample to predict other quantities of interest.
It may appear that to estimate 𝑓 (𝑥) (Eq. 12.36), we need to know
𝑓 (𝑥) (Eq. 12.50). The distinction is that we just need to be able to
evaluate 𝑓 (𝑥) at some predefined quadrature points, which in turn
gives a polynomial approximation for any 𝑥.
The second approach to determining the coefficients is regression.
Equation 12.36 is linear, so we can estimate the coefficients using least
squares (although an underdetermined system with regularization can
be used as well). If we evaluate the function 𝑚 times, where 𝑥 (𝑖) is the
𝑖th sample, the resulting linear system is as follows:
     
   
 𝜓0 𝑥 (1) ... 𝜓 𝑛 𝑥 (1)   𝛼0   𝑓 𝑥 (1) 
     
   ..   
   . =  .
.. .. .. (12.52)
 .  .      .  There are software packages that facil-

  𝛼 𝑛    itate the use of polynomial chaos meth-


𝜓0 𝑥 (𝑚) ... 𝑥 (𝑚)     𝑓 𝑥 (𝑚) 
   
𝜓𝑛 ods.200 , 201
200. Adams et al., Dakota, a multilevel
As a rule of thumb, the number of sample points 𝑚 should be at least parallel object-oriented framework for de-
sign optimization, parameter estimation,
twice as large as the number of unknowns, 𝑛 + 1. The sampling points, uncertainty quantification, and sensitivity
also known as the collocation points, typically correspond to the nodes in analysis: Version 6.14 user’s manual, 2021.

the corresponding quadrature strategy or utilize random sequences.‖ 201. Feinberg and Langtangen, Chaospy:
An open source tool for designing methods of
uncertainty quantification, 2015.
12 Optimization Under Uncertainty 464

Example 12.10 Forward propagation with polynomial chaos

Consider the following objective function:

𝑓 (𝑥) = 3 + cos(3𝑥1 ) + exp(−2𝑥 2 ) ,

where the current iteration is at 𝑥 = [1, 1], and we assume that both design
variables are normally distributed with the following standard deviations:
𝜎 = [0.06, 0.2].
We approximate the function with fourth-order Hermite polynomials.
Using Eq. 12.37, we see that there are 15 basis functions from the various
combinations of 𝐻𝑖 𝐻 𝑗 :

Ψ0 = 𝐻0 (𝑥1 )𝐻0 (𝑥2 )


Ψ1 = 𝐻0 (𝑥1 )𝐻1 (𝑥2 ) = 𝑥2
Ψ2 = 𝐻0 (𝑥1 )𝐻2 (𝑥2 ) = 𝑥22 − 1
Ψ3 = 𝐻0 (𝑥1 )𝐻3 (𝑥2 ) = 𝑥23 − 3𝑥2
Ψ4 = 𝐻0 (𝑥1 )𝐻4 (𝑥2 ) = 𝑥24 − 6𝑥22 + 3
Ψ5 = 𝐻1 (𝑥1 )𝐻0 (𝑥2 ) = 𝑥1
Ψ6 = 𝐻1 (𝑥1 )𝐻1 (𝑥2 ) = 𝑥1 𝑥2
Ψ7 = 𝐻1 (𝑥1 )𝐻2 (𝑥2 ) = 𝑥1 𝑥22 − 𝑥1
Ψ8 = 𝐻1 (𝑥1 )𝐻3 (𝑥2 ) = 𝑥1 𝑥23 − 3𝑥 1 𝑥2
Ψ9 = 𝐻2 (𝑥1 )𝐻0 (𝑥2 ) = 𝑥12 − 1
Ψ10 = 𝐻2 (𝑥1 )𝐻1 (𝑥2 ) = 𝑥12 𝑥2 − 𝑥2
Ψ11 = 𝐻2 (𝑥1 )𝐻2 (𝑥2 ) = 𝑥12 𝑥22 − 𝑥 12 − 𝑥22 + 1
Ψ12 = 𝐻3 (𝑥1 )𝐻0 (𝑥2 ) = 𝑥13 − 3𝑥1
Ψ13 = 𝐻3 (𝑥1 )𝐻1 (𝑥2 ) = 𝑥13 𝑥2 − 3𝑥 1 𝑥2
Ψ14 = 𝐻4 (𝑥1 )𝐻0 (𝑥2 ) = 𝑥14 − 6𝑥12 + 3 .

The integrals for the basis functions (Hermite polynomials) have analytic
solutions:
hΨ2𝑘 i = h(𝐻𝑚 𝐻𝑛 )2 i = 𝑚!𝑛! .
We now compute the following double integrals to obtain the coefficients using
Gaussian quadrature:
∫ ∞∫ ∞
1
𝛼𝑘 = 𝑓 (𝑥)Ψ 𝑘 (𝑥)𝑝(𝑥) d𝑥1 d𝑥 2
hΨ2𝑘 i −∞ −∞

We must be careful with variable definitions because the inputs are not standard
normal distributions. The function 𝑓 is defined over the unnormalized variable
𝑥, whereas our basis functions are defined over a standard normal distribu-
tion: 𝑦 = (𝑥 − 𝜇)/𝜎. The probability distribution in this case is a bivariate,
12 Optimization Under Uncertainty 465

uncorrelated, normal distribution:


∫ ∞∫ ∞ 𝑥 − 𝜇
1
𝛼𝑘 = 𝑓 (𝑥)Ψ 𝑘 ×
hΨ2𝑘 i −∞ −∞ 𝜎
   !  !
1 𝑥 1 − 𝜇1 2 𝑥 2 − 𝜇2 2
exp − √ exp − √ d𝑥1 d𝑥2 .
2𝜋𝜎1 𝜎2 2𝜎1 2𝜎2

To put this in the proper form√for Gauss–Hermite quadrature, we use the


change of variable 𝑧 = (𝑥 − 𝜇)/( 2𝜎), as follows:
∫ ∞ ∫ ∞ √  √ 
1 1 2 2
𝛼𝑘 = 𝑓 2𝜎𝑧 + 𝜇 Ψ 𝑘 2𝑧 𝑒 −𝑧1 𝑒 −𝑧2 d𝑧1 d𝑧2 .
2
hΨ 𝑘 i 𝜋 −∞ −∞

Applying Gauss–Hermite quadrature, the integral is approximated by

1 Õ
𝑛𝑖 Õ
𝑛𝑗 √ 
𝛼𝑘 ≈ 𝑤 𝑖 𝑤 𝑗 𝑓 (𝑋𝑖𝑗 )Ψ 𝑘 2𝑍 𝑖𝑗 ,
𝜋hΨ2𝑘 i 𝑖=1 𝑗=1

where 𝑛 𝑖 and 𝑛 𝑗 determine the number of quadrature nodes we choose to


include, and 𝑋𝑖𝑗 is the tensor product
√  √ 
𝑋= 2𝜎1 𝑧1 + 𝜇1 ⊗ 2𝜎2 𝑧2 + 𝜇2 ,

and 𝑍 = 𝑧1 ⊗ 𝑧2 .
In this case, we choose a full tensor product mesh of the fifth order in both
dimensions. The nodes and weights are given by

𝑧1 = 𝑧2 = [−2.02018, −0.95857, 0.0, 0.95857, 2.02018]


𝑤 1 = 𝑤 2 = [0.01995, 0.39362, 0.94531, 0.39362, 0.01995]

and visualized as a tensor product of evaluation points in Fig. 12.17. The 𝑧2

nonzero coefficients (within a tolerance of approximately 10−4 ) are as follows: 2

𝛼0 = 2.1725 1

𝛼1 = −0.0586 0

𝛼2 = 0.0117
−1
𝛼3 = −0.00156
−2
𝛼5 = −0.0250
−2 −1 0 1 2
𝛼 9 = 0.01578 . 𝑧1

We can now easily compute the mean and standard deviation as Fig. 12.17 Evaluation nodes with area
proportional to weight.
𝜇 𝑓 = 𝛼 0 = 2.1725
v
u
tÕ𝑛
𝜎𝑓 = 𝛼2𝑖 hΨ2𝑖 i = 0.06966 .
𝑖=1
12 Optimization Under Uncertainty 466

In this case, we are able to accurately estimate the mean and standard
deviation with only 25 function evaluations. In contrast, applying Monte Carlo
to this same problem, with LHS, requires about 10,000 function calls to estimate
the mean and over 100,000 function calls to estimate the standard deviation
(with less accuracy).
Although direct quadrature would work equally well if all we wanted was
the mean and standard deviation, polynomial chaos gives us a polynomial
approximation of our function near 𝜇𝑥 :
Õ
𝑓˜(𝑥) = 𝛼 𝑖 Ψ𝑖 (𝑥).
𝑖

This fourth-order polynomial is compared to the original function in Fig. 12.18,


where the dot represents the mean of 𝑥.

2 2

1.5 1.5

𝜇𝑥 𝜇𝑥
𝑥2 1 𝑥2 1

0.5 0.5

0 0 Fig. 12.18 Original function on left,


0 0.5 1 1.5 2 0 0.5 1 1.5 2
polynomial expansion about 𝜇𝑥 on
𝑥1 𝑥1
right.
Original Polynomial Expansion

The primary benefit of this new function is that it is very inexpensive to ·104

evaluate (and the original function is often expensive), so we can use sampling 3
procedures to compute other statistics, such as percentiles or reliability levels,
or simply to visualize the output PDF, as shown in Fig. 12.19. 2

12.4 Summary 0
2 2.2 2.4 2.6 2.8
𝑓 (𝑥)
Engineering problems are subject to variation under uncertainty. OUU
deals with optimization problems where the design variables or other Fig. 12.19 Output histogram pro-
parameters have uncertain variability. Robust design optimization seeks duced by sampling the polynomial
designs that are less sensitive to inherent variability in the objective expansion.
function. Common OUU objectives include minimizing the mean or
standard deviation or performing multiobjective trade-offs between the
mean performance and standard deviation. Reliable design optimiza-
tion seeks designs with a reduced probability of failure, considering
the variability in the constraint values. To quantify robustness and
reliability, we need a forward-propagation procedure that propagates
12 Optimization Under Uncertainty 467

the probability distributions of the inputs (either design variables or


parameters that are fixed during optimization) to the statistics or prob-
ability distributions of the outputs (objective and constraint functions).
Four classes of forward propagation methods were discussed in this
chapter.∗ ∗ This list is not exhaustive. For example,
the methods discussed in this chapter are
Perturbation methods use a Taylor series expansion of the output nonintrusive. Intrusive polynomial chaos
functions to estimate the mean and variance. These methods can be uses expansions inside governing equa-
tions. Like intrusive methods for deriva-
efficient for a range of problem sizes, especially if accurate derivatives are tive computation (Chapter 6), intrusive
available. Their main weaknesses are that they require derivatives (and methods for forward propagation require
more implementation effort but are more
hence second derivatives when using a gradient-based optimization), accurate and efficient.
only work well with symmetric input probability distributions, and
only provide the mean and variance (for first-order methods).
Direct quadrature uses numerical quadrature to evaluate the sum-
mary statistics. This process is straightforward and effective. Its primary
weakness is that it is limited to low-dimensional problems (number of
random inputs). Sparse grids enable these methods to handle a higher
number of dimensions, but the scaling is still lacking.
Monte Carlo methods approximate the summary statistics and out-
put distributions using random sampling and the law of large numbers.
These methods are straightforward to use and are independent of the
problem dimension. Their major weakness is that they are inefficient.
However, because the alternatives are intractable for a large number
of random inputs, Monte Carlo is an appropriate choice for many
high-dimensional problems.
Polynomial chaos represents uncertain variables as a sum of orthog-
onal basis functions. This method is often a more efficient way to char-
acterize both statistical moments and output distributions. However,
the methodology is usually limited to a small number of dimensions
because the number of required basis functions grows exponentially.
12 Optimization Under Uncertainty 468

Problems

12.1 Answer true or false and justify your answer.

a. The greater the reliability, the less likely the design is to have
a worse objective function value.
b. Reliability can be handled in a deterministic way using safety
factors, which ensure that the optimum has some margin
before the original constraint is violated.
c. Forward propagation computes the PDFs of the outputs and
inputs for a given numerical model.
d. The computational cost of direct quadrature scales exponen-
tially with the number of random variables, whereas the cost
of Monte Carlo is independent of the number of random
variables.
e. Monte Carlo methods approximate PDFs using random
sampling and converges slowly.
f. The first-order perturbation method computes the PDFs
using local Taylor series expansions.
g. Because the first-order perturbation method requires first-
order derivatives to compute the uncertainty metrics, OUU
using the first-order perturbation method requires second-
order derivatives.
h. Polynomial chaos is a forward-propagation technique that
uses polynomial approximations with random coefficients
to model the input uncertainties.
i. The number of basis functions required by polynomial chaos
grows exponentially with the number of uncertain input
variables.

12.2 Consider the following problem:

minimize 𝑓 = 𝑥12 + 𝑥24 + 𝑥2 exp(𝑥3 )


subject to 𝑥12 + 𝑥 22 + 𝑥33 ≥ 10
𝑥1 𝑥 2 + 𝑥2 𝑥3 ≥ 5.

Assume that all design variables are random variables with


the following standard deviations: 𝜎𝑥1 = 0.1, 𝜎𝑥2 = 0.2, 𝜎𝑥3 =
0.05. Use the iterative reliability-based optimization procedure
to find a reliable optimum with an overall reliability of 99.9
percent. How much did the objective decrease relative to the
12 Optimization Under Uncertainty 469

deterministic optimum? Check your reliability level with Monte


Carlo simulation.

12.3 Using Gaussian quadrature, find the mean and variance of the
function exp(cos(𝑥)) at 𝑥 = 1, assuming 𝑥 is normally distributed
with a standard deviation of 0.1. Determine how many evaluation
points are needed to converge to 5 decimal places. Compare your
results to trapezoidal integration.

12.4 Repeat the previous problem, but assume a uniform distribution


with a half-width of 0.1.

12.5 Consider the function in Ex. 12.10. Solve the same problem,
but use Monte Carlo sampling instead. Compare the output
histogram and how many function calls are required to achieve
well-converged results for the mean and variance.

12.6 Repeat Ex. 12.10 using polynomial chaos, except with a uniform
distribution in both dimensions, where the standard deviations
from the example correspond to the half-width of a uniform
distribution.

12.7 Robust optimization of a wind farm. We want to find the optimal


turbine layout for a wind farm to minimize the cost of energy
(COE). We will consider a very simplified wind farm with only
three wind turbines. The first turbine will be fixed at (0, 0), and the
𝑥-positions of the back two turbines will be fixed with 4-diameter
spacing between them. The only thing we can change is the
𝑦-position of the two back turbines, as shown in Fig. 12.20 (all
dimensions in this problem are in terms of rotor diameters). In
other words, we just have two design variables: 𝑦2 and 𝑦3 .

𝑇3
𝑦3

𝑇2
𝑦2
Wind
𝑇1

Fig. 12.20 Wind farm layout.


4 4

We further simplify by assuming the wind always comes from


the west, as shown in the figure, and is always at a constant speed.
The wake model has a few parameters that define things like its
12 Optimization Under Uncertainty 470

spread angle and decay rate. We will refer to these parameters as


𝛼, 𝛽, and 𝛿 (knowing exactly what each parameter corresponds to
is not important for our purposes). The supplementary resources
repository contains code for this problem.

a. Run the optimization deterministically, assuming that the


three wake parameters are 𝛼 = 0.1, 𝛽 = 9, and 𝛿 = 5.
Because there are several possible similar solutions, we add
the following constraints: 𝑦 𝑖 ≥ 0 (bound) and 𝑦3 ≥ 𝑦2 (linear).
Do not use [0, 0] as the starting point for the optimization
because that occurs right at a flat spot in the wake (a fixed
point), so you might not make any progress. Report the
optimal spacing that you find.
b. Now assume that the wake parameters are uncertain vari-
ables under some probability distribution. Specifically, we
have the following information for the three parameters:
• 𝛼 is governed by a Weibull distribution with a scale
parameter of 0.1 and a shape parameter of 1.
• 𝛽 is given by a normal distribution with a mean and
standard deviation of 𝜇=9, 𝜎=1.
• 𝛿 is given by a normal distribution with a mean and
standard deviation of 𝜇=5, 𝜎=0.4.
Note that the mean for all of these distributions corresponds
to the deterministic value we used previously.
Using a Monte Carlo method, run an OUU minimizing the
95th percentile for COE.
c. Once you have completed both optimizations, perform a
cross analysis by filling out the four numbers in the table
that follows.
Deterministic 95th percentile
COE COE
Deterministic layout [ ] [ ]
OUU layout [ ] [ ]

Take the two optimal designs that you found, and then
compare each on the two objectives (deterministic and 95th
percentile). The first row corresponds to the performance of
the optimal deterministic layout. Evaluate the performance
of this layout using the deterministic value for COE and the
95th percentile that accounts for uncertainty. Repeat for the
optimal solution for the OUU case. Discuss your findings.
Multidisciplinary Design Optimization
13
As mentioned in Chapter 1, most engineering systems are multidiscipli-
nary, motivating the development of multidisciplinary design optimiza-
tion (MDO). The analysis of multidisciplinary systems requires coupled
models and coupled solvers. We prefer the term component instead of
discipline or model because it is more general. However, we use these
terms interchangeably depending on the context. When components in
a system represent different physics, the term multiphysics is commonly
used.
All the optimization methods covered so far apply to multidisci-
plinary problems if we view the coupled multidisciplinary analysis
as a single analysis that computes the objective and constraint func-
tions by solving the coupled model for a given set of design variables.
However, there are additional considerations in the solution, derivative
computation, and optimization of coupled systems.
In this chapter, we build on Chapter 3 by introducing models and
solvers for coupled systems. We also expand the derivative computation
methods of Chapter 6 to handle such systems. Finally, we introduce
various MDO architectures, which are different options for formulating
and solving MDO problems.

By the end of this chapter you should be able to:

1. Describe when and why you might want to use MDO.


2. Read and create XDSM diagrams.
3. Compute derivatives of coupled models.
4. Understand the differences between monolithic and dis-
tributed architectures.

13.1 The Need for MDO

In Chapter 1, we mentioned that MDO increases the system perfor-


mance, decreases the design time, reduces the total cost, and reduces

471
13 Multidisciplinary Design Optimization 472

the uncertainty at a given point in time (recall Fig. 1.3). Although these
benefits still apply when modeling and optimizing a single discipline
or component, broadening the modeling and optimization to the whole
system brings on additional benefits.
Even without performing any optimization, constructing a multi-
disciplinary (coupled) model that considers the whole engineering
system is beneficial. Such a model should ideally consider all the
interactions between the system components. In addition to modeling
physical phenomena, the model should also include other relevant
considerations, such as economics and human factors. The benefit of
such a model is that it better reflects the actual state and performance of
the system when deployed in the real world, as opposed to an isolated
component with assumed boundary conditions. Using such a model,
designers can quantify the actual impact of proposed changes on the
whole system.
When considering optimization, the main benefit of MDO is that op-
timizing the design variables for the various components simultaneously
leads to a better system than when optimizing the design variables
for each component separately. Currently, many engineering systems
are designed and optimized sequentially, which leads to suboptimal
designs. This approach is often used in industry, where engineers are
grouped by discipline, physical subsystem, or both. This might be per-
𝑥2 𝑥∗
ceived as the only choice when the engineering system is too complex
and the number of engineers too large to coordinate a simultaneous
design involving all groups. 𝑥0

Sequential optimization is analogous to coordinate descent, which 𝑥1

consists of optimizing each variable sequentially, as shown in Fig. 13.1.


Fig. 13.1 Sequential optimization is
Instead of optimizing one variable at a time, sequential optimization analogous to coordinate descent.
optimizes distinct sets of variables at a time, but the principle remains
the same. This approach tends to work for unconstrained problems,
although the convergence rate is limited to being linear.
One issue with sequential optimization is that it might converge to 𝑥2
𝑥 𝐵∗
𝑥 𝐵0
a suboptimal point for a constrained problem. An example of such a
case is shown in Fig. 13.2, where sequential optimization gets stuck at ∗
𝑥𝐴 𝑥 𝐴0
the constraint because it cannot decrease the objective while remaining
feasible by only moving in one of the directions. In this case, the 𝑥1

optimization must consider both variables simultaneously to find a


Fig. 13.2 Sequential optimization can
feasible descent direction. fail to find the constrained optimum
Another issue is that when there are variables that affect multiple because the optimization with re-
disciplines (called shared design variables), we must make a choice about spect to a set of variables might not
see a feasible descent direction that
which discipline handles those variables. If we let each discipline
otherwise exists when considering all
optimize the same shared variable, the optimizations likely yield variables simultaneously.
13 Multidisciplinary Design Optimization 473

different values for those variables each time, in which case they
will not converge. On the other hand, if we let one discipline handle a
shared variable, it will likely converge to a value that violates one or
more constraints from the other disciplines.
By considering the various components and optimizing a multidisci-
plinary performance metric with respect to as many design variables
as possible simultaneously, MDO automatically finds the best trade-off
between the components—this is the key principle of MDO. Suboptimal
designs also result from decisions at the system level that involve power
struggles between designers. In contrast, MDO provides the right
trade-offs because mathematics does not care about politics.

Example 13.1 MDO applied to wing design

Consider a multidisciplinary model of an aircraft wing, where the aero-


dynamics and structures disciplines are coupled to solve an aerostructural
analysis and design optimization problem. For a given flow condition, the
aerodynamic solver computes the forces on the wing for a given wing shape,
whereas the structural solver computes the wing displacement for a given set
of applied forces. Thus, these two models are coupled as shown in Fig. 13.3.
For a steady flow condition, there is only one wing shape and a corresponding
set of forces that satisfies both disciplinary models simultaneously.

Shape

Structural
sizing
Surface
Aerodynamic pressures
solver

Structural Displacements
Displacements solver

Weight
Weight

Surface
pressure Drag, lift
integration

Stress Structural
computation stresses

Fuel Fuel
computation consumption

Fig. 13.3 Multidisciplinary numerical


model for an aircraft wing.
13 Multidisciplinary Design Optimization 474

In the absence of a coupled model, aerodynamicists may have to


assume a fixed wing shape at the flight conditions of interest. Similarly,
structural designers may assume fixed loads in their structural analysis.
However, solving the coupled model is necessary to get the actual flying
shape of the wing and the corresponding performance metrics.
One possible design optimization problem based on these models
would be to minimize the drag by changing the wing shape and the
structural sizing while satisfying a lift constraint and structural stress
constraints. Optimizing the wing shape and structural sizing simulta-
neously yields the best possible result because it finds feasible descent
directions that would not be available with sequential optimization.
Wing shape variables, such as wingspan, are shared design variables be-
cause they affect both the aerodynamics and the structure. They cannot
be optimized by considering aerodynamics or structures separately.

13.2 Coupled Models

As mentioned in Chapter 3, a model is a set of equations that we solve


to predict the state of the engineering system and compute the objective
and constraint function values. More generally, we can have a coupled
model, which consists of multiple models (or components) that depend
on each other’s state variables.
The same steps for formulating a design optimization problem
(Section 1.2) apply in the formulation of MDO problems. The main
difference in MDO problems is that the objective and constraints are
computed by the coupled model. Once such a model is in place,
the design optimization problem statement (Eq. 1.4) applies, with no
changes needed.
A generic example of a coupled model with three components is
illustrated in Fig. 13.4. Here, the states of each component affect all other
components. However, it is common for a component to depend only
on a subset of the other system components. Furthermore, we might
distinguish variables between internal state variables and coupling
variables (more in this in Section 13.2.2).

𝑢1
𝑟1 (𝑥, 𝑢) = 0
𝑢1 
 
𝑢 = 𝑢2 
𝑢2 𝑢2
𝑥 𝑟2 (𝑥, 𝑢) = 0 Fig. 13.4 Coupled model composed
𝑢3  of three numerical models. This cou-
 
𝑢3 pled model would replace the single
𝑟3 (𝑥, 𝑢) = 0 model in Fig. 3.21.
13 Multidisciplinary Design Optimization 475

Mathematically, a coupled model is no more than a larger set of


equations to be solved, where all the governing equation residuals (𝑟),
the corresponding state variables (𝑢), and all the design variables (𝑥)
are concatenated into single vectors. Then, we can still just write the
whole multidisciplinary model as 𝑟(𝑥, 𝑢) = 0.
However, it is often necessary or advantageous to partition the sys-
tem into smaller components for three main reasons. First, specialized
solvers are often already in place for a given set of governing equations,
which may be more efficient at solving their set of equations than a
general-purpose solver. In addition, some of these solvers might be
black boxes that do not provide an interface for using alternative solvers.
Second, there is an incentive for building the multidisciplinary system
in a modular way. For example, a component might be useful on its own
and should therefore be usable outside the multidisciplinary system.
A modular approach also facilitates the extension of the multi-
disciplinary system and makes it easy to replace the model of a given
discipline with an alternative one. Finally, the overall system of equa-
tions may be more efficiently solved if it is partitioned in a way that
exploits the system structure. These reasons motivate an implementa-
tion of coupled models that is flexible enough to handle a mixture of
different types of models and solvers for each component.

Tip 13.1 Beware of loss of precision when coupling components

Precision can be lost when coupling components, leading to a loss of


precision in the overall coupled system solution. Ideally, the various components
would be coupled through memory, that is, a component can provide a pointer
to or a copy of the variable or array to the other components. If the type (e.g.,
double-precision float) is maintained, then there would be no loss in precision.
However, the number type might not be maintained in some conversions,
so it is crucial to be aware of this possibility and mitigate it. One common issue
is that components need to be coupled through file input and output. Codes do
not usually write all the available digits to the file, causing a loss in precision.
Casting a read variable to another type might also introduce errors. Find the
level of numerical error (Tip 3.2) and mitigate these issues as much as possible.

We start the remainder of this section by defining components in


more detail (Section 13.2.1). We explain how the coupling variables
relate to the state variables (Section 13.2.2) and coupled system formu-
lation (Section 13.2.3). Then, we discuss the coupled system structure
(Section 13.2.4). Finally, we explain methods for solving coupled sys-
tems (Section 13.2.5), including a hierarchical approach that can handle
a mixture of models and solvers (Section 13.2.6).
13 Multidisciplinary Design Optimization 476

13.2.1 Components
In Section 3.3, we explained how all models can ultimately be written as
a system of residuals, 𝑟(𝑥, 𝑢) = 0. When the system is large or includes
submodels, it might be natural to partition the system into components.
We prefer to use the more general term components instead of disciplines
to refer to the submodels resulting from the partitioning because the
partitioning of the overall model is not necessarily by discipline (e.g.,
aerodynamics, structures). A system model might also be partitioned
by physical system components (e.g., wing, fuselage, or an aircraft
in a fleet) or by different conditions applied to the same model (e.g.,
aerodynamic simulations at different flight conditions).
The partitioning can also be performed within a given discipline
for the same reasons cited previously. In theory, the system model
equations in 𝑟(𝑥, 𝑢) = 0 can be partitioned in any way, but only some
partitions are advantageous or make sense. We denote a partitioning
into 𝑛 components as




𝑟1 (𝑢1 ; 𝑢2 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0



..


 .
𝑟(𝑢) = 0 ≡ 𝑟 (𝑢 ; 𝑢 , . . . , 𝑢 ,𝑢 ,...,𝑢 ) = 0 . (13.1)


𝑖 𝑖 1 𝑖−1 𝑖+1 𝑛

 ..

 .

 𝑟𝑛 (𝑢𝑛 ; 𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛−1 ) = 0

Each 𝑟 𝑖 and 𝑢𝑖 are vectors corresponding to the residuals and states of
component 𝑖. The semicolon denotes that we solve each component 𝑖
by driving its residuals (𝑟 𝑖 ) to zero by varying only its states (𝑢𝑖 ) while
keeping the states from all other components constant. We assume this
is possible, but this is not guaranteed in general. We have omitted the
dependency on 𝑥 in Eq. 13.1 because, for now, we just want to find the
state variables that solve the governing equations for a fixed design.
Components can be either implicit or explicit, a concept we introduced
in Section 3.3. To solve an implicit component 𝑖, we need an algorithm
for driving the equation residuals, 𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ), to zero by
varying the states 𝑢𝑖 while the other states (𝑢 𝑗 for all 𝑗 ≠ 𝑖) remain fixed.
This algorithm could involve a matrix factorization for a linear system
or a Newton solver for a nonlinear system.
An explicit component is much easier to solve because that com-
ponents’ states are explicit functions of other components’ states. The
states of an explicit component can be computed without factorization
or iteration. Suppose that the states of a component 𝑖 are given by the
explicit function 𝑢𝑖 = 𝑓 (𝑢 𝑗 ) for all 𝑗 ≠ 𝑖. As previously explained in
Section 3.3, we can convert an explicit equation to the residual form by
13 Multidisciplinary Design Optimization 477

moving the function on the right-hand side to the left-hand side. Then,
we obtain set of residuals,

𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 𝑢𝑖 − 𝑓 (𝑢 𝑗 ) for all 𝑗 ≠ 𝑖. (13.2)

Therefore, there is no loss of generality when using the residual notation


in Eq. 13.1.
Most disciplines involve a mix of implicit and explicit components
because, as mentioned in Section 3.3 and shown in Fig. 3.21, the state
variables are implicitly defined, whereas the objective function and
constraints are usually explicit functions of the state variables. In
addition, a discipline usually includes functions that convert inputs
and outputs, as discussed in Section 13.2.3.
As we will see in Section 13.2.6, the partitioning of a model can be
hierarchical, where components are gathered in multiple groups. These
groups can be nested to form a hierarchy with multiple levels. Again,
this might be motivated by efficiency, modularity, or both.

Example 13.2 Residuals of the coupled aerostructural problem

Let us formulate models for the aerostructural problem described in


Ex. 13.1.∗ A possible model for the aerodynamics is a vortex-lattice model given ∗ This description omits many details for
brevity. Jasa et al.202 describe the aero-
by the linear system
structural model in more detail and cite
𝐴Γ = 𝑣 , other references on the background theory.

where 𝐴 is the matrix of aerodynamic influence coefficients, and 𝑣 is a vector of 202. Jasa et al., Open-source coupled aero-
structural optimization using Python, 2018.
boundary conditions, both of which depend on the wing shape. The state Γ is a
vector that represents the circulation (vortex strength) at each spanwise position
on the wing, as shown on the left-hand side of Fig. 13.5. The lift and drag
scalars can be computed explicitly for a given Γ, so we write these dependencies
as 𝐿 = 𝐿(Γ) and 𝐷 = 𝐷(Γ), omitting the detailed explicit expressions for
conciseness.

𝑑𝑧
↑ ↑ ↑ ↑ ↑ Fig. 13.5 Aerostructural wing model
showing the aerodynamic state vari-
 ables (circulations Γ) on the left and
 
y

 structural state variables (displace-


y


y

Γ 𝑑𝜃
ments 𝑑 𝑧 and rotations 𝑑𝜃 ) on the
right.

A possible model for the structures is a cantilevered beam modeled with


Euler–Bernoulli elements,
𝐾𝑑 = 𝑞 , (13.3)
where 𝐾 is the stiffness matrix, which depends on the beam shape and sizing.
The right-hand-side vector 𝑞 represents the applied forces at the spanwise
13 Multidisciplinary Design Optimization 478

position on the beam. The states 𝑑 are the displacements and rotations at each
node, as shown on the right-hand side of Fig. 13.5. The weight does not depend
on the states, and it is an explicit function of the beam sizing and shape, so it
does not involve the structural model (Eq. 13.3). The stresses are an explicit
function of the displacements, so we can write 𝜎 = 𝜎(𝑑), where 𝜎 is a vector
whose size is the number of elements.
When we couple these two models, 𝐴 and 𝑣 depend on the wing dis- Aerodynamics Γ
placements 𝑑, and 𝑞 depends on Γ. We can write all the implicit and explicit 𝐴(𝑑)Γ − 𝑣(𝑑) = 0
equations as residuals:
𝑟1 = 𝐴(𝑑)Γ − 𝑣(𝑑) Structures
𝑑 𝐾𝑑 − 𝑞(Γ) = 0
𝑟2 = 𝐾𝑑 − 𝑞(Γ) .
The states of this system are as follows:
Fig. 13.6 The aerostructural model
   
𝑢 Γ couples aerodynamics and structures
𝑢= 1 ≡ . through a displacement and force
𝑢2 𝑑
transfer.
This coupled system is illustrated in Fig. 13.6.

13.2.2 Models and Coupling Variables


In MDO, the coupling variables are variables that need to be passed from
the model of one discipline to the others because of interdependencies
in the system. Thus, the coupling variables are the inputs and outputs
of each model. Sometimes, the coupling variables are just the state
variables of one model (or a subset of these) that get passed to another
model, but often we need to convert between the coupling variables
and other variables within the model.
We represent the coupling variables by a vector 𝑢ˆ 𝑖 , where the
subscript 𝑖 denotes the model that computes these variables. In other
words, 𝑢ˆ 𝑖 contains the outputs of model 𝑖. A model 𝑖 can take any
coupling variable vector 𝑢ˆ 𝑗≠𝑖 as one of its inputs, where the subscript
indicates that 𝑗 can be the output from any model except its own.
Figure 13.7 shows the inputs and outputs for a model. The model solves
for the set of its state variables, 𝑢𝑖 . The residuals in the solver depend
on the input variables coming from other models. In general, this is
not a direct dependency, so the model may require an explicit function
(𝑃𝑖 ) that converts the inputs (𝑢ˆ 𝑗≠𝑖 ) to the required parameters 𝑝 𝑖 . These
parameters remain fixed when the model solves its implicit equations
for 𝑢𝑖 .
After the model solves for its state variables (𝑢𝑖 ), there may be
another explicit function (𝑄 𝑖 ) that converts these states to output
variables (𝑢ˆ 𝑖 ) for the other models. The function (𝑄 𝑖 ) typically reduces
13 Multidisciplinary Design Optimization 479

𝑢ˆ 𝑗≠𝑖 𝑃𝑖 (𝑢ˆ 𝑗≠𝑖 ) 𝑝𝑖

𝑢𝑖
Solver
𝑢𝑖
𝑟𝑖 𝑟 𝑖 (𝑢𝑖 ; 𝑝 𝑖 ) Fig. 13.7 In the general case, a model
may require conversions of inputs
and outputs distinct from the states
𝑄 𝑖 (𝑝 𝑖 , 𝑢𝑖 ) 𝑢ˆ 𝑖 = 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) that the solver computes.

the number of output variables relative to the number of internal states,


sometimes by orders of magnitude.
The model shown in Fig. 13.7 can be viewed as an implicit function
that computes its outputs as a function of all the inputs, so we can
write 𝑢ˆ 𝑖 = 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ). The model contains three components: two explicit
and one implicit. We can convert the explicit components to residual
equations using Eq. 13.2 and express the model as three sets of residuals
as shown in Fig. 13.8. The result is a group of three components that
we can represent as 𝑟(𝑢) = 0. This conversion and grouping hint at
a powerful concept that we will use later, which is hierarchy, where
components can be grouped using multiple levels.

Fig. 13.8 The conversion of inputs


𝑢 𝑗≠{𝑖−1,𝑖,𝑖+1} 𝑟 𝑖−1 (𝑢 𝑗 ) 𝑢𝑖−1 and outputs can be represented as ex-
plicit components with correspond-
ing state variables. Using this form,
𝑟 𝑖 (𝑢𝑖 ; 𝑢𝑖−1 ) = 0 𝑢𝑖 any model can be entirely expressed
as 𝑟(𝑢) = 0. The inputs could be any
subset of 𝑢 except for those handled
𝑟 𝑖+1 (𝑢𝑖−1 , 𝑢𝑖 ) 𝑢𝑖+1 in the component (𝑢𝑖−1 , 𝑢𝑖 , and 𝑢𝑖+1 ).

Example 13.3 Conversion of inputs and outputs in aerostructural problems

Consider the structural model from Ex. 13.2. We wrote 𝑞(Γ) to represent
the dependency of the external forces on the aerodynamic model circulations
to keep the notation simple, but in reality, there should be a separate explicit
component that converts Γ into 𝑞. The circulation translates to a lift force at each
spanwise position, which in turn needs to be distributed consistently to the
nodes of each beam element. Also, the displacements given by the structural
model (translations and rotations of each node) must be converted into a twist
distribution on the wing, which affects the right-hand side of the aerodynamic
model, 𝜃(𝑑). Both of these conversions are explicit functions.
13 Multidisciplinary Design Optimization 480

13.2.3 Residual and Functional Forms


The system-level representation of a coupled system is determined by
the variables that are “seen” and controlled at this level.
Representing all models and variable conversions as 𝑟(𝑢) = 0 leads
to the residual form of the coupled system, already written in Eq. 13.1,
where 𝑛 is the number of components. In this case, the system level
has direct access and control over all the variables. This residual form
is desirable because, as we will see later in this chapter, it enables us to
formulate efficient ways to solve coupled systems and compute their
derivatives.
The functional form is an alternate system-level representation of
the coupled system that considers only the coupling variables and
expresses them as implicit functions of the others. We can write this
form as




𝑢ˆ 1 = 𝑈1 (𝑢ˆ 2 , . . . , 𝑢ˆ 𝑚 )



..


 .
𝑢ˆ = 𝑈(𝑢)
ˆ ⇔ 𝑢ˆ = 𝑈 (𝑢ˆ , . . . , 𝑢ˆ , 𝑢ˆ , . . . , 𝑢ˆ 𝑚 ) , (13.4)


𝑖 𝑖 1 𝑖−1 𝑖+1

 ..

 .

 𝑢ˆ 𝑚 = 𝑈𝑚 (𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚−1 )

where 𝑚 is the number of models and 𝑚 ≤ 𝑛. If a model 𝑈 𝑖 is a black
box and we have no access to the residuals and the conversion functions,
this is the only form we can use. In this case, the system-level solver
only iterates the coupling variables 𝑢ˆ and relies on each model 𝑖 to solve
or compute its outputs 𝑢ˆ 𝑖 .
These two forms are shown in Fig. 13.9 for a generic example with
three models (or disciplines). The left of this figure shows the residual
form, where each model is represented as residuals and states, as
in Fig. 13.8. This leads to a system with nine sets of residuals and
corresponding state variables. The number of state variables in each of
these sets is not specified but could be any number.
The functional form of these three models is shown on the right of
Fig. 13.9. In the case where the model is a black box, the residuals and
conversion functions shown in Fig. 13.7 are hidden, and the system
level can only access the coupling variables. In this case, each black-box
is considered to be a component, as shown in the right of Fig. 13.9.
In an even more general case, these two views can be mixed in a
coupled system. The models in residual form expose residuals and
states, in which case, the model potentially has multiple components
at the system level. The models in functional form only expose inputs
and outputs; in that case, the model is just a single component.
13 Multidisciplinary Design Optimization 481

𝑢1
𝑟1 𝑅1

𝑢2 𝑢ˆ 1 𝑢ˆ 1
𝑟2 ˆ 22, 𝑢ˆ 3 )
𝑈1 (𝑢𝑅
𝑢3 𝑢3
𝑟3 𝑅3

𝑢4
𝑟4 𝑅4

𝑢5 𝑢ˆ 2
𝑟5 ˆ 15, 𝑢ˆ 3 )
𝑈2 (𝑢𝑅 𝑢ˆ 2
𝑢6
𝑟6 𝑢6 𝑅6

𝑢7
𝑟7 𝑅7

𝑢8 𝑢ˆ 3 𝑢ˆ 3
𝑟8 ˆ 18, 𝑢ˆ 2 )
𝑈3 (𝑢𝑅
𝑢9 𝑢9
𝑟9 𝑅9

Residual form Functional form

13.2.4 Coupled System Structure Fig. 13.9 Two system-level views of


coupled system with three solvers.
To show how multidisciplinary systems are coupled, we use a design In the residual form, all components
and their states are exposed (left); in
structure matrix (DSM), which is sometimes referred to as a dependency the functional (black-box) form, only
structure matrix or an 𝑁 2 matrix. An example of the DSM for a hypothet- inputs and outputs for each solver
ical system is shown on the left in Fig. 13.10. In this matrix, the diagonal are visible (right), where 𝑢ˆ 1 ≡ 𝑢3 ,
𝑢ˆ 2 ≡ 𝑢6 , and 𝑢ˆ 3 ≡ 𝑢9 .
elements represent the components, and the off-diagonal entries denote
coupling variables. A given coupling variable is computed by the
component in its row and is passed to the component in its column.† † Insome of the DSM literature, this defini-
tion is reversed, where “row” and “column”
As shown in the DSM on the left side of Fig. 13.10, there are generally are interchanged, resulting in a transposed
off-diagonal entries both above and below the diagonal, where the matrix.
entries above feed forward, whereas entries below feed backward.

B B
C

D C D Fig. 13.10 Different ways to represent


the dependencies of a hypothetical
Design structure Directed graph coupled system.
matrix

The mathematical representation of these dependencies is given by


13 Multidisciplinary Design Optimization 482

a graph (Fig. 13.10, right), where the graph nodes are the components,
and the edges represent the information dependency. This graph is
a directed graph because, in general, there are three possibilities for
coupling two components: single coupling one way, single coupling
the other way, and two-way coupling. A directed graph is cyclic when
there are edges that form a closed loop (i.e., a cycle). The graph on
the right of Fig. 13.10 has a single cycle between components B and C.
When there are no closed loops, the graph is acyclic. In this case, the
whole system can be solved by solving each component in turn without
iterating.
The DSM can be viewed as a matrix where the blank entries are
zeros. For real-world systems, this is often a sparse matrix. This means
that in the corresponding DSM, each component depends only on a
subset of all the other components. We can take advantage of the
structure of this sparsity in the solution of coupled systems.
The components in the DSM can be reordered without changing
the solution of the system. This is analogous to reordering sparse
matrices to make linear systems easier to solve. In one extreme case,
reordering could achieve a DSM with no entries below the diagonal. In
that case, we would have only feedforward connections, which means
all dependencies could be resolved in one forward pass (as we will
see in Ex. 13.4). This is analogous to having a linear system where
the matrix is lower triangular, in which case the linear solution can be 203. Cuthill and McKee, Reducing the
obtained with forward substitution. bandwidth of sparse symmetric matrices,
1969.
The sparsity of the DSM can be exploited using ideas from sparse 204. Amestoy et al., An approximate
linear algebra. For example, reducing the bandwidth of the matrix (i.e., minimum degree ordering algorithm, 1996.
moving nonzero elements closer to the diagonal) can also be helpful. ‡ Although these methods were designed
for symmetric matrices, they are still useful
This can be achieved using algorithms such as Cuthill–McKee,203 for non-symmetric ones. Several numeri-
reverse Cuthill–McKee (RCM), and approximate minimum degree cal libraries include these methods.
(AMD) ordering.204‡ 205. Lambe and Martins, Extensions to the
design structure matrix for the description
We now introduce an extended version of the DSM, called XDSM,205 of multidisciplinary design, analysis, and
which we use later in this chapter to show the process in addition to optimization processes, 2012.

the data dependencies. Figure 13.11 shows the XDSM for the same
A ûA ûA
four-component system. When showing only the data dependencies,
the only difference relative to DSM is that the coupling variables are B ûB
labeled explicitly, and the data paths are drawn. In the next section, we
add the process to the XDSM. ûC C ûD

D
13.2.5 Solving Coupled Numerical Models
The solution of coupled systems, also known as multidisciplinary analysis Fig. 13.11 XDSM showing data de-
(MDA), requires concepts beyond the solvers reviewed in Section 3.6 pendencies for the four-component
coupled system of Fig. 13.10.
13 Multidisciplinary Design Optimization 483

because it usually involves multiple levels of solvers.


When using the residual form described in Section 13.2.3, any solver
(such as a Newton solver) can be used to solve for the state of all
components (the entire vector 𝑢) simultaneously to satisfy 𝑟(𝑢) = 0 for
the coupled system (Eq. 13.1). This is a monolithic solution approach.
When using the functional form, we do not have access to the
internal states of each model and must rely on the model’s solvers to
compute the coupling variables. The model solver is responsible for
computing its output variables for a given set of coupling variables
from other models, that is,

𝑢ˆ 𝑖 = 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) . (13.5)

In some cases, we have access to the model’s internal states, but we may
want to use a dedicated solver for that model anyway.
Because each model, in general, depends on the outputs of all other
models, we have a coupled dependency that requires a solver to resolve.
This means that the functional form requires two levels: one for the
model solvers and another for the system-level solver. At the system
level, we only deal with the coupling variables (𝑢),
ˆ and the internal
states (𝑢) are hidden.
The rest of this section presents several system-level solvers. We
will refer to each model as a component even though it is a group of
components in general.

Tip 13.2 Avoid coupling components with file input and output

The coupling variables are often passed between components through files.
This is undesirable because of a potential loss in precision (see Tip 13.1) and
because it can substantially slow down the coupled solution.
Instead of using files, pass the coupling variable data through memory
whenever possible. You can do this between codes written in different languages
by wrapping each code using a common language. When using files is
unavoidable, be aware of these issues and mitigate them as much as possible.

Nonlinear Block Jacobi

The most straightforward way to solve coupled numerical models


(systems of components) is through a fixed-point iteration, which is
analogous to the fixed-point iteration methods mentioned in Section 3.6
and detailed in Appendix B.4.1. The difference here is that instead of
updating one state at a time, we update a vector of coupling variables
at each iteration corresponding to a subset of the coupling variables in
13 Multidisciplinary Design Optimization 484

the overall coupled system. Obtaining this vector of coupling variables


generally involves the solution of a nonlinear system. Therefore, these
are called nonlinear block variants of the linear fixed-point iteration
methods.
The nonlinear block Jacobi method requires an initial guess for
all coupling variables to start with and calls for the solution of all
components given those guesses. Once all components have been
solved, the coupling variables are updated based on the new values
computed by the components, and all components are solved again.
This iterative process continues until the coupling variables do not
change in subsequent iterations. Because each component takes the
coupling variable values from the previous iteration, which have already
been computed, all components can be solved in parallel without
communication. This algorithm is formalized in Alg. 13.1. When
applied to a system of components, we call it the block Jacobi method,
where block refers to each component.
The nonlinear block Jacobi method is also illustrated using an XDSM
in Fig. 13.12 for three components. The only input is the initial guess
for the coupling variables, 𝑢ˆ (0) .§ The MDA block (step 0) is responsible § Inthis chapter, we use a superscript for
the iteration number instead of subscript
for iterating the system-level analysis loop and for checking if the to avoid a clash with the component index.
system has converged. The process line is shown as a thin black line
to distinguish it from the data dependency connections (thick gray
lines) and follows the sequence of numbered steps. The analyses for
each component are all numbered the same (step 1) because they can
be done in parallel. Each component returns the coupling variables
it computes to the MDA iterator, closing the loop between step 2 and
step 1 (denoted as “2 → 1”).

û(0)

0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Jacobi

1:
û1 2 : û1
Solver 1

1:
û2 2 : û2
Solver 2

1: Fig. 13.12 Nonlinear block Jacobi


û3 2 : û3
Solver 3 solver for a three-component coupled
system.
13 Multidisciplinary Design Optimization 485

Algorithm 13.1 Nonlinear block Jacobi algorithm

Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states

𝑘=0

while 𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1) > 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for all 𝑖 ∈ {1, . . . , 𝑚} do  Can be done in parallel
(𝑘+1) (𝑘+1) (𝑘)
𝑢ˆ 𝑖 ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 𝑗 = 0, 𝑗 ≠ 𝑖 Solve for component 𝑖 ’s states
using the states from the previous iteration of other components
end for
𝑘 = 𝑘+1
end while

The block Jacobi solver (Alg. 13.1) can also be used when one or
more components are linear solvers. This is useful for computing the
derivatives of the coupled system using implicit analytics methods
because that involves solving a coupled linear system with the same
structure as the coupled model (see Section 13.3.3).

Nonlinear Block Gauss–Seidel

The nonlinear block Gauss–Seidel algorithm is similar to its Jacobi


counterpart. The only difference is that when solving each component,
we use the latest coupling variables available instead of just using the
coupling variables from the previous iteration. We cycle through each
component 𝑖 = 1, . . . , 𝑚 in order. When computing 𝑢ˆ 𝑖 by solving com-
ponent 𝑖, we use the latest available states from the other components.
Figure 13.13 illustrates this process.
Both Gauss–Seidel and Jacobi converge linearly, but Gauss–Seidel
tends to converge more quickly because each equation uses the latest
information available. However, unlike Jacobi, the components can no
longer be solved in parallel.
The convergence of nonlinear block Gauss–Seidel can be improved
by using a relaxation. Suppose that 𝑢ˆ temp is the state of component
𝑖 resulting from the solving of that component given the states of all
other components, as we would normally do for each block in the
Gauss–Seidel or Jacobi method. If we used this, the step would be
(𝑘) (𝑘)
Δ𝑢ˆ 𝑖 = 𝑢ˆ temp − 𝑢ˆ 𝑖 . (13.6)
13 Multidisciplinary Design Optimization 486

û(0)

0, 4 → 1 :
1 : û2 , û3 2 : û3
Gauss–Seidel

1:
û1 4 : û1 2 : û1 2 : û1
Solver 1

2:
û2 4 : û2 3 : û2
Solver 2

3: Fig. 13.13 Nonlinear block Gauss–


û3 4 : û3
Solver 3 Seidel solver for the three-discipline
coupled system of Fig. 13.9.

Instead of using that step, relaxation updates the variables as


(𝑘) (𝑘)
𝑢ˆ 𝑖 = 𝑢ˆ temp + 𝜃 (𝑘) Δ𝑢ˆ 𝑖 , (13.7)

(𝑘)
where 𝜃 (𝑘) is the relaxation factor, and Δ𝑢ˆ 𝑖 is the previous update
for component 𝑖. The relaxation factor, 𝜃, could be a fixed value,
which would normally be less than 1 to dampen oscillations and avoid
divergence.
Aitken’s method206 improves on the fixed relaxation approach by 206. Irons and Tuck, A version of the
Aitken accelerator for computer iteration,
adapting the 𝜃. The relaxation factor at each iteration changes based 1969.
on the last two updates according to
 |
© Δ𝑢ˆ (𝑘) − Δ𝑢ˆ (𝑘−1) Δ𝑢ˆ (𝑘) ª
= 𝜃 (𝑘−1) ­­1 − ®.
Δ𝑢ˆ (𝑘) − Δ𝑢ˆ (𝑘−1) 2 ®
𝜃 (𝑘) (13.8)
« ¬
Aitken’s method usually accelerates convergence and has been shown
to work well for nonlinear block Gauss–Seidel with multidisciplinary
systems.207 It is advisable to override the value of the relaxation factor 207. Kenway et al., Scalable parallel ap-
proach for high-fidelity steady-state aeroe-
given by Eq. 13.8 to keep it between 0.25 and 2.208 lastic analysis and derivative computations,
The steps for the full Gauss–Seidel algorithm with Aitken accelera- 2014.
208. Chauhan et al., An automated selec-
tion are listed in Alg. 13.2. Similar to the block Jacobi solver, the block tion algorithm for nonlinear solvers in MDO,
Gauss–Seidel solver can also be used when one or more components 2018.
are linear solvers. Aitken acceleration can be used in the linear case
without modification and it is still useful.
The order in which the components are solved makes a significant
difference in the efficiency of the Gauss–Seidel method. In the best
possible scenario, the components can be reordered such that there are
13 Multidisciplinary Design Optimization 487

no entries in the lower diagonal of the DSM, which means that each
component depends only on previously solved components, and there
are therefore no feedback dependencies (see Ex. 13.4). In this case,
the block Gauss–Seidel method would converge to the solution in one
forward sweep.
In the more general case, even though we might not eliminate
the lower diagonal entries completely, minimizing these entries by
reordering results in better convergence. This reordering can also mean
the difference between convergence and nonconvergence.

Algorithm 13.2 Nonlinear block Gauss–Seidel algorithm with Aitken accelera-


tion
Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables

𝜃(0) : Initial relaxation factor for Aitken acceleration


Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states

𝑘=0

while 𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1) > 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for 𝑖 = 1, 𝑚 do  
(𝑘+1) (𝑘+1) (𝑘+1) (𝑘) (𝑘)
𝑢ˆ temp ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑖−1 , 𝑢ˆ 𝑖+1 , . . . , 𝑢ˆ 𝑚 = 0
Solve for component 𝑖 ’s states using the latest states from other components
(𝑘) (𝑘)
Δ𝑢ˆ 𝑖 = 𝑢ˆ temp − 𝑢ˆ 𝑖 Compute step
if 𝑘 > 0 then  | !
Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) Δ𝑢ˆ (𝑘)
𝜃(𝑘) = 𝜃(𝑘−1) 1− 2 Update the relaxation factor
k Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) k
end if
(𝑘+1) (𝑘) (𝑘)
𝑢ˆ 𝑖 = 𝑢ˆ 𝑖 + 𝜃(𝑘) Δ𝑢ˆ 𝑖 Update component 𝑖 ’s states
end for
𝑘 = 𝑘+1
end while

Example 13.4 Making Gauss–Seidel converge in one pass by reordering com-


ponents
Consider the coupled system of six components with the dependencies
shown on the left in Fig. 13.14. This system includes both feedforward and
feedback dependencies and would normally require an iterative solver. In
this case, however, we can reorder the components as shown on the right in
Fig. 13.14 to eliminate the feedback loops. Then, we only need to solve the
sequence of components E → C → A → D → F → B once to get a converged
coupled solution.
13 Multidisciplinary Design Optimization 488

A E

B C

C A

Fig. 13.14 The solution of the compo-


D D
nents of the system shown on the left
can be reordered to get the equivalent
E F system shown on the right. This new
system has no feedback loops and
F B can therefore be solved in one pass of
a Gauss–Seidel solver.
Original order Components reordered

Newton’s Method

As mentioned previously, Newton’s method can be applied to the


residual form illustrated in Fig. 13.9 and expressed in Eq. 13.1. Recall
that in this form, we have 𝑛 components and the coupling variables are
part of the state variables. In this case, Newton’s method is as described
in Section 3.8.
Concatenating the residuals and state variables for all components
and applying Newton’s method yields the coupled block Newton
system,
 𝜕𝑟 
 1 𝜕𝑟1 · · · 𝜕𝑟1     
   Δ𝑢1   𝑟1 
 
 𝜕𝑢1 𝜕𝑢2 𝜕𝑢𝑛     
 𝜕𝑟    
 𝜕𝑟 𝜕𝑟     𝑟2 
   Δ𝑢2 
2 2 2
 1
···
𝑛 
 
 .
𝜕𝑢 𝜕𝑢 2 𝜕𝑢
 .   = −   . (13.9)
 .. . ..   .. 
.  .. 
.
.
 
.. ..
    
 𝜕𝑟𝑛   
 𝜕𝑟𝑛     
  𝑟𝑛 
𝜕𝑟𝑛
 𝜕𝑢 ···  
Δ𝑢
  
 1 𝜕𝑢𝑛 
𝑛
| {z } |{z}
𝜕𝑢2
| {z }
Δ𝑢 𝑟
𝜕𝑟
𝜕𝑢
We can solve this linear system to compute the Newton step for all
components’ state variables 𝑢 simultaneously, and then iterate to satisfy
𝑟(𝑢) = 0 for the complete system. This is the monolithic Newton approach
illustrated on the left panel of Fig. 13.15. As with any Newton method,
a globalization strategy (such as a line search) is required to increase
the likelihood of successful convergence when starting far from the
solution (see Section 4.2). Even with such a strategy, Newton’s method
does not necessarily converge robustly.
13 Multidisciplinary Design Optimization 489

Newton solver Newton solver Newton solver


𝜕𝑟 𝜕𝑟 𝜕𝑈
Δ𝑢 = −𝑟 Δ𝑢 = −𝑟 − Δ𝑢ˆ = −(𝑢ˆ − 𝑈)
𝜕𝑢 𝜕𝑢 𝜕𝑢ˆ

𝑢 𝑟1 𝑢 𝑟𝑛 𝑢 𝑢1 𝑢 𝑢𝑛 𝑢ˆ 𝑈1 𝑢ˆ 𝑈𝑚

Compute ··· Compute Solve ··· Solve Solve ··· Solve


𝑟1 𝑟𝑛 𝑟1 = 0 𝑟𝑛 = 0 component 1 component 𝑚

Monolithic Hierarchical (full space) Hierarchical (reduced space)

A variation on this monolithic Newton approach uses two-level Fig. 13.15 There are three options
solver hierarchy, as illustrated on the middle panel of Fig. 13.15. The for solving a coupled system with
Newton’s method. The monolithic
system-level solver is the same as in the monolithic approach, but each approach (left) solves for all state
component is solved first using the latest states. The Newton step for variables simultaneously. The block
approach (middle) solves the same
each component 𝑖 is given by system as the monolithic approach,

but solves each component for its
𝜕𝑟 𝑖 states at each iteration. The black box
Δ𝑢𝑖 = −𝑟 𝑖 𝑢𝑖 ; 𝑢 𝑗≠𝑖 , (13.10)
𝜕𝑢𝑖 approach (right) applies Newton’s
method to the coupling variables.
where 𝑢 𝑗 represents the states from other components (i.e., 𝑗 ≠ 𝑖),
which are fixed at this level. Each component is solved before taking
a step in the entire state vector (Eq. 13.9). The procedure is given
in Alg. 13.3 and illustrated in Fig. 13.16. We call this the full-space
hierarchical Newton approach because the system-level solver iterates the
entire state vector. Solving each component before taking each step in
the full space Newton iteration acts as a preconditioner. In general, the
monolithic approach is more efficient, and the hierarchical approach is
more robust, but these characteristics are case-dependent.

u(0)

0, 2 → 1 :
1 : u2 , u3 1 : u1 , u3 1 : u1 , u2
Newton

∂ r1 1:
u1 2 : u1 ,
∂u Solver 1

∂ r2 1:
u2 2 : u2 ,
∂u Solver 2

u3 ∂ r3 1: Fig. 13.16 Full-space hierarchical


2 : u3 ,
∂u Solver 3 Newton solver for a three-component
coupled system.

Newton’s method can also be applied to the functional form illus-


13 Multidisciplinary Design Optimization 490

trated in Fig. 13.9 to solve only for the coupling variables. We call this
the reduced-space hierarchical Newton approach because the system-level
solver iterates only in the space of the coupling variables, which is
smaller than the full space of the state variables. Using this approach,
each component’s solver can be a black box, as in the nonlinear block
Jacobi and Gauss–Seidel solvers. This approach is illustrated on the
right panel of Fig. 13.15. The reduced-space approach is mathemati-
cally equivalent and follows the same iteration path as the full-space
approach if each component solver in the reduced-space approach is
converged well enough.132 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
design, analysis, and optimization, 2019.
Algorithm 13.3 Full-space hierarchical Newton

Inputs: h i
(0) (0)
𝑢 (0) = 𝑢1 , . . . , 𝑢𝑛 : Initial guess for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states

𝑘=1 Iteration counter for full-space iteration


while k𝑟 k 2 > 𝜀 do Check residual norm for all components
for all 𝑖 ∈ {1, . . . , 𝑛} do Can be done in parallel; 𝑘 is constant in this loop
while k𝑟 𝑖 k 2 > 𝜀do  Check residual norm for component 𝑖
(𝑘) (𝑘−1)
Compute 𝑟 𝑖 𝑢𝑖 ; 𝑢 𝑗≠𝑖 States for other components are fixed

𝜕𝑟
Compute 𝑖 Jacobian block for component 𝑖 for current state
𝜕𝑢𝑖
𝜕𝑟 𝑖
Solve Δ𝑢𝑖 = −𝑟 𝑖 Solve for Newton step for 𝑖 th component
𝜕𝑢𝑖
(𝑘) (𝑘)
𝑢𝑖 = 𝑢𝑖 + Δ𝑢𝑖 Update state variables for component 𝑖
end while
end for  
Compute 𝑟 𝑢 (𝑘) Full residual vector for current states

𝜕𝑟
Compute Full Jacobian for current states
𝜕𝑢
𝜕𝑟
Solve Δ𝑢 = −𝑟 Coupled Newton system (Eq. 13.9)
𝜕𝑢
𝑢 (𝑘+1) = 𝑢 (𝑘) + Δ𝑢 Update full state variable vector
𝑘 = 𝑘+1
end while

To apply the reduced-space Newton’s method, we express the


functional form (Eq. 13.4) as residuals by using the same technique we
used to convert an explicit function to the residual form (Eq. 13.2). This
yields
𝑟ˆ𝑖 (𝑢)
ˆ = 𝑢ˆ 𝑖 − 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) , (13.11)
13 Multidisciplinary Design Optimization 491

where 𝑢ˆ 𝑖 represents the guesses for the coupling variables, and 𝑈 𝑖


represents the actual computed values. For a system of nonlinear
residual equations, the Newton step in the coupling variables, Δ𝑢ˆ =
𝑢ˆ (𝑘+1) − 𝑢ˆ (𝑘) , can be found by solving the linear system
 
𝜕ˆ𝑟
(𝑘) Δ𝑢ˆ = −ˆ𝑟 𝑢ˆ (𝑘) , (13.12)
𝜕𝑢ˆ 𝑢=ˆ 𝑢ˆ

where we need the partial derivatives of all the residuals with respect to
the coupling variables to form the Jacobian matrix 𝜕ˆ𝑟 /𝜕𝑢.
ˆ The Jacobian
can be found by differentiating Eq. 13.11 with respect to the coupling
variables. Then, expanding the concatenated residuals and coupling
variable vectors yields

 𝜕𝑈1     
 𝐼 𝜕𝑈1
 Δ𝑢ˆ 1   𝑢ˆ 1 − 𝑈1 (𝑢ˆ 2 , . . . , 𝑢ˆ 𝑚 ) 
 − ··· −     
 𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚     
 𝜕𝑈 𝜕𝑈2     
− 2  Δ𝑢ˆ 2  𝑢ˆ 2 − 𝑈2 (𝑢ˆ 1 , 𝑢ˆ 3 , . . . , 𝑢ˆ 𝑚 )
 𝐼 ··· −     
 𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚    =−  .
 . ..   ..   
 .. .. ..
 .  
..

 . . .    
.

     
 𝜕𝑈𝑚     
− 𝜕𝑈𝑚  Δ𝑢ˆ 𝑚   𝑢ˆ 𝑚 − 𝑈𝑚 (𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚−1 ) 
 𝜕𝑢ˆ − ··· 𝐼     
 1 𝜕𝑢ˆ 2 
(13.13)
The residuals in the right-hand side of this equation are evaluated at
the current iteration.
The derivatives in the block Jacobian matrix are also computed
at the current iteration. Each row 𝑖 represents the derivatives of the
(potentially implicit) function that computes the outputs of component
𝑖 with respect to all the inputs of that component. The Jacobian matrix
in Eq. 13.13 has the same structure as the DSM (but transposed) and is
often sparse. These derivatives can be computed using the methods
from Chapter 6. These are partial derivatives in the sense that they do
not take into account the coupled system. However, they must take
into account the respective model and can be computed using implicit
analytic methods when the model is implicit.
This Newton solver is shown in Fig. 13.17 and detailed in Alg. 13.4.
Each component corresponds to a set of rows in the block Newton
system (Eq. 13.13). To compute each set of rows, the corresponding
component must be solved, and the derivatives of its outputs with
respect to its inputs must be computed as well. Each set can be computed
in parallel, but once the system is assembled, a step in the coupling
variables is computed by solving the full system (Eq. 13.13).
These coupled Newton methods have similar advantages and dis-
advantages to the plain Newton method. The main advantage is that it
13 Multidisciplinary Design Optimization 492

û(0)

0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Newton

∂ U1 1:
û1 2 : û1 ,
∂ û Solver 1

∂ U2 1:
û2 2 : û2 ,
∂ û Solver 2

∂ U3 1: Fig. 13.17 Reduced-space hierarchical


û3 2 : û3 ,
∂ û Solver 3 Newton solver for a three-component
coupled system.

converges quadratically once it is close enough to the solution (if the


problem is well-conditioned). The main disadvantage is that it might
not converge at all, depending on the initial guess. One disadvantage
specific to the coupled Newton methods is that it requires formulating
and solving the coupled linear system (Eq. 13.13) at each iteration.

Algorithm 13.4 Reduced-space hierarchical Newton

Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states

𝑘=0
while k 𝑟ˆ k 2 > 𝜀 do Check residual norm for all components


for all 𝑖 ∈ {1, . . . , 𝑚} do  Can be done in parallel
(𝑘)
𝑈𝑖 ← compute 𝑈 𝑖 𝑢ˆ 𝑗≠𝑖 Solve component 𝑖 and compute its outputs
end for
𝑟ˆ = 𝑢ˆ (𝑘) − 𝑈 Compute all coupling variable residuals
𝜕𝑈
Compute Jacobian of coupling variables for current state
𝜕𝑢ˆ
𝜕ˆ𝑟
Solve Δ𝑢ˆ = −ˆ𝑟 Coupled Newton system (Eq. 13.13)
𝜕𝑢ˆ
𝑢ˆ (𝑘+1) = 𝑢ˆ (𝑘) + Δ𝑢ˆ Update all coupling variables
𝑘 = 𝑘+1
end while

If the Jacobian 𝜕𝑟/𝜕𝑢 is not readily available, Broyden’s method can


13 Multidisciplinary Design Optimization 493

approximate the Jacobian inverse ( 𝐽˜−1 ) by starting with a guess (say,


𝐽˜0−1 = 𝐼) and then using the update
 
(𝑘)
Δ𝑢 (𝑘) − ˜
𝐽 −1 Δ𝑟 (𝑘) Δ𝑢 (𝑘)
|
(𝑘+1) (𝑘)
˜
𝐽 −1 ˜
= 𝐽 −1 + , (13.14)
|
Δ𝑟 (𝑘) Δ𝑟 (𝑘)

where Δ𝑢 (𝑘) is the last step in the states and Δ𝑟 (𝑘) is the difference
between the two latest residual vectors. Because the inverse is provided
explicitly, we can find the update by performing the multiplication,

Δ𝑢 (𝑘) = − 𝐽˜−1 𝑟 (𝑘) . (13.15)

Broyden’s method is analogous to the quasi-Newton methods of Sec-


tion 4.4.4 and is derived in Appendix C.1.

Example 13.5 Aerostructural solver comparison

We now apply the coupled solution methods presented in this section to


the implicit parts of the aerostructural model, which are the two first residuals
from Ex. 13.2,    
𝑟1 𝐴(𝑑)Γ − 𝑣(𝑑)
𝑟= = ,
𝑟2 𝐾𝑑 − 𝑞(Γ)
and the variables are the circulations and displacements,
   
𝑢 Γ
𝑢= 1 = .
𝑢2 𝑑 6,000
𝑙

Lift
In this case, the linear systems defined by 𝑟1 and 𝑟2 are small enough to 4,000

be solved using a direct method, such as LU factorization. Thus, we can solve 2,000

𝑟1 for Γ, for a given 𝑑, and solve 𝑟2 for 𝑑, for a given Γ. Also, no conversions 0
are involved, so the set of coupling variables is equivalent to the set of state 𝑑𝛾
0.08
variables (𝑢ˆ = 𝑢). Rotation
0.06
Using the nonlinear block Jacobi method (Alg. 13.1), we start with an initial 0.04
guess (e.g., Γ = 0, 𝑑 = 0) and solve 𝑟1 = 0 and 𝑟2 = 0 separately for the new 0.02
values of Γ and 𝑑, respectively. Then we use these new values of Γ and 𝑑 to 0

solve 𝑟1 = 0 and 𝑟2 = 0 again, and so on until convergence. 0.3


𝑑𝑦

Nonlinear block Gauss–Seidel (Alg. 13.2) is similar, but we need to solve Vertical displacement
0.2
the two components in sequence. We can start by solving 𝑟1 = 0 for Γ with
0.1
𝑑 = 0. Then we use the Γ obtained from this solution in 𝑟2 and solve for a new
0
𝑑. We now have a new 𝑑 to use in 𝑟1 to solve for a new Γ, and so on. 0 5 10 15
The Jacobian for the Newton system (Eq. 13.9) is Spanwise location [m]

 𝜕𝑟1 𝜕𝑟1   𝜕𝑣 
   𝐴 𝜕𝐴 Fig. 13.18 Spanwise distribution of
𝜕𝑟  𝜕𝑢2  = 
Γ−
𝜕𝑑  .
=  𝜕𝑢1
𝜕𝑟2   𝜕𝑞
𝜕𝑑 the lift, wing rotation (𝑑𝜃 ), and verti-
 𝜕𝑟2 
  − 
𝜕𝑢 cal displacement (𝑑 𝑧 ) for the coupled
𝐾
 𝜕𝑢1 𝜕𝑢2   𝜕Γ  aerostructural solution.
13 Multidisciplinary Design Optimization 494

We already have the block diagonal matrices in this Jacobian from the governing
equations, but we need to compute the off-diagonal partial derivative blocks,
which can be done analytically or with algorithmic differentiation (AD).
The solution is shown in Fig. 13.18, where we plot the variation of lift,
vertical displacement, and rotation along the span. The vertical displacements
are a subset of 𝑑, and the rotations are a conversion of a subset of 𝑑 representing
the rotations of the wing section at each spanwise location. The lift is the
vertical force at each spanwise location, which is proportional to Γ times the
wing chord at that location.
The monolithic Newton approach does not converge in this case. We
apply the full-space hierarchical approach (Alg. 13.3), which converges more
reliably. In this case, the reduced-space approach is not used because there is
no distinction between coupling variables and state variables.
In Fig. 13.19, we compare the convergence of the methods introduced in this
section.¶ The Jacobi method has the poorest convergence rate and oscillates. ¶ These results and subsequent results
The Gauss–Seidel method is much better, and it is even better with Aitken based on the same example were obtained
using OpenAeroStruct,202 which was de-
acceleration. Newton has the highest convergence rate, as expected. Broyden veloped using OpenMDAO. The descrip-
performs about as well as Gauss–Seidel in this case. tion in these examples is simplified for
didactic purposes; check the paper and
code for more details.
104 202. Jasa et al., Open-source coupled aero-
structural optimization using Python, 2018.
102

100
Block Jacobi
k𝑟 k
10−2
Newton

10−4 Block
Gauss-Seidel
10−6
Broyden Block GS
10−8 with Aitken Fig. 13.19 Convergence of each solver
3 6 9 12 15 18 for aerostructural system.
Iterations

13.2.6 Hierarchical Solvers for Coupled Systems ‖


MAUD was developed by Hwang and
Martins44 when they realized that the
The coupled solvers we discussed so far already use a two-level hierarchy unified derivatives equation (UDE) pro-
because they require a solver for each component and a second level vides the mathematical foundation for a
framework of parallel hierarchical solvers
that solves the group of components. This hierarchy can be extended through a small set of user-defined func-
to three and more levels by making groups of groups. tions. MAUD can also compute the deriva-
tives of coupled systems, as we will see in
Modular analysis and unified derivatives (MAUD) is a mathematical Section 13.3.3.
framework developed for this purpose. Using MAUD, we can mix 44. Hwang and Martins, A computational
architecture for coupling heterogeneous
residual and functional forms and seamlessly handle implicit and numerical models and computing coupled
explicit components.‖ derivatives, 2018.
13 Multidisciplinary Design Optimization 495

The hierarchy of solvers can be represented as a tree data structure,


where the nodes are the solvers and the leaves are the components, as
shown in Fig. 13.20 for a system of six components and five solvers.
The root node ultimately solves the complete system, and each solver is
responsible for a subsystem and thus handles a subset of the variables.

Recursive
solver
1 6

Recursive Recursive
solver solver
2 3 7 8

Monolithic Recursive Component Component


solver solver
4 5
Fig. 13.20 A system of components
Component Component Component Component can be organized in a solver hierarchy.

There are two possible types of solvers: monolithic and recursive.


Monolithic solvers can only have components as children and handle all
their variables simultaneously using the residual form. Of the methods
we introduced in the previous section, only monolithic and full-space
Newton (and Broyden) can do this for nonlinear systems. Linear
systems can be solved in a monolithic fashion using a direct solver or
an iterative linear solver, such as a Krylov subspace method. Recursive
solvers, as the name implies, visit all the child nodes in turn. If a child
node turns out to be another recursive solver, it does the same until a
component is reached. The block Jacobi and Gauss–Seidel methods
can be used as recursive solvers for nonlinear and linear systems. The
reduced-space Newton and Broyden methods can also be recursive
solvers. For the hypothetical system shown in Fig. 13.20, the numbers
show the order in which each solver and component would be called.
The hierarchy of solvers should be chosen to exploit the system
structure. MAUD also facilitates parallel computation when subsystems
are uncoupled, which provides further opportunities to exploit the
structure of the problem. Figs. 13.21 and 13.22 show several possibilities.
The three systems in Fig. 13.21 show three different coupling modes.
In the first mode, the two components are independent of each other
and can be solved in parallel using any solvers appropriate for each
of the components. In the serial case, component 2 depends on 1, but
not the other way around. Therefore, we can converge to the coupled
solution using one block Gauss–Seidel iteration. If the dependency
were reversed (feedback but no feedforward), the order of the two
components would be switched. Finally, the fully coupled case requires
13 Multidisciplinary Design Optimization 496

an iterative solution using any of the methods from Section 13.2.5.


MAUD is designed to handle these three coupling modes.

𝑢1 𝑢1 𝑢1

𝑢2 𝑢2 𝑢2
Fig. 13.21 There are three main possi-
bilities involving two components.
Parallel Serial Coupled

Figure 13.22 shows three possibilities for a four-component system


where two levels of solvers can be used. In the first one (on the left),
we require a coupled solver for components 1 and 2 and another for
components 3 and 4, but no further solving is needed. In the second
(Fig. 13.22, middle), components 1 and 2 as well as components 3 and 4
can be solved serially, but these two groups require a coupled solution.
For the two levels to converge, the serial and coupled solutions are
called repeatedly until the two solvers agree with each other. The third
possibility (Fig. 13.22, right) has two systems that have two independent
components, which can each be solved in parallel, but the overall system
is coupled. With MAUD, we can set up any of these sequences of solvers
through the solver hierarchy tree, as illustrated in Fig. 13.20.

𝑢1 𝑢1 𝑢1

𝑢2 𝑢2 𝑢2 Parallel

𝑢3 𝑢3 𝑢3 Serial

𝑢4 𝑢4 𝑢4 Coupled

To solve the system from Ex. 13.3 using hierarchical solvers, we can Fig. 13.22 Three examples of a system
of four components with a two-level
use the hierarchy shown in Fig. 13.23. We form three groups with three
solver hierarchy.
components each. Each group includes the input and output conversion
components (which are explicit) and one implicit component (which
requires its own solver). Serial solvers can be used to handle the input
and output conversion components. A coupled solver is required to
solve the entire coupled system, but the coupling between the groups
is restricted to the corresponding outputs (components 3, 6, and 9).
Alternatively, we could apply a coupled solver to the functional
representation (Fig. 13.9, right). This would also use two levels of
solvers: a solver within each group and a system-level solver for the
13 Multidisciplinary Design Optimization 497

𝑟1 Serial

𝑟2 Coupled

𝑟3

𝑟4

𝑟5

𝑟6
Fig. 13.23 For the case of Fig. 13.9,
𝑟7 we can use a serial evaluation within
each of the three groups and require a
𝑟8 coupled solver to handle the coupling
between the three groups.
𝑟9

coupling of the three groups. However, the system-level solver would


handle coupling variables rather than the residuals of each component.

Tip 13.3 Framework for implementing coupled system solvers

The development of coupled solvers is often done for a specific set of models
from scratch, which requires substantial effort. OpenMDAO is an open-source
framework that facilitates such efforts by implementing MAUD.132 All the 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
solvers introduced in this chapter are available in OpenMDAO. This framework design, analysis, and optimization, 2019.
also makes it easier to compute the derivatives of the coupled system, as we
will see in the next section. Users can assemble systems of mixed explicit and
implicit components.
For implicit components, they must give OpenMDAO access to the residual
computations and the corresponding state variables. For explicit components,
OpenMDAO only needs access to the inputs and the outputs, so it supports
black-box models.
OpenMDAO is usually more efficient when the user provides access to
the residuals and state variables instead of treating models as black boxes. A
hierarchy of multiple solvers can be set up in OpenMDAO, as illustrated in
Fig. 13.20. OpenMDAO also provides the necessary interfaces for user-defined
solvers. Finally, OpenMDAO encourages coupling through memory, which is
beneficial for numerical precision (see Tip 13.1) and computational efficiency
(see Tip 13.2).

13.3 Coupled Derivatives Computation

The gradient-based optimization algorithms from Chapters 4 and 5


require the derivatives of the objective and constraints with respect to
the design variables. Any of the methods for computing derivatives
13 Multidisciplinary Design Optimization 498

from Chapter 6 can be used to compute the derivatives of coupled


models, but some modifications are required. The main difference is
that in MDO, the computation of the functions of interest (objective
and constraints) requires the solution of the multidisciplinary model.

13.3.1 Finite Differences


The finite-difference method can be used with no modification, as long
as an MDA is converged well enough for each perturbation in the
design variables. As explained in Section 6.4, the cost of computing
derivatives with the finite-difference method is proportional to the
number of variables. The constant of proportionality can increase
significantly compared with that of a single discipline because the
MDA convergence might be slow (especially if using a block Jacobi or
Gauss–Seidel iteration).
The accuracy of finite-difference derivatives depends directly on the
accuracy of the functions of interest. When the functions are computed
from the solution of a coupled system, their accuracy depends both
on the accuracy of each component and the accuracy of the MDA. To
address the latter, the MDA should be converged well enough.

13.3.2 Complex Step and AD


The complex-step method and forward-mode AD can also be used for a
coupled system, but some modifications are required. The complex-step
method requires all components to be able to take complex inputs and
compute the corresponding complex outputs. Similarly, AD requires
inputs and outputs that include derivative information. For a given
𝑥¤
MDA, if one of these methods is applied to each component and the
coupling includes the derivative information, we can compute the 𝑢¤ 1
derivatives of the coupled system. The propagation of the forward-
mode seed (or the complex step) is illustrated in Fig. 13.24 for a system 𝑢¤ 2 𝑓¤
of two components.
When using AD, manual coupling is required if the components and
Fig. 13.24 Forward mode of AD for a
the coupling are programmed in different languages. The complex-step system of two components.
method can be more straightforward to implement than AD for cases
𝑥¯
where the models are implemented in different languages, and all
the languages support complex arithmetic. Although both of these 𝑢¯ 1
methods produce accurate derivatives for each component, the accuracy
of the derivatives for the coupled system could be compromised by a 𝑢¯ 2 𝑓¯
low level of convergence of the MDA.
The reverse mode of AD for coupled systems would be more Fig. 13.25 Reverse mode of AD for a
system of two components.
13 Multidisciplinary Design Optimization 499

involved: after an initial MDA, we would run a reverse MDA to


compute the derivatives, as illustrated in Fig. 13.25.

13.3.3 Implicit Analytic Methods


The implicit analytic methods from Section 6.7 (both direct and adjoint)
can also be extended to compute the derivatives of coupled systems.
All the equations derived for a single component in Section 6.7 are
valid for coupled systems if we concatenate the residuals and the state
variables. Furthermore, we can mix explicit and implicit components
using concepts introduced in the UDE. Finally, when using the MAUD
approach, the coupled derivative computation can be done using the
same hierarchy of solvers.

Coupled Derivatives of Residual Representation

In Eq. 13.1, we denoted the coupled system as a series of concatenated


residuals, 𝑟 𝑖 (𝑢) = 0, and variables 𝑢𝑖 corresponding to each component
𝑖 = 1, . . . , 𝑛 as
 𝑟1 (𝑢)   𝑢1 
   
 ..   
𝑟(𝑢) ≡  .  , 𝑢 ≡  ...  , (13.16)
   
𝑟𝑛 (𝑢) 𝑢 𝑛 
   
where the residual for each component, 𝑟 𝑖 , could depend on all states 𝑢.
To derive the coupled version of the direct and adjoint methods, we
apply them to the concatenated vectors. Thus, the coupled version of
the linear system for the direct method (Eq. 6.43) is
 𝜕𝑟 𝜕𝑟1     𝜕𝑟1 
 1  𝜙1   
 ···     𝜕𝑥 
 𝜕𝑢1 𝜕𝑢𝑛     
 . ..   ..   .. 
 .. ..
 . = .  ,
 . .     
(13.17)
     
 𝜕𝑟𝑛 𝜕𝑟𝑛     𝜕𝑟𝑛 
 𝜙𝑛   
 𝜕𝑢 ···
𝜕𝑢𝑛     𝜕𝑥 
 1
where 𝜙 𝑖 represents the derivatives of the states from component 𝑖 with
respect to the design variables. Once we have solved for 𝜙, we can
use the coupled equivalent of the total derivative equation (Eq. 6.44) to
compute the derivatives:

   𝜙1 
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓  .. 
= − ...  . . (13.18)
d𝑥  
𝜙𝑛 
𝜕𝑥 𝜕𝑢1 𝜕𝑢𝑛
 
Similarly, the adjoint equations (Eq. 6.46) can be written for a coupled
system using the same concatenated state and residual vectors. The
13 Multidisciplinary Design Optimization 500

coupled adjoint equations involve a corresponding concatenated adjoint


vector and can be written as
 𝜕𝑟 | 𝜕𝑟𝑛 |     𝜕 𝑓 | 
 1  𝜓1  
 ···     𝜕𝑢1 
 𝜕𝑢1 𝜕𝑢1    
 . ..   ..   .. 
 .. ..
 . = .  .
 . .    
(13.19)
     
 𝜕𝑟1 | |
   𝜕 𝑓 
 𝜕𝑟𝑛 
|

 𝜕𝑢 𝜓 𝑛  
𝜕𝑢𝑛     𝜕𝑢𝑛 
···
 𝑛
After solving this equations for the coupled-adjoint vector, we can
use the coupled version of the total derivative equation (Eq. 6.47) to
compute the desired derivatives as
 𝜕𝑟1 
 
 𝜕𝑥 
d𝑓  |   
− 𝜓1 . . . 𝜓 |𝑛  ...  .
𝜕𝑓
= (13.20)
d𝑥 𝜕𝑥  𝜕𝑟 
 𝑛
 
 𝜕𝑥 
Like the adjoint method from Section 6.7, the coupled adjoint is a
powerful approach for computing gradients with respect to many
design variables.∗ ∗ The coupled-adjoint approach has been
The required partial derivatives are the derivatives of the residuals implemented for aerostructural problems
or outputs of each component with respect to the state variables or governed by coupled PDEs207 and demon-
strated in wing design optimization.209
inputs of all other components. In practice, the block structure of
207. Kenway et al., Scalable parallel ap-
these partial derivative matrices is sparse, and the matrices themselves proach for high-fidelity steady-state aeroe-
are sparse. This sparsity can be exploited using graph coloring to lastic analysis and derivative computations,
2014.
drastically reduce the computation effort of computing Jacobians at the 209. Kenway and Martins, Multipoint
system or component level, as explained in Section 6.8. high-fidelity aerostructural optimization of a
transport aircraft configuration, 2014.
Figure 13.26 shows the structure of the Jacobians in Eq. 13.17 and
Eq. 13.19 for the three-group case from Fig. 13.23. The sparsity structure
of the Jacobian is the transpose of the DSM structure. Because the
Jacobian in Eq. 13.19 is transposed, the Jacobian in the adjoint equation
has the same structure as the DSM.
The structure of the linear system can be exploited in the same
way as for the nonlinear system solution using hierarchical solvers:
serial solvers within each group and a coupled solver for the three
groups. The block Jacobi and Gauss–Seidel methods from Section 13.2.5
are applicable to coupled linear components, so these methods can
be re-used to solve this coupled linear system for the total coupled
derivatives.
The partial derivatives in the coupled Jacobian, the right-hand side
of the linear systems (Eqs. 13.17 and 13.19), and the total derivatives
equations (Eqs. 13.18 and 13.20) can be computed with any of the
13 Multidisciplinary Design Optimization 501

𝜕𝑟1 𝜕𝑟1 𝜕𝑟1 𝜕𝑟1 | 𝜕𝑟2 |


𝜕𝑢1 𝜕𝑢6 𝜕𝑢9 𝜕𝑢1 𝜕𝑢1
𝜕𝑟2 𝜕𝑟2 𝜕𝑟2 | 𝜕𝑟3 |
𝜕𝑢1 𝜕𝑢2 𝜕𝑢2 𝜕𝑢2
𝜕𝑟3 𝜕𝑟3 𝜕𝑟3 | 𝜕𝑟4 | 𝜕𝑟7 |
𝜕𝑢2 𝜕𝑢3 𝜕𝑢3 𝜕𝑢3 𝜕𝑢3
𝜕𝑟4 𝜕𝑟4 𝜕𝑟4 𝜕𝑟4 | 𝜕𝑟5 |
𝜕𝑢3 𝜕𝑢4 𝜕𝑢9 𝜕𝑢4 𝜕𝑢4
𝜕𝑟5 𝜕𝑟5 𝜕𝑟5 | 𝜕𝑟6 |
𝜕𝑢4 𝜕𝑢5 𝜕𝑢5 𝜕𝑢5
𝜕𝑟6 𝜕𝑟6 𝜕𝑟1 | 𝜕𝑟6 | 𝜕𝑟7 |
𝜕𝑢5 𝜕𝑢6 𝜕𝑢6 𝜕𝑢6 𝜕𝑢6
𝜕𝑟7 𝜕𝑟7 𝜕𝑟7 𝜕𝑟7 | 𝜕𝑟8 |
𝜕𝑢3 𝜕𝑢6 𝜕𝑢7 𝜕𝑢7 𝜕𝑢7
𝜕𝑟8 𝜕𝑟8 𝜕𝑟8 | 𝜕𝑟9 |
𝜕𝑢7 𝜕𝑢8 𝜕𝑢8 𝜕𝑢8
𝜕𝑟9 𝜕𝑟9 𝜕𝑟1 | 𝜕𝑟4 | 𝜕𝑟9 |
𝜕𝑢8 𝜕𝑢9 𝜕𝑢9 𝜕𝑢9 𝜕𝑢9

Direct Jacobian Adjoint Jacobian

methods from Chapter 6. The nature of these derivatives is the same Fig. 13.26 Jacobian structure for resid-
as we have seen previously for implicit analytic methods (Section 6.7). ual form of the coupled direct (left)
and adjoint (right) equations for the
They do not require the solution of the equation and are typically three-group system of Fig. 13.23. The
cheap to compute. Ideally, the components would already have analytic structure of the transpose of the Jaco-
bian is the same as that of the DSM.
derivatives of their outputs with respect to their inputs, which are all
the derivatives needed at the system level.
The partial derivatives can also be computed using the finite-
difference or complex-step methods. Even though these are not efficient
for cases with many inputs, it might still be more efficient to compute
the partial derivatives with these methods and then solve the coupled
derivative equations instead of performing a finite difference of the
coupled system, as described in Section 13.3.1. The reason is that com-
puting the partial derivatives avoids having to reconverge the coupled
system for every input perturbation. In addition, the coupled system
derivatives should be more accurate when finite differences are used
only to compute the partial derivatives.

Coupled Derivatives of Functional Representation

Variants of the coupled direct and adjoint methods can also be derived
for the functional form of the system-level representation (Eq. 13.4),
by using the residuals defined for the system-level Newton solver
(Eq. 13.11),
ˆ = 𝑢ˆ 𝑖 − 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) = 0 ,
𝑟ˆ𝑖 (𝑢) 𝑖 = 1, . . . , 𝑚 . (13.21)
Recall that driving these residuals to zero relies on a solver for each
component to solve for each component’s states and another solver to
13 Multidisciplinary Design Optimization 502

solve for the coupling variables 𝑢.


ˆ
Using this new residual definition and the coupling variables, we
can derive the functional form of the coupled direct method as

 𝜕𝑈1     𝜕𝑈ˆ 1 
 𝐼 𝜕𝑈1
 𝜙ˆ 1   
 − ··· −     𝜕𝑥 
 𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚     
 𝜕𝑈 𝜕𝑈2     𝜕𝑈ˆ 2 
− 2  𝜙ˆ 2   
 𝐼 ··· −     𝜕𝑥 
 𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚   =  ,
 . ..   ..   .. 
(13.22)
 .. .. ..
 .   . 
 . . .     
     ˆ 
 𝜕𝑈𝑚   ˆ   𝜕𝑈𝑚 
− 𝜕𝑈𝑚  𝜙𝑚   
 𝜕𝑢ˆ − ··· 𝐼     𝜕𝑥 
 1 𝜕𝑢ˆ 2 
where the Jacobian is identical to the one we derived for the coupled
Newton step (Eq. 13.13). Here, 𝜙ˆ 𝑖 represents the derivatives of the cou-
pling variables from component 𝑖 with respect to the design variables.
The solution can then be used in the following equation to compute the
total derivatives:

   𝜙ˆ 1 
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓  .. 
= − ...  . . (13.23)
d𝑥  
𝜙ˆ 𝑚 
𝜕𝑥 𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚
 
Similarly, the functional version of the coupled adjoint equations
can be derived as
 |
   𝜕 𝑓 | 
 𝐼 𝜕𝑈2 | 𝜕𝑈𝑚   𝜓ˆ 1  
 − ··· −     𝜕𝑢ˆ 1 
 𝜕𝑢ˆ 1 𝜕𝑢ˆ 1    
 𝜕𝑈 | |
   𝜕 𝑓 | 
− 1 𝜕𝑈𝑚   𝜓ˆ 2  
 𝐼 ··· −     𝜕𝑢ˆ 2 
 𝜕𝑢ˆ 2 𝜕𝑢ˆ 2   =
   ..   .. 
. (13.24)
 .. .. .. ..   .   . 
 . . . .    
     
 𝜕𝑈1 |   ˆ   𝜕 𝑓 
− 
| |
𝜓 𝑚  
𝜕𝑈2
 𝜕𝑢ˆ − ··· 𝐼     𝜕𝑢ˆ 𝑚 
 𝑚 𝜕𝑢ˆ 𝑚 
After solving for the coupled-adjoint vector using the previous equa-
tion, we can use the total derivative equation to compute the desired
derivatives:
 𝜕ˆ𝑟1 
 
 𝜕𝑥 
d𝑓  |   
− 𝜓ˆ 1 . . . 𝜓ˆ |𝑚  ...  .
𝜕𝑓
= (13.25)
d𝑥 𝜕𝑥  𝜕ˆ𝑟 
 𝑚
 
 𝜕𝑥 
Because the coupling variables (𝑢) ˆ are usually a reduction of the
internal state variables (𝑢), the linear systems in Eqs. 13.22 and 13.24 are
13 Multidisciplinary Design Optimization 503

usually much smaller than that of the residual counterparts (Eqs. 13.17
and 13.19). However, unlike the partial derivatives in the residual form,
the partial derivatives in the functional form Jacobian need to account
for the solution of the corresponding component. When viewed at
the component level, these derivatives are actually total derivatives of
the component. When the component is an implicit set of equations,
computing these derivatives with finite-differencing would require
solving the component’s equations for each variable perturbation.
Alternatively, an implicit analytic method (from Section 6.7) could be
applied to the component to compute these derivatives.
Figure 13.27 shows the Jacobian structure in the functional form of 𝜕𝑈1 𝜕𝑈1
𝐼 − −
the coupled direct method (Eq. 13.22) for the case of Fig. 13.23. The 𝜕𝑢ˆ 2 𝜕𝑢ˆ 3
dimension of this Jacobian is smaller than that of the residual form.
𝜕𝑈2 𝜕𝑈2
Recall from Fig. 13.9 that 𝑈1 corresponds to 𝑟3 , 𝑈2 corresponds to 𝑟6 , and −
𝜕𝑢ˆ 1
𝐼 −
𝜕𝑢ˆ 3
𝑈3 corresponds to 𝑟9 . Thus, the total size of this Jacobian corresponds
to the sum of the sizes of components 3, 6, and 9, as opposed to the −
𝜕𝑈3

𝜕𝑈3
𝐼
sum of the sizes of all nine components for the residual form. However, 𝜕𝑢ˆ 1 𝜕𝑢ˆ 2

as mentioned previously, partial derivatives for the functional form


Fig. 13.27 Jacobian of coupled deriva-
are more expensive to compute because they need to account for an tives for the functional form of
implicit solver in each of the three groups. Fig. 13.23.

UDE for Coupled Systems

As in the single-component case in Section 6.9, the coupled direct and


adjoint equations derived in this section can be obtained from the
UDE with the appropriate definitions of residuals and variables. The
components corresponding to each block in these equations can also be
implicit or explicit, which provides the flexibility to represent systems
of heterogeneous components.
MAUD implements the linear systems from these coupled direct
and adjoint equations using the UDE. The overall linear system inherits
the hierarchical structure defined for the nonlinear solvers. Instead
of nonlinear solvers, we use linear solvers, such as a direct solver and
Krylov (both monolithic). As mentioned in Section 13.2.5, the nonlinear
block Jacobi and Gauss–Seidel (both recursive) can be reused to solve
coupled linear systems. Components can be expressed using residual or
functional forms, making it possible to include black-box components.
The example originally used in Chapter 6 to demonstrate how
to compute derivatives with the UDE (Ex. 6.15) can be viewed as a
coupled derivative computation where each equation is a component.
Example 13.6 demonstrates the UDE approach to computing derivatives
by building on the wing design problem presented in Ex. 13.2.
13 Multidisciplinary Design Optimization 504

Tip 13.4 Implementing coupled derivative computation

Obtaining derivatives for each component of a multidisciplinary model


and assembling them to compute the coupled derivatives usually requires a
high implementation effort. In addition to implementing hierarchical coupled
solvers (as mentioned in Tip 13.3), the OpenMDAO framework also implements
the MAUD approach to computing coupled derivatives. The linear system
mirrors the hierarchy set up for nonlinear coupled solvers.132 Ideally, users 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
provide the partial derivatives for each component using accurate and efficient design, analysis, and optimization, 2019.
methods. However, if derivatives are not available, OpenMDAO can auto-
matically compute them using finite differences or the complex-step method.
OpenMDAO also facilitates efficient derivative computation for sparse Jacobians
using the graph coloring techniques introduced in Section 6.8.

Example 13.6 Aerostructural derivatives

Let us now consider a wing design optimization problem based on the


aerostructural model considered in Ex. 13.1.† The design variables are as † As in Ex. 13.5, these results were obtained

follows: using OpenAeroStruct, and the description


and equations are simplified for brevity.
𝛼: Angle of attack. This controls the amount of lift produced by the airplane.
𝑏: Wingspan. This is a shared variable because it directly affects both the
aerodynamic and structural models.
𝜃: Twist distribution along the wingspan, represented by a vector. This controls
the relative lift loading in the spanwise direction, which affects the drag
and the load distribution on the structure. It affects the aerodynamic
model but not the structural model (because it is idealized as a beam).
𝑡: Thickness distribution of beam along the wingspan, represented by a vector.
This directly affects the weight and the stiffness. It does not affect the
aerodynamic model.
The objective is to minimize the fuel required for a given range 𝑅, which
can be written as a function of drag, lift, and weight, as follows:
   
𝑅𝑐𝐷
𝑓 = 𝑊 exp −1 . (13.26)
𝑉𝐿

The empty weight 𝑊 only depends on 𝑡 and 𝑏, and the dependence is explicit
(it does not require solving the aerodynamic or structural models). The drag 𝐷
and lift 𝐿 depend on all variables once we account for the coupled system of
equations. The remaining variables are fixed: 𝑅 is the required range, 𝑉 is the
airplane’s cruise speed, and 𝑐 is the specific fuel consumption of the airplane’s
engines. We also need to constrain the stresses in the structure, 𝜎, which are an
explicit function of the displacements (see Ex. 6.12).
To solve this optimization problem using gradient-based optimization, we
need the coupled derivatives of 𝑓 and 𝜎 with respect to 𝛼, 𝑏, 𝜃, and 𝑡. Computing
the derivatives of the aerodynamic and structural models separately is not
13 Multidisciplinary Design Optimization 505

sufficient. For example, a perturbation on the twist changes the loads, which
then changes the wing displacements, which requires solving the aerodynamic
model again. Coupled derivatives take this effect into account.

𝑟𝛼 𝑟𝑏 𝑟𝜃 𝑟𝑡 𝑟Γ 𝑟𝑑 𝑟𝑊 𝑟𝐷 𝑟𝐿 𝑟𝜎 𝑟𝑓

𝑏
Design variables

Γ
Intermediate variables

𝐿
Functions

𝜎 Fig. 13.28 The DSM of the aerostruc-


tural problem shows the structure of
the reverse UDE.
𝑓

We show the DSM for the system in Fig. 13.28. Because the DSM has the
same sparsity structure as the transpose of the Jacobian, this diagram reflects
the structure of the reverse UDE. The blocks that pertain to the design variables
have unit diagonals because they are independent variables, but they directly
affect the solver blocks. The blocks responsible for solving for Γ and 𝑑 are the
only ones with feedback coupling. The part of the UDE pertaining to Γ and 𝑑
is the Jacobian of residuals for the aerodynamic and structural components,
which we already derived in Ex. 13.5 to apply Newton’s method on the coupled
system. The functions of interest are all explicit components and only depend
directly on the design variables or the state variables. For example, the weight
𝑊 depends only on 𝑡; drag and lift depend only on the converged Γ; 𝜎 depends
on the displacements; and finally, the fuel burn 𝑓 just depends on drag, lift,
and weight. This whole coupled chain of derivatives is computed by solving
the linear system shown in Fig. 13.28.
For brevity, we only discuss the derivatives required to compute the
derivative of fuel burn with respect to span, but the other partial derivatives
would follow the same rationale.
• 𝜕𝑟/𝜕𝑢 is identical to what we derived when solving the coupled aero-
13 Multidisciplinary Design Optimization 506

structural system in Ex. 13.5.


• 𝜕𝑟/𝜕𝑥 has two components, which we can obtain by differentiating the
residuals:
𝜕 𝜕𝐴 𝜕𝑣 𝜕 𝜕𝐾
(𝐴Γ − 𝑣) = Γ− , (𝐾𝑑 − 𝑞) = 𝑑.
𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏
• 𝜕 𝑓 /𝜕𝑥 = 𝜕 𝑓 /𝜕𝑏 = 0 because the fuel burn does not depend directly on
the span if we just consider Eq. 13.26. However, it does depend on the
span through 𝑊, 𝐷, and 𝐿. This is where the UDE description is more
general and clearer than the standard direct and adjoint formulation.
By defining the explicit components of the function in the bottom-right
corner, the solution of the linear system yields the chain rule

d𝑓 𝜕 𝑓 d𝐷 𝜕 𝑓 d𝐿 𝜕 𝑓 d𝑊
= + + ,
d𝑏 𝜕𝐷 d𝑏 𝜕𝐿 d𝑏 𝜕𝑊 d𝑏
where the partial derivatives can be obtained by differentiating Eq. 13.26
symbolically, and the total derivatives are part of the coupled linear
system solution.
After computing all the partial derivative terms, we solve either the forward d𝑓
d𝑡
or reverse UDE system. For the derivative with respect to span, neither method 1,250
Decoupled
has an advantage. However, for the derivatives of fuel burn with respect to 1,000

the twist and thickness variables, the reverse mode is much more efficient. In 750
Coupled
this example, d 𝑓 /d𝑏 = −11.0 kg/m, so each additional meter of span reduced 500
the fuel burn by 11 kg. If we compute this same derivative without coupling 250
(by converging the aerostructural model but not considering the off-diagonal 0
terms in the aerostructural Jacobian), we obtain d 𝑓 /d𝑏 = −17.7 kg/m, which is
d𝑓
significantly different. The derivatives of the fuel burn with respect to the twist d𝛾
distribution and the thickness distribution along the wingspan are plotted in 0

Fig. 13.29, where we can see the difference between coupled and uncoupled
−20
derivatives.
−40

−60
0 5 10 15
13.4 Monolithic MDO Architectures Spanwise location [m]

So far in this chapter, we have extended the models and solvers from Fig. 13.29 Derivatives of the fuel burn
with respect to the spanwise distribu-
Chapter 3 and derivative computation methods from Chapter 6 to
tion of twist and thickness variables.
coupled systems. We now discuss the options to optimize coupled The coupled derivatives differ from
systems, which are given by various MDO architectures. the uncoupled derivatives, especially
Monolithic MDO architectures cast the design problem as a single for the derivatives with respect to
structural thicknesses near the wing
optimization. The only difference between the different monolithic root.
architectures is the set of design variables that the optimizer is responsi-
ble for, which affects the constraint formulation and how the governing
equations are solved.
13 Multidisciplinary Design Optimization 507

13.4.1 Multidisciplinary Feasible


The multidisciplinary design feasible (MDF) architecture is the archi-
tecture that is most similar to a single-discipline problem and usually
the most intuitive for engineers. The design variables, objective, and
constraints are the same as we would expect for a single-discipline
problem. The only difference is that the computation of the objective
and constraints requires solving a coupled system instead of a sin-
gle system of governing equations. Therefore, all the optimization
algorithms covered in the previous chapters can be applied without
modification when using the MDF architecture. This approach is also
called a reduced-space approach because the optimizer does not handle
the space of the state and coupling variables. Instead, it relies on a
solver to find the state variables that satisfy the governing equations
for the current design (see Eq. 3.32).
The resulting optimization problem is as follows:∗ ∗ The quantities after the semicolon in the

variable dependence correspond to vari-


ables that remain fixed in the current con-
minimize 𝑓 (𝑥; 𝑢ˆ ∗ ) text. For simplicity, we omit the design
equality constraints ( ℎ = 0) without loss
by varying 𝑥 of generality.
subject to 𝑔 (𝑥, 𝑢ˆ ∗ ) ≤ 0 (13.27)
while solving ˆ 𝑥) = 0
𝑟ˆ (𝑢;
for 𝑢ˆ .

At each optimization iteration, the optimizer has a multidisciplinary


feasible point 𝑢ˆ ∗ found through the MDA. For a design given by the
optimizer (𝑥), the MDA finds the internal component states (𝑢) and
the coupling variables (𝑢). ˆ To denote the MDA solution, we use the
residuals of the functional form, where the residuals for component 𝑖
are†
† Theseare identical to the residuals of the
system-level Newton solver (Eq. 13.11).
ˆ 𝑢𝑖 ) = 𝑢ˆ 𝑖 − 𝑈 𝑖 (𝑢𝑖 , 𝑢ˆ 𝑗≠𝑖 ) = 0 .
𝑟ˆ𝑖 (𝑢, (13.28)
Each component is assumed to solve for its state variables 𝑢𝑖 internally.
The MDA finds the coupling variables by solving the coupled system of
components 𝑖 = 1, . . . , 𝑚 using one of the methods from Section 13.2.5.
Then, the objective and constraints can be computed based on the
current design variables and coupling variables. Figure 13.30 shows
an XDSM for MDF with three components. Here we use a nonlinear
block Gauss–Seidel method (see Alg. 13.2) to converge the MDA, but
any other method from Section 13.2.5 could be used.
One advantage of MDF is that the system-level states are physically
compatible if an optimization stops prematurely. This is advantageous
in an engineering design context when time is limited, and we are not
as concerned with finding an optimal design in the strict mathematical
13 Multidisciplinary Design Optimization 508

x(0) û(0)

0, 7 → 1 :
x∗ 2:x 3:x 4:x 6:x
Optimization

0, 4 → 1 :
2 : û2 , û3 3 : û3
MDA

2:
û∗1 5 : û1 3 : û1 4 : û1 6 : û1
Solver 1

3:
û∗2 5 : û2 4 : û2 6 : û2
Solver 2

4:
û∗3 5 : û3 6 : û3
Solver 3

6:
7 : f, g
Functions

sense as we are with finding an improved design. However, it is not Fig. 13.30 The MDF architecture relies
guaranteed that the design constraints are satisfied if the optimization is on an MDA to solve for the coupling
and state variables at each optimiza-
terminated early; that depends on whether the optimization algorithm tion iteration. In this case, the MDA
maintains a feasible design point or not. uses the block Gauss–Seidel method.
The main disadvantage of MDF is that it solves an MDA for each
optimization iteration, which requires its own algorithm outside of the
optimization. Implementing an MDA algorithm can be time-consuming
if one is not already in place.
As mentioned in Tip 13.3, a MAUD-based framework such as Open-
MDAO can facilitate this. MAUD naturally implements the MDF archi-
tecture because it focuses on solving the MDA (Section 13.2.5) and on
computing the derivatives corresponding to the MDA (Section 13.3.3).‡ ‡ The first application of MAUD was the
design optimization of a satellite and its
When using a gradient-based optimizer, gradient computations are orbit dynamics. The problem consisted
also challenging for MDF because coupled derivatives are required. of over 25,000 design variables and over 2
million state variables210
Finite-difference derivative approximations are easy to implement, but
210. Hwang et al., Large-scale multidiscipli-
their poor scalability and accuracy are compounded by the MDA, as nary optimization of a small satellite’s design
explained in Section 13.3. Ideally, we would use one of the analytic and operation, 2014.

coupled derivative computation methods of Section 13.3, which require


a substantial implementation effort. Again, OpenMDAO was developed
to facilitate coupled derivative computation (see Tip 13.4).
13 Multidisciplinary Design Optimization 509

Example 13.7 Aerostructural optimization using MDF

Continuing the wing aerostructural problem from Ex. 13.6, we are finally
ready to optimize the wing. The MDF formulation is as follows: Initial

minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡 Optimized
subject to 𝐿−𝑊 = 0
2.5|𝜎| − 𝜎yield ≤ 0 0 5 10 15 20

while solving 𝐴(𝑑)Γ − 𝑣(𝑑, 𝛼) = 0 Span [m]

𝐾𝑑 − 𝑞(Γ) = 0
Fig. 13.31 The optimization reduces
for Γ, 𝑑. the fuel burn by increasing the span.

The structural stresses are constrained to be less than the yield stress of the
material by a safety factor (2.5 in this case). In Ex. 13.5, we set up the MDA for 𝜃
2
the aerostructural problem, and in Ex. 13.6, we set up the coupled derivative Optimized
computations needed to solve this problem using gradient-based optimization. 0
Initial
Solving this optimization resulted in the larger span wing shown in Fig. 13.31. −2

This larger span increases the structural weight, but decreases drag. Although
the increase in weight would typically increase the fuel burn, the drag decrease 𝑡

more than compensates for this adverse effect, and the fuel burn ultimately 1 Optimized
decreases up to this value of span. Beyond this optimal span value, the weight 0.5 Initial
penalty would start to dominate, resulting in a fuel burn increase.
0
The twist and thickness distributions are shown in Fig. 13.32. The wing 0 0.2 0.4 0.6 0.8 1
twist directly controls the spanwise lift loading. The baseline wing had no Normalized spanwise location
twist, which resulted in the loading shown in Fig. 13.33. In this figure, the
gray line represents a hypothetical elliptical lift distribution, which results in Fig. 13.32 Twist and thickness dis-
the theoretical minimum for induced drag. The loading distributions for the tributions for the baseline and opti-
mized wings.
level flight (1 g) and maneuver conditions (2.5 g) are indistinguishable. The
optimization increases the twist in the midspan and drastically decreases it
toward the tip. This twist distribution differentiates the loading at the two
conditions: it makes the loading at level flight closer to the elliptical ideal while
Loading
shifting the loading at the maneuver condition toward the wing root.
1 1g Initial
The thickness distribution also changes significantly, as shown in Fig. 13.32.
The optimization tailors the thickness by adding more thickness in the spar 2.5g
0.5
near the root, where the moments are larger, and thins out the wing much
0
more toward the tip, where the loads decrease. This more radical thickness
Loading
distribution is enabled by the tailoring of the spanwise lift loading discussed
1 Optimized
1g
previously.
These trades make sense because, at the level flight condition, the optimizer 0.5 2.5g

is concerned with minimizing drag, whereas, at the maneuver condition, the


0
optimizer just wants to satisfy the stress constraint for a given total lift. 0 0.2 0.4 0.6 0.8 1
Normalized spanwise location

Fig. 13.33 Lift loading for the baseline


and optimized wings.
13 Multidisciplinary Design Optimization 510

Example 13.8 Aerostructural sequential optimization

In Section 13.1, we argued that sequential optimization does not, in general,


converge to the true optimum for constrained problems. We now demonstrate
this for a modified version of the wing aerostructural design optimization
problem from Ex. 13.7. One major modification was to reduce the problem
to two design variables to visualize the optimization path: one structural
variable corresponding to a constant spar thickness and one twist variable
corresponding to the wing tip twist, which controls the slope of a linear twist
distribution. The simultaneous optimization of these two variables using the
MDF architecture from Ex. 13.7 yields the path labeled “MDO” in Fig. 13.34.

5.5
Sequential
5 𝑥0

4.5
Thickness [cm]

4
MDO

3.5

3 Fig. 13.34 Sequential optimization


𝑥∗ gets stuck at the stress constraint,

𝑥seq
2.5 whereas simultaneous optimization
Stress constraint of the aerodynamic and structural
2 variable finds the true multidiscipli-
−3 −2 −1 0 1 2 3 nary optimum.
Wing tip jig twist [deg]

To perform sequential optimization for the wing design problem of Ex. 13.1,
we could start by optimizing the aerodynamics by solving the following
problem:
minimize 𝑓
by varying 𝛼, 𝜃
subject to 𝐿−𝑊 = 0.
Here, 𝑊 is constant because the structural thicknesses 𝑡 are fixed, but 𝐿 is a
function of the aerodynamic design variables and states. We cannot include the
span 𝑏 because it is a shared variable, as explained in Section 13.1. Otherwise,
this optimization would tend to increase 𝑏 indefinitely to reduce the lift-induced
drag. Because 𝑓 is a function of 𝐷 and 𝐿, and 𝐿 is constant because 𝐿 = 𝑊, we
could replace the objective with 𝐷.
Once the aerodynamic optimization has converged, the twist distribution
and the forces are fixed, and we then optimize the structure by minimizing
weight subject to stress constraints by solving the following problem:

minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0 .
13 Multidisciplinary Design Optimization 511

Because the drag and lift are constant, the objective could be replaced by 𝑊.
Again, we cannot include the span in this problem because it would decrease
indefinitely to reduce the weight and internal loads due to bending.
These two optimizations are repeated until convergence. As shown in
Fig. 13.34, sequential optimization only changes one variable at a time, and it
converges to a point on the constraint with about 3.5 ◦ more twist than the true
optimum of the MDO. When including more variables, these differences are
likely to be even larger.

13.4.2 Individual Discipline Feasible


The individual discipline feasible (IDF) architecture adds independent
copies of the coupling variables to allow component solvers to run
independently and possibly in parallel. These copies are known as
target variables and are controlled by the optimizer, whereas the actual
coupling variables are computed by the corresponding component.
Target variables are denoted by a superscript 𝑡, so the coupling variables
produced by discipline 𝑖 are denoted as 𝑢ˆ 𝑖𝑡 . These variables represent
the current guesses for the coupling variables that are independent
of the corresponding actual coupling variables computed by each
component. To ensure the eventual consistency between the target
coupling variables and the actual coupling variables at the optimum,
we define a set of consistency constraints, ℎ 𝑖𝑐 = 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 , which we add to
the optimization problem formulation.
The optimization problem for the IDF architecture is

minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥, 𝑢ˆ 𝑡
subject to ˆ ≤0
𝑔 (𝑥; 𝑢)
ℎ 𝑖𝑐 = 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 = 0 𝑖 = 1, . . . , 𝑚 (13.29)
 
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥, 𝑢ˆ 𝑗≠𝑖
𝑡
=0 𝑖 = 1, . . . , 𝑚

for 𝑢ˆ .

Each component 𝑖 is solved independently to compute the correspond-


ing output coupling variables 𝑢ˆ 𝑖 , where the inputs 𝑢ˆ 𝑗≠𝑖
𝑡
are given by
the optimizer. Thus, each component drives its residuals to zero to
compute  
𝑡
𝑢ˆ 𝑖 = 𝑈 𝑖 𝑥, 𝑢ˆ 𝑗≠𝑖 . (13.30)

The consistency constraint quantifies the difference between the target


coupling variables guessed by the optimizer and the actual coupling
13 Multidisciplinary Design Optimization 512

variables computed by the components. The optimizer iterates the


target coupling variables simultaneously with the design variables to
find a multidisciplinary feasible point that is also an optimum. At each
iteration, the objective and constraints are computed using the latest
available coupling variables. Figure 13.35 shows the XDSM for IDF.

x(0) , ût,(0)

0, 3 → 1 :
x∗ 2 : x, ût 1 : x, ût2 , ût3 1 : x, ût1 , ût3 1 : x, ût1 , ût2
Optimization

2:
û∗1 3 : f, g, g c
Functions

1:
û∗2 2 : û1
Solver 1

1:
û∗3 2 : û2
Solver 2

1:
2 : û3
Solver 3

One advantage of IDF is that each component can be solved in Fig. 13.35 The IDF architecture breaks
up the MDA by letting the optimizer
parallel because they do not depend on each other directly. Another
solve for the coupling variables that
advantage is that if gradient-based optimization is used to solve the satisfy interdisciplinary feasibility.
problem, the optimizer is typically more robust and has a better conver-
gence rate than the fixed-point iteration algorithms of Section 13.2.5.
The main disadvantage of IDF is that the optimizer must handle
more variables and constraints compared with the MDF architecture. If
the number of coupling variables is large, the size of the resulting opti-
mization problem may be too large to solve efficiently. This problem can
be mitigated by careful selection of the components or by aggregating
the coupling variables to reduce their dimensionality.
Unlike MDF, IDF does not guarantee a multidisciplinary feasible
state at every design optimization iteration. Multidisciplinary feasibility
is only guaranteed at the end of the optimization through the satisfaction
of the consistency constraints. This is a disadvantage because if the
optimization stops prematurely or we run out of time, we do not have
a valid state for the coupled system.
13 Multidisciplinary Design Optimization 513

Example 13.9 Aerostructural optimization using IDF

For the IDF architecture, we need to make copies of the coupling variables
(Γ𝑡 and 𝑑 𝑡 ) and add the corresponding consistency constraints, as highlighted
in the following problem statement:

minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ𝑡 , 𝑑 𝑡
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
Γ𝑡 − Γ = 0
𝑑𝑡 − 𝑑 = 0
   
while solving 𝐴 𝑑𝑡 Γ − 𝜃 𝑑𝑡 , 𝛼 = 0
 
𝐾𝑑 − 𝑞 Γ𝑡 = 0

for Γ, 𝑑 .

The aerodynamic and structural models are solved independently. The aerody-
namic solver finds Γ for the 𝑑 𝑡 given by the optimizer, and the structural solver
finds 𝑑 for the given Γ𝑡 .
When using gradient-based optimization, we do not require coupled
derivatives, but we do need the derivatives of each model with respect to both
state variables. The derivatives of the consistency constraints are just a unit
matrix when taken with respect to the variable copies and are zero otherwise.

13.4.3 Simultaneous Analysis and Design


Simultaneous analysis and design (SAND) extends the idea of IDF by
moving not only the coupling variables to the optimization problem but
also all component states. The SAND architecture requires exposing
all the components in the form of the system-level view previously
introduced in Fig. 13.9. The residuals of the analysis become constraints
for which the optimizer is responsible.§ § When the residual equations arise from
discretized PDEs, we have what is called
This means that component solvers are no longer needed, and PDE-constrained optimization.211
the optimizer becomes responsible for simultaneously solving the 211. Biegler et al., Large-Scale PDE-
components for their states, the interdisciplinary compatibility for the Constrained Optimization, 2003.

coupling variables, and the design optimization problem for the design
variables. All that is required from the model is the computation
of residuals. Because the optimizer is controlling all these variables,
SAND is also known as a full-space approach. SAND can be stated as
13 Multidisciplinary Design Optimization 514

follows:
minimize ˆ 𝑢)
𝑓 (𝑥, 𝑢,
by varying ˆ 𝑢
𝑥, 𝑢,
(13.31)
subject to ˆ ≤0
𝑔 (𝑥, 𝑢)
ˆ 𝑢) = 0 .
𝑟 (𝑥, 𝑢,
Here, we use the representation shown in Fig. 13.7, so there are two
sets of explicit functions that convert the input coupling variables of
the component. The SAND architecture is also applicable to single
components, in which case there are no coupling variables. The XDSM
for SAND is shown in Fig. 13.36.

x(0) , û(0) , u(0)

0, 2 → 1 :
x∗ , û∗ 1 : x, û 1 : x, û, u1 1 : x, û, u2 1 : x, û, u3
Optimization

1:
2 : f, g
Functions

1:
2 : r1
Residual 1

1:
2 : r2
Residual 2

1:
2 : r3
Residual 3

Because it solves for all variables simultaneously, the SAND archi- Fig. 13.36 The SAND architecture
lets the optimizer solve for all vari-
tecture can be the most efficient way to get to the optimal solution. In
ables (design, coupling, and state
practice, however, it is unlikely that this is advantageous when efficient variables), and component solvers are
component solvers are available. no longer needed.
The resulting optimization problem is the largest of all MDO archi-
tectures and requires an optimizer that scales well with the number
of variables. Therefore, a gradient-based optimization algorithm is
likely required, in which case the derivative computation must also
be considered. Fortunately, SAND does not require derivatives of the
coupled system or even total derivatives that account for the component
solution; only partial derivatives of residuals are needed.
SAND is an intrusive approach because it requires access to residuals.
13 Multidisciplinary Design Optimization 515

These might not be available if components are provided as black boxes.


Rather than computing coupling variables 𝑢ˆ 𝑖 and state variables 𝑢𝑖 by
converging the residuals to zero, each component 𝑖 just computes the
current residuals 𝑟 𝑖 for the current values of the coupling variables 𝑢ˆ
and the component states 𝑢𝑖 .

Example 13.10 Aerostructural optimization using SAND

For the SAND approach, we do away completely with the solvers and let
the optimizer find the states. The resulting problem is as follows:

minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ, 𝑑
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
𝐴Γ − 𝜃 = 0
𝐾𝑑 − 𝑞 = 0.

Instead of being solved separately, the models are now solved by the optimizer.
When using gradient-based optimization, the required derivatives are just
partial derivatives of the residuals (the same partial derivatives we would use
for an implicit analytic method).

13.5 Distributed MDO Architectures

The monolithic MDO architectures we have covered so far form and


solve a single optimization problem. Distributed architectures decom-
pose this single optimization problem into a set of smaller optimization
problems, or disciplinary subproblems, which are then coordinated by a
system-level subproblem. One key requirement for these architectures is
that they must be mathematically equivalent to the original monolithic
problem to converge to the same solution.
There are two primary motivations for distributed architectures.
The first one is the possibility of decomposing the problem to reduce the
computational time. The second motivation is to mimic the structure
of large engineering design teams, where disciplinary groups have the
autonomy to design their subsystems so that MDO is more readily
adopted in industry. Overall, distributed MDO architectures have fallen
short on both of these expectations. Unless a problem has a special
structure, there is no distributed architecture that converges as rapidly
as a monolithic one. In practice, distributed architectures have not been
used much recently.
13 Multidisciplinary Design Optimization 516

There are two main types of distributed architectures: those that


enforce multidisciplinary feasibility via an MDA somewhere in the
process and those that enforce multidisciplinary feasibility in some
other way (using constraints or penalties at the system level). This
is analogous to MDF and IDF, respectively, so we name these types
∗ Martins and Lambe41 present a more com-
distributed MDF and distributed IDF.∗
prehensive description of all MDO archi-
In MDO problems, it can be helpful to distinguish between design tectures, including references to known
variables that affect only one component directly (called local design applications of each architecture.

variables) and design variables that affect two or more components 41. Martins and Lambe, Multidisciplinary
design optimization: A survey of architec-
directly (called shared design variables). We denote the vector of design tures, 2013.
variables local to component 𝑖 by 𝑥 𝑖 and the shared variables by 𝑥0 . The
full vector of design variables is given by concatenating
 | | the shared and
|
local design variables into a single vector 𝑥 = 𝑥0 , 𝑥1 , . . . , 𝑥 𝑚 , where
𝑚 is the number of components.
If a constraint can be computed using a single component and
satisfied by varying only the local design variables for that component,
it is a local constraint; otherwise, it is nonlocal. Similarly,
 | | for the|design

variables, we concatenate the constraints as 𝑔 = 𝑔0 , 𝑔1 , . . . , 𝑔𝑚 . The
same distinction could be applied to the objective function, but we do
not usually do this.
The MDO problem representation we use here is shown in Fig. 13.37
for a general three-component system. We use the functional form
introduced in Section 13.2.3, where the states in each component are
hidden. In this form, the system level only has access to the outputs of
each solver, which are the coupling coupling variables and functions of
interest.

x0 , x1 x0 , x2 x0 , x3 x x

g1 Solver 1 û1 û1 û1 û1

g2 û2 Solver 2 û2 û2 û2

g3 û3 û3 Solver 3 û3 û3

Global
g0
constraints

f Objective Fig. 13.37 MDO problem nomencla-


ture and dependencies.
13 Multidisciplinary Design Optimization 517

The set of constraints is also split into shared constraints and local
ones. Local constraints are computed by the corresponding component
and depend only on the variables available in that component. Shared
constraints depend on more than one set of coupling variables. These
dependencies are also shown in Fig. 13.37.

13.5.1 Collaborative Optimization


The collaborative optimization (CO) architecture is inspired by how
disciplinary teams work to design complex engineered systems.212 This 212. Braun and Kroo, Development and
application of the collaborative optimization
is a distributed IDF architecture, where the disciplinary optimization architecture in a multidisciplinary design
problems are formulated to be independent of each other by using environment, 1997.

target values of the coupling and shared design variables. These target
values are then shared with all disciplines during every iteration of
the solution procedure. The complete independence of disciplinary
subproblems combined with the simplicity of the data-sharing protocol
makes this architecture attractive for problems with a small amount of
shared data.
The system-level subproblem modifies the original problem as
follows: (1) local constraints are removed, (2) target coupling variables,
𝑢ˆ 𝑡 , are added as design variables, and (3) a consistency constraint is
added. This optimization problem can be written as follows:

minimize 𝑓 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡
by varying 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡

subject to 𝑔0 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡 ≤ 0
2 2
𝐽𝑖∗ = 𝑥 0𝑖
𝑡
− 𝑥 0 2 + 𝑥 𝑖𝑡 − 𝑥 𝑖
  2

+ 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖 =0 for 𝑖 = 1, . . . , 𝑚 ,
(13.32)
where 𝑥0𝑖𝑡
are copies of the shared design variables that are passed to
discipline 𝑖, and 𝑥 𝑖𝑡 are copies of the local design variables passed to
the system subproblem.
The constraint function 𝐽𝑖∗ is a measure of the inconsistency between
the values requested by the system-level subproblem and the results
from the discipline 𝑖 subproblem. The disciplinary subproblems do not
include the original objective function. Instead, the objective of each
subproblem is to minimize the inconsistency function.
13 Multidisciplinary Design Optimization 518

For each discipline 𝑖, the subproblem is as follows:



minimize 𝑡
𝐽𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖
by varying 𝑡
𝑥 0𝑖 , 𝑥𝑖

subject to 𝑡
𝑔𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ≤ 0 (13.33)
 
while solving 𝑟𝑖 𝑢ˆ 𝑖 ; 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖 =0

for 𝑢ˆ 𝑖 .
These subproblems are independent of each other and can be solved
in parallel. Thus, the system-level subproblem is responsible for
minimizing the design objective, whereas the discipline subproblems
minimize system inconsistency while satisfying local constraints.
The CO problem statement has been shown to be mathematically
equivalent to the original MDO problem.212 There are two versions of 212. Braun and Kroo, Development and
the CO architecture: CO1 and CO2 . Here, we only present the CO2 application of the collaborative optimization
architecture in a multidisciplinary design
version. The XDSM for CO is shown in Fig. 13.38 and the procedure is environment, 1997.
detailed in Alg. 13.5.

(0) t,(0) t,(0) (0)


x0 , x1...m , ût,(0) x0i , xi

0, 2 → 1 :
x∗0 System 1 : x0 , xt1...m , ût 1.1 : ûtj6=i 1.2 : x0 , xti , ût
optimization

1:
2 : f0 , g0 System
functions

1.0, 1.3 → 1.1 :


x∗i 1.1 : xt0i , xi 1.2 : xt0i , xi
Optimization i

1.1 :
û∗i 1.2 : ûi
Solver i

1.2 :
2 : Ji∗ 1.3 : fi , gi , Ji Discipline i
functions

CO has the organizational advantage of having entirely separate Fig. 13.38 Diagram for the CO archi-
disciplinary subproblems. This is desirable when designers in each tecture.
discipline want to maintain some autonomy. However, the CO for-
13 Multidisciplinary Design Optimization 519

mulation suffers from numerical ill-conditioning. This is because the


constraint gradients of the system problem at an optimal solution are
all zero vectors, which violates the constraint qualification requirement
for the Karush–Kuhn–Tucker (KKT) conditions (see Section 5.3.1). This
ill-conditioning slows down convergence when using a gradient-based
optimization algorithm or prevents convergence altogether.

Algorithm 13.5 Collaborative optimization

Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑔 ∗ : Optimal constraint values

0: Initiate system optimization iteration


repeat
1: Compute system subproblem objectives and constraints
for Each discipline 𝑖 (in parallel) do
1.0: Initiate disciplinary subproblem optimization
repeat
1.1: Evaluate disciplinary analysis
1.2: Compute disciplinary subproblem objective and constraints
1.3: Compute new disciplinary subproblem design point and 𝐽𝑖
until 1.3 → 1.1: Optimization 𝑖 has converged
end for
2: Compute a new system subproblem design point
until 2 → 1: System optimization has converged

Example 13.11 Aerostructural optimization using CO

To apply CO to the wing aerostructural design optimization problem


(Ex. 13.1), we need to set up a system-level optimization problem and two
discipline-level optimization subproblems.
The system-level optimization problem is formulated as follows:
minimize 𝑓
by varying 𝑏 𝑡 , Γ𝑡 , 𝑑 𝑡 , 𝑊 𝑡
subject to 𝐽1∗ ≤ 𝜀
𝐽2∗ ≤ 𝜀,
where 𝜀 is a specified convergence tolerance. The set of variables that are copied
as targets includes the shared design variable (𝑏) and the coupling variables (Γ
and 𝑑).
13 Multidisciplinary Design Optimization 520

The aerodynamics subproblem is as follows:


 2 !2
𝑏 Õ
𝑛Γ
Γ𝑖
minimize 𝐽1 ≡ 1 − 𝑡 + 1− 𝑡
𝑏 Γ𝑖
𝑖=1
by varying 𝑏, 𝛼, 𝜃
subject to 𝐿 − 𝑊𝑡 = 0
while solving 𝐴Γ − 𝜃 = 0
for Γ .

In this problem, the aerodynamic optimization minimizes the discrepancy


between the span requested by the system-level optimization (𝑏 𝑡 ) and the span
that aerodynamics is optimizing (𝑏). The same applies to the coupling variables
Γ. The aerodynamics subproblem is fully responsible for optimizing 𝛼 and 𝜃.
The structures subproblem is as follows:
 2 !2  2
𝑏 Õ
𝑛𝑑
𝑑𝑖 𝑊
minimize 𝐽2 ≡ 1 − 𝑡 + 1− 𝑡 + 1− 𝑡
𝑏 𝑑𝑖 𝑊
𝑖=1
by varying 𝑏, 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0
while solving 𝐾𝑑 − 𝑞 = 0
for 𝑑.

Here, the structural optimization minimizes the discrepancy between the span
wanted by the structures (a decrease) versus what the system level requests
(which takes into account the opposite trend from aerodynamics). The structural
subproblem is fully responsible for satisfying the stress constraints by changing
the structural sizing 𝑡, which are local variables.

13.5.2 Analytical Target Cascading † ATC was originally developed as a


method to handle design requirements in
Analytical target cascading (ATC) is a distributed IDF architecture a system’s hierarchical decomposition.213
that uses penalties in the objective function to minimize the difference ATC became an MDO architecture after
further developments.214 A MATLAB im-
between the target variables requested by the system-level optimization plementation of ATC is available.215
and the actual variables computed by each discipline.† 213. Kim et al., Analytical target cascading
The idea of ATC is similar to the CO architecture in the previous in automotive vehicle design, 2003.

section, except that ATC uses penalties instead of a constraint. The ATC 214. Tosserams et al., An augmented
Lagrangian relaxation for analytical target
system-level problem is as follows: cascading using the alternating direction
method of multipliers, 2006.
 Õ
𝑚
 215. Talgorn and Kokkolaras, Compact
minimize 𝑓0 𝑥, 𝑢ˆ 𝑡 + 𝑡
Φ𝑖 𝑥0𝑖 − 𝑥0 , 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 𝑥0 , 𝑥 𝑖 , 𝑢ˆ 𝑡 implementation of non-hierarchical analytical
target cascading for coordinating distributed

𝑖=1
(13.34) multidisciplinary design optimization
𝑡
+ Φ0 𝑔0 𝑥, 𝑢ˆ problems, 2017.

by varying 𝑥 0 , 𝑢ˆ 𝑡 ,
13 Multidisciplinary Design Optimization 521

where Φ0 is a penalty relaxation of the shared design constraints, and


Φ𝑖 is a penalty relaxation of the discipline 𝑖 consistency constraints.
Although the most common penalty functions in ATC are quadratic
penalty functions, other penalty functions are possible. As mentioned
in Section 5.4, penalty methods require a good selection of the penalty
weight values to converge quickly and accurately enough. The ATC
architecture converges to the same optimum as other MDO architectures,
provided that problem is unimodal and all the penalty terms in the
optimization problems approach zero.
Figure 13.39 shows the ATC architecture XDSM, where 𝑤 denotes
the penalty function weights used in the determination of Φ0 and Φ𝑖 .
The details of ATC are described in Alg. 13.6.

(0) t,(0) (0)


w(0) x0 , ût,(0) x0i , xi

0, 8 → 1 :
6:w 3 : wi
Update w

5, 7 → 6 :
x∗0 System 6 : x0 , ût 3 : x0 , ût 2 : ûtj6=i
optimization

6:
System and
7 : f0 , Φ0...m
penalty
functions

1, 4 → 2 :
x∗i 6 : xt0i , xi 3 : xt0i , xi 2 : xt0i , xi
Optimization i

3:
Discipline i
4 : fi , gi , Φ0 , Φi
and penalty
functions

2:
û∗i 6 : ûi 3 : ûi
Solver i

Fig. 13.39 Diagram for the ATC archi-


tecture
13 Multidisciplinary Design Optimization 522

The 𝑖th discipline subproblem is as follows:


  
minimize 𝑡
𝑓0 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 , 𝑢ˆ 𝑗≠𝑖
𝑡 𝑡
+ 𝑓𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖

𝑡
+ Φ𝑖 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 , 𝑥0𝑖 − 𝑥0
  
𝑡
+ Φ0 𝑔0 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 , 𝑢ˆ 𝑗≠𝑖
𝑡

by varying 𝑡
𝑥0𝑖 , 𝑥𝑖 (13.35)

subject to 𝑡
𝑔𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ≤ 0
 
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖 =0

for 𝑢ˆ 𝑖 .
The most common penalty functions used in ATC are quadratic
penalty functions (see Section 5.4.1). Appropriate penalty weights are
important for multidisciplinary consistency and convergence.

Algorithm 13.6 Analytical target cascading

Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values

0: Initiate main ATC iteration


repeat
for Each discipline 𝑖 do
1: Initiate discipline optimizer
repeat
2: Evaluate disciplinary analysis
3: Compute discipline objective and constraint functions and
penalty function values
4: Update discipline design variables
until 4 → 2: Discipline optimization has converged
end for
5: Initiate system optimizer
repeat
6: Compute system objective, constraints, and all penalty functions
7: Update system design variables and coupling targets.
until 7 → 6: System optimization has converged
8: Update penalty weights
until 8 → 1: Penalty weights are large enough
13 Multidisciplinary Design Optimization 523

13.5.3 Bilevel Integrated System Synthesis


Bilevel integrated system synthesis (BLISS) uses a series of linear ap-
proximations to the original design problem, with bounds on the design
variable steps to prevent the design point from moving so far away
that the approximations are too inaccurate.216 This is an idea similar 216. Sobieszczanski–Sobieski et al.,
Bilevel integrated system synthesis for
to that of the trust-region methods in Section 4.5. These approxima- concurrent and distributed processing, 2003.
tions are constructed at each iteration using coupled derivatives (see
Section 13.3).
BLISS optimizes the local design variables within the discipline
subproblems and the shared variables at the system level. The approach
consists of using a series of linear approximations to the original
optimization problem with limits on the design variable steps to stay
within the region where the linear prediction yields the correct trend.
This idea is similar to that of trust-region methods (see Section 4.5).
The system-level subproblem is formulated as follows:
 
d 𝑓0∗
minimize ( 𝑓0∗ )0 + Δ𝑥0
d𝑥 0
by varying Δ𝑥0
 
d𝑔0∗
subject to (𝑔0∗ )0 + Δ𝑥0 ≤ 0 (13.36)
d𝑥0
 ∗
d𝑔𝑖
(𝑔𝑖∗ )0 + Δ𝑥0 ≤ 0 for 𝑖 = 1, . . . , 𝑚
d𝑥0
Δ𝑥 0 ≤ Δ𝑥0 ≤ Δ𝑥 0 .
The linearization is performed at each iteration using coupled derivative
computation (see Section 13.3). The discipline 𝑖 subproblem is given by
the following:
 
d 𝑓0
minimize ( 𝑓0 )0 + Δ𝑥 𝑖
d𝑥 𝑖
by varying Δ𝑥 𝑖
 
d𝑔0
subject to (𝑔0 )0 + Δ𝑥 𝑖 ≤ 0 (13.37)
d𝑥 𝑖
 
d𝑔𝑖
(𝑔𝑖 )0 + Δ𝑥 𝑖 ≤ 0
d𝑥 𝑖
Δ𝑥 𝑖 ≤ Δ𝑥 𝑖 ≤ Δ𝑥 𝑖 .
The extra set of constraints in both system-level and discipline subprob-
lems denotes the design variable bounds.
To prevent violation of the disciplinary constraints by changes in
the shared design variables, post-optimality derivatives are required
13 Multidisciplinary Design Optimization 524

to solve the system-level subproblem. In this case, the post-optimality


derivatives quantify the change in the optimized disciplinary constraints
with respect to a change in the system design variables, which can be
estimated with the Lagrange multipliers of the active constraints (see
Sections 5.3.3 and 5.3.4).
Figure 13.40 shows the XDSM for BLISS, and the corresponding
steps are listed in Alg. 13.7. Because BLISS uses an MDA, it is a
distributed MDF architecture. As a result of the linear nature of the
optimization problems, repeated interrogation of the objective and
constraint functions is not necessary once we have the gradients. If the
underlying problem is highly nonlinear, the algorithm may converge
slowly. The variable bounds may help the convergence if these bounds
are properly chosen, such as through a trust-region framework.

Algorithm 13.7 Bilevel integrated system synthesis

Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values

0: Initiate system optimization


repeat
1: Initiate MDA
repeat
2: Evaluate discipline analyses
3: Update coupling variables
until 3 → 2: MDA has converged
4: Initiate parallel discipline optimizations
for Each discipline 𝑖 do
5: Evaluate discipline analysis
6: Compute objective and constraint function values and derivatives
with respect to local design variables
7: Compute the optimal solutions for the disciplinary subproblem
end for
8: Initiate system optimization
9: Compute objective and constraint function values and derivatives with
respect to shared design variables using post-optimality analysis
10: Compute optimal solution to system subproblem
until 11 → 1: System optimization has converged
13 Multidisciplinary Design Optimization 525

(0) (0)
x(0) ût,(0) x0 xi

0, 11 → 1 :
Convergence
check

1, 3 → 2 :
6 : ûtj6=i 6, 9 : ûtj6=i 6 : ûtj6=i 2, 5 : ûtj6=i
MDA

8, 10 :
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
optimization

4, 7 :
x∗i 11 : x0 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i

6, 9 :
10 : f0 , g0 7 : f0 , g0 System
functions

6, 9 :
10 : fi , gi 7 : fi , gi Discipline i
functions

9:
Shared
10 : df /dx0 , dg/dx0
variable
derivatives

3:
Discipline i
7 : df0,i /dx0 , dg0,i /dx0
variable
derivatives

2:
û∗i 3 : ûi 6, 9 : ûi 6, 9 : ûi 9 : ûi 6 : ûi
Solver i

13.5.4 Asymmetric Subspace Optimization Fig. 13.40 Diagram for the BLISS ar-
chitecture.
Asymmetric subspace optimization (ASO) is a distributed MDF archi-
tecture motivated by cases where there is a large discrepancy between
the cost of the disciplinary solvers. The cheaper disciplinary analyses
are replaced by disciplinary design optimizations inside the overall
MDA to reduce the number of more expensive disciplinary analyses.
The system-level optimization subproblem is as follows:

minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥0 , 𝑥 𝑘
subject to ˆ ≤0
𝑔0 (𝑥; 𝑢)
𝑔 𝑘 (𝑥; 𝑢ˆ 𝑘 ) ≤ 0 for all 𝑘, (13.38)
 
while solving 𝑟 𝑘 𝑢ˆ 𝑘 ; 𝑥 𝑘 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0

for 𝑢ˆ 𝑘 .
13 Multidisciplinary Design Optimization 526

The subscript 𝑘 denotes disciplinary information that remains outside


of the MDA. The disciplinary optimization subproblem for discipline 𝑖,
which is resolved inside the MDA, is as follows:

minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥𝑖
subject to 𝑔𝑖 (𝑥 0 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ) ≤ 0 (13.39)
 
while solving 𝑟𝑖 𝑢ˆ 𝑖 ; 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0

for 𝑢ˆ 𝑖 .

Figure 13.41 shows a three-discipline case where the third discipline


replaced with an optimization subproblem. ASO is detailed in Alg. 13.8.
To solve the system-level problem with a gradient-based optimizer, we
require post-optimality derivatives of the objective and constraints with
respect to the subproblem inputs (see Section 5.3.4).

Algorithm 13.8 ASO

Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values

0: Initiate system optimization


repeat
1: Initiate MDA
repeat
2: Evaluate analysis 1
3: Evaluate analysis 2
4: Initiate optimization of discipline 3
repeat
5: Evaluate analysis 3
6: Compute discipline 3 objectives and constraints
7: Update local design variables
until 7 → 5: Discipline 3 optimization has converged
8: Update coupling variables
until 8 → 2 MDA has converged
9: Compute objective and constraint function values for all disciplines 1
and 2
10: Update design variables
until 10 → 1: System optimization has converged
13 Multidisciplinary Design Optimization 527

(0) (0)
x0,1,2 ût,(0) x3

0, 10 → 1 :
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
optimization

9:
Discipline 0, 1,
10 : f0,1,2 , g0,1,2
and 2
functions

0, 8 → 2 :
2 : ût2 , ût3 3 : ût3
MDA

2:
û∗1 9 : û1 8 : û1 3 : û1 6 : û1 5 : û1
Solver 1

3:
û∗2 9 : û2 8 : û2 6 : û2 5 : û2
Solver 2

4, 7 → 5 :
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3

6:
Discipline 0
7 : f0 , g0 , f3 , g3
and 3
functions

5:
û∗3 9 : û3 8 : û3 6 : û3
Solver 3

For a gradient-based system-level optimizer, the gradients of the Fig. 13.41 Diagram for the ASO archi-
objective and constraints must take into account the suboptimization. tecture.
This requires coupled post-optimality derivative computation, which
increases computational and implementation time costs compared
with a normal coupled derivative computation. The total optimization
cost is only competitive with MDF if the discrepancy between each
disciplinary solver is high enough.

Example 13.12 Aerostructural optimization using ASO

Aerostructural optimization is an example of asymmetry in the cost of


the models. When the aerodynamic model consists of computational fluid
dynamics, it is usually much more expensive than a finite-element structural
model. If that is the case, we might be able to solve a structural sizing
optimization in parallel within the time required for an aerodynamic analysis.
In this example, we formulate the system-level optimization problem as
13 Multidisciplinary Design Optimization 528

follows:
minimize 𝑓
by varying 𝑏, 𝜃
subject to 𝐿 − 𝑊∗ = 0
while solving 𝐴(𝑑∗ )Γ − 𝜃(𝑑∗ ) = 0
for Γ ,
where 𝑊 ∗ 𝑑∗
and correspond to values obtained from the structural subopti-
mization. The suboptimization is formulated as follows:

minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0
while solving 𝐾𝑑 − 𝑞 = 0
for 𝑑.
Similar to the sequential optimization, we could replace 𝑓 with 𝑊 in the
suboptimization because the other parameters in 𝑓 are fixed. To solve the
system-level problem with a gradient-based optimizer, we would need post-
optimality derivatives of 𝑊 ∗ with respect to span and Γ.

13.5.5 Other Distributed Architectures


There are other distributed MDF architectures in addition to BLISS
and ASO: concurrent subspace optimization (CSSO) and MDO of
independent subspaces (MDOIS).41
41. Martins and Lambe, Multidisciplinary
CSSO requires surrogate models for the analyses for all disciplines. design optimization: A survey of architec-
The system-level optimization subproblem is solved based on the tures, 2013.

surrogate models and is therefore fast. The discipline-level optimization


subproblem uses the actual analysis from the corresponding discipline
and surrogate models for all other disciplines. The solutions for each
discipline subproblem are used to update the surrogate models.
MDOIS only applies when no shared variables exist. In this case, dis-
cipline subproblems are solved independently, assuming fixed coupling
variables, and then an MDA is performed to update the coupling.
There are also other distributed IDF architectures. Some of these are
similar to CO in that they use a multilevel approach to enforce multi-
disciplinary feasibility: BLISS-2000 and quasi-separable decomposition
(QSD). Other architectures enforce multidisciplinary feasibility with
penalties, like ATC: inexact penalty decomposition (IPD), exact penalty
decomposition (EPD), and enhanced collaborative optimization (ECO).
BLISS-2000 is a variation of BLISS that uses surrogate models to
represent the coupling variables for all disciplines. Each discipline
13 Multidisciplinary Design Optimization 529

subproblem minimizes the linearized objective with respect to local


variables subject to local constraints. The system-level subproblem min-
imizes the objective with respect to the shared variables and coupling
variables while enforcing consistency constraints.
When using QSD, the objective and constraint functions are assumed
to depend only on the shared design variables and coupling variables.
Each discipline is assigned a “budget” for a local objective, and the
discipline problems maximize the margin in their local constraints and
the budgeted objective. The system-level subproblem minimizes the
objective and budgets of each discipline while enforcing the shared
constraints and a positive margin for each discipline.
IPD and EPD apply to MDO problems with no shared objectives
or constraints. They are similar to ATC in that copies of the shared
variables are used for every discipline subproblem, and the consistency
constraints are relaxed with a penalty function. Unlike ATC, however,
the more straightforward structure of the discipline subproblems is
exploited to compute post-optimality derivatives to guide the system-
level optimization subproblem.
Like CO, ECO uses copies of the shared variables. The discipline
subproblems minimize quadratic approximations of the objective while
enforcing local constraints and linear models of the nonlocal constraints.
The system-level subproblem minimizes the total violation of all con-
sistency constraints with respect to the shared variables.

13.6 Summary

MDO architectures provide different options for solving MDO problems.


An acceptable MDO architecture must be mathematically equivalent
to the original problem and converge to the same optima. Sequential
optimization, although intuitive, is not mathematically equivalent to
the original problem and yields a design inferior to the MDO optimum.
MDO architectures are divided into two broad categories: mono-
lithic architectures and distributed architectures. Monolithic archi-
tectures solve a single optimization problem, whereas distributed
architectures solve optimization subproblems for each discipline and a
system-level optimization problem. Overall, monolithic architectures
exhibit a much better convergence rate than distributed architectures.217 217. Tedford and Martins, Benchmarking
multidisciplinary design optimization
In the last few years, the vast majority of MDO applications have used algorithms, 2010.
monolithic MDO architectures. The MAUD architecture, which can im-
plement MDF, IDF, or a hybrid of the two, successfully solves large-scale
MDO problems.132 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
design, analysis, and optimization, 2019.
13 Multidisciplinary Design Optimization 530

MDF/MAUD

Monolithic IDF

SAND

BLISS
MDO
architecture CSSO
classification Distributed MDF
MDOIS

ASO
Distributed CO
Multilevel
QSD
Penalty

Distributed IDF ATC

IPD/EPD

ECO

The distributed architectures can be divided according to whether Fig. 13.42 Classification of MDO ar-
or not they enforce multidisciplinary feasibility (through an MDA of the chitectures.
whole system), as shown in Fig. 13.42. Distributed MDF architectures
enforce multidisciplinary feasibility through an MDA. The distributed
IDF architectures are like IDF in that no MDA is required. However,
they must ensure multidisciplinary feasibility in some other way. Some
do this by formulating an appropriate multilevel optimization (such as
CO), and others use penalties to ensure this (such as ATC).∗ ∗ Martins and Lambe41 describe all of these

Several commercial MDO frameworks are available, including MDO architectures in detail.
41. Martins and Lambe, Multidisciplinary
Isight/SEE 218 by Dassault Systèmes, ModelCenter/CenterLink by design optimization: A survey of architec-
Phoenix Integration, modeFRONTIER by Esteco, AML Suite by Tech- tures, 2013.
noSoft, Optimus by Noesis Solutions, and VisualDOC by Vanderplaats 218. Golovidov et al., Flexible implementa-
tion of approximation concepts in an MDO
Research and Development.219 These frameworks focus on making it framework, 1998.
easy for users to couple multiple disciplines and use the optimization 219. Balabanov et al., VisualDOC: A soft-
ware system for general purpose integration
algorithms through graphical user interfaces. They also provide con- and design optimization, 2002.
venient wrappers to popular commercial engineering tools. Typically,
these frameworks use fixed-point iteration to converge the MDA. When
derivatives are needed for a gradient-based optimizer, finite-difference
approximations are used rather than more accurate analytic derivatives.
13 Multidisciplinary Design Optimization 531

Problems

13.1 Answer true or false and justify your answer.

a. We prefer to use the term component instead of discipline


because it is more general.
b. Local design variables affect only one discipline in the MDO
problem, whereas global variables affect all disciplines.
c. All multidisciplinary models can be written in the functional
form, but not all can be written in the residual form.
d. The coupling variables are a subset of component state
variables.
e. Multidisciplinary models can be represented by directed
cyclic graphs, where the nodes represent components and
the edges represent coupling variables.
f. The nonlinear block Jacobi and Gauss–Seidel methods can
be used with any combination of component solvers.
g. All the derivative computation methods from Chapter 6 can
be implemented for coupled multidisciplinary systems.
h. Implicit analytic methods for derivative computation are
incompatible with the functional form of multidisciplinary
models.
i. The MAUD approach is based on the UDE.
j. The MDF architecture has fewer design variables and more
constraints than IDF.
k. The main difference between monolithic and distributed
MDO architectures is that the distributed architectures per-
form optimization at multiple levels.
l. Sequential optimization is a valid MDO approach, but the
main disadvantage is that it converges slowly.

13.2 Pick a multidisciplinary engineering system from the literature


or formulate one based on your experience.

a. Identify the different analyses and coupling variables.


b. List the design variables and classify them as local or global.
c. Identify the objective and constraint functions.
d. Draw a diagram similar to the one in Fig. 13.37 for your
system.
13 Multidisciplinary Design Optimization 532

e. Exploration: Think about the objective that each discipline


would have if considered separately, and discuss the trades
needed to optimize the multidisciplinary objective.

13.3 Consider the DSMs that follow. For each case, what is the lowest
number of feedback loops you can achieve through reordering?
What hierarchy of solvers would you recommend to solve the
coupled problem for each case?

a. A

b. A

c. A

13.4 Consider the “spaghetti” diagram shown in Fig. 13.43. Draw the
equivalent DSM for these dependencies. How can you exploit
the structure in these dependencies? What hierarchy of solvers
would you recommend to solve a coupled system with these
dependencies?

A B

C F

Fig. 13.43 Graph of dependencies.


D E
13 Multidisciplinary Design Optimization 533

13.5 Let us solve a simplified wing aerostructural problem based on


simple equations for the aerodynamics and structures. We reuse
the wing design problem described in Appendix D.1.6, but with
a few modifications.
Aerodynamics
Suppose the lift coefficient now depends on the wing deflection: 𝐿 = 12 𝜌𝑣 2 𝑆𝐶 𝐿 (𝜃)
𝐿

𝐶 𝐿 = 𝐶 𝐿0 − 𝐶 𝐿,𝜃 𝜃 ,
Structures
𝐿𝑏 2
where 𝜃 is the angle of deflection at the wing tip. Use 𝐶 𝐿0 = 0.4 𝜃 𝜃 = 48𝐸𝐼

and 𝐶 𝐿,𝜃 = 0.1 rad−1 . The deflection also depends on the lift. We
compute 𝜃 assuming the uniform lift distribution and using the Fig. 13.44 The aerostructural model
couples aerodynamics and structures
simple beam bending theory as through lift and wing deflection.

(𝐿/𝑏)(𝑏/2)3 𝐿𝑏 2
𝜃= = .
6𝐸𝐼 48𝐸𝐼
The Young’s modulus is 𝐸 = 70 GPa. Use the H-shaped cross-
section described in Prob. 5.17 to compute the second moment of
inertia, 𝐼.
We add the flight speed 𝑣 to the set of design variables and
handle 𝐿 = 𝑊 as a constraint. The objective of the aerostructural
optimization problem is to minimize the power with respect to
𝑥 = [𝑏, 𝑐, 𝑣], subject to 𝐿 = 𝑊.
Solve this problem using MDF, IDF, and a distributed MDO
architecture. Compare the aerostructural optimal solution with
the original solution from Appendix D.1.6 and discuss your
results.
Mathematics Background
A
This appendix briefly reviews various mathematical concepts used
throughout the book.

A.1 Taylor Series Expansion

Series expansions are representations of a given function in terms


of a series of other (usually simpler) functions. One common series
expansion is the Taylor series, which is expressed as a polynomial whose
coefficients are based on the derivatives of the original function at a
fixed point.
The Taylor series is a general tool that can be applied whenever
the function has derivatives. We can use this series to estimate the
value of the function near the given point, which is useful when the
function is difficult to evaluate directly. The Taylor series is used to
derive algorithms for finding the zeros of functions and algorithms for
minimizing functions in Chapters 4 and 5.
To derive the Taylor series, we start with an infinite polynomial
series about an arbitrary point, 𝑥, to approximate the value of a function
at 𝑥 + Δ𝑥 using

𝑓 (𝑥 + Δ𝑥) = 𝑎0 + 𝑎1 Δ𝑥 + 𝑎2 Δ𝑥 2 + . . . + 𝑎 𝑘 Δ𝑥 𝑘 + . . . . (A.1)

We can make this approximation exact at Δ𝑥 = 0 by setting the first


coefficient to 𝑓 (𝑥). To find the appropriate value for 𝑎1 , we take the first
derivative to get

𝑓 0(𝑥 + Δ𝑥) = 𝑎1 + 2𝑎2 Δ𝑥 + . . . + 𝑖𝑎 𝑘 Δ𝑥 𝑘−1 + . . . , (A.2)

which means that we need 𝑎 1 = 𝑓 0(𝑥) to obtain an exact derivative at 𝑥.


To derive the other coefficients, we systematically take the derivative of
both sides and the appropriate value of the first nonzero term (which
is always constant). Identifying the pattern yields the general formula
for the 𝑛th-order coefficient:

𝑓 (𝑘) (𝑥)
𝑎𝑘 = . (A.3)
𝑘!

534
A Mathematics Background 535

Substituting this into the polynomial in Eq. A.1 yields the Taylor series

Õ

Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) . (A.4)
𝑘!
𝑘=0

The series is typically truncated to use terms up to order 𝑚,

Õ
𝑚
Δ𝑥 𝑘  
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) + 𝒪 Δ𝑥 𝑚+1 , (A.5)
𝑘!
𝑘=0

which yields an approximation with a truncation error of order 𝒪(Δ𝑥 𝑚+1 ).


In optimization, it is common to use the first three terms (up to 𝑚 = 2)
to get a quadratic approximation.

Example A.1 Taylor series expansion for single variable

Consider the scalar function of a single variable, 𝑓 (𝑥) = 𝑥 − 4 cos(𝑥). If we

𝑛=2

6
use Taylor series expansions of this function about 𝑥 = 0, we get

=
𝑛
1 1
𝑓 (Δ𝑥) = −4 + Δ𝑥 + 2Δ𝑥 2 − Δ𝑥 4 + Δ𝑥 6 − . . . .
6 180
Four different truncations of this series are plotted and compared to the exact 𝑓
function in Fig. A.1. 1
𝑛=

𝑛=4
The Taylor series in multiple dimensions is similar to the single- 0
variable case but more complicated. The first derivative of the function 𝑥

becomes a gradient vector, and the second derivatives become a Hessian Fig. A.1 Taylor series expansions for
matrix. Also, we need to define a direction along which we want to one-dimensional example. The more
approximate the function because that information is not inherent like terms we consider from the Taylor
series, the better the approximation.
it is in a one-dimensional function. The Taylor series expansion in 𝑛
dimensions along a direction 𝑝 can be written as

Õ
𝑛
𝜕𝑓 1 ÕÕ
𝑛 𝑛
𝜕2 𝑓  
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼 𝑝𝑘 + 𝛼2 𝑝𝑘 𝑝𝑙 + 𝒪 𝛼3 ,
𝜕𝑥 𝑘 2 𝜕𝑥 𝑘 𝜕𝑥 𝑙
𝑘=1 𝑘=1 𝑙=1
(A.6)
where 𝛼 is a scalar that determines how far to go in the direction 𝑝. In
matrix form, we can write
1  
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼∇ 𝑓 (𝑥)| 𝑝 + 𝛼 2 𝑝 | 𝐻(𝑥)𝑝 + 𝒪 𝛼 3 , (A.7)
2
where 𝐻 is the Hessian matrix.
A Mathematics Background 536

Example A.2 Taylor series expansion for two variables

Consider the following function of two variables:

1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 .
2
Performing a Taylor series expansion about 𝑥 = [0, −2], we get
 
  1 2 | 10 0
𝑓 (𝑥 + 𝛼𝑝) = 18 + 𝛼 −2 − 14 𝑝 + 𝛼 𝑝 𝑝.
2 0 6

The original function, the linear approximation, and the quadratic approxima-
tion are compared in Fig. A.2.

Original function Linear approximation (𝑛 = 1) Quadratic approximation (𝑛 = 2)

Fig. A.2 Taylor series approximations


for two-dimensional example.

A.2 Chain Rule, Total Derivatives, and Differentials

The single-variable chain rule is needed for differentiating composite


functions. Given a composite function, 𝑓 (𝑔(𝑥)), the derivative with
respect to the variable 𝑥 is

d d 𝑓 d𝑔
( 𝑓 (𝑔(𝑥))) = . (A.8)
d𝑥 d𝑔 d𝑥

Example A.3 Single-variable chain rule


 
Let 𝑓 (𝑔(𝑥)) = sin 𝑥 2 . In this case, 𝑓 (𝑔) = sin(𝑔), and 𝑔(𝑥) = 𝑥 2 . The
derivative with respect to 𝑥 is

d d d  2  
( 𝑓 (𝑔(𝑥))) = (sin(𝑔)) 𝑥 = cos 𝑥 2 (2𝑥) .
d𝑥 d𝑔 d𝑥
A Mathematics Background 537

If a function depends on more than one variable, then we need


to distinguish between partial and total derivatives. For example, if
𝑓 (𝑔(𝑥), ℎ(𝑥)), then 𝑓 is a function of two variables: 𝑔 and ℎ. The
application of the chain rule for this function is

d 𝜕 𝑓 d𝑔 𝜕 𝑓 dℎ
( 𝑓 (𝑔(𝑥), ℎ(𝑥))) = + , (A.9)
d𝑥 𝜕𝑔 d𝑥 𝜕ℎ d𝑥

where 𝜕/𝜕𝑥 indicates a partial derivative, and d/d𝑥 is a total derivative.


When taking a partial derivative, we take the derivative with respect
to only that variable, treating all other variables as constants. More
generally,
𝑛 
Õ 
d 𝜕 𝑓 d𝑔𝑖
( 𝑓 (𝑔1 (𝑥), . . . , 𝑔𝑛 (𝑥))) = . (A.10)
d𝑥 𝜕𝑔𝑖 d𝑥
𝑖=1

Example A.4 Partial versus total derivatives

Consider 𝑓 (𝑥, 𝑦(𝑥)) = 𝑥 2 + 𝑦 2 , where 𝑦(𝑥) = sin(𝑥). The partial derivative


of 𝑓 with respect to 𝑥 is
𝜕𝑓
= 2𝑥 ,
𝜕𝑥
whereas the total derivative of 𝑓 with respect to 𝑥 is

d𝑓 𝜕𝑓 𝜕 𝑓 d𝑦
= +
d𝑥 𝜕𝑥 𝜕𝑦 d𝑥
= 2𝑥 + 2𝑦 cos(𝑥)
= 2𝑥 + 2 sin(𝑥) cos(𝑥) .

Notice that the partial derivative and total derivative are quite different. For this
simple case, we could also find the total derivative by direct substitution and
then using an ordinary one-dimensional derivative. Substituting 𝑦(𝑥) = sin(𝑥)
directly into the original expression for 𝑓 gives

𝑓 (𝑥) = 𝑥 2 + sin2 (𝑥)


d𝑓
= 2𝑥 + 2 sin(𝑥) cos(𝑥) .
d𝑥
A Mathematics Background 538

Example A.5 Multivariable chain rule

Expanding on our single-variable example, let 𝑔(𝑥) = cos(𝑥), ℎ(𝑥) = sin(𝑥),


and 𝑓 (𝑔, ℎ) = 𝑔 2 ℎ 3 . Then, 𝑓 (𝑔(𝑥), ℎ(𝑥)) = cos2 (𝑥) sin3 (𝑥). Applying Eq. A.9,
we have the following:
d 𝜕 𝑓 d𝑔 𝜕 𝑓 dℎ
( 𝑓 (𝑔(𝑥), ℎ(𝑥))) = +
d𝑥 𝜕𝑔 d𝑥 𝜕ℎ d𝑥
d𝑔 𝑑ℎ
= 2𝑔 ℎ 3 + 𝑔 2 3ℎ 2
d𝑥 d𝑥
= −2𝑔 ℎ 3 sin(𝑥) + 𝑔 2 3ℎ 2 cos(𝑥)
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥) .

The differential of a function represents the linear change in that


function with respect to changes in the independent variable. We
introduce them here because they are helpful for finding total derivatives
of multivariable equations that are implicit.
If function 𝑦 = 𝑓 (𝑥) is differentiable, the differential d𝑦 is

d𝑦 = 𝑓 0(𝑥) d𝑥 , (A.11)

where d𝑥 is a nonzero real number (considered small) and d𝑦 is an


approximation of the change (due to the linear term in the Taylor
series). We can solve for 𝑓 0(𝑥) to get 𝑓 0(𝑥) = d𝑦/d𝑥. This states that
the derivative of 𝑓 with respect to 𝑥 is the differential of 𝑦 divided by
the differential of 𝑥. Strictly speaking, d𝑦/d𝑥 here is not the derivative,
although it is written in the same way. The derivative is a symbol, not a
fraction. However, for our purposes, we will use these representations
interchangeably and treat differentials algebraically. We also write the
differentials of functions as

d 𝑓 = 𝑓 0(𝑥) d𝑥 . (A.12)

Example A.6 Multivariable chain rule using differentials

We can solve Ex. A.5 using differentials as follows. Taking the definition of
each function, we write their differentials,

d 𝑓 = 2𝑔 ℎ 3 d𝑔 + 3𝑔 2 ℎ 2 dℎ, d𝑔 = − sin(𝑥) d𝑥, dℎ = cos(𝑥) d𝑥 .

Substituting 𝑔, d𝑔, ℎ, and dℎ into the differential of 𝑓 we get obtain

d 𝑓 = 2 cos(𝑥) sin(𝑥)3 (− sin(𝑥) d𝑥) + 3 cos(𝑥)2 sin(𝑥)2 cos(𝑥) d𝑥 .

Simplifying and dividing by d𝑥 yields the total derivative


d𝑓
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥).
d𝑥
A Mathematics Background 539

In Ex. A.5, there is no clear advantage in using differentials. However,


differentials are more straightforward for finding total derivatives of
multivariable implicit equations because there is no need to identify
the independent variables. Given an equation, we just need to (1)
find the differential of the equation and (2) solve for the derivative
of interest. When we want quantities to remain constant, we can set
the corresponding differential to zero. Differentials can be applied to
vectors (say a vector 𝑥 of size 𝑛), yielding a vector of differentials with
the same size (d𝑥 of size 𝑛). We use this technique to derive the unified
derivatives equation (UDE) in Section 6.9.

Example A.7 Total derivatives of an implicit equation

Suppose we have the equation for a circle,


𝑥2 + 𝑦2 = 𝑟2 .
The differential of this equation is
2𝑥 d𝑥 + 2𝑦 d𝑦 = 2𝑟 d𝑟 .
Say we want to find the slope of the tangent of a circle with a fixed radius. Then,
d𝑟 = 0, and we can solve for the derivative d𝑦/d𝑥 as follows:
d𝑦 𝑥
2𝑥 d𝑥 + 2𝑦 d𝑦 = 0 ⇒ =− .
d𝑥 𝑦
Another interpretation of this derivative is that it is the first-order change in
𝑦 with respect to a change in 𝑥 subject to the constraint of staying on a circle
(keeping a constant 𝑟). Similarly, we could find the derivative of 𝑥 with respect
to 𝑦 as d𝑥/d𝑦 = −𝑦/𝑥. Furthermore, we can find relationships between any
derivative involving 𝑟, 𝑥, or 𝑦.

A.3 Matrix Multiplication

Consider a matrix 𝐴 of size (𝑚 × 𝑛)∗ and a matrix 𝐵 of size (𝑛 × 𝑝). The


𝐶 𝐴 𝐵
two matrices can be multiplied together (𝐶 = 𝐴𝐵) as follows:
Õ
𝑛
𝐶 𝑖𝑗 = 𝐴𝑖 𝑘 𝐵 𝑘 𝑗 , (A.13) 𝐶 𝑖𝑗 = 𝐴 𝑖∗ 𝐵∗𝑗

𝑘=1

where 𝐶 is an (𝑚 × 𝑝) matrix. This multiplication is illustrated in Fig. A.3. (𝑚 × 𝑝) (𝑚 × 𝑛) (𝑛 × 𝑝)

Two matrices can be multiplied only if their inner dimensions are equal
Fig. A.3 Matrix product and resulting
(𝑛 in this case). The remaining products discussed in this section are size.
just special cases of matrix multiplication, but they are common enough ∗ In this notation, 𝑚 is the number of rows
that we discuss them separately. and 𝑛 is the number of columns.
A Mathematics Background 540

A.3.1 Vector-Vector Products


In this book, a vector 𝑢 is a column vector; thus, the row vector is
represented as 𝑢 | . The product of two vectors can be performed in
two ways. The more common is called an inner product (also known
as a dot product or scalar product). The inner product is a functional,
meaning that it is an operator that acts on vectors and produces a
scalar. This product is illustrated in Fig. A.4. In the real vector space
of 𝑛 dimensions, the inner product of two vectors, 𝑢 and 𝑣, whose
dimensions are equal, is defined algebraically as 𝛼 𝑢 𝑣

 𝑣1 
 
   𝑣2  Õ
=
𝑛
|
𝛼 = 𝑢 𝑣 = 𝑢1 𝑢2 . . . 𝑢𝑛  .  = 𝑢𝑖 𝑣 𝑖 . (A.14)
 .. 
  (1 × 1) (1 × 𝑛) (𝑛 × 1)
𝑣 𝑛 
𝑖=1
  Fig. A.4 Dot (or inner) product of two
The order of multiplication is irrelevant, and therefore, vectors.

𝑢|𝑣 = 𝑣|𝑢 . (A.15)


In Euclidean space, where vectors have magnitude and direction, the
inner product is defined as
𝑢 | 𝑣 = k𝑢 k k𝑣k cos(𝜃) , (A.16)
where k·k represents the 2-norm (Eq. A.25), and 𝜃 is the angle between
the two vectors.
The outer product takes the two vectors and multiplies them element-
wise to produce a matrix, as illustrated in Fig. A.5. Unlike the inner 𝑢 𝑣
𝐶
product, the outer product does not require the vectors to be of the
𝑢𝑖
same length. The matrix form is as follows: 𝑣𝑗
=
 𝑢1 
𝐶 𝑖𝑗
 
 𝑢2   
 
𝐶 = 𝑢𝑣 =  .  𝑣1 𝑣2 · · · 𝑣 𝑛
| (𝑚 × 𝑝) (𝑚 × 1) (1 × 𝑝)
 .. 
 
𝑢 𝑚  Fig. A.5 Outer product of two vectors.
 
(A.17)
 𝑢1 𝑣1 𝑢1 𝑣2 · · · 𝑢1 𝑣 𝑛 
 
 𝑢2 𝑣1 𝑢2 𝑣2 · · · 𝑢2 𝑣 𝑛 
 
= . ..  .
 .. .. .. 
 . . . 
𝑢 𝑚 𝑣 1 𝑢 𝑚 𝑣 2 · · · 𝑢 𝑚 𝑣 𝑛 
 
The index form is as follows:
(𝑢𝑣 | )𝑖𝑗 = 𝑢𝑖 𝑣 𝑗 . (A.18)
Outer products generate rank 1 matrices. They are used in quasi-
Newton methods (Section 4.4.4 and Appendix C).
A Mathematics Background 541

A.3.2 Matrix-Vector Products


Consider multiplying a matrix 𝐴 of size (𝑚 × 𝑛) by vector 𝑢 of size 𝑛.
The result is a vector of size 𝑚:
Õ
𝑛
𝑣 = 𝐴𝑢 ⇒ 𝑣 𝑖 = 𝐴 𝑖𝑗 𝑢 𝑗 . (A.19)
𝑗=1

This multiplication is illustrated in Fig. A.6. The entries in 𝑣 are dot 𝑣 𝐴 𝑢


products between the rows of 𝐴 and 𝑢: 𝑣𝑖

 —— 𝐴1∗ ——  =
 
𝐴 𝑖∗
 —— 𝐴2∗ —— 
 
𝑣= 𝑢, (A.20)
 ..
 (𝑚 × 1) (𝑚 × 𝑛) (𝑛 × 1)
 . 
 —— 𝐴𝑚∗ ——
  Fig. A.6 Matrix-vector product.

where 𝐴 𝑖∗ is the 𝑖th row of the matrix 𝐴. Thus, a matrix-vector


product transforms a vector in 𝑛-dimensional space (R𝑛 ) to a vector in
𝑚-dimensional space (R𝑚 ).
A matrix-vector product can be thought of as a linear combination
of the columns of 𝐴, where the 𝑢 𝑗 values are the weights:

 |   |   | 
     
𝑣 = 𝐴∗1  𝑢1 + 𝐴∗2  𝑢2 + . . . +
 
𝐴∗𝑛  𝑢𝑛 ,
  (A.21)
 |   |   | 
     
and 𝐴∗𝑗 are the columns of 𝐴.
We can also multiply by a vector on the left, instead of on the right:

𝑣 | = 𝑢 | 𝐴. (A.22)

In this case, a row vector is multiplied with a matrix, producing a row


vector.

A.3.3 Quadratic Form (Vector-Matrix-Vector Product)


Another common product is a quadratic form. A quadratic form consists
of a row vector, times a matrix, times a column vector, producing a
scalar:
 𝐴11 𝐴1𝑛   𝑢1 
 𝐴12 ···  
   𝐴21 𝐴22 ··· 𝐴2𝑛   𝑢2 
 
𝛼 = 𝑢 | 𝐴𝑢 = 𝑢1 𝑢2 . . . 𝑢𝑛  . ..   ..  (A.23)
 .. .. ..
.   . 
 . .  
𝐴𝑛1 𝐴𝑛𝑛  𝑢 𝑛 . 
 𝐴𝑛2 ···  
A Mathematics Background 542

The index form is as follows:


Õ
𝑛 Õ
𝑛
𝛼= 𝑢𝑖 𝐴 𝑖𝑗 𝑢 𝑗 . (A.24)
𝑖=1 𝑗=1

In general, a vector-matrix-vector product can have a nonsquare 𝐴


matrix, and the vectors would be two different sizes, but for a quadratic
form, the two vectors 𝑢 are identical, and thus 𝐴 is square. Also, in a
quadratic form, we assume that 𝐴 is symmetric (even if it is not, only the
symmetric part of 𝐴 contributes, so effectively, it acts like a symmetric
matrix).

A.4 Four Fundamental Subspaces in Linear Algebra

This section reviews how the dimensions of a matrix in a linear system


relate to dimensional spaces.∗ These concepts are especially helpful for ∗ Strang87 provides a comprehensive cov-
erage of linear algebra and is credited with
understanding constrained optimization (Chapter 5) and build on the popularizing the concept of the “four fun-
review in Section 5.2. damental subspaces”.
A vector space is the set of all points that can be obtained by linear 87. Strang, Linear Algebra and its Applica-
tions, 2006.
combinations of a given set of vectors. The vectors are said to span
the vector space. A basis is a set of linearly independent vectors that
generates all points in a vector space. A subspace is a space of lower
dimension than the space that contains it (e.g., a line is a subspace of a
plane).
Two vectors are orthogonal if the angle between them is 90 degrees.
Then, their dot product is zero. A subspace 𝑆1 is orthogonal to another
subspace 𝑆2 if every vector in 𝑆1 is orthogonal to every vector in 𝑆2 .
Consider an (𝑚 × 𝑛) matrix 𝐴. The rank (𝑟) of a matrix 𝐴 is the maxi-
mum number of linearly independent row vectors of 𝐴 or, equivalently,
the maximum number of linearly independent column vectors. The rank
can also be defined as the dimensionality of the vector space spanned
by the rows or columns of 𝐴. For an (𝑚 × 𝑛) matrix, 𝑟 ≤ min(𝑚, 𝑛).
Through a matrix-vector multiplication 𝐴𝑥 = 𝑏, this matrix maps
an 𝑛-vector 𝑥 into an 𝑚-vector 𝑏. Figure A.7 shows this mapping and
illustrates the four fundamental subspaces that we now explain.
The column space of a matrix 𝐴 is the vector space spanned by the
vectors in the columns of 𝐴. The dimensionality of this space is given
by 𝑟, where 𝑟 ≤ 𝑛, so the column space is a subspace of 𝑛-dimensional
space. The row space of a matrix 𝐴 is the vector space spanned by the
vectors in the rows of 𝐴 (or equivalently, it is the column space of 𝐴𝑇 ).
The dimensionality of this space is given by 𝑟, where 𝑟 ≤ 𝑚, so the row
space is a subspace of 𝑛-dimensional space.
A Mathematics Background 543

𝐴
R𝑛 R𝑚
Row space
𝐴| 𝑦 ≠ 0 Column space
dim = 𝑟 𝐴𝑥 ≠ 0
dim = 𝑟
𝐴𝑥 𝑟 = 𝑏 Fig. A.7 The four fundamental sub-
𝑥𝑟
𝑏 spaces of linear algebra. An (𝑚 × 𝑛)
𝐴𝑥 = 𝑏 matrix 𝐴 maps vectors from 𝑛-space
to 𝑚-space. When the vector is in
𝑥 = 𝑥𝑟 + 𝑥𝑛
the row space of the matrix, it maps
0 0 to the column space of 𝐴 (𝑥 𝑟 → 𝑏).
0
𝐴𝑥 𝑛 =
When the vector is in the nullspace
𝑥𝑛 of 𝐴, it maps to zero (𝑥 𝑛 → 0). Com-
Left nullspace bining the row space and nullspace
Nullspace of 𝐴, we can obtain any vector in
𝐴| 𝑦 = 0
𝐴𝑥 = 0
dim = 𝑚 − 𝑟 𝑛-dimensional space (𝑥 = 𝑥 𝑟 + 𝑥 𝑛 ),
dim = 𝑛 − 𝑟
which maps to the column space of
𝐴 (𝑥 → 𝑏).

The nullspace of a matrix 𝐴 is the vector space consisting of all the


vectors that are orthogonal to the rows of 𝐴. Equivalently, the nullspace
of 𝐴 is the vector space of all vectors 𝑥 𝑛 such that 𝐴𝑥 𝑛 = 0. Therefore,
the nullspace is orthogonal to the row space of 𝐴. The dimension of
the nullspace of 𝐴 is 𝑛 − 𝑟.
Combining the nullspace and row space of 𝐴 adds up to the whole
𝑛-dimensional space, that is, 𝑥 = 𝑥 𝑟 + 𝑥 𝑛 , where 𝑥 𝑟 is in the row space
of 𝐴 and 𝑥 𝑛 is in the nullspace of 𝐴.
The left nullspace of a matrix 𝐴 is the vector space of all 𝑥 such that
𝐴 𝑥 = 0. Therefore, the left nullspace is orthogonal to the column space
|

of 𝐴. The dimension of the left nullspace of 𝐴 is 𝑚−𝑟. Combining the left


nullspace and column space of 𝐴 adds up to the whole 𝑚-dimensional
space.

A.5 Vector and Matrix Norms

Norms give an idea of the magnitude of the entries in vectors and


matrices. They are a generalization of the absolute value for real
numbers. A norm k·k is a real-valued function with the following
properties:
• k𝑥 k ≥ 0 for all 𝑥.
• k𝑥 k = 0 if an only if 𝑥 = 0.
• k𝛼𝑥k = |𝛼| k𝑥 k for all real numbers 𝛼.
• k𝑥 + 𝑦 k ≤ k𝑥 k + k 𝑦k for all 𝑥 and 𝑦.
Most common matrix norms also have the property that k𝑥 𝑦k ≤ k𝑥k k𝑦k,
although this is not required in general.
A Mathematics Background 544

We start by defining vector norms, where the vector is 𝑥 = [𝑥1 , . . . , 𝑥 𝑛 ].


The most familiar norm for vectors is the 2-norm, also known as the
Euclidean norm, which corresponds to the Euclidean length of the vector:
! 21
Õ
𝑛   12
k𝑥k 2 = 𝑥 2𝑖 = 𝑥 12 + 𝑥22 + . . . + 𝑥 𝑛2 . (A.25)
𝑖=1

Because this norm is used so often, we often omit the subscript and just
write k𝑥 k. In this book, we sometimes use the square of the 2-norm,
which can be written as the dot product,

k𝑥 k 22 = 𝑥 | 𝑥 . (A.26)

More generally, we can refer to a class of norms called 𝑝-norms:


! 𝑝1
Õ
𝑛
1
k𝑥 k 𝑝 = |𝑥 𝑖 | 𝑝
= (|𝑥1 | 𝑝 + |𝑥2 | 𝑝 + . . . + |𝑥 𝑛 | 𝑝 ) 𝑝 , (A.27)
𝑖=1
||𝑥|| 1
where 1 ≤ 𝑝 < ∞. Of all the 𝑝-norms, three are most commonly used:
the 2-norm (Eq. A.25), the 1-norm, and the ∞-norm. From the previous
definition, we see that the 1-norm is the sum of the absolute values of
all the entries in 𝑥:
Õ
𝑛
k𝑥 k 1 = |𝑥 𝑖 | = |𝑥1 | + |𝑥2 | + . . . + |𝑥 𝑛 | . (A.28)
𝑖=1 ||𝑥|| 2

The application of ∞ in the 𝑝-norm definition is perhaps less obvious,


but as 𝑝 → ∞, the largest term in that sum dominates all of the others.
Raising that quantity to the power of 1/𝑝 causes the exponents to cancel,
leaving only the largest-magnitude component of 𝑥. Thus, the infinity
norm is
k𝑥 k ∞ = max |𝑥 𝑖 | . (A.29) ||𝑥|| ∞
𝑖

The infinity norm is commonly used in optimization convergence


criteria.
The vector norms are visualized in Fig. A.8 for 𝑛 = 2. If 𝑥 = [1, . . . , 1],
then
1
k𝑥 k 1 = 𝑛, k𝑥 k 2 = 𝑛 2 , k𝑥 k ∞ = 1 . (A.30)
||𝑥|| 𝑝
It is also possible to assign different weights to each vector compo-
nent to form a weighted norm:
1
k𝑥k 𝑝 = (𝑤 1 |𝑥1 | 𝑝 + 𝑤 2 |𝑥2 | 𝑝 + . . . + 𝑤 𝑛 |𝑥 𝑛 | 𝑝 ) 𝑝 . (A.31) Fig. A.8 Norms for two-dimensional
case.
A Mathematics Background 545

Several norms for matrices exist. There are matrix norms similar to
the vector norms that we defined previously. Namely,

Õ
𝑛

k𝐴k 1 = max 𝐴 𝑖𝑗
1≤ 𝑗≤𝑛
𝑖=1
1
k𝐴k 2 = (𝜆max (𝐴| 𝐴)) 2 (A.32)
Õ
𝑛

k𝐴k ∞ = max 𝐴 𝑖𝑗 ,
1≤𝑖≤𝑛
𝑖=1

where 𝜆max (𝐴| 𝐴) is the largest eigenvalue of 𝐴| 𝐴. When 𝐴 is a square


symmetric matrix, then

k𝐴k 2 = |𝜆max (𝐴)| . (A.33)

Another matrix norm that is useful but not related to any vector
norm is the Frobenius norm, which is defined as the square root of the
absolute squares of its elements, that is,
v
u
tÕ𝑚 Õ
𝑛
k𝐴k 𝐹 = 𝐴2𝑖𝑗 . (A.34)
𝑖=1 𝑗=1

The Frobenius norm can be weighted by a matrix 𝑊 as follows:


1
2 1
k𝐴k 𝑊 = 𝑊 𝐴𝑊 .
2 (A.35)
𝐹

This norm is used in the formal derivation of the Broyden–Fletcher–


Goldfarb–Shanno (BFGS) update formula (see Appendix C).

A.6 Matrix Types

There are several common types of matrices that appear regularly


throughout this book. We review some terminology here.
A diagonal matrix is a matrix where all off-diagonal terms are zero.
In other words, 𝐴 is diagonal if:

𝐴 𝑖𝑗 = 0 for all 𝑖 ≠ 𝑗 . (A.36)

The identity matrix 𝐼 is a special diagonal matrix where all diagonal


components are 1.
The transpose of a matrix is defined as follows:

[𝐴| ]𝑖𝑗 = 𝐴 𝑗𝑖 . (A.37)


A Mathematics Background 546

Note that
(𝐴| )| = 𝐴
(𝐴 + 𝐵)| = 𝐴| + 𝐵| (A.38)
| | |
(𝐴𝐵) = 𝐵 𝐴 .
A symmetric matrix is one where the matrix is equal to its transpose:

𝐴| = 𝐴 ⇒ 𝐴 𝑖𝑗 = 𝐴 𝑗𝑖 . (A.39)

The inverse of a matrix, 𝐴−1 , satisfies

𝐴𝐴−1 = 𝐼 = 𝐴−1 𝐴 . (A.40)

Not all matrices are invertible. Some common properties for inverses
are as follows:   −1
𝐴−1 =𝐴

(𝐴𝐵)−1 = 𝐵−1 𝐴−1 (A.41)


 |
𝐴−1 = (𝐴| )−1 .

A symmetric matrix 𝐴 is positive definite if and only if

𝑥 | 𝐴𝑥 > 0 (A.42)

for all nonzero vectors 𝑥. One property of positive-definite matrices is


that their inverse is also positive definite.
The positive-definite condition (Eq. A.42) can be challenging to
verify. Still, we can use equivalent definitions that are more practical.
For example, by choosing appropriate 𝑥s, we can derive the neces-
sary conditions for positive definiteness:

𝐴 𝑖𝑖 > 0 for all 𝑖


q (A.43)
𝐴 𝑖𝑗 < 𝐴 𝑖𝑖 𝐴 𝑗 𝑗 for all 𝑖 ≠ 𝑗. 𝑛
𝑘
These are necessary but not sufficient conditions. Thus, if any diagonal
𝐴1
element is less than or equal to zero, we know that the matrix is not
𝑘 𝐴𝑘
positive definite.
An equivalent condition to Eq. A.42 is that all the eigenvalues of 𝐴 𝑛

are positive. This is a sufficient condition.


𝐴
Another practical condition equivalent to Eq. A.42 is that all the
leading principal minors of 𝐴 are positive. A leading principal minor is
the determinant of a leading principal submatrix. A leading principal Fig. A.9 For 𝐴 to be positive definite,
the determinants of the submatrices
submatrix of order 𝑘, 𝐴 𝑘 of an (𝑛 × 𝑛) matrix 𝐴 is obtained by removing 𝐴1 , 𝐴2 , . . . 𝐴𝑛 must be greater than
the last 𝑛 − 𝑘 rows and columns of 𝐴, as shown in Fig. A.9. Thus, to verify zero.
A Mathematics Background 547

if 𝐴 is positive definite, we start with 𝑘 = 1, check that 𝐴1 > 0 (only


one element), then check that det(𝐴2 ) > 0, and so on, until det(𝐴𝑛 ) > 0.
Suppose any of the determinants in this sequence is not positive. In
that case, we can stop the process and conclude that 𝐴 is not positive
definite.
A positive-semidefinite matrix satisfies

𝑥 | 𝐴𝑥 ≥ 0 (A.44)

for all nonzero vectors 𝑥. In this case, the eigenvalues are nonnegative,
and there is at least one that is zero. A negative-definite matrix satisfies

𝑥 | 𝐴𝑥 < 0 (A.45)

for all nonzero vectors 𝑥. In this case, all the eigenvalues are negative.
An indefinite matrix is one that is neither positive definite nor negative
definite. Then, there are at least two nonzero vectors 𝑥 and 𝑦 such that

𝑥 | 𝐴𝑥 > 0 > 𝑦 | 𝐴𝑦 . (A.46)

A.7 Matrix Derivatives

Let us consider the derivatives of a few common cases: linear and


quadratic functions. Combining the concept of partial derivatives and
matrix forms of equations allows us to find the gradients of matrix
functions. First, let us consider a linear function, 𝑓 , defined as

Õ
𝑛
𝑓 (𝑥) = 𝑎 | 𝑥 + 𝑏 = 𝑎𝑖 𝑥𝑖 + 𝑏𝑖 , (A.47)
𝑖=1

where 𝑎, 𝑥, and 𝑏 are vectors of length 𝑛, and 𝑎 𝑖 , 𝑥 𝑖 , and 𝑏 𝑖 are the 𝑖th
elements of 𝑎, 𝑥, and 𝑏, respectively. If we take the partial derivative
of each element with respect to an arbitrary element of 𝑥, namely, 𝑥 𝑘 ,
we get " #
𝜕 Õ
𝑛
𝑎𝑖 𝑥𝑖 + 𝑏𝑖 = 𝑎𝑘 . (A.48)
𝜕𝑥 𝑘
𝑖=1

Thus,
∇𝑥 (𝑎 | 𝑥 + 𝑏) = 𝑎 . (A.49)
Recall the quadratic form presented in Appendix A.3.3; we can
combine that with a linear term to form a general quadratic function:

𝑓 (𝑥) = 𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐 , (A.50)
A Mathematics Background 548

where 𝑥, 𝑏, and 𝑐 are still vectors of length 𝑛, and 𝐴 is an 𝑛-by-𝑛


symmetric matrix. In index notation, 𝑓 is as follows:
Õ
𝑛 Õ
𝑛
𝑓 (𝑥) = 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 . (A.51)
𝑖=1 𝑗=1

For convenience, we separate the diagonal terms from the off-


diagonal terms, leaving us with
Õ
𝑛
  Õ
𝑓 (𝑥) = 𝑎 𝑖𝑖 𝑥 2𝑖 + 𝑏 𝑖 𝑥 𝑖 + 𝑐 𝑖 + 𝑥 𝑖 𝑎 𝑖𝑗 𝑥 𝑗 . (A.52)
𝑖=1 𝑗≠𝑖

Now we take the partial derivatives with respect to 𝑥 𝑘 as before, yielding


𝜕𝑓 Õ Õ
= 2𝑎 𝑘 𝑘 𝑥 𝑘 + 𝑏 𝑘 + 𝑥𝑗 𝑎𝑗𝑘 + 𝑎𝑘 𝑗 𝑥𝑗 . (A.53)
𝜕𝑥 𝑘
𝑗≠𝑖 𝑗≠𝑖

We now move the diagonal terms back into the sums to get

𝜕𝑓 Õ 𝑛
= 𝑏𝑘 + (𝑥 𝑗 𝑎 𝑗 𝑘 + 𝑎 𝑘 𝑗 𝑥 𝑗 ) , (A.54)
𝜕𝑥 𝑘
𝑗=1

which we can put back into matrix form as follows:

∇𝑥 𝑓 (𝑥) = 𝐴| 𝑥 + 𝐴𝑥 + 𝑏 . (A.55)

If 𝐴 is symmetric, then 𝐴| = 𝐴, and thus

∇𝑥 (𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐) = 2𝐴𝑥 + 𝑏 . (A.56)

A.8 Eigenvalues and Eigenvectors

Given an (𝑛 × 𝑛) matrix, if there is a scalar 𝜆 and a nonzero vector 𝑣


that satisfy
𝐴𝑣 = 𝜆𝑣 , (A.57)
then 𝜆 is an eigenvalue of the matrix 𝐴, and 𝑣 is an eigenvector. The
left-hand side of Eq. A.57 is a matrix-vector product that represents a
linear transformation applied to 𝑣. The right-hand side of Eq. A.57 is a
scalar-vector product that represents a vector aligned with 𝑣. Therefore,
the eigenvalue problem (Eq. A.57) answers the question: Which vectors,
when transformed by 𝐴, remain in the same direction, and how much
do their corresponding lengths change in that transformation?
The solutions of the eigenvalue problem (Eq. A.57) are given by the
solutions of the scalar equation,

det (𝐴 − 𝜆𝐼) = 0 . (A.58)


A Mathematics Background 549

This equation yields a polynomial of degree 𝑛 called the characteristic


equation, whose roots are the eigenvalues of 𝐴.
If 𝐴 is symmetric, it has 𝑛 real eigenvalues (𝜆1 , . . . , 𝜆𝑛 ) and 𝑛
linearly independent eigenvectors (𝑣 1 , . . . , 𝑣 𝑛 ) corresponding to those
eigenvalues. It is possible to choose the eigenvectors to be orthogonal
to each other (i.e., 𝑣 𝑖 𝑣 𝑗 = 0 for 𝑖 ≠ 𝑗) and to normalize them (so that
|

𝑣 𝑖 𝑣 𝑖 = 1).
|

We use the eigenvalue problem in Section 4.1.2, where the eigen-


vectors are the directions of principal curvature, and the eigenvalues
quantify the curvature. Eigenvalues are also helpful in determining if a
matrix is positive definite.

A.9 Random Variables

Imagine measuring the axial strength of a rod by performing a tensile


test with many rods, each designed to be identical. Even with “iden-
tical” rods, every time you perform the test, you get a different result
(hopefully with relatively small differences). This variation has many
potential sources, including variation in the manufactured size and
shape, in the composition of the material, and in the contact between
the rod and testing fixture. In this example, we would call the axial
strength a random variable, and the result from one test would be a
random sample. The random variable, axial strength, is a function of
several other random variables, such as bar length, bar diameter, and
material Young’s modulus.
One measurement does not tell us anything about how variable
the axial strength is, but if we perform the test many times, we can
learn a lot about its distribution. From this information, we can infer
various statistical quantities, such as the mean value of the axial strength.
The mean of some variable 𝑥 that is measured 𝑛 times is estimated as
follows:

𝑛
𝜇𝑥 = 𝑥𝑖 . (A.59)
𝑛
𝑖=1

This is actually a sample mean, which would differ from the pop-
ulation mean (the true mean if you could measure every bar). With
enough samples, the sample mean approaches the population mean. In
this brief review, we do not distinguish between sample and population
statistics.
Another important quantity is the variance or standard deviation. This ∗ Unbiasedmeans that the expected value
is a measure of spread, or how far away our samples are from the mean. of the sample variance is the same as the
true population variance. If 𝑛 were used in
The unbiased∗ estimate of the variance is the denominator instead of 𝑛 − 1, then the
two quantities would differ by a constant.
A Mathematics Background 550

1 Õ
𝑛
𝜎𝑥2 = (𝑥 𝑖 − 𝜇𝑥 )2 , (A.60)
𝑛−1
𝑖=1

and the standard deviation is just the square root of the variance. A
small variance implies that measurements are clustered tightly around
the mean, whereas a large variance means that measurements are spread
out far from the mean. The variance can also be written in the following
mathematically equivalent but more computationally-friendly format:
𝑛  
!
1 Õ
𝜎𝑥2 = 𝑥 2𝑖 − 𝑛𝜇2𝑥 . (A.61)
𝑛−1
𝑖=1

More generally, we might want to know what the probability is of


getting a bar with a specific axial strength. In our testing, we could
tabulate the frequency of each measurement in a histogram. If done
enough times, it would define a smooth curve, as shown in Fig. A.10.
This curve is called the probability density function (PDF), 𝑝(𝑥), and it
tells us the relative probability of a certain value occurring.
More specifically, a PDF gives the probability of getting a value with
a certain range:
∫ 𝑏
Prob(𝑎 ≤ 𝑥 ≤ 𝑏) = 𝑝(𝑥) d𝑥 . (A.62)
𝑎

The total integral of the PDF must be 1 because it contains all possible
outcomes (100 percent):
∫ ∞
𝑝(𝑥) d𝑥 = 1 . (A.63)
−∞

From the PDF, we can also measure various statistics, such as the mean
value:
∫ ∞
𝜇𝑥 = E(𝑥) = 𝑥𝑝(𝑥) d𝑥 . (A.64)
−∞

This quantity is also referred to as the expected value of 𝑥 (E[𝑥]). The


expected value of a function of a random variable, 𝑓 (𝑥), is given by:† † This is not a definition, but rather uses the

∫ ∞ expected value definition with a somewhat


lengthy derivation.
𝜇 𝑓 = E ( 𝑓 (𝑥)) = 𝑓 (𝑥)𝑝(𝑥) d𝑥 . (A.65)
−∞

We can also compute the variance, which is the expected value of


the squared difference from the mean:
  ∫ ∞
𝜎𝑥2 = E (𝑥 − E (𝑥))2 = (𝑥 − 𝜇𝑥 )2 𝑝(𝑥) d𝑥 , (A.66)
−∞
A Mathematics Background 551

or in a mathematically equivalent format:


∫ ∞
𝜎𝑥2 = 𝑥 2 𝑝(𝑥) d𝑥 − 𝜇2𝑥 . (A.67)
−∞

The mean and variance are the first and second moments of the
distribution. In general, a distribution may require an infinite number
of moments to describe it fully. Higher-order moments are generally
mean centered and are normalized by the standard deviation so that
the 𝑛th normalized moment is computed as follows:
 𝑥 − 𝜇 𝑛
𝑥
E . (A.68)
𝜎
The third moment is called skewness, and the fourth is called kurtosis,
although these higher-order moments are less commonly used.
The cumulative distribution function (CDF) is related to the PDF, which
is the cumulative integral of the PDF and is defined as follows:
∫ 𝑥
𝑃(𝑥) = 𝑝(𝑡) d𝑡 . (A.69)
−∞

The capital 𝑃 denotes the CDF, and the lowercase 𝑝 denotes the PDF.
As an example, the PDF and corresponding CDF for the axial strength
are shown in Fig. A.10. The CDF always approaches 1 as 𝑥 → ∞.

1
0.3
0.8

0.2 0.6
𝑝(𝜎) 𝑃(𝜎)
0.4
0.1
0.2

0 0
990 995 1,000 1,005 1,010 990 995 1,000 1,005 1,010 Fig. A.10 Comparison between PDF
𝜎 𝜎
and CDF for a simple example.
PDF for the axial strength of a rod. CDF for the axial strength of a rod.

We often fit a named distribution to the PDF of empirical data. One


of the most popular distributions is the normal distribution, also known
as the Gaussian distribution. Its PDF is as follows:
 
2 1 −(𝑥 − 𝜇)2
𝑝(𝑥; 𝜇, 𝜎 ) = √ exp . (A.70)
𝜎 2𝜋 2𝜎2

For a normal distribution, the mean and variance are visible in the func-
tion, but these quantities are defined for any distribution. Figure A.11
A Mathematics Background 552

𝜇 = 1, 𝜎 = 0.5
0.6

𝑝(𝑥) 0.4

𝜇 = 3, 𝜎 = 1.0

0.2 Fig. A.11 Two normal distributions.


Changing the mean causes a shift
along the 𝑥-axis. Increasing the stan-
0 dard deviation causes the PDF to
−1 0 1 2 3 4 5 6 7 spread out.
𝑥

shows two normal distributions with different means and standard


deviations to illustrate the effect of those parameters.
Several other popular distributions are shown in Fig. A.12: uni-
form distribution, Weibull distribution, lognormal distribution, and
exponential distribution. These are only a few of many other possible
probability distributions.

0.3 0.3

0.2 0.2
𝑝(𝑥) 𝑝(𝑥)

0.1 0.1

0 0
0 2 4 6 0 2 4 6
𝑥 𝑥

Uniform distribution Weibull distribution

0.3 0.3

0.2 0.2
𝑝(𝑥) 𝑝(𝑥)

0.1 0.1

0 0
Fig. A.12 Popular probability distri-
0 2 4 6 0 2 4 6 butions besides the normal distribu-
𝑥 𝑥
tion.
Lognormal distribution Exponential distribution
A Mathematics Background 553

An extension of variance is the covariance, which measures the


variability between two random variables:

cov(𝑥, 𝑦) = E ((𝑥 − E(𝑥)) (𝑦 − E(𝑦)))


(A.71)
= E(𝑥 𝑦) − 𝜇𝑥 𝜇 𝑦 .

From this definition, we see that the variance is related to covariance


by the following:
𝜎𝑥2 = var(𝑥) = cov(𝑥, 𝑥) . (A.72)
Covariance is often expressed as a matrix, in which case the variance of
each variable appears on the diagonal. The correlation is the covariance
divided by the standard deviations:

cov(𝑥, 𝑦)
corr(𝑥, 𝑦) = . (A.73)
𝜎𝑥 𝜎 𝑦
Linear Solvers
B
In Section 3.6, we present an overview of solution methods for dis-
cretized systems of equations, followed by an introduction to Newton-
based methods for solving nonlinear equations. Here, we review the
solvers for linear systems required to solve for each step of Newton-
∗ Trefethen and Bau III220
based methods.∗ provides a much
more detailed explanation of linear solvers.
220. Trefethen and Bau III, Numerical
B.1 Systems of Linear Equations Linear Algebra, 1997.

If the equations are linear, they can be written as

𝐴𝑢 = 𝑏 , (B.1)

where 𝐴 is a square (𝑛 × 𝑛) matrix, and 𝑏 is a vector, and neither of


these depends on 𝑢. If this system of equations has a unique solution,
then the system and the matrix 𝐴 are nonsingular. This is equivalent to
saying that 𝐴 has an inverse, 𝐴−1 . If 𝐴−1 does not exist, the matrix and
the system are singular.
A matrix 𝐴 is singular if its rows (or equivalently, its columns) are
linearly dependent (i.e., if one of the rows can be written as a linear
combination of the others).
If the matrix 𝐴 is nonsingular and we know its inverse 𝐴−1 , the
solution of the linear system (Eq. B.1) can be written as 𝑥 = 𝐴−1 𝑏.
However, the numerical methods described here do not form 𝐴−1 . The
main reason for this is that forming 𝐴−1 is expensive: the computational
cost is proportional to 𝑛 3 .
For practical problems with large 𝑛, it is typical for the matrix 𝐴 to be
sparse, that is, for most of its entries to be zeros. An entry 𝐴 𝑖𝑗 represents
the interaction between variables 𝑖 and 𝑗. When solving differential
equations on a discretized grid, for example, a given variable 𝑖 only
interacts with variables 𝑗 in its vicinity in the grid. These interactions
correspond to nonzero entries, whereas all other entries are zero.
Sparse linear systems tend to have a number of nonzero terms that is
proportional to 𝑛. This is in contrast with a dense matrix, which has 𝑛 2
nonzero entries. Solvers should take advantage of sparsity to remain
efficient for large 𝑛.

554
B Linear Solvers 555

We rewrite the linear system (Eq. B.1) as a set of residuals,

𝑟(𝑢) = 𝐴𝑢 − 𝑏 = 0. (B.2)

To solve this system of equations, we can use either a direct method


or an iterative method. We explain these briefly in the rest of this
appendix, but we do not cover more advanced techniques that take
advantage of sparsity.

B.2 Conditioning

The distinction between singular and nonsingular systems blurs once


we have to deal with finite-precision arithmetic. Systems that are
singular in the exact sense are ill-conditioned when a small change in the
data (entries of 𝐴 or 𝑏) results in a large change in the solution. This
large sensitivity of the solution to the problem parameters is an issue
because the parameters themselves have finite precision. Then, any
imprecision in these parameters can lead to significant errors in the
solution, even if no errors are introduced in the numerical solution of
the linear system.
The conditioning of a linear system can be quantified by the condition
number of the matrix, which is defined as the scalar

cond(𝐴) = k𝐴k · 𝐴−1 , (B.3)

where any matrix norm can be used. Because k𝐴k · 𝐴−1 ≥ 𝐴𝐴−1 ,
we have
cond(𝐴) ≥ 1 (B.4)
for all matrices. A matrix 𝐴 is well-conditioned if cond(𝐴) is small and
ill-conditioned if cond(𝐴) is large.

B.3 Direct Methods

The standard way to solve linear systems of equations with a computer


is Gaussian elimination, which in matrix form is equivalent to LU
factorization. This is a factorization (or decomposition) of 𝐴, such as
𝐴 = 𝐿𝑈, where 𝐿 is a unit lower triangular matrix, and 𝑈 is an upper
triangular matrix, as shown in Fig. B.1.
The factorization transforms the matrix 𝐴 into an upper triangular
1
1
𝑈
= 1
matrix 𝑈 by introducing zeros below the diagonal, one column at a 𝐴
𝐿
1
1

time, starting with the first one and progressing from left to right. This 1

is done by subtracting multiples of each row from subsequent rows. Fig. B.1 𝐿𝑈 factorization.
B Linear Solvers 556

These operations can be expressed as a sequence of multiplications


with lower triangular matrices 𝐿 𝑖 ,
𝐿𝑛−1 · · · 𝐿2 𝐿1 𝐴 = 𝑈. (B.5)
| {z }
𝐿−1

After completing these operations, we have 𝑈, and we can find 𝐿 by


computing 𝐿 = 𝐿−1
1
𝐿−1
2
· · · 𝐿−1
𝑛−1
.
Once we have this factorization, we have 𝐿𝑈𝑢 = 𝑏. Setting 𝑈𝑢 to
𝑦, we can solve 𝐿𝑦 = 𝑏 for 𝑦 by forward substitution. Now we have
𝑈𝑢 = 𝑦, which we can solve by back substitution for 𝑢.

Algorithm B.1 Solving 𝐴𝑢 = 𝑏 by 𝐿𝑈 factorization

Inputs:
𝐴: Nonsingular square matrix
𝑏: A vector
Outputs:
𝑢: Solution to 𝐴𝑢 = 𝑏

Perform forward substitution to solve 𝐿𝑦 = 𝑏 for 𝑦:

𝑏1 1 © Õ 𝑖−1
ª
𝑦1 = , 𝑦𝑖 = ­𝑏 − 𝐿 𝑖𝑗 𝑦 𝑗 ® for 𝑖 = 2, . . . , 𝑛
𝐿11 𝐿 𝑖𝑖 𝑖
« 𝑗=1 ¬
Perform backward substitution to solve the following 𝑈𝑢 = 𝑦 for 𝑢:

𝑦𝑛 1 © Õ
𝑛
ª
𝑢𝑛 = , 𝑢𝑖 = ­ 𝑦𝑖 − 𝑈 𝑖𝑗 𝑢 𝑗 ® for 𝑖 = 𝑛 − 1, . . . , 1
𝑈𝑛𝑛 𝑈 𝑖𝑖
« 𝑗=𝑖+1 ¬

This process is not stable in general because roundoff errors are


magnified in the backward substitution when diagonal elements of 𝐴
have a small magnitude. This issue is resolved by partial pivoting, which
interchanges rows to obtain more favorable diagonal elements.
Cholesky factorization is an LU factorization specialized for the case
where the matrix 𝐴 is symmetric and positive definite. In this case,
pivoting is unnecessary because the Gaussian elimination is always
stable for symmetric positive-definite matrices. The factorization can
be written as
𝐴 = 𝐿𝐷𝐿| , (B.6)
where 𝐷 = diag[𝑈11 , . . . , 𝑈𝑛𝑛 ]. This can be expressed as the matrix
product
𝐴 = 𝐺𝐺| , (B.7)
where 𝐺 = 𝐿𝐷 1/2 .
B Linear Solvers 557

B.4 Iterative Methods

Although direct methods are usually more efficient and robust, iterative
methods have several advantages:

• Iterative methods make it possible to trade between computational


cost and precision because they can be stopped at any point and
still yield an approximation of 𝑢. On the other hand, direct
methods only get the solution at the end of the process with the
final precision.
• Iterative methods have the advantage when a good guess for 𝑢
exists. This is often the case in optimization, where the 𝑢 from
the previous optimization iteration can be used as the guess for
the new evaluations (called a warm start).
• Iterative methods do not require forming and manipulating
the matrix 𝐴, which can be computationally costly in terms of
both time and memory. Instead, iterative methods require the
computation of the residuals 𝑟(𝑢) = 𝐴𝑢 − 𝑏 and, in the case of
Krylov subspace methods, products of 𝐴 with a given vector.
Therefore, iterative methods can be more efficient than direct
methods for cases where 𝐴 is large and sparse. All that is needed
is an efficient process to get the product of 𝐴 with a given vector,
as shown in Fig. B.2.

Iterative methods are divided into stationary methods (also known 𝑣 𝐴𝑣


as fixed-point iteration methods) and Krylov subspace methods.
Fig. B.2 Iterative methods just require
B.4.1 Jacobi, Gauss–Seidel, and SOR a process (which can be a black box)
to compute products of 𝐴 with an
Fixed-point methods generate a sequence of iterates 𝑢1 , . . . , 𝑢 𝑘 , . . . using arbitrary vector 𝑣.
a function
𝑢 𝑘+1 = 𝐺 (𝑢 𝑘 ) , 𝑘 = 0, 1, . . . (B.8)
starting from an initial guess 𝑢0 . The function 𝐺(𝑢) is devised such
that the iterates converge to the solution 𝑢 ∗ , which satisfies 𝑟(𝑢 ∗ ) = 0.
Many stationary methods can be derived by splitting the matrix such
that 𝐴 = 𝑀 − 𝑁. Then, 𝐴𝑢 = 𝑏 leads to 𝑀𝑢 = 𝑁𝑢 + 𝑏, and substituting
this into the linear system yields

𝑢 = 𝑀 −1 (𝑁𝑢 + 𝑏). (B.9)

Because 𝑁𝑢 = 𝑀𝑢 − 𝐴𝑢, substituting this into the previous equation


results in the iteration

𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 (𝑏 − 𝐴𝑢 𝑘 ) . (B.10)
B Linear Solvers 558

Defining the residual at iteration 𝑘 as

𝑟 (𝑢 𝑘 ) = 𝑏 − 𝐴𝑢 𝑘 , (B.11)

we can write
𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 𝑟 (𝑢 𝑘 ) . (B.12)
The splitting matrix 𝑀 is fixed and constructed so that it is easy to
invert. The closer 𝑀 −1 is to the inverse of 𝐴, the better the iterations
work. We now introduce three stationary methods corresponding to
three different splitting matrices.
The Jacobi method consists of setting 𝑀 to be a diagonal matrix 𝐷,
where the diagonal entries are those of 𝐴. Then,

𝑢 𝑘+1 = 𝑢 𝑘 + 𝐷 −1 𝑟 (𝑢 𝑘 ) . (B.13)

In component form, this can be written as

 Õ 
 𝑛𝑢

1 𝑏 𝑖 − 𝐴 𝑖𝑗 𝑢 𝑗 𝑘  ,
𝑢𝑖 𝑘+1 =  𝑖 = 1, . . . , 𝑛𝑢 . (B.14)
𝐴 𝑖𝑖  
 𝑗=1,𝑗≠𝑖

Using this method, each component in 𝑢 𝑘+1 is independent of each
other at a given iteration; they only depend on the previous iteration
values, 𝑢 𝑘 , and can therefore be done in parallel.
The Gauss–Seidel method is obtained by setting 𝑀 to be the lower
triangular portion of 𝐴 and can be written as

𝑢 𝑘+1 = 𝑢 𝑘 + 𝐸−1 𝑟(𝑢 𝑘 ), (B.15)

where 𝐸 is the lower triangular matrix. Because of the triangular


matrix structure, each component in 𝑢 𝑘+1 is dependent on the previous
elements in the vector, but the iteration can be performed in a single
forward sweep. Writing this in component form yields

 Õ Õ 
 
1 𝑏 𝑖 − 
𝑢𝑖 𝑘+1 =  𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗 𝑘  ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 . (B.16)
 
𝑖𝑗 𝑗 𝑘+1
𝐴 𝑖𝑖
 𝑗<𝑖 𝑗>𝑖

Unlike the Jacobi iterations, a Gauss–Seidel iteration cannot be per-
formed in parallel because of the terms where 𝑗 < 𝑖, which require
the latest values. Instead, the states must be updated sequentially.
However, the advantage of Gauss–Seidel is that it generally converges
faster than Jacobi iterations.
B Linear Solvers 559

The successive over-relaxation (SOR) method uses an update that


is a weighted average of the Gauss–Seidel update and the previous
iteration,
𝑢 𝑘+1 = 𝑢 𝑘 + 𝜔 ((1 − 𝜔) 𝐷 + 𝜔𝐸)−1 𝑟(𝑢 𝑘 ), (B.17)
where 𝜔, the relaxation factor, is a scalar between 1 and 2. Setting 𝜔 = 1
yields the Gauss–Seidel method. SOR in component form is as follows:

 Õ Õ 
 
𝜔 𝑏 𝑖 − 
𝑢𝑖 𝑘+1 = (1−𝜔)𝑢𝑖 𝑘 +  𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗 𝑘  ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 .
 
𝑖𝑗 𝑗 𝑘+1
𝐴 𝑖𝑖 𝑢2
 𝑗<𝑖 𝑗>𝑖
 2

(B.18)
With the correct value of 𝜔, SOR converges faster than Gauss–Seidel. 1.5

𝑢0
1
Example B.1 Iterative methods applied to a simple linear system.
𝑢∗
0.5
Suppose we have the following linear system of two equations:
     0
2 −1 𝑢1 0 0 1 2
= .
−2 3 𝑢2 1 𝑢1

Jacobi
This corresponds to the two lines shown in Fig. B.3, where the solution is at
𝑢2
their intersection. 2
Applying the Jacobian iteration (Eq. B.14),
1.5
1
𝑢1 𝑘+1 = 𝑢2 𝑘 𝑢0
2 1
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘 ) . 𝑢∗
3 0.5

Starting with the guess 𝑢 (0) = (2, 1), we get the iterations shown in Fig. B.3. The 0
Gauss–Seidel iteration (Eq. B.16) is similar, where the only change is that the 0 1 2
𝑢1
second equation uses the latest state from the first one:
Gauss–Seidel
1
𝑢1 𝑘+1 = 𝑢2 𝑘 𝑢2
2 2
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘+1 ) .
3 1.5

As expected, Gauss–Seidel converges faster than the Jacobi iteration, taking a 𝑢0


1
more direct path. The SOR iteration is
𝑢∗
𝜔 0.5
𝑢1 𝑘+1 = (1 − 𝜔)𝑢1 𝑘 + 𝑢
2 2𝑘
𝜔 0
𝑢2 𝑘+1 = (1 − 𝜔)𝑢2 𝑘 + (1 + 2𝑢1 𝑘 ) . 0 1 2
3 𝑢1

SOR converges even faster for the right values of 𝜔. The result shown here is SOR
for 𝜔 = 1.2.
Fig. B.3 Jacobi, Gauss–Seidel, and
SOR iterations.
B Linear Solvers 560

B.4.2 Conjugate Gradient Method


The conjugate gradient method applies to linear systems where 𝐴 is
symmetric and positive definite. This method can be adapted to solve
nonlinear minimization problems (see Section 4.4.2).
We want to solve a linear system (Eq. B.2) iteratively. This means
that at a given iteration 𝑢 𝑘 , the residual is not necessarily zero and can
be written as
𝑟 𝑘 = 𝐴𝑢 𝑘 − 𝑏 . (B.19)
Solving this linear system is equivalent to minimizing the quadratic
function
1
𝑓 (𝑢) = 𝑢 | 𝐴𝑢 − 𝑏 | 𝑢 . (B.20)
2
This is because the gradient of this function is

∇ 𝑓 (𝑢) = 𝐴𝑢 − 𝑏 . (B.21)

Thus, the gradient of the quadratic is the residual of the linear system,

𝑟 𝑘 = ∇ 𝑓 (𝑢 𝑘 ) . (B.22)

We can express the path from any starting point to a solution 𝑢 ∗ as


a sequence of 𝑛 steps with directions 𝑝 𝑘 and length 𝛼 𝑘 :
Õ
𝑛−1

𝑢 = 𝛼𝑘 𝑝𝑘 . (B.23)
𝑘=0

Substituting this into the quadratic (Eq. B.20), we get


!
Õ
𝑛−1

𝑓 (𝑢 ) = 𝑓 𝛼𝑘 𝑝𝑘
!| ! !
𝑘=0

1 Õ Õ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼𝑘 𝑝𝑘 𝐴 𝛼𝑘 𝑝𝑘 − 𝑏 |
𝛼𝑘 𝑝𝑘 (B.24)
2
𝑘=0 𝑘=0 𝑘=0

1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 − 𝛼 𝑘 𝑏| 𝑝 𝑘 .
2
𝑘=0 𝑗=0 𝑘=0

The conjugate gradient method uses a set of 𝑛 vectors 𝑝 𝑘 that are


conjugate with respect to matrix 𝐴. Such vectors have the following
property:
𝑝 𝑘 | 𝐴𝑝 𝑗 = 0, for all 𝑘 ≠ 𝑗 . (B.25)
Using this conjugacy property, the double-sum term can be simplified
to a single sum,

1 ÕÕ 1Õ 2 |
𝑛−1 𝑛−1 𝑛−1
𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 = 𝛼 𝑘 𝑝 𝑘 𝐴𝑝 𝑘 . (B.26)
2 2
𝑘=0 𝑗=0 𝑘=0
B Linear Solvers 561

Then, Eq. B.24 simplifies to


𝑛−1 
Õ 
1 2

𝑓 (𝑢 ) = | |
𝛼 𝑘 𝑝 𝑘 𝐴𝑝 𝑘 − 𝛼 𝑘 𝑏 𝑝 𝑘 . (B.27)
2
𝑘=0

Because each term in this sum involves only one direction 𝑝 𝑘 , we have
reduced the original problem to a series of one-dimensional quadratic
functions that can be minimized one at a time. Each one-dimensional
problem corresponds to minimizing the quadratic with respect to the
step length 𝛼 𝑘 . Differentiating each term and setting it to zero yields
the following:

𝑏| 𝑝 𝑘
𝛼 𝑘 𝑝 𝑘 | 𝐴𝑝 𝑘 − 𝑏 | 𝑝 𝑘 = 0 ⇒ 𝛼 𝑘 = . (B.28)
𝑝 𝑘 | 𝐴𝑝 𝑘

Now, the question is: How do we find this set of directions? There
are many sets of directions that satisfy conjugacy. For example, the
eigenvectors of 𝐴 satisfy Eq. B.25.∗ However, it is costly to compute the ∗ Suppose we have two eigenvectors, 𝑣 𝑘
and 𝑣 𝑗 . Then 𝑣 𝑘 | 𝐴𝑣 𝑗 = 𝑣 𝑘 | (𝜆 𝑗 𝑣 𝑗 ) =
eigenvectors of a matrix. We want a more convenient way to compute a 𝜆 𝑗 𝑣 𝑘 | 𝑣 𝑗 . This dot product is zero because
sequence of conjugate vectors. the eigenvectors of a symmetric matrix are
mutually orthogonal.
The conjugate gradient method sets the first direction to the steepest-
descent direction of the quadratic at the first point. Because the gradient
of the function is the residual of the linear system (Eq. B.22), this first
direction is obtained from the residual at the starting point,

𝑝1 = −𝑟 (𝑢0 ) . (B.29)

Each subsequent direction is set to a new conjugate direction using


the update
𝑝 𝑘+1 = −𝑟 𝑘+1 + 𝛽 𝑘 𝑝 𝑘 , (B.30)
where 𝛽 is set such that 𝑝 𝑘+1 and 𝑝 𝑘 are conjugate with respect to 𝐴.
We can find the expression for 𝛽 by starting with the conjugacy
property that we want to achieve,

𝑝 𝑘+1 | 𝐴𝑝 𝑘 = 0 . (B.31)

Substituting the new direction 𝑝 𝑘+1 with the update (Eq. B.30), we get

(−𝑟 𝑘+1 + 𝛽 𝑘 𝑝 𝑘 )| 𝐴𝑝 𝑘 = 0 . (B.32)

Expanding the terms and solving for 𝛽, we get

𝑟 𝑘+1 | 𝐴𝑝 𝑘
𝛽𝑘 = . (B.33)
𝑝 𝑘 | 𝐴𝑝 𝑘
B Linear Solvers 562

For each search direction 𝑝 𝑘 , we can perform an exact line search by


minimizing the quadratic analytically. The directional derivative of the
quadratic at a point 𝑥 along the search direction 𝑝 is as follows:
 
𝜕 𝑓 (𝑥 + 𝛼𝑝) 𝜕 1
= (𝑥 + 𝛼𝑝)| 𝐴(𝑥 + 𝛼𝑝) − 𝑏 | (𝑥 + 𝛼𝑝)
𝜕𝛼 𝜕𝛼 2
= 𝑝 | 𝐴(𝑥 + 𝛼𝑝) − 𝑝 | 𝑏 (B.34)
= 𝑝 | (𝐴𝑥 − 𝑏) + 𝛼𝑝 | 𝐴𝑝
= 𝑝 | 𝑟(𝑥) + 𝛼𝑝 | 𝐴𝑝 .

By setting this derivative to zero, we can get the step size that minimizes
the quadratic along the line to be
𝑟𝑘 | 𝑝 𝑘
𝛼𝑘 = − . (B.35)
𝑝 𝑘 | 𝐴𝑝 𝑘

The numerator can be written as a function of the residual alone.


Replacing 𝑝 𝑘 with the conjugate direction update (Eq. B.30), we get

𝑟 𝑘 | 𝑝 𝑘 = 𝑟 𝑘 | (−𝑟 𝑘 | + 𝛽 𝑘 𝑝 𝑘−1 )
= −𝑟 𝑘 | 𝑟 𝑘 | + 𝛽 𝑘 𝑟 𝑘 | 𝑝 𝑘−1 (B.36)
|
= −𝑟 𝑘 𝑟 𝑘 .

Here we have used the property of the conjugate directions stating that
the residual vector is orthogonal to all previous conjugate directions,
so that 𝑟 𝑖 | 𝑝 𝑖 for 𝑖 = 0, 1, . . . , 𝑘 − 1.† Thus, we can now write, † For a proof of this property, see Theorem
5.2 in Nocedal and Wright.79
𝑟𝑘 𝑘 |𝑟
𝛼𝑘 = − | . (B.37) 79. Nocedal and Wright, Numerical
𝑝 𝑘 𝐴𝑝 𝑘 Optimization, 2006.

The numerator of the expression for 𝛽 (Eq. B.33) can also be written
in terms of the residual alone. Using the expression for the residual
(Eq. B.19) and taking the difference between two subsequent residuals,
we get

𝑟 𝑘+1 − 𝑟 𝑘 = (𝐴𝑢 𝑘+1 − 𝑏) − (𝐴𝑢 𝑘 − 𝑏) = 𝐴 (𝑢 𝑘+1 − 𝑢 𝑘 )


= 𝐴 (𝑢 𝑘 + 𝛼 𝑘 𝑝 𝑘 − 𝑢 𝑘 ) (B.38)
= 𝛼 𝑘 𝐴𝑝 𝑘 .

Using this result in the numerator of 𝛽 in Eq. B.33, we can write


1
𝑟 𝑘+1 | 𝐴𝑝 𝑘 = 𝑟 𝑘+1 | (𝑟 𝑘+1 − 𝑟 𝑘 )
𝛼𝑘
1
= (𝑟 𝑘+1 | 𝑟 𝑘+1 − 𝑟 𝑘+1 | 𝑟 𝑘 )
𝛼𝑘
B Linear Solvers 563

1
(𝑟 𝑘+1 | 𝑟 𝑘+1 ) ,
𝑟 𝑘+1 | 𝐴𝑝 𝑘 = (B.39)
𝛼𝑘
where we have used the property that the residual at any conjugate
residual iteration is orthogonal to the residuals at all previous iterations,
so 𝑟 𝑘+1 | 𝑟 𝑘 = 0.‡ ‡ For a proof of this property, see Theorem
5.3 in Nocedal and Wright.79
Now, using this new numerator and using Eq. B.37 to write the
79. Nocedal and Wright, Numerical
denominator as a function of the previous residual, we obtain Optimization, 2006.

𝑟𝑘 | 𝑟𝑘
𝛽𝑘 = . (B.40)
𝑟 𝑘−1 | 𝑟 𝑘−1
We use this result in the nonlinear conjugate gradient method for
function minimization in Section 4.4.2.
The linear conjugate gradient steps are listed in Alg. B.2. The
advantage of this method relative to the direct method is that 𝐴 does
not need to be stored or given explicitly. Instead, we only need to
provide a function that computes matrix-vector products with 𝐴. These
products are required to compute residuals (𝑟 = 𝐴𝑢 − 𝑏) and the 𝐴𝑝
term in the computation of 𝛼. Assuming a well-conditioned problem
with good enough arithmetic precision, the algorithm should converge
to the solution in 𝑛 steps.§ § Because the linear conjugate gradient
method converges in 𝑛 steps, it was origi-
nally thought of as a direct method. It was
Algorithm B.2 Linear conjugate gradient initially dismissed in favor of more efficient
direct methods, such as LU factorization.
Inputs: However, the conjugate gradient method
𝑢 (0) : Starting point was later reframed as an effective iterative
method to obtain approximate solutions
𝜏: Convergence tolerance to large problems.
Outputs:
𝑢 ∗ : Solution of linear system

𝑘=0 Initialize iteration counter


while k𝑟 𝑘 k ∞ > 𝜏 do
if 𝑘 = 0 then
𝑝 𝑘 = −𝑟 𝑘 First direction is steepest descent
else
𝑟𝑘 | 𝑟𝑘
𝛽𝑘 =
𝑟 𝑘−1 | 𝑟 𝑘−1
𝑝 𝑘 = −𝑟 𝑘 + 𝛽 𝑘 𝑝 𝑘−1 Conjugate gradient direction update
end if
𝑟 |𝑟
𝛼 𝑘 = − |𝑘 𝑘 Step length
𝑝 𝑘 𝐴𝑝 𝑘
𝑢 𝑘+1 = 𝑢 𝑘 + 𝛼 𝑘 𝑝 𝑘 Update variables
𝑘 = 𝑘+1 Increment iteration index
end while
B Linear Solvers 564

B.4.3 Krylov Subspace Methods


Krylov subspace methods are a more general class of iterative methods.¶ ¶ Thisis just an overview of Krylov sub-
space methods; for more details, see Tre-
The conjugate gradient is a special case of a Krylov subspace method that fethen and Bau III220 or Saad.75
applies only to symmetric positive-definite matrices. However, more 75. Saad, Iterative Methods for Sparse
general Krylov subspace methods, such as the generalized minimum Linear Systems, 2003.

residual (GMRES) method, do not have such restrictions on the matrix. 220. Trefethen and Bau III, Numerical
Linear Algebra, 1997.
Compared with stationary methods of Appendix B.4.1, Krylov methods
have the advantage that they use information gathered throughout the
iterations. Instead of using a fixed splitting matrix, Krylov methods
effectively vary the splitting so that 𝑀 is changed at each iteration
according to some criteria that use the information gathered so far. For
this reason, Krylov methods are usually more efficient than stationary
methods.
Like stationary iteration methods, Krylov methods do not require
forming or storing 𝐴. Instead, the iterations require only matrix-vector
products of the form 𝐴𝑣, where 𝑣 is some vector given by the Krylov
algorithm. The matrix-vector product could be given by a black box, as
shown in Fig. B.2.
For the linear conjugate gradient method (Appendix B.4.2), we
found conjugate directions and minimized the residual of the linear
system in a sequence of these directions.
Krylov subspace methods minimize the residual in a space,

𝑥 0 + 𝒦𝑘 , (B.41)

where 𝑥0 is the initial guess, and 𝒦𝑘 is the Krylov subspace,

𝒦𝑘 (𝐴; 𝑟0 ) ≡ span{𝑟0 , 𝐴𝑟0 , 𝐴2 𝑟0 , . . . , 𝐴 𝑘−1 𝑟0 } . (B.42)

In other words, a Krylov subspace method seeks a solution that is a


linear combination of the vectors 𝑟0 , 𝐴𝑟0 , . . . , 𝐴 𝑘−1 𝑟0 . The definition
of this particular sequence is convenient because these terms can be
computed recursively with the matrix-vector product black box as
𝑟0 , 𝐴(𝑟0 ), 𝐴(𝐴(𝑟0 )), 𝐴(𝐴(𝐴(𝑟0 ))), . . . . Under certain conditions, it can be
shown that the solution of the linear system of size 𝑛 is contained in
the subspace 𝒦𝑛 .
Krylov subspace methods (including the conjugate gradient method)
converge much faster when using preconditioning. Instead of solving
𝐴𝑥 = 𝑏, we solve
(𝑀 −1 𝐴)𝑥 = 𝑀 −1 𝑏 , (B.43)
where 𝑀 is the preconditioning matrix (or simply preconditioner). The
matrix 𝑀 should be similar to 𝐴 and correspond to a linear system that
is easier to solve. The inverse, 𝑀 −1 , should be available explicitly, and
B Linear Solvers 565

we do not need an explicit form for 𝑀. The matrix resulting from the
product 𝑀 −1 𝐴 should have a smaller condition number so that the new
linear system is better conditioned.
In the extreme case where 𝑀 = 𝐴, that means we have computed
the inverse of 𝐴, and we can get 𝑥 explicitly. In another extreme, 𝑀
could be a diagonal matrix with the diagonal elements of 𝐴, which
would scale 𝐴 such that the diagonal elements are 1.‖ ‖
The splitting matrix 𝑀 we used in the
equation for the stationary methods (Ap-
Krylov subspace solvers require three main components: (1) an pendix B.4.1) is effectively a preconditioner.
orthogonal basis for the Krylov subspace, (2) an optimal property that An 𝑀 using the diagonal entries of 𝐴 cor-
responds to the Jacobi method (Eq. B.13).
determines the solution within the subspace, and (3) an effective pre-
conditioner. Various Krylov subspace methods are possible, depending
on the choice for each of these three components. One of the most
popular Krylov subspace methods is the GMRES.221∗∗ 221. Saad and Schultz, GMRES: A gener-
alized minimal residual algorithm for solving
nonsymmetric linear systems, 1986.
∗∗ GMRES and other Krylov subspace
methods are available in most program-
ming languages, including C/C++, For-
tran, Julia, MATLAB, and Python.
Quasi-Newton Methods
C
C.1 Broyden’s Method

Broyden’s method is the extension of the secant method (from Sec-


tion 3.8) to 𝑛 dimensions.222 It can also be viewed as the analog of the 222. Broyden, A class of methods for solving
nonlinear simultaneous equations, 1965.
quasi-Newton methods from Section 4.4.4 for solving equations (as
opposed to finding a minimum).
Using the notation from Chapter 3, suppose we have a set of 𝑛
equations 𝑟(𝑢) = [𝑟1 , . . . , 𝑟𝑛 ] = 0 and 𝑛 unknowns 𝑢 = [𝑢1 , . . . , 𝑢𝑛 ].
Writing a Taylor series expansion of 𝑟(𝑢) and selecting the linear term
of the Taylor series expansion of 𝑟 yields

𝐽𝑘+1 (𝑢 𝑘+1 − 𝑢 𝑘 ) ≈ 𝑟 𝑘+1 − 𝑟 𝑘 , (C.1)

where 𝐽 is the (𝑛 × 𝑛) Jacobian, 𝜕𝑟/𝜕𝑢. Defining the step in 𝑢 as

𝑠 𝑘 = 𝑢 𝑘+1 − 𝑢 𝑘 , (C.2)

and the change in the residuals as

𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 , (C.3)

we can write Eq. C.1 as


𝐽˜𝑘+1 𝑠 𝑘 = 𝑦 𝑘 . (C.4)
This is the equivalent of the secant equation (Eq. 4.80). The difference is
that we now approximate the Jacobian instead of the Hessian. The right-
hand side is the difference between two subsequent function values
(which quantifies the directional derivative along the last step) instead
of the difference between gradients (which quantifies the curvature).
We seek a rank 1 update of the form

𝐽˜ = 𝐽˜𝑘 + 𝑣𝑣 | , (C.5)

where the self outer product 𝑣𝑣 | yields a symmetric matrix of rank 1.


Substituting this update into the required condition (Eq. C.4) yields
 
𝐽˜𝑘 + 𝑣𝑣 | 𝑠 𝑘 = 𝑦 𝑘 . (C.6)

566
C Quasi-Newton Methods 567

Post-multiplying both sides by 𝑠 | , rearranging, and dividing by 𝑠 𝑘 𝑠 𝑘


|

yields  
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|
|
𝑣𝑣 = | . (C.7)
𝑠𝑘 𝑠𝑘
Substituting this result into the update (Eq. C.5), we get the Jacobian
approximation update,
 
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|

𝐽˜𝑘+1 = 𝐽˜𝑘 + | , (C.8)


𝑠𝑘 𝑠𝑘

where
𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 (C.9)
is the difference in the function values (as opposed to the difference in
the gradients used in optimization).
This update can be inverted using the Sherman–Morrison–Woodbury
formula (Appendix C.3) to get the more useful update on the inverse of
the Jacobian,
 
𝑠 𝑘 − 𝐽˜𝑘−1 𝑦 𝑘 𝑦 𝑘
|

𝐽˜𝑘+1
−1
= 𝐽˜𝑘−1 + | . (C.10)
𝑦𝑘 𝑦𝑘

We can start with 𝐽˜0−1 = 𝐼. Similar to the Newton step (Eq. 3.30), the step
in Broyden’s method is given by solving the linear system. Because the
inverse is provided explicitly, we can just perform the multiplication,

Δ𝑢 𝑘 = −𝐽˜−1 𝑟 𝑘 . (C.11)

Then we update the variables as

𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (C.12)

C.2 Additional Quasi-Newton Approximations

In Section 4.4.4, we introduced the Broyden–Fletcher–Goldfarb–Shanno


(BFGS) quasi-Newton approximation for unconstrained optimization,
which was also used in Section 5.5 for constrained optimization. Here
we expand on that to introduce other quasi-Newton approximations
and generalize them.
To get a unique solution for the approximate Hessian update,
quasi-Newton methods quantify the “closeness” of successive Hessian
C Quasi-Newton Methods 568

approximations by using some norm of the difference between the two


matrices, leading to the following optimization problem:

minimize 𝐻˜ − 𝐻˜ 𝑘
by varying 𝐻˜
(C.13)
subject to 𝐻˜ = 𝐻˜ |
˜ 𝑘 = 𝑦𝑘 ,
𝐻𝑠

where, 𝑦 𝑘 = ∇ 𝑓 𝑘+1 − ∇ 𝑓 𝑘 , and 𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘 (the latest step). There


are several possibilities for quantifying the “closeness” between matri-
ces and satisfying the constraints, leading to different quasi-Newton
updates. With a convenient choice of matrix norm, we can solve this
optimization problem analytically to obtain a formula for 𝐻˜ 𝑘+1 as a
function of 𝐻˜ 𝑘 , 𝑠 𝑘 , and 𝑦 𝑘 .
The optimization problem (Eq. C.13) does not enforce a positive-
definiteness constraint. It turns out that the update formula always
produces a 𝐻˜ 𝑘+1 that is positive definite, provided that 𝐻˜ 𝑘 is positive
definite. The fact that the curvature condition (Eq. 4.81) is satisfied for
each step helps with this.

C.2.1 Davidon–Fletcher–Powell Update


The Davidon–Fletcher–Powell (DFP) update can be derived using a
similar approach to that used to derive the BFGS update in Section 4.4.4.
However, instead of starting with the update for the Hessian, we start
with the update to the Hessian inverse,

𝑉˜ 𝑘+1 = 𝑉˜ 𝑘 + 𝛼𝑢𝑢 | + 𝛽𝑣𝑣 | . (C.14)

We need the inverse version of the secant equation (Eq. 4.80), which is

𝑉˜ 𝑘+1 𝑦 𝑘 = 𝑠 𝑘 . (C.15)

Setting 𝑢 = 𝑠 𝑘 and 𝑣 = 𝑉˜ 𝑘 𝑦 𝑘 in the update (Eq. C.14) and substituting it


into the inverse version of the secant equation (Eq. C.15), we get

𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑠 𝑘 𝑠 𝑘 𝑦 𝑘 + 𝛽𝑉˜ 𝑘 𝑦 𝑘 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘 = 𝑠 𝑘 . (C.16)
| |

We can obtain the coefficients 𝛼 and 𝛽 by rearranging this equation and


using similar arguments to those used in the BFGS update derivation
(see Section 4.4.4). The DFP update for the Hessian inverse approxima-
tion is

1 1
𝑉˜ 𝑘+1 = 𝑉˜ 𝑘 + 𝑦𝑘 | 𝑠 𝑘 − 𝑦 𝑘 | 𝑉˜ 𝑘 𝑦 𝑘 . (C.17)
𝑠𝑘 𝑠𝑘 | 𝑉𝑘 𝑦 𝑘 𝑦 𝑘 | 𝑉˜ 𝑘
˜
C Quasi-Newton Methods 569

However, the DFP update was originally derived by solving the


optimization problem (Eq. C.13), which minimizes a matrix norm
of the update while enforcing symmetry and the secant equation.
This problem can be solved analytically through the Karush–Kuhn–
Tucker (KKT) conditions and a convenient matrix norm. The weighted
Frobenius norm (Eq. A.35) was the norm used in this case, where the
weights were based on an averaged Hessian inverse. The derivation is
lengthy and is not included here. The final result is the update,
𝐻˜ 𝑘+1 = (𝐼 − 𝜎 𝑘 𝑠 𝑘 𝑦 𝑘 | ) 𝐻˜ 𝑘 (𝐼 − 𝜎 𝑘 𝑦 𝑘 𝑠 𝑘 | ) + 𝜎 𝑘 𝑦 𝑘 𝑦 𝑘 | , (C.18)
where
1
𝜎𝑘 =
. (C.19)
𝑦𝑘 | 𝑠 𝑘
This can be inverted using the Sherman–Morrison–Woodbury formula
(Appendix C.3) to get the update on the inverse (Eq. C.17).

C.2.2 BFGS
The BFGS update was informally derived in Section 4.4.4. As discussed
previously, obtaining an approximation of the Hessian inverse is a more
efficient way to get the quasi-Newton step.
Similar to DFP, BFGS was originally formally derived by analytically
solving an optimization problem. However, instead of solving the
optimization problem of Eq. C.13, we solve a similar problem using the
Hessian inverse approximation instead. This problem can be stated as

minimize 𝑉˜ − 𝑉˜ 𝑘
subject to 𝑉˜ 𝑦 𝑘 = 𝑠 𝑘 (C.20)
𝑉˜ = 𝑉˜ | ,

where 𝑉˜ is the updated inverse Hessian that we seek, 𝑉˜ 𝑘 is the inverse


Hessian approximation from the previous step. The first constraint is
known as the secant equation applied to the inverse. The second constraint
enforces symmetric updates. We do not explicitly specify positive
definiteness. The matrix norm is again a weighted Frobenius norm
(Eq. A.35), but now the weights are based on an averaged Hessian
(instead of the inverse for DFP). Solving this optimization problem
(Eq. C.20), the final result is
𝑉˜ 𝑘+1 = (𝐼 − 𝜎 𝑘 𝑠 𝑘 𝑦 𝑘 | ) 𝑉˜ 𝑘 (𝐼 − 𝜎 𝑘 𝑦 𝑘 𝑠 𝑘 | ) + 𝜎 𝑘 𝑠 𝑘 𝑠 𝑘 | , (C.21)
where
1
𝜎𝑘 = . (C.22)
𝑦𝑘 | 𝑠 𝑘
This is identical to Eq. 4.88.
C Quasi-Newton Methods 570

C.2.3 Symmetric Rank 1 Update


The symmetric rank 1 (SR1) update is a quasi-Newton update that is
rank 1 as opposed to the rank 2 update of DFP and BFGS (Eq. C.14). The
SR1 update can be derived formally without solving the optimization
problem of Eq. C.13 because there is only one update that satisfies the
secant equation.
Similar to the rank 2 update of the approximate inverse Hessian
(Eq. 4.82), we construct the update,

𝑉˜ = 𝑉˜ 𝑘 + 𝛼𝑣𝑣 | , (C.23)

where we only need one self outer product to produce a rank 1 update
(as opposed to two).
Substituting the rank 1 update (Eq. C.23) into the secant equation,
we obtain
𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑣𝑣 | 𝑦 𝑘 = 𝑠 𝑘 . (C.24)
Rearranging yields
(𝛼𝑣 | 𝑦 𝑘 ) 𝑣 = 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.25)
Thus, we have to make sure that 𝑣 is in the direction of 𝑦 𝑘 − 𝐻 𝑘 𝑠 𝑘 . The
scalar 𝛼 must be such that the scaling of the vectors on both sides of the
equation match each other. We define a normalized 𝑣 in the desired
direction,
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
𝑣= . (C.26)
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
2
To find the correct value for 𝛼, we substitute Eq. C.26 into Eq. C.25 to
get
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘  
| |
𝑠 𝑘 − 𝑉𝑘 𝑦 𝑘 = 𝛼
˜ 2 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.27)
𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘
2
Solving for 𝛼 yields

𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 2
2
𝛼= . (C.28)
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |

Substituting Eqs. C.26 and C.28 into Eq. C.23, we get the SR1 update

1   |
𝑉˜ = 𝑉˜ 𝑘 + 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.29)
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |

Because it is possible for the denominator in this update to be zero, the


update requires safeguarding. This update is not positive definite in
general because the denominator can be negative.
C Quasi-Newton Methods 571

As in the BFGS method, the search direction at each major iteration


is given by 𝑝 𝑘 = −𝑉˜ 𝑘 ∇ 𝑓 𝑘 and a line search with 𝛼init = 1 determines the
final step length.

C.2.4 Unification of SR1, DFP, and BFGS


The SR1, DFP, and BFGS updates for the inverse Hessian approximation
can be expressed using the following more general formula:
  
  𝛼 𝛽 𝑦 𝑘 𝑉˜ 𝑘
|
𝑉˜ 𝑘+1 = 𝑉˜ 𝑘 + 𝑉˜ 𝑘 𝑦 𝑘 𝑠𝑘 | . (C.30)
𝛽 𝛾 𝑠𝑘

For the SR1 method, we have


1
𝛼SR1 =
− 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
𝑦𝑘 𝑠 𝑘
1
𝛽 SR1 =− | (C.31)
𝑦 𝑠 𝑘 − 𝑦 𝑉˜ 𝑘 𝑦 𝑘
|
𝑘 𝑘
1
𝛾SR1 = .
− 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
𝑦𝑘 𝑠 𝑘

For the DFP method, we have


1 1
𝛼 DFP = − , 𝛽DFP = 0, 𝛾DFP = . (C.32)
𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
|
𝑦𝑘 𝑠 𝑘

For the BFGS method, we have

𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
|
1 1
𝛼 BFGS = 0, 𝛽BFGS = − | , 𝛾BFGS = | +  2 . (C.33)
𝑦𝑘 𝑠 𝑘 𝑦𝑘 𝑠 𝑘 |
𝑦𝑘 𝑠 𝑘

C.3 Sherman–Morrison–Woodbury Formula

The formal derivations of the DFP and BFGS methods use the Sherman–
Morrison–Woodbury formula (also known as the Woodbury matrix
identity). Suppose that the inverse of a matrix is known, and then the
matrix is perturbed. The Sherman–Morrison–Woodbury formula gives
the inverse of the perturbed matrix without having to re-invert the
perturbed matrix. We used this formula in Section 4.4.4 to derive the
quasi-Newton update.
One possible perturbation is a rank 1 update of the form

𝐴ˆ = 𝐴 + 𝑢𝑣 | , (C.34)
C Quasi-Newton Methods 572

where 𝑢 and 𝑣 are 𝑛-vectors. This is a rank 1 update to 𝐴 because 𝑢𝑣 |


is an outer product that produces a matrix whose rank is equal to 1 (see
Fig. 4.50).
If 𝐴ˆ is nonsingular, and 𝐴−1 is known, the Sherman–Morrison–
Woodbury formula gives

𝐴−1 𝑢𝑣 | 𝐴−1
𝐴ˆ −1 = 𝐴−1 − . (C.35)
1 + 𝑣 | 𝐴−1 𝑢
This formula can be verified by multiplying Eq. C.34 and Eq. C.35,
which yields the identity matrix.
This formula can be generalized for higher-rank updates as follows:

𝐴ˆ = 𝐴 + 𝑈𝑉 | , (C.36)

where 𝑈 and 𝑉 are (𝑛 × 𝑝) matrices for some 𝑝 between 1 and 𝑛. Then,


 
𝐴ˆ −1 = 𝐴−1 − 𝐴−1𝑈 𝐼 + 𝑉 | 𝐴−1𝑈 𝑉 | 𝐴−1 . (C.37)

Although we need to invert a new matrix, 𝐼 + 𝑉 | 𝐴−1 𝑈 , this matrix is
typically small and can be inverted analytically for 𝑝 = 2 for the rank 2
update, for example.
Test Problems
D
D.1 Unconstrained Problems

D.1.1 Slanted Quadratic Function


This is a smooth two-dimensional function suitable for a first test of a 8
𝑥2

gradient-based optimizer:
4
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥 22 − 𝛽𝑥 1 𝑥2 , (D.1) 𝑥∗
0

where 𝛽 ∈ [0, 2). A 𝛽 value of zero corresponds to perfectly circular


contours. As 𝛽 increases, the contours become increasingly slanted.
−4

For 𝛽 = 2, the quadratic becomes semidefinite, and there is a line of −8


weak minima. For 𝛽 > 2, the quadratic is indefinite, and there is no −10 −5 0
𝑥1
5 10

minimum. An intermediate value of 𝛽 = 3/2 is suitable for first tests


and yields the contours shown in Fig. D.1. Fig. D.1 Slanted quadratic function
for 𝛽 = 3/2.
Global minimum: 𝑓 (𝑥 ∗ ) = 0 at 𝑥 ∗ = (0, 0).

D.1.2 Rosenbrock Function


The two-dimensional Rosenbrock function, shown in Fig. D.2, is also 𝑥2

known as Rosenbrock’s valley or banana function. This function was 2


introduced by Rosenbrock,223 who used it as a benchmark problem for
𝑥∗
optimization algorithms. 1

The function is defined as follows:


 2 0

𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + 100 𝑥2 − 𝑥 12 . (D.2)


−1
−1 0 1
This became a classic benchmarking function because of its narrow turn- 𝑥1

ing valley. The large difference between the maximum and minimum
Fig. D.2 Rosenbrock function.
curvatures, and the fact that the principal curvature directions change
223. Rosenbrock, An automatic method
along the valley, makes it a good test for quasi-Newton methods. for finding the greatest or least value of a
The Rosenbrock function can be extended to 𝑛 dimensions by function, 1960.
defining the sum,
𝑛−1 
Õ  2 
𝑓 (𝑥) = 100 𝑥 𝑖+1 − 𝑥 𝑖 2 + (1 − 𝑥 𝑖 )2 . (D.3)
𝑖=1

573
D Test Problems 574

Global minimum: 𝑓 (𝑥 ∗ ) = 0.0 at 𝑥 ∗ = (1, 1, . . . , 1).


Local minimum: For 𝑛 ≥ 4, a local minimum exists near 𝑥 = (−1, 1, . . . , 1).

D.1.3 Bean Function


The “bean” function was developed in this book as a milder version 3
𝑥2

of the Rosenbrock function: it has the same curved valley as the


Rosenbrock function but without the extreme variations in curvature. 2

The function, shown in Fig. D.3, is


1 2 1 𝑥∗

𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 +
2𝑥2 − 𝑥12 . (D.4)
2 0

Global minimum: 𝑓 (𝑥 ∗ ) = 0.09194 at 𝑥 ∗ = (1.21314, 0.82414).


−1
−2 0 2
𝑥1
D.1.4 Jones Function
Fig. D.3 Bean function.
This is a fourth-order smooth multimodal function that is useful to test
global search algorithms and also gradient-based algorithms starting 3
𝑥2

from different points. There are saddle points, maxima, and minima,
with one global minimum. This function, shown in Fig. D.4 along with 2

the local and global minima, is 1

𝑓 (𝑥 1 , 𝑥2 ) = 𝑥 14 + 𝑥24 − 4𝑥13 − 3𝑥23 + 2𝑥 12 + 2𝑥1 𝑥2 . (D.5) 0

𝑥∗
Global minimum: = −13.5320 at = (2.6732, −0.6759).
𝑓 (𝑥 ∗ ) 𝑥∗ −1
Local minima: 𝑓 (𝑥) = −9.7770 at 𝑥 = (−0.4495, 2.2928). −1 0 1 2 3
𝑓 (𝑥) = −9.0312 at 𝑥 = (2.4239, 1.9219). 𝑥1

Fig. D.4 Jones multimodal function.


D.1.5 Hartmann Function
The Hartmann function is a three-dimensional smooth function with
multiple local minima: 𝑥3

Õ
4
© Õ
3
ª
1
𝛼 𝑖 exp ­− 𝐴 𝑖𝑗 (𝑥 𝑗 − 𝑃𝑖𝑗 )2 ® ,
𝑥∗
𝑓 (𝑥) = − (D.6)
0.8
𝑖=1 « 𝑗=1 ¬
0.6
where
0.4
𝛼 = [1.0, 1.2, 3.0, 3.2] ,
3 10 30 3689 1170 2673
0.2

 
0.1 10 35  4387 7470 (D.7) 0 0.25 0.5 0.75 1 1.25
𝐴 =  −4 4699
10 30
𝑃 = 10 
8732 5547
𝑥2
, .
3 1091
0.1 10 35  381 5743 8828
 
Fig. D.5 An 𝑥2 − 𝑥3 slice of Hartmann
function at 𝑥 1 = 0.1148.
A slice of the function, at the optimal value of 𝑥1 = 0.1148, is shown
in Fig. D.5.
Global minimum: 𝑓 (𝑥 ∗ ) = −3.86278 at 𝑥 ∗ = (0.11480, 0.55566, 0.85254).
D Test Problems 575

D.1.6 Aircraft Wing Design


We want to optimize the rectangular planform wing of a general
aviation-sized aircraft by changing its wingspan and chord (see Ex. 1.1).
In general, we would add many more design variables to a problem
like this, but we are limiting it to a simple two-dimensional problem so
that we can easily visualize the results.
The objective is to minimize the required power, thereby taking into
account drag and propulsive efficiency, which are speed dependent.
The following describes a basic performance estimation methodology
for a low-speed aircraft. Implementing it may not seem like it has
much to do with optimization. The physics is not important for our
purposes, but practice translating equations and concepts into code is
an important element of formulating optimization problems in general.
In level flight, the aircraft must generate enough lift to equal the
required weight, so
𝐿=𝑊. (D.8)
We assume that the total weight consists of a fixed aircraft and payload
weight 𝑊0 and a component of the weight that depends on the wing
area 𝑆—that is,
𝑊 = 𝑊0 + 𝑊𝑠 𝑆 . (D.9)
The wing can produce a certain lift coefficient (𝐶 𝐿 ) and so we must
make the wing area (𝑆) large enough to produce sufficient lift. Using
the definition of lift coefficient, the total lift can be computed as

𝐿 = 𝑞𝐶 𝐿 𝑆 , (D.10)

where 𝑞 is the dynamic pressure and


1 2
𝑞= 𝜌𝑣 . (D.11)
2
If we use a rectangular wing, then the wing area can be computed from
the wingspan (𝑏) and the chord (𝑐) as

𝑆 = 𝑏𝑐 . (D.12)

The aircraft drag consists of two components: viscous drag and


induced drag. The viscous drag can be approximated as

𝐷 𝑓 = 𝑘𝐶 𝑓 𝑞𝑆wet . (D.13)

For a fully turbulent boundary layer, the skin friction coefficient, 𝐶 𝑓 ,


can be approximated as
0.074
𝐶𝑓 = . (D.14)
𝑅𝑒 0.2
D Test Problems 576

In this equation, the Reynolds number is based on the wing chord and
is defined as follows:
𝜌𝑣𝑐
𝑅𝑒 = , (D.15)
𝜇
where 𝜌 is the air density, and 𝜇 is the air dynamic viscosity. The form
factor, 𝑘, accounts for the effects of pressure drag. The wetted area,
𝑆wet , is the area over which the skin friction drag acts, which is a little
more than twice the planform area. We will use

𝑆wet = 2.05𝑆 . (D.16)

The induced drag is defined as

𝐿2
𝐷𝑖 = , (D.17)
𝑞𝜋𝑏 2 𝑒

where 𝑒 is the Oswald efficiency factor. The total drag is the sum of
induced and viscous drag, 𝐷 = 𝐷𝑖 + 𝐷 𝑓 .
Our objective function, the power required by the motor for level
flight, is
𝐷𝑣
𝑃(𝑏, 𝑐) = , (D.18)
𝜂
where 𝜂 is the propulsive efficiency. We assume that our electric
propellers have a Gaussian efficiency curve (real efficiency curves are
not Gaussian, but this is simple and will be sufficient for our purposes):
 
−(𝑣 − 𝑣)2
𝜂 = 𝜂max exp . (D.19)
2𝜎2

In this problem, the lift coefficient is provided. Therefore, to satisfy


the lift requirement in Eq. D.8, we need to compute the velocity using
Eq. D.11 and Eq. D.10 as
s
2𝐿 1.5
𝑐
𝑣= . (D.20)
𝜌𝐶 𝐿 𝑆
1.2

This is the same problem that was presented in Ex. 1.2 of Chapter 1.
0.9
The optimal wingspan and chord are 𝑏 = 25.48 m and 𝑐 = 0.50 m,
respectively, given the parameters. The contour and the optimal wing 0.6
shape are shown in Fig. D.6.
Because there are no structural considerations in this problem, the 0.3
5 15 25 35
resulting wing has a higher wing aspect ratio than is realistic. This 𝑏

emphasizes the importance of carefully selecting the objective and


Fig. D.6 Wing design problem with
including all relevant constraints. power requirement contour.
D Test Problems 577

The parameters for this problem are given as follows:

Parameter Value Unit Description


𝜌 1.2 kg/m3 Density of air
𝜇 1.8 × 10−5 kg/(m sec) Viscosity of air
𝑘 1.2 Form factor
𝐶𝐿 0.4 Lift coefficient
𝑒 0.80 Oswald efficiency factor
𝑊0 1,000 N Fixed aircraft weight
𝑊𝑠 8.0 N/m2 Wing area dependent weight
𝜂max 0.8 Peak propulsive efficiency
𝑣¯ 20.0 m/s Flight speed at peak
propulsive efficiency
𝜎 5.0 m/s Standard deviation of
efficiency function

D.1.7 Brachistochrone Problem


The brachistochrone problem is a classic problem proposed by Johann
Bernoulli (see Section 2.2 for the historical background). Although this
was originally solved analytically, we discretize the model and solve
the problem using numerical optimization. This is a useful problem
for benchmarking because you can change the number of dimensions.
A bead is set on a wire that defines a path that we can shape. The
bead starts at some 𝑦-position ℎ with zero velocity. For convenience,
we define the starting point at 𝑥 = 0.
From the law of conservation of energy, we can then find the
velocity of the bead at any other location. The initial potential energy
is converted to kinetic energy, potential energy, and dissipative work
from friction acting along the path length, yielding the following:
𝑥 ∫
1
𝑚𝑔ℎ = 𝑚𝑣 2 + 𝑚 𝑔 𝑦 + 𝜇 𝑘 𝑚 𝑔 cos 𝜃 d𝑠
2 0
1 (D.21)
0 = 𝑣 2 + 𝑔(𝑦 − ℎ) + 𝜇 𝑘 𝑔𝑥
2
q
𝑣= 2𝑔(ℎ − 𝑦 − 𝜇 𝑘 𝑥) .

Now that we know the speed of the bead as a function of 𝑥, we can


compute the time it takes to traverse an differential element of length
d𝑠: ∫ 𝑥 𝑖 +d𝑥
d𝑠
Δ𝑡 =
𝑥𝑖 𝑣(𝑥)
D Test Problems 578

∫ p
𝑥 𝑖 +d𝑥 d𝑥 2 + d𝑦 2
Δ𝑡 = p
2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
r
𝑥𝑖
 2 (D.22)
∫ 𝑥 𝑖 +d𝑥 1+
d𝑦
d𝑥 d𝑥
= p .
𝑥𝑖 2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
To discretize this problem, we can divide the path into linear
segments. As an example, Fig. D.7 shows the wire divided into four
linear segments (five nodes) as an approximation of a continuous wire.
The slope 𝑠 𝑖 = (Δ𝑦/Δ𝑥)𝑖 is then a constant along a given segment, and 𝑦 (𝑥 𝑖 , 𝑦 𝑖 )
𝑦(𝑥) = 𝑦 𝑖 + 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ). Making these substitutions results in Δ𝑦
q
(𝑥 𝑖+1 , 𝑦 𝑖+1 )

1 + 𝑠 𝑖2 ∫ 𝑥 𝑖+1
d𝑥 Δ𝑥
Δ𝑡 𝑖 = p p . (D.23) 𝑥
2𝑔 𝑥𝑖 ℎ − 𝑦 𝑖 − 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ) − 𝜇 𝑘 𝑥
Fig. D.7 A discretized representation
Performing the integration and simplifying (many steps omitted here) of the brachistochrone problem.
results in
s q
2 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
Δ𝑡 𝑖 = p p , (D.24)
𝑔 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦 𝑖 − 𝜇 𝑘 𝑥 𝑖

where Δ𝑥 𝑖 = (𝑥 𝑖+1 − 𝑥 𝑖 ), and Δ𝑦 𝑖 = (𝑦 𝑖+1 − 𝑦 𝑖 ). The objective of the


optimization is to minimize the total travel time, so we need to sum up
the travel time across all of our linear segments:

Õ
𝑛−1
𝑇= Δ𝑡 𝑖 . (D.25)
𝑖=1

Minimization is unaffected by multiplying by a constant, so we can


remove the multiplicative constant for simplicity (we see that the
magnitude of the acceleration of gravity has no effect on the optimal
path):
q
Õ
𝑛−1 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
minimize 𝑓 = p p (D.26)
𝑖=1 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦𝑖 − 𝜇𝑘 𝑥 𝑖
by varying 𝑦𝑖 , 𝑖 = 1, . . . , 𝑛.

The design variables are the 𝑛−2 positions of the path parameterized
by 𝑦 𝑖 . The endpoints must be fixed; otherwise, the problem is ill-defined,
which is why there are 𝑛 − 2 design variables instead of 𝑛. Note that
𝑥 is a parameter, meaning that it is fixed. We could space the 𝑥 𝑖 any
reasonable way and still find the same underlying optimal curve, but
D Test Problems 579

it is easiest to just use uniform spacing. As the dimensionality of the


problem increases, the solution becomes more challenging. We will
use the following specifications:

• Starting point: (𝑥, 𝑦) = (0, 1) m.


• Ending point: (𝑥, 𝑦) = (1, 0) m.
• Kinetic coefficient of friction 𝜇 𝑘 = 0.3.

The analytic solution for the case with friction is more difficult to
derive, but the analytic solution for the frictionless case (𝜇 𝑘 = 0) with
our starting and ending points is as follows:

𝑥 = 𝑎(𝜃 − sin(𝜃))
(D.27)
𝑦 = −𝑎(1 − cos(𝜃)) + 1 ,

where 𝑎 = 0.572917 and 𝜃 ∈ [0, 2.412].

D.1.8 Spring System


Consider a connected spring system of two springs with lengths of 𝑙1
and 𝑙2 and stiffnesses of 𝑘 1 and 𝑘2 , fixed at the walls as shown in Fig. D.8.
An object with mass 𝑚 is suspended between the two springs. It will
naturally deform such that the sum of the gravitational and spring
potential energy, 𝐸 𝑝 , is at the minimum.

ℓ1 ℓ2

𝑘1 𝑘2

𝑥1

𝑥2
Fig. D.8 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔

The total potential energy of the spring system is

1 1
𝐸 𝑝 (𝑥1 , 𝑥2 ) = 𝑘 1 (Δ𝑙1 )2 + 𝑘2 (Δ𝑙2 )2 − 𝑚 𝑔𝑥 2 , (D.28)
2 2
where Δ𝑙1 and Δ𝑙2 are the changes in length for the two springs. With
respect to the original lengths, and displacements 𝑥 1 and 𝑥 2 as shown,
D Test Problems 580

they are defined as


q
Δ𝑙1 = (𝑙1 + 𝑥1 )2 + 𝑥22 − 𝑙1
q (D.29)
2
Δ𝑙2 = (𝑙2 − 𝑥1 ) + 𝑥22 − 𝑙2 .

This can be minimized to determine the final location of the object.


With initial lengths of 𝑙1 = 12 cm, 𝑙2 = 8 cm; spring stiffnesses of
𝑥2
12
𝑘 1 = 1.0 N·cm, 𝑘2 = 10.0 N·cm; and a force due to gravity of 𝑚 𝑔 = 7N,
8
the minimum potential energy is at (𝑥1 , 𝑥2 ) = (2.7852, 6.8996). The
contour of 𝐸 𝑘 with respect to 𝑥1 and 𝑥 2 is shown in Fig. D.9. 4

The analytic derivatives can also be computed for use in a gradient- 0

based optimization. The derivative of 𝐸 𝑝 with respect to 𝑥 1 is −4


   
𝜕𝐸 𝑝 1 𝜕(Δ𝑙1 ) 1 𝜕(Δ𝑙2 ) −8
= 𝑘1 2Δ𝑙1 + 𝑘 2 2Δ𝑙2 − 𝑚𝑔 , (D.30) −5 0 5 10 15
𝜕𝑥1 2 𝜕𝑥1 2 𝜕𝑥1 𝑥1

where the partial derivatives of Δ𝑙1 and Δ𝑙2 are Fig. D.9 Total potential energy con-
tours for two-spring system.
𝜕(Δ𝑙1 ) 𝑙1 + 𝑥 1
=q
𝜕𝑥 1
(𝑙1 + 𝑥 1 )2 + 𝑥22
(D.31)
𝜕(Δ𝑙2 ) 𝑙2 − 𝑥 1
=q .
𝜕𝑥2
(𝑙2 − 𝑥 1 )2 + 𝑥 22
q q
By letting ℒ1 = (𝑙1 + 𝑥1 )2 + 𝑥22 and ℒ2 = (𝑙2 − 𝑥1 )2 + 𝑥22 , the partial
derivative of 𝐸 𝑝 with respect to 𝑥1 can be written as

𝜕𝐸 𝑝 𝑘1 (ℒ1 − 𝑙1 )(𝑙1 + 𝑥1 ) 𝑘 2 (ℒ2 − 𝑙2 )(𝑙2 − 𝑥1 )


= + − 𝑚 𝑔. (D.32)
𝜕𝑥1 ℒ1 ℒ2

Similarly, the partial derivative of 𝐸 𝑝 with respect to 𝑥2 can be written


as
𝜕𝐸 𝑝 𝑘1 𝑥 2 (ℒ1 − 𝑙1 ) 𝑘2 𝑥2 (ℒ2 − 𝑙2 )
= + . (D.33)
𝜕𝑥2 ℒ1 ℒ2

D.2 Constrained Problems

D.2.1 Barnes Problem


The Barnes problem was devised in a master’s thesis224 and has been 224. Barnes, A comparative study of nonlin-
ear optimization codes, 1967.
used in various optimization demonstration studies. It is a good starter
problem because it only has two dimensions for easy visualization
while also including constraints.
D Test Problems 581

The objective function contains the following coefficients:

𝑎 1 = 75.196 𝑎 2 = −3.8112
𝑎 3 = 0.12694 𝑎 4 = −2.0567 × 10−3
𝑎 5 = 1.0345 × 10−5 𝑎 6 = −6.8306
𝑎 7 = 0.030234 𝑎 8 = −1.28134 × 10−3
𝑎 9 = 3.5256 × 10−5 𝑎 10 = −2.266 × 10−7
𝑎11 = 0.25645 𝑎 12 = −3.4604 × 10−3
𝑎13 = 1.3514 × 10−5 𝑎 14 = −28.106
𝑎 15 = −5.2375 × 10−6 𝑎 16 = −6.3 × 10−8
𝑎 17 = 7.0 × 10−10 𝑎 18 = 3.4054 × 10−4
𝑎 19 = −1.6638 × 10−6 𝑎20 = −2.8673
𝑎 21 = 0.0005

For convenience, we define the following quantities:

𝑦1 = 𝑥 1 𝑥 2 , 𝑦2 = 𝑦1 𝑥 1 , 𝑦3 = 𝑥22 , 𝑦4 = 𝑥12 (D.34)

The objective function is then:

𝑓 (𝑥1 , 𝑥2 ) = 𝑎1 + 𝑎2 𝑥1 + 𝑎 3 𝑦4 + 𝑎4 𝑦4 𝑥1 + 𝑎5 𝑦42 + 𝑎6 𝑥2 + 𝑎7 𝑦1 +
𝑎8 𝑥1 𝑦1 + 𝑎 9 𝑦1 𝑦4 + 𝑎 10 𝑦2 𝑦4 + 𝑎11 𝑦3 + 𝑎12 𝑥2 𝑦3 + 𝑎 13 𝑦32 +
𝑎14 (D.35)
+ 𝑎15 𝑦3 𝑦4 + 𝑎16 𝑦1 𝑦4 𝑥2 + 𝑎17 𝑦1 𝑦3 𝑦4 + 𝑎18 𝑥1 𝑦3 +
𝑥2 + 1
𝑎19 𝑦1 𝑦3 + 𝑎20 exp(𝑎21 𝑦1 ).

There are three constraints of the form 𝑔(𝑥) ≤ 0:


𝑦1
𝑔1 = 1 −
700
𝑦4 𝑥2
𝑔2 = 2 − (D.36)
25 5
𝑥1 𝑥 2
2
𝑔3 = − 0.11 − −1 .
500 50
The problem also has bound constraints. The original formulation
𝑥2

is bounded from [0, 80] in both dimensions, in which case the global 60

optimum occurs in the corner at 𝑥 ∗ = [80, 80], with a local minimum in


the middle. However, for our usage, we preferred the global optimum 40

not to be in the corner and so set the bounds to [0, 65] in both dimen-
𝑥∗
sions. The contour of this function is plotted in Fig. D.10. 20

Global minimum: 𝑓 (𝑥 ∗ ) = −31.6368 at 𝑥 ∗ = (49.5263, 19.6228).


Local minimum: 𝑓 (𝑥) = −17.7754 at 𝑥 = (65, 65). 0 20 40 60
𝑥1

Fig. D.10 Barnes function.


D Test Problems 582

D.2.2 Ten-Bar Truss


The 10-bar truss is a classic optimization problem.225 In this problem, 225. Venkayya, Design of optimum struc-
tures, 1971.
we want to find the optimal cross-sectional areas for the 10-bar truss
shown in Fig. D.11. A simple truss finite-element code set up for this
particular configuration is available in the book code repository. The
function takes in an array of cross-sectional areas and returns the total
mass and an array of stresses for each truss member.

ℓ ℓ

1 2

7 8 9 10

5 6

3 4 Fig. D.11 Ten-bar truss and element


numbers.
𝑃 𝑃

The objective of the optimization is to minimize the mass of the


structure, subject to the constraints that every segment does not yield in
compression or tension. The yield stress of all elements is 25 × 103 psi,
except for member 9, which uses a stronger alloy with a yield stress of
75 × 103 psi. Mathematically, the constraint is

|𝜎𝑖 | ≤ 𝜎 𝑦 𝑖 for 𝑖 = 1, . . . , 10, (D.37)

where the absolute value is needed to handle tension and compression


(with the same yield strength for tension and compression). Abso-
lute values are not differentiable at zero and should be avoided in
gradient-based optimization if possible. Thus, we should put this in
a mathematically equivalent form that avoids absolute value. Each
element should have a cross-sectional area of at least 0.1 in2 for manu-
facturing reasons (bound constraint). When solving this optimization
𝑣2
problem, you may need to scale the objective and constraints.
Although not needed to solve the problem, an overview of the 𝑢2
2
equations is provided. A truss element is the simplest type of structural 𝐿
𝑣1
finite element and only has an axial degree of freedom. The theory and
derivation for truss elements are simple, but for our purposes, we skip 1
𝜙
𝑢1
to the result. Given a two-dimensional element oriented arbitrarily in
space (Fig. D.12), we can relate the displacements at the nodes to the Fig. D.12 A truss element oriented
at some angle 𝜙, where 𝜙 is mea-
forces at the nodes through a stiffness relationship.
sured from a horizontal line emanat-
In matrix form, the equation for a given element is 𝐾 𝑒 𝑑 = 𝑞. In ing from the first node, oriented in
the positive 𝑥 direction.
D Test Problems 583

detail, the equation is

 𝑐2 −𝑐 2 −𝑐𝑠  𝑢1  𝑋1 


 𝑐𝑠    
𝐸𝐴  𝑐𝑠 𝑠2 −𝑐𝑠 −𝑠 2  𝑣1   𝑌1 
 = 
𝐿  −𝑐 2 𝑐𝑠  𝑢2  𝑋2  (D.38)
−𝑐𝑠 𝑐2    
−𝑐𝑠 𝑠 2  𝑣2   𝑌2 
 −𝑠 2 𝑐𝑠    
where the displacement vector is 𝑑 = [𝑢1 , 𝑣1 , 𝑢2 , 𝑣2 ]. The meanings of
the variables in the equation are described in Table D.1.

Table D.1 The variables used in the stiffness equation.


Variable Description
𝑋𝑖 Force in the 𝑥-direction at node 𝑖
𝑌𝑖 Force in the 𝑦-direction at node 𝑖
𝐸 Modulus of elasticity of truss element material
𝐴 Area of truss element cross section
𝐿 Length of truss element
𝑐 cos 𝜙
𝑠 sin 𝜙
𝑢𝑖 Displacement in the 𝑥-direction at node 𝑖
𝑣𝑖 Displacement in the 𝑦-direction at node 𝑖

The stress in the truss element can be computed from the equation
𝜎 = 𝑆 𝑒 𝑑, where 𝜎 is a scalar, 𝑑 is the same vector as before, and the
element 𝑆 𝑒 matrix (really a row vector because stress is one-dimensional
for truss elements) is

𝐸  
𝑆𝑒 = −𝑐 −𝑠 𝑐 𝑠 . (D.39)
𝐿
The global structure (an assembly of multiple finite elements) has the
same equations, 𝐾𝑑 = 𝑞 and 𝜎 = 𝑆𝑑, but now 𝑑 contains displacements
for all of the nodes in the structure, 𝑑 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. If we have 𝑛
nodes and 𝑚 elements, then 𝑞 and 𝑑 are 2𝑛-vectors, 𝐾 is a (2𝑛 × 2𝑛)
matrix, 𝑆 is an (𝑚 × 2𝑛) matrix, and 𝜎 is an 𝑚-vector. The elemental
stiffness and stress matrices are first computed and then assembled into
the global matrices. This is straightforward because the displacements
and forces of the individual elements add linearly.
After we assemble the global matrices, we must remove any degrees
of freedom that are structurally rigid (already known to have zero
displacement). Otherwise, the problem is ill-defined, and the stiffness
matrix will be ill-conditioned.
D Test Problems 584

Given the geometry, materials, and external loading, we can pop-


ulate the stiffness matrix and force vector. We can then solve for the
unknown displacements from

𝐾𝑑 = 𝑞 . (D.40)

With the solved displacements, we can compute the stress in each


element using
𝜎 = 𝑆𝑑 . (D.41)
Bibliography

1 Wu, N., Kenway, G., Mader, C. A., Jasa, J., and Martins, J. R. R. A., cited on pp. 15, 199
“PyOptSparse: A Python framework for large-scale constrained
nonlinear optimization of sparse systems,” Journal of Open Source
Software, Vol. 5, No. 54, October 2020, p. 2564.
doi: 10.21105/joss.02564
2 Lyu, Z., Kenway, G. K. W., and Martins, J. R. R. A., “Aerodynamic cited on p. 20
Shape Optimization Investigations of the Common Research Model
Wing Benchmark,” AIAA Journal, Vol. 53, No. 4, April 2015, pp. 968–
985.
doi: 10.2514/1.J053318
3 He, X., Li, J., Mader, C. A., Yildirim, A., and Martins, J. R. R. A., cited on p. 20
“Robust aerodynamic shape optimization—From a circle to an
airfoil,” Aerospace Science and Technology, Vol. 87, April 2019, pp. 48–
61.
doi: 10.1016/j.ast.2019.01.051
4 Betts, J. T., “Survey of numerical methods for trajectory optimiza- cited on p. 26
tion,” Journal of Guidance, Control, and Dynamics, Vol. 21, No. 2, 1998,
pp. 193–207.
doi: 10.2514/2.4231
5 Bryson, A. E. and Ho, Y. C., Applied Optimal Control; Optimization, cited on p. 26
Estimation, and Control. Waltham, MA: Blaisdell Publishing, 1969.
6 Bertsekas, D. P., Dynamic Programming and Optimal Control. Belmont, cited on p. 26
MA: Athena Scientific, 1995.
7 Kepler, J., Nova stereometria doliorum vinariorum (New Solid Geometry cited on p. 34
of Wine Barrels). Linz, Austria: Johannes Planck, 1615.
8 Ferguson, T. S., “Who solved the secretary problem?” Statistical cited on p. 34
Science, Vol. 4, No. 3, August 1989, pp. 282–289.
doi: 10.1214/ss/1177012493
9 Fermat, P. de, Methodus ad disquirendam maximam et minimam cited on p. 35
(Method for the Study of Maxima and Minima). 1636, translated by
Jason Ross.

585
Bibliography 586

10 Kollerstrom, N., “Thomas Simpson and ‘Newton’s method of cited on p. 35


approximation’: An enduring myth,” The British Journal for the
History of Science, Vol. 25, No. 3, 1992, pp. 347–354.
11 Lagrange, J.-L., Mécanique analytique. Paris, France: Jacques Gabay, cited on p. 36
1788, Vol. 1.
12 Cauchy, A.-L., “Méthode générale pour la résolution des systèmes cited on p. 36
d’équations simultanées,” Comptes rendus hebdomadaires des séances
de l’Académie des sciences, Vol. 25, October 1847, pp. 536–538.
13 Hancock, H., Theory of Minima and Maxima. Boston, MA: Ginn and cited on p. 36
Company, 1917.
14 Menger, K., “Das botenproblem,” Ergebnisse eines Mathematischen cited on p. 36
Kolloquiums. Leipzig, Germany: Teubner, 1932, pp. 11–12.
15 Karush, W., “Minima of functions of several variables with inequal- cited on p. 37
ities as side constraints,” Master’s thesis, University of Chicago,
Chicago, IL, 1939.
16 Dantzig, G., Linear programming and extensions. Princeton, NJ: Prince- cited on p. 37
ton University Press, 1998.
isbn: 0691059136
17 Krige, D. G., “A statistical approach to some mine valuation and cited on p. 37
allied problems on the Witwatersrand,” Master’s thesis, University
of the Witwatersrand, Johannesburg, South Africa, 1951.
18 Markowitz, H., “Portfolio selection,” Journal of Finance, Vol. 7, March cited on p. 38
1952, pp. 77–91.
doi: 10.2307/2975974
19 Bellman, R., Dynamic Programming. Princeton, NJ: Princeton Uni- cited on p. 38
versity Press, 1957.
isbn: 9780691146683
20 Davidon, W. C., “Variable metric method for minimization,” SIAM cited on pp. 38, 125
Journal on Optimization, Vol. 1, No. 1, February 1991, pp. 1–17, issn:
1095-7189.
doi: 10.1137/0801001
21 Fletcher, R. and Powell, M. J. D., “A rapidly convergent descent cited on pp. 38, 125
method for minimization,” The Computer Journal, Vol. 6, No. 2,
August 1963, pp. 163–168, issn: 1460-2067.
doi: 10.1093/comjnl/6.2.163
22 Wolfe, P., “Convergence conditions for ascent methods,” SIAM cited on p. 38
Review, Vol. 11, No. 2, 1969, pp. 226–235.
doi: 10.1137/1011036
Bibliography 587

23 Wilson, R. B., “A simplicial algorithm for concave programming,” cited on p. 38


PhD dissertation, Harvard University, Cambridge, MA, June 1963.
24 Han, S.-P., “Superlinearly convergent variable metric algorithms cited on p. 38
for general nonlinear programming problems,” Mathematical Pro-
gramming, Vol. 11, No. 1, 1976, pp. 263–282.
doi: 10.1007/BF01580395
25 Powell, M. J. D., “Algorithms for nonlinear constraints that use cited on pp. 38, 199
Lagrangian functions,” Mathematical Programming, Vol. 14, No. 1,
December 1978, pp. 224–248.
doi: 10.1007/bf01588967
26 Holland, J. H., Adaptation in Natural and Artificial Systems. Ann cited on p. 39
Arbor, MI: University of Michigan Press, 1975.
27 Hooke, R. and Jeeves, T. A., “‘Direct search’ solution of numerical cited on p. 39
and statistical problems,” Journal of the ACM, Vol. 8, No. 2, 1961,
pp. 212–229.
doi: 10.1145/321062.321069
28 Nelder, J. A. and Mead, R., “A simplex method for function mini- cited on pp. 39, 285
mization,” Computer Journal, Vol. 7, 1965, pp. 308–313.
doi: 10.1093/comjnl/7.4.308
29 Karmarkar, N., “A new polynomial-time algorithm for linear pro- cited on p. 39
gramming,” Proceedings of the Sixteenth Annual ACM Symposium on
Theory of Computing. New York, NY: Association for Computing
Machinery, 1984, pp. 302–311.
doi: 10.1145/800057.808695
30 Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., and cited on p. 39
Mishchenko, E. F., The Mathematical Theory of Optimal Processes.
New York, NY: Interscience Publishers, 1961, translated by K. N.
Triruguf, edited by T. W. Neustadt.
31 Bryson Jr, A. E., “Optimal control—1950 to 1985,” IEEE Control cited on p. 39
Systems Magazine, Vol. 16, No. 3, June 1996, pp. 26–33.
doi: 10.1109/37.506395
32 Schmit, L. A., “Structural design by systematic synthesis,” Proceed- cited on pp. 40, 218
ings of the 2nd National Conference on Electronic Computation. New
York, NY: American Society of Civil Engineers, 1960, pp. 105–122.
33 Schmit, L. A. and Thornton, W. A., “Synthesis of an airfoil at cited on p. 40
supersonic Mach number,” CR 144, National Aeronautics and
Space Administration, January 1965.
Bibliography 588

34 Fox, R. L., “Constraint surface normals for structural synthesis cited on p. 40


techniques,” AIAA Journal, Vol. 3, No. 8, August 1965, pp. 1517–
1518.
doi: 10.2514/3.3182
35 Arora, J. and Haug, E. J., “Methods of design sensitivity analysis cited on p. 40
in structural optimization,” AIAA Journal, Vol. 17, No. 9, 1979,
pp. 970–974.
doi: 10.2514/3.61260
36 Haftka, R. T. and Grandhi, R. V., “Structural shape optimization—A cited on p. 40
survey,” Computer Methods in Applied Mechanics and Engineering,
Vol. 57, No. 1, 1986, pp. 91–106, issn: 0045-7825.
doi: 10.1016/0045-7825(86)90072-1
37 Eschenauer, H. A. and Olhoff, N., “Topology optimization of cited on p. 40
continuum structures: A review,” Applied Mechanics Reviews, Vol.
54, No. 4, July 2001, pp. 331–390.
doi: 10.1115/1.1388075
38 Pironneau, O., “On optimum design in fluid mechanics,” Journal of cited on p. 40
Fluid Mechanics, Vol. 64, No. 01, 1974, p. 97, issn: 0022-1120.
doi: 10.1017/S0022112074002023
39 Jameson, A., “Aerodynamic design via control theory,” Journal of cited on p. 40
Scientific Computing, Vol. 3, No. 3, September 1988, pp. 233–260.
doi: 10.1007/BF01061285
40 Sobieszczanski–Sobieski, J. and Haftka, R. T., “Multidisciplinary cited on p. 40
aerospace design optimization: Survey of recent developments,”
Structural Optimization, Vol. 14, No. 1, 1997, pp. 1–23.
doi: 10.1007/BF011
41 Martins, J. R. R. A. and Lambe, A. B., “Multidisciplinary design cited on pp. 40, 516, 528, 530
optimization: A survey of architectures,” AIAA Journal, Vol. 51, No.
9, September 2013, pp. 2049–2075.
doi: 10.2514/1.J051895
42 Sobieszczanski–Sobieski, J., “Sensitivity of complex, internally cited on p. 40
coupled systems,” AIAA Journal, Vol. 28, No. 1, 1990, pp. 153–160.
doi: 10.2514/3.10366
43 Martins, J. R. R. A., Alonso, J. J., and Reuther, J. J., “A coupled- cited on p. 41
adjoint sensitivity analysis method for high-fidelity aero-structural
design,” Optimization and Engineering, Vol. 6, No. 1, March 2005,
pp. 33–62.
doi: 10.1023/B:OPTE.0000048536.47956.62
Bibliography 589

44 Hwang, J. T. and Martins, J. R. R. A., “A computational architecture cited on pp. 41, 494
for coupling heterogeneous numerical models and computing
coupled derivatives,” ACM Transactions on Mathematical Software,
Vol. 44, No. 4, June 2018, Article 37.
doi: 10.1145/3182393
45 Wright, M. H., “The interior-point revolution in optimization: cited on p. 41
History, recent developments, and lasting consequences,” Bulletin
of the American Mathematical Society, Vol. 42, 2005, pp. 39–56.
doi: 10.1007/978-1-4613-3279-4_23
46 Grant, M., Boyd, S., and Ye, Y., “Disciplined convex programming,” cited on pp. 41, 428
Global Optimization—From Theory to Implementation, Liberti, L. and
Maculan, N., Eds. Boston, MA: Springer, 2006, pp. 155–210.
doi: 10.1007/0-387-30528-9_7
47 Wengert, R. E., “A simple automatic derivative evaluation program,” cited on p. 41
Communications of the ACM, Vol. 7, No. 8, August 1964, pp. 463–464,
issn: 0001-0782.
doi: 10.1145/355586.364791
48 Speelpenning, B., “Compiling fast partial derivatives of functions cited on p. 41
given by algorithms,” PhD dissertation, University of Illinois at
Urbana–Champaign, Champaign, IL, January 1980.
doi: 10.2172/5254402
49 Squire, W. and Trapp, G., “Using complex variables to estimate cited on pp. 42, 231
derivatives of real functions,” SIAM Review, Vol. 40, No. 1, 1998,
pp. 110–112, issn: 0036-1445 (print), 1095-7200 (electronic).
doi: 10.1137/S003614459631241X
50 Martins, J. R. R. A., Sturdza, P., and Alonso, J. J., “The complex- cited on pp. 42, 232, 234, 236
step derivative approximation,” ACM Transactions on Mathematical
Software, Vol. 29, No. 3, September 2003, pp. 245–262.
doi: 10.1145/838250.838251
51 Torczon, V., “On the convergence of pattern search algorithms,” cited on p. 42
SIAM Journal on Optimization, Vol. 7, No. 1, February 1997, pp. 1–25.
doi: 10.1137/S1052623493250780
52 Jones, D., Perttunen, C., and Stuckman, B., “Lipschitzian optimiza- cited on pp. 42, 296
tion without the Lipschitz constant,” Journal of Optimization Theory
and Application, Vol. 79, No. 1, October 1993, pp. 157–181.
doi: 10.1007/BF00941892
53 Jones, D. R. and Martins, J. R. R. A., “The DIRECT algorithm—25 cited on pp. 42, 296
years later,” Journal of Global Optimization, Vol. 79, March 2021,
pp. 521–566.
doi: 10.1007/s10898-020-00952-6
Bibliography 590

54 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization by cited on p. 42


simulated annealing,” Science, Vol. 220, No. 4598, 1983, pp. 671–680.
doi: 10.1126/science.220.4598.671
55 Kennedy, J. and Eberhart, R. C., “Particle swarm optimization,” cited on p. 42
Proceedings of the IEEE International Conference on Neural Networks.
Institute of Electrical and Electronics Engineers, 1995, Vol. IV,
pp. 1942–1948.
doi: 10.1007/978-0-387-30164-8_630
56 Forrester, A. I. and Keane, A. J., “Recent advances in surrogate- cited on p. 42
based optimization,” Progress in Aerospace Sciences, Vol. 45, No. 1,
2009, pp. 50–79, issn: 0376-0421.
doi: 10.1016/j.paerosci.2008.11.001
57 Bottou, L., Curtis, F. E., and Nocedal, J., “Optimization methods for cited on p. 43
large-scale machine learning,” SIAM Review, Vol. 60, No. 2, 2018,
pp. 223–311.
doi: 10.1137/16M1080173
58 Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M., cited on pp. 43, 410
“Automatic differentiation in machine learning: A survey,” Journal
of Machine Learning Research, Vol. 18, No. 1, January 2018, pp. 5595–
5637.
doi: 10.5555/3122009.3242010
59 Gerdes, P., “On mathematics in the history of sub-Saharan Africa,” cited on p. 43
Historia Mathematica, Vol. 21, No. 3, 1994, pp. 345–376, issn: 0315-
0860.
doi: 10.1006/hmat.1994.1029
60 Closs, M. P., Native American Mathematics. Austin, TX: University of cited on p. 43
Texas Press, 1986.
61 Shen, K., Crossley, J. N., Lun, A. W.-C., and Liu, H., The Nine cited on p. 43
Chapters on the Mathematical Art: Companion and Commentary. Oxford
University Press on Demand, 1999.
62 Hodgkin, L., A History of Mathematics: From Mesopotamia to Moder- cited on p. 43
nity. Oxford University Press on Demand, 2005.
63 Joseph, G. G., The Crest of the Peacock: Non-European Roots of Mathe- cited on p. 43
matics. Princeton, NJ: Princeton University Press, 2010.
64 Hollings, C., Martin, U., and Rice, A., Ada Lovelace: The Making of a cited on p. 44
Computer Scientist. Oxford, UK: Bodleian Library, 2014.
65 Osen, L. M., Women in Mathematics. Cambridge, MA: MIT Press, cited on p. 44
1974.
Bibliography 591

66 Hodges, A., Alan Turing: The Enigma. Princeton, NJ: Princeton cited on p. 44
University Press, 2014.
isbn: 9780691164724
67 Lipsitz, G., How Racism Takes Place. Philadelphia, PA: Temple cited on p. 44
University Press, 2011.
68 Rothstein, R., The Color of Law: A Forgotten History of How Our cited on p. 44
Government Segregated America. New York, NY: Liveright Publishing,
2017.
69 King, L. J., “More than slaves: Black founders, Benjamin Banneker, cited on p. 44
and critical intellectual agency,” Social Studies Research & Practice
(Board of Trustees of the University of Alabama), Vol. 9, No. 3, 2014.
70 Shetterly, M. L., Hidden Figures: The American Dream and the Untold cited on p. 45
Story of the Black Women Who Helped Win the Space Race. New York,
NY: William Morrow and Company, 2016.
71 Box, G. E. P., “Science and statistics,” Journal of the American Statistical cited on p. 48
Association, Vol. 71, No. 356, 1976, pp. 791–799, issn: 01621459.
doi: 10.2307/2286841
72 Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., cited on p. 58
Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley,
M. D., Waugh, B., White, E. P., and Wilson, P., “Best practices for
scientific computing,” PLoS Biology, Vol. 12, No. 1, 2014, e1001745.
doi: 10.1371/journal.pbio.1001745
73 Grotker, T., Holtmann, U., Keding, H., and Wloka, M., The De- cited on p. 59
veloper’s Guide to Debugging, 2nd ed. New York, NY: Springer,
2012.
74 Ascher, U. M. and Greif, C., A First Course in Numerical Methods. cited on p. 60
Philadelphia, PA: SIAM, 2011.
75 Saad, Y., Iterative Methods for Sparse Linear Systems, 2nd ed. Philadel- cited on pp. 61, 564
phia, PA: SIAM, 2003.
76 Higgins, T. J., “A note on the history of mixed partial derivatives,” cited on p. 83
Scripta Mathematica, Vol. 7, 1940, pp. 59–62.
77 Hager, W. W. and Zhang, H., “A new conjugate gradient method cited on p. 97
with guaranteed descent and an efficient line search,” SIAM Journal
on Optimization, Vol. 16, No. 1, January 2005, pp. 170–192, issn:
1095-7189.
doi: 10.1137/030601880
Bibliography 592

78 Moré, J. J. and Thuente, D. J., “Line search algorithms with guaran- cited on p. 101
teed sufficient decrease,” ACM Transactions on Mathematical Software
(TOMS), Vol. 20, No. 3, 1994, pp. 286–307.
doi: 10.1145/192115.192132
79 Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd ed. Berlin: cited on pp. 101, 116, 140, 141, 189,
Springer, 2006. 208, 562, 563
doi: 10.1007/978-0-387-40065-5
80 Broyden, C. G., “The convergence of a class of double-rank min- cited on p. 125
imization algorithms 1. General considerations,” IMA Journal of
Applied Mathematics, Vol. 6, No. 1, 1970, pp. 76–90, issn: 1464-3634.
doi: 10.1093/imamat/6.1.76
81 Fletcher, R., “A new approach to variable metric algorithms,” The cited on p. 125
Computer Journal, Vol. 13, No. 3, March 1970, pp. 317–322, issn:
1460-2067.
doi: 10.1093/comjnl/13.3.317
82 Goldfarb, D., “A family of variable-metric methods derived by cited on p. 125
variational means,” Mathematics of Computation, Vol. 24, No. 109,
January 1970, pp. 23–23, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0258249-6
83 Shanno, D. F., “Conditioning of quasi-Newton methods for func- cited on p. 125
tion minimization,” Mathematics of Computation, Vol. 24, No. 111,
September 1970, pp. 647–647, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0274029-x
84 Conn, A. R., Gould, N. I. M., and Toint, P. L., Trust Region Methods. cited on pp. 139, 140, 141, 142
Philadelphia, PA: SIAM, 2000.
isbn: 0898714605
85 Steihaug, T., “The conjugate gradient method and trust regions in cited on p. 140
large scale optimization,” SIAM Journal on Numerical Analysis, Vol.
20, No. 3, June 1983, pp. 626–637, issn: 1095-7170.
doi: 10.1137/0720042
86 Boyd, S. P. and Vandenberghe, L., Convex Optimization. Cambridge, cited on pp. 155, 423
UK: Cambridge University Press, March 2004.
isbn: 0521833787
87 Strang, G., Linear Algebra and its Applications, 4th ed. Boston, MA: cited on pp. 155, 542
Cengage Learning, 2006.
isbn: 0030105676
88 Dax, A., “Classroom note: An elementary proof of Farkas’ lemma,” cited on p. 165
SIAM Review, Vol. 39, No. 3, 1997, pp. 503–507.
doi: 10.1137/S0036144594295502
Bibliography 593

89 Gill, P. E., Murray, W., Saunders, M. A., and Wright, M. H., “Some cited on p. 182
theoretical properties of an augmented Lagrangian merit function,”
SOL 86-6R, Systems Optimization Laboratory, September 1986.
90 Di Pillo, G. and Grippo, L., “A new augmented Lagrangian function cited on p. 182
for inequality constraints in nonlinear programming problems,”
Journal of Optimization Theory and Applications, Vol. 36, No. 4, 1982,
pp. 495–519
doi: 10.1007/BF00940544
91 Birgin, E. G., Castillo, R. A., and MartÍnez, J. M., “Numerical cited on p. 182
comparison of augmented Lagrangian algorithms for nonconvex
problems,” Computational Optimization and Applications, Vol. 31, No.
1, 2005, pp. 31–55
doi: 10.1007/s10589-005-1066-7
92 Rockafellar, R. T., “The multiplier method of Hestenes and Powell cited on p. 182
applied to convex programming,” Journal of Optimization Theory
and Applications, Vol. 12, No. 6, 1973, pp. 555–562
doi: 10.1007/BF00934777
93 Murray, W., “Analytical expressions for the eigenvalues and eigen- cited on p. 186
vectors of the Hessian matrices of barrier and penalty functions,”
Journal of Optimization Theory and Applications, Vol. 7, No. 3, March
1971, pp. 189–196.
doi: 10.1007/bf00932477
94 Forsgren, A., Gill, P. E., and Wright, M. H., “Interior methods for cited on p. 186
nonlinear optimization,” SIAM Review, Vol. 44, No. 4, January 2002,
pp. 525–597.
doi: 10.1137/s0036144502414942
95 Gill, P. E. and Wong, E., “Sequential quadratic programming cited on p. 189
methods,” Mixed Integer Nonlinear Programming, Lee, J. and Leyffer,
S., Eds., ser. The IMA Volumes in Mathematics and Its Applications.
New York, NY: Springer, 2012, Vol. 154.
doi: 10.1007/978-1-4614-1927-3_6
96 Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP cited on pp. 189, 196, 199
algorithm for large-scale constrained optimization,” SIAM Review,
Vol. 47, No. 1, 2005, pp. 99–131.
doi: 10.1137/S0036144504446096
97 Fletcher, R. and Leyffer, S., “Nonlinear programming without a cited on p. 197
penalty function,” Mathematical Programming, Vol. 91, No. 2, January
2002, pp. 239–269.
doi: 10.1007/s101070100244
Bibliography 594

98 Benson, H. Y., Vanderbei, R. J., and Shanno, D. F., “Interior-point cited on p. 197
methods for nonconvex nonlinear programming: Filter methods
and merit functions,” Computational Optimization and Applications,
Vol. 23, No. 2, 2002, pp. 257–272.
doi: 10.1023/a:1020533003783
99 Fletcher, R., Leyffer, S., and Toint, P., “A brief history of filter cited on p. 197
methods,” ANL/MCS-P1372-0906, Argonne National Laboratory,
September 2006.
100 Fletcher, R., Practical Methods of Optimization, 2nd ed. Hoboken, NJ: cited on p. 199
Wiley, 1987.
101 Liu, D. C. and Nocedal, J., “On the limited memory BFGS method cited on p. 199
for large scale optimization,” Mathematical Programming, Vol. 45,
No. 1–3, August 1989, pp. 503–528.
doi: 10.1007/bf01589116
102 Byrd, R. H., Nocedal, J., and Waltz, R. A., “Knitro: An integrated cited on pp. 199, 207
package for nonlinear optimization,” Large-Scale Nonlinear Opti-
mization, Di Pillo, G. and Roma, M., Eds. Boston, MA: Springer US,
2006, pp. 35–59.
doi: 10.1007/0-387-30065-1_4
103 Kraft, D., “A software package for sequential quadratic program- cited on p. 199
ming,” DFVLR-FB 88-28, DLR German Aerospace Center–Institute
for Flight Mechanics, Koln, Germany, 1988.
104 Wächter, A. and Biegler, L. T., “On the implementation of an cited on p. 207
interior-point filter line-search algorithm for large-scale nonlinear
programming,” Mathematical Programming, Vol. 106, No. 1, April
2005, pp. 25–57.
doi: 10.1007/s10107-004-0559-y
105 Byrd, R. H., Hribar, M. E., and Nocedal, J., “An interior point cited on p. 207
algorithm for large-scale nonlinear programming,” SIAM Journal
on Optimization, Vol. 9, No. 4, January 1999, pp. 877–900.
doi: 10.1137/s1052623497325107
106 Wächter, A. and Biegler, L. T., “On the implementation of a primal- cited on p. 207
dual interior point filter line search algorithm for large-scale non-
linear programming,” Mathematical Programming, Vol. 106, No. 1,
2006, pp. 25–57.
107 Gill, P. E., Saunders, M. A., and Wong, E., “On the performance cited on p. 208
of SQP methods for nonlinear optimization,” Modeling and Opti-
mization: Theory and Applications, Defourny, B. and Terlaky, T., Eds.
New York, NY: Springer, 2015, Vol. 147, pp. 95–123.
doi: 10.1007/978-3-319-23699-5_5
Bibliography 595

108 Kreisselmeier, G. and Steinhauser, R., “Systematic control design by cited on p. 211
optimizing a vector performance index,” IFAC Proceedings Volumes,
Vol. 12, No. 7, September 1979, pp. 113–117, issn: 1474-6670.
doi: 10.1016/s1474-6670(17)65584-8
109 Duysinx, P. and Bendsøe, M. P., “Topology optimization of contin- cited on p. 212
uum structures with local stress constraints,” International Journal
for Numerical Methods in Engineering, Vol. 43, 1998, pp. 1453–1478.
doi: 10 . 1002 / (SICI ) 1097 - 0207(19981230 ) 43 : 8 % 3C1453 :: AID -
NME480%3E3.0.CO;2-2
110 Kennedy, G. J. and Hicken, J. E., “Improved constraint-aggregation cited on p. 212
methods,” Computer Methods in Applied Mechanics and Engineering,
Vol. 289, 2015, pp. 332–354, issn: 0045-7825.
doi: 10.1016/j.cma.2015.02.017
111 Hoerner, S. F., Fluid-Dynamic Drag. Bakersfield, CA: Hoerner Fluid cited on pp. 217, 218
Dynamics, 1965.
112 Lyness, J. N. and Moler, C. B., “Numerical differentiation of analytic cited on p. 231
functions,” SIAM Journal on Numerical Analysis, Vol. 4, No. 2, 1967,
pp. 202–210, issn: 0036-1429 (print), 1095-7170 (electronic).
doi: 10.1137/0704019
113 Lantoine, G., Russell, R. P., and Dargent, T., “Using multicomplex cited on p. 232
variables for automatic computation of high-order derivatives,”
ACM Transactions on Mathematical Software, Vol. 38, No. 3, April
2012, pp. 1–21, issn: 0098-3500.
doi: 10.1145/2168773.2168774
114 Fike, J. A. and Alonso, J. J., “Automatic differentiation through the cited on p. 232
use of hyper-dual numbers for second derivatives,” Recent Advances
in Algorithmic Differentiation, Forth, S., Hovland, P., Phipps, E., Utke,
J., and Walther, A., Eds. Berlin: Springer, 2012, pp. 163–173, isbn:
978-3-642-30023-3.
doi: 10.1007/978-3-642-30023-3_15
115 Griewank, A., Evaluating Derivatives. Philadelphia, PA: SIAM, 2000. cited on pp. 236, 246, 248
doi: 10.1137/1.9780898717761
116 Naumann, U., The Art of Differentiating Computer Programs—An cited on p. 236
Introduction to Algorithmic Differentiation. Philadelphia, PA: SIAM,
2011.
117 Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, cited on p. 249
P., Hill, C., and Wunsch, C., “OpenAD/F: A modular open-source
tool for automatic differentiation of Fortran codes,” ACM Trans-
actions on Mathematical Software, Vol. 34, No. 4, July 2008, issn:
Bibliography 596

0098-3500.
doi: 10.1145/1377596.1377598
118 Hascoet, L. and Pascual, V., “The Tapenade automatic differentia- cited on p. 249
tion tool: Principles, model, and specification,” ACM Transactions
on Mathematical Software, Vol. 39, No. 3, May 2013, 20:1–20:43, issn:
0098-3500.
doi: 10.1145/2450153.2450158
119 Griewank, A., Juedes, D., and Utke, J., “Algorithm 755: ADOL-C: cited on p. 249
A package for the automatic differentiation of algorithms written
in C/C++,” ACM Transactions on Mathematical Software, Vol. 22, No.
2, June 1996, pp. 131–167, issn: 0098-3500.
doi: 10.1145/229473.229474
120 Wiltschko, A. B., Merriënboer, B. van, and Moldovan, D., “Tangent: cited on p. 249
Automatic differentiation using source code transformation in
Python,” arXiv:1711.02712, 2017.
Url: https://arxiv.org/abs/1711.02712.
121 Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., cited on p. 249
Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-
Milne, S., and Zhang, Q., “JAX: Composable Transformations of
Python+NumPy Programs,” 2018.
Url: http://github.com/google/jax.
122 Revels, J., Lubin, M., and Papamarkou, T., “Forward-mode auto- cited on p. 249
matic differentiation in Julia,” arXiv:1607.07892, July 2016.
Url: https://arxiv.org/abs/1607.07892.
123 Neidinger, R. D., “Introduction to automatic differentiation and cited on p. 249
MATLAB object-oriented programming,” SIAM Review, Vol. 52,
No. 3, January 2010, pp. 545–563.
doi: 10.1137/080743627
124 Betancourt, M., “A geometric theory of higher-order automatic cited on p. 249
differentiation,” arXiv:1812.11592 [stat.CO], December 2018.
Url: https://arxiv.org/abs/1812.11592.
125 Giles, M., “An extended collection of matrix derivative results for cited on pp. 250, 251
forward and reverse mode algorithmic differentiation,” Oxford,
UK, January 2008.
Url: https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf.
126 Peter, J. E. V. and Dwight, R. P., “Numerical sensitivity analysis for cited on p. 251
aerodynamic optimization: A survey of approaches,” Computers
and Fluids, Vol. 39, No. 3, March 2010, pp. 373–391.
doi: 10.1016/j.compfluid.2009.09.013
Bibliography 597

127 Martins, J. R. R. A., “Perspectives on aerodynamic design optimiza- cited on pp. 256, 441
tion,” Proceedings of the AIAA SciTech Forum. American Institute of
Aeronautics and Astronautics, January 2020.
doi: 10.2514/6.2020-0043
128 Lambe, A. B., Martins, J. R. R. A., and Kennedy, G. J., “An evaluation cited on p. 259
of constraint aggregation strategies for wing box mass minimiza-
tion,” Structural and Multidisciplinary Optimization, Vol. 55, No. 1,
January 2017, pp. 257–277.
doi: 10.1007/s00158-016-1495-1
129 Kenway, G. K. W., Mader, C. A., He, P., and Martins, J. R. R. A., cited on p. 260
“Effective Adjoint Approaches for Computational Fluid Dynamics,”
Progress in Aerospace Sciences, Vol. 110, October 2019, p. 100 542.
doi: 10.1016/j.paerosci.2019.05.002
130 Curtis, A. R., Powell, M. J. D., and Reid, J. K., “On the estimation cited on p. 262
of sparse Jacobian matrices,” IMA Journal of Applied Mathematics,
Vol. 13, No. 1, February 1974, pp. 117–119, issn: 1464-3634.
doi: 10.1093/imamat/13.1.117
131 Gebremedhin, A. H., Manne, F., and Pothen, A., “What color is cited on p. 263
your Jacobian? Graph coloring for computing derivatives,” SIAM
Review, Vol. 47, No. 4, January 2005, pp. 629–705, issn: 1095-7200.
doi: 10.1137/s0036144504444711
132 Gray, J. S., Hwang, J. T., Martins, J. R. R. A., Moore, K. T., and cited on pp. 263, 490, 497, 504, 529
Naylor, B. A., “OpenMDAO: An open-source framework for multi-
disciplinary design, analysis, and optimization,” Structural and
Multidisciplinary Optimization, Vol. 59, No. 4, April 2019, pp. 1075–
1104.
doi: 10.1007/s00158-019-02211-z
133 Ning, A., “Using blade element momentum methods with gradient- cited on p. 264
based design optimization,” Structural and Multidisciplinary Opti-
mization, May 2021
doi: 10.1007/s00158-021-02883-6
134 Martins, J. R. R. A. and Hwang, J. T., “Review and unification of cited on p. 265
methods for computing derivatives of multidisciplinary compu-
tational models,” AIAA Journal, Vol. 51, No. 11, November 2013,
pp. 2582–2599.
doi: 10.2514/1.J052184
135 Yu, Y., Lyu, Z., Xu, Z., and Martins, J. R. R. A., “On the influence of cited on p. 281
optimization algorithm and starting design on wing aerodynamic
shape optimization,” Aerospace Science and Technology, Vol. 75, April
Bibliography 598

2018, pp. 183–199.


doi: 10.1016/j.ast.2018.01.016
136 Rios, L. M. and Sahinidis, N. V., “Derivative-free optimization: A cited on pp. 281, 282
review of algorithms and comparison of software implementations,”
Journal of Global Optimization, Vol. 56, 2013, pp. 1247–1293.
doi: 10.1007/s10898-012-9951-y
137 Conn, A. R., Scheinberg, K., and Vicente, L. N., Introduction to cited on p. 283
Derivative-Free Optimization. Philadelphia, PA: SIAM, 2009.
doi: 10.1137/1.9780898718768
138 Audet, C. and Hare, W., Derivative-Free and Blackbox Optimization. cited on p. 283
New York, NY: Springer, 2017.
doi: 10.1007/978-3-319-68913-5
139 Kokkolaras, M., “When, why, and how can derivative-free opti- cited on p. 283
mization be useful to computational engineering design?” Journal
of Mechanical Design, Vol. 142, No. 1, January 2020, p. 010 301.
doi: 10.1115/1.4045043
140 Simon, D., Evolutionary Optimization Algorithms. Hoboken, NJ: John cited on pp. 284, 310
Wiley & Sons, June 2013.
isbn: 1118659503
141 Audet, C. and J. E. Dennis, J., “Mesh adaptive direct search algo- cited on p. 295
rithms for constrained optimization,” SIAM Journal on Optimization,
Vol. 17, No. 1, July 2006, pp. 188–217.
doi: 10.1137/040603371
142 Le Digabel, S., “Algorithm 909: NOMAD: Nonlinear optimization cited on p. 296
with the MADS algorithm,” ACM Transactions on Mathematical
Software, Vol. 37, No. 4, 2011, pp. 1–15.
doi: 10.1145/1916461.1916468
143 Jones, D. R., “Direct global optimization algorithm,” Encyclopedia cited on pp. 296, 302
of Optimization, Floudas, C. A. and Pardalos, P. M., Eds. Boston,
MA: Springer, 2009, pp. 725–735, isbn: 978-0-387-74759-0.
doi: 10.1007/978-0-387-74759-0_128
144 Jarvis, R. A., “On the identification of the convex hull of a finite set cited on p. 301
of points in the plane,” Information Processing Letters, Vol. 2, No. 1,
1973, pp. 18–21.
doi: 10.1016/0020-0190(73)90020-3
145 Jones, D. R., Schonlau, M., and Welch, W. J., “Efficient global cited on pp. 302, 413
optimization of expensive black-box functions,” Journal of Global
Optimization, Vol. 13, 1998, pp. 455–492.
doi: 10.1023/A:1008306431147
Bibliography 599

146 Barricelli, N., “Esempi numerici di processi di evoluzione,” Metho- cited on p. 304
dos, 1954, pp. 45–68.
147 Jong, K. A. D., “An analysis of the behavior of a class of genetic cited on p. 304
adaptive systems,” PhD dissertation, University of Michigan, Ann
Arbor, MI, 1975.
148 Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., “A fast and cited on pp. 306, 362
elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions
on Evolutionary Computation, Vol. 6, No. 2, April 2002, pp. 182–197.
doi: 10.1109/4235.996017
149 Deb, K., Multi-Objective Optimization Using Evolutionary Algorithms. cited on p. 311
Hoboken, NJ: John Wiley & Sons, 2001.
isbn: 047187339X
150 Eberhart, R. and Kennedy, J. A., “New optimizer using particle cited on p. 314
swarm theory,” Proceedings of the Sixth International Symposium
on Micro Machine and Human Science. Institute of Electrical and
Electronics Engineers, 1995, pp. 39–43.
doi: 10.1109/MHS.1995.494215
151 Zhan, Z.-H., Zhang, J., Li, Y., and Chung, H. S.-H., “Adaptive cited on p. 315
particle swarm optimization,” IEEE Transactions on Systems, Man,
and Cybernetics, Part B (Cybernetics), Vol. 39, No. 6, April 2009,
pp. 1362–1381.
doi: 10.1109/TSMCB.2009.2015956
152 Gutin, G., Yeo, A., and Zverovich, A., “Traveling salesman should cited on p. 336
not be greedy: Domination analysis of greedy-type heuristics for
the TSP,” Discrete Applied Mathematics, Vol. 117, No. 1–3, March
2002, pp. 81–86, issn: 0166-218X.
doi: 10.1016/s0166-218x(01)00195-0
153 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization cited on p. 345
by simulated annealing,” Science, Vol. 220, No. 4598, May 1983,
pp. 671–680, issn: 1095-9203.
doi: 10.1126/science.220.4598.671
154 Černý, V., “Thermodynamical approach to the traveling salesman cited on p. 345
problem: An efficient simulation algorithm,” Journal of Optimization
Theory and Applications, Vol. 45, No. 1, January 1985, pp. 41–51, issn:
1573-2878.
doi: 10.1007/bf00940812
155 Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., cited on p. 345
and Teller, E., “Equation of state calculations by fast computing
machines,” Journal of Chemical Physics, March 1953.
doi: 10.2172/4390578
Bibliography 600

156 Andresen, B. and Gordon, J. M., “Constant thermodynamic speed cited on p. 346
for minimizing entropy production in thermodynamic processes
and simulated annealing,” Physical Review E, Vol. 50, No. 6, Decem-
ber 1994, pp. 4346–4351, issn: 1095-3787.
doi: 10.1103/physreve.50.4346
157 Lin, S., “Computer solutions of the traveling salesman problem,” cited on p. 347
Bell System Technical Journal, Vol. 44, No. 10, December 1965,
pp. 2245–2269, issn: 0005-8580.
doi: 10.1002/j.1538-7305.1965.tb04146.x
158 Press, W. H., Wevers, J., Flannery, B. P., Teukolsky, S. A., Vetterling, cited on p. 349
W. T., Flannery, B. P., and Vetterling, W. T., Numerical Recipes in
C: The Art of Scientific Computing. Cambridge, UK: Cambridge
University Press, 1992.
isbn: 0521431085
159 Haimes, Y. Y., Lasdon, L. S., and Wismer, D. A., “On a bicriterion cited on p. 358
formulation of the problems of integrated system identification
and system optimization,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. SMC-1, No. 3, July 1971, pp. 296–297.
doi: 10.1109/tsmc.1971.4308298
160 Das, I. and Dennis, J. E., “Normal-boundary intersection: A new cited on p. 358
method for generating the Pareto surface in nonlinear multicriteria
optimization problems,” SIAM Journal on Optimization, Vol. 8, No.
3, August 1998, pp. 631–657.
doi: 10.1137/s1052623496307510
161 Ismail-Yahaya, A. and Messac, A., “Effective generation of the cited on p. 360
Pareto frontier using the normal constraint method,” Proceedings
of the 40th AIAA Aerospace Sciences Meeting & Exhibit. American
Institute of Aeronautics and Astronautics, January 2002.
doi: 10.2514/6.2002-178
162 Messac, A. and Mattson, C. A., “Normal constraint method with cited on p. 360
guarantee of even representation of complete Pareto frontier,”
AIAA Journal, Vol. 42, No. 10, October 2004, pp. 2101–2111.
doi: 10.2514/1.8977
163 Hancock, B. J. and Mattson, C. A., “The smart normal constraint cited on p. 360
method for directly generating a smart Pareto set,” Structural and
Multidisciplinary Optimization, Vol. 48, No. 4, June 2013, pp. 763–775.
doi: 10.1007/s00158-013-0925-6
164 Schaffer, J. D., “Some experiments in machine learning using cited on p. 361
vector evaluated genetic algorithms.” PhD dissertation, Vanderbilt
University, Nashville, TN, 1984.
Bibliography 601

165 Deb, K., “Introduction to evolutionary multiobjective optimization,” cited on p. 362


Multiobjective Optimization. Berlin: Springer, 2008, pp. 59–96.
doi: 10.1007/978-3-540-88908-3_3
166 Kung, H. T., Luccio, F., and Preparata, F. P., “On finding the maxima cited on p. 362
of a set of vectors,” Journal of the ACM, Vol. 22, No. 4, October 1975,
pp. 469–476.
doi: 10.1145/321906.321910
167 Faure, H., “Discrépance des suites associées à un systéme de cited on p. 381
numération (en dimension s).” Acta Arithmetica, Vol. 41, 1982,
pp. 337–351.
doi: 10.4064/aa-41-4-337-351
168 Faure, H. and Lemieux, C., “Generalized Halton sequences in 2008: cited on p. 381
A comparative study,” ACM Transactions on Modeling and Computer
Simulation, Vol. 19, No. 4, October 2009, pp. 1–31.
doi: 10.1145/1596519.1596520
169 Sobol, I. M., “On the distribution of points in a cube and the approx- cited on p. 381
imate evaluation of integrals,” USSR Computational Mathematics
and Mathematical Physics, Vol. 7, No. 4, 1967, pp. 86–112.
doi: 10.1016/0041-5553(67)90144-9
170 Niederreiter, H., “Low-discrepancy and low-dispersion sequences,” cited on p. 381
Journal of Number Theory, Vol. 30, No. 1, 1988, pp. 51–70.
doi: 10.1016/0022-314X(88)90025-X
171 Bouhlel, M. A., Hwang, J. T., Bartoli, N., Lafage, R., Morlier, J., and cited on p. 397
Martins, J. R. R. A., “A Python surrogate modeling framework with
derivatives,” Advances in Engineering Software, 2019, p. 102 662, issn:
0965-9978.
doi: https://doi.org/10.1016/j.advengsoft.2019.03.005
172 Bouhlel, M. A. and Martins, J. R. R. A., “Gradient-enhanced kriging cited on pp. 397, 406
for high-dimensional problems,” Engineering with Computers, Vol.
1, No. 35, January 2019, pp. 157–173.
doi: 10.1007/s00366-018-0590-x
173 Jones, D. R., “A taxonomy of global optimization methods based cited on p. 401
on response surfaces,” Journal of Global Optimization, Vol. 21, 2001,
pp. 345–383.
doi: 10.1023/A:1012771025575
174 Sacks, J., Welch, W. J., Mitchell, T. J., and Wynn, H. P., “Design and cited on p. 401
analysis of computer experiments,” Statistical Science, Vol. 4, No. 4,
1989, pp. 409–423, issn: 08834237.
doi: 10.2307/2245858
Bibliography 602

175 Han, Z.-H., Zhang, Y., Song, C.-X., and Zhang, K.-S., “Weighted cited on p. 406
gradient-enhanced kriging for high-dimensional surrogate model-
ing and design optimization,” AIAA Journal, Vol. 55, No. 12, August
2017, pp. 4330–4346.
doi: 10.2514/1.J055842
176 Forrester, A., Sobester, A., and Keane, A., Engineering Design via cited on p. 406
Surrogate Modelling: A Practical Guide. Hoboken, NJ: John Wiley &
Sons, 2008.
isbn: 0470770791
177 Ruder, S., “An overview of gradient descent optimization algo- cited on p. 411
rithms,” arXiv:1609.04747, 2016.
Url: http://arxiv.org/abs/1609.04747.
178 Goh, G., “Why momentum really works,” Distill, 2017. cited on p. 412
doi: 10.23915/distill.00006
179 Diamond, S. and Boyd, S., “Convex optimization with abstract lin- cited on p. 421
ear operators,” Proceedings of the 2015 IEEE International Conference
on Computer Vision (ICCV). Institute of Electrical and Electronics
Engineers, December 2015.
doi: 10.1109/iccv.2015.84
180 Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., “Applica- cited on p. 423
tions of second-order cone programming,” Linear Algebra and Its
Applications, Vol. 284, No. 1–3, November 1998, pp. 193–228.
doi: 10.1016/s0024-3795(98)10032-0
181 Parikh, N. and Boyd, S., “Block splitting for distributed optimiza- cited on p. 423
tion,” Mathematical Programming Computation, Vol. 6, No. 1, October
2013, pp. 77–102.
doi: 10.1007/s12532-013-0061-8
182 Vandenberghe, L. and Boyd, S., “Semidefinite programming,” cited on p. 423
SIAM Review, Vol. 38, No. 1, March 1996, pp. 49–95.
doi: 10.1137/1038003
183 Vandenberghe, L. and Boyd, S., “Applications of semidefinite cited on p. 423
programming,” Applied Numerical Mathematics, Vol. 29, No. 3, March
1999, pp. 283–299.
doi: 10.1016/s0168-9274(98)00098-1
184 Boyd, S., Kim, S.-J., Vandenberghe, L., and Hassibi, A., “A tutorial cited on p. 434
on geometric programming,” Optimization and Engineering, Vol. 8,
No. 1, April 2007, pp. 67–127.
doi: 10.1007/s11081-007-9001-7
Bibliography 603

185 Hoburg, W., Kirschen, P., and Abbeel, P., “Data fitting with geometric- cited on p. 434
programming-compatible softmax functions,” Optimization and
Engineering, Vol. 17, No. 4, August 2016, pp. 897–918.
doi: 10.1007/s11081-016-9332-3
186 Kirschen, P. G., York, M. A., Ozturk, B., and Hoburg, W. W., “Ap- cited on p. 435
plication of signomial programming to aircraft design,” Journal of
Aircraft, Vol. 55, No. 3, May 2018, pp. 965–987.
doi: 10.2514/1.c034378
187 York, M. A., Hoburg, W. W., and Drela, M., “Turbofan engine cited on p. 435
sizing and tradeoff analysis via signomial programming,” Journal
of Aircraft, Vol. 55, No. 3, May 2018, pp. 988–1003.
doi: 10.2514/1.c034463
188 Stanley, A. P. and Ning, A., “Coupled wind turbine design and cited on p. 441
layout optimization with non-homogeneous wind turbines,” Wind
Energy Science, Vol. 4, No. 1, January 2019, pp. 99–114.
doi: 10.5194/wes-4-99-2019
189 Gagakuma, B., Stanley, A. P. J., and Ning, A., “Reducing wind cited on p. 441
farm power variance from wind direction using wind farm layout
optimization,” Wind Engineering, January 2021.
doi: 10.1177/0309524X20988288
190 Padrón, A. S., Thomas, J., Stanley, A. P. J., Alonso, J. J., and Ning, cited on p. 441
A., “Polynomial chaos to efficiently compute the annual energy
production in wind farm layout optimization,” Wind Energy Science,
Vol. 4, May 2019, pp. 211–231.
doi: 10.5194/wes-4-211-2019
191 Cacuci, D., Sensitivity & Uncertainty Analysis. Boca Raton, FL: cited on p. 447
Chapman and Hall/CRC, May 2003, Vol. 1.
doi: 10.1201/9780203498798
192 Parkinson, A., Sorensen, C., and Pourhassan, N., “A general ap- cited on p. 447
proach for robust optimal design,” Journal of Mechanical Design, Vol.
115, No. 1, 1993, p. 74.
doi: 10.1115/1.2919328
193 Golub, G. H. and Welsch, J. H., “Calculation of Gauss quadrature cited on p. 452
rules,” Mathematics of Computation, Vol. 23, No. 106, 1969, pp. 221–
230, issn: 00255718, 10886842.
doi: 10.1090/S0025-5718-69-99647-1
194 Wilhelmsen, D. R., “Optimal quadrature for periodic analytic cited on p. 454
functions,” SIAM Journal on Numerical Analysis, Vol. 15, No. 2, 1978,
pp. 291–296, issn: 00361429.
doi: 10.1137/0715020
Bibliography 604

195 Trefethen, L. N. and Weideman, J. A. C., “The exponentially conver- cited on p. 454
gent trapezoidal rule,” SIAM Review, Vol. 56, No. 3, 2014, pp. 385–
458, issn: 00361445, 10957200.
doi: 10.1137/130932132
196 Johnson, S. G., “Notes on the convergence of trapezoidal-rule cited on p. 454
quadrature,” March 2010.
Url: http://math.mit.edu/~stevenj/trapezoidal.pdf.
197 Smolyak, S. A., “Quadrature and interpolation formulas for tensor cited on p. 455
products of certain classes of functions,” Proceedings of the USSR
Academy of Sciences, 5. 1963, Vol. 148, pp. 1042–1045.
doi: 10.3103/S1066369X10030084
198 Wiener, N., “The homogeneous chaos,” American Journal of Mathe- cited on p. 459
matics, Vol. 60, No. 4, October 1938, p. 897.
doi: 10.2307/2371268
199 Eldred, M., Webster, C., and Constantine, P., “Evaluation of non- cited on p. 462
intrusive approaches for wiener–askey generalized polynomial
chaos,” Proceedings of the 49th AIAA Structures, Structural Dynamics,
and Materials Conference. American Institute of Aeronautics and
Astronautics, April 2008.
doi: 10.2514/6.2008-1892
200 Adams, B. M., Bohnhoff, W. J., Dalbey, K. R., Ebeida, M. S., Eddy, J. P., cited on p. 463
Eldred, M. S., Hooper, R. W., Hough, P. D., Hu, K. T., Jakeman, J. D.,
Khalil, M., Maupin, K. A., Monschke, J. A., Ridgway, E. M., Rushdi,
A. A., Seidl, D. T., Stephens, J. A., Swiler, L. P., and Winokur, J. G.,
“Dakota, a multilevel parallel object-oriented framework for design
optimization, parameter estimation, uncertainty quantification,
and sensitivity analysis: Version 6.14 user’s manual,” May 2021.
Url: https://dakota.sandia.gov/content/manuals.
201 Feinberg, J. and Langtangen, H. P., “Chaospy: An open source tool cited on p. 463
for designing methods of uncertainty quantification,” Journal of
Computational Science, Vol. 11, November 2015, pp. 46–57.
doi: 10.1016/j.jocs.2015.08.008
202 Jasa, J. P., Hwang, J. T., and Martins, J. R. R. A., “Open-source cited on pp. 477, 494
coupled aerostructural optimization using Python,” Structural and
Multidisciplinary Optimization, Vol. 57, No. 4, April 2018, pp. 1815–
1827.
doi: 10.1007/s00158-018-1912-8
Bibliography 605

203 Cuthill, E. and McKee, J., “Reducing the bandwidth of sparse cited on p. 482
symmetric matrices,” Proceedings of the 1969 24th National Confer-
ence. New York, NY: Association for Computing Machinery, 1969,
pp. 157–172.
doi: 10.1145/800195.805928
204 Amestoy, P. R., Davis, T. A., and Duff, I. S., “An approximate cited on p. 482
minimum degree ordering algorithm,” SIAM Journal on Matrix
Analysis and Applications, Vol. 17, No. 4, 1996, pp. 886–905.
doi: 10.1137/S0895479894278952
205 Lambe, A. B. and Martins, J. R. R. A., “Extensions to the design cited on p. 482
structure matrix for the description of multidisciplinary design,
analysis, and optimization processes,” Structural and Multidiscipli-
nary Optimization, Vol. 46, August 2012, pp. 273–284.
doi: 10.1007/s00158-012-0763-y
206 Irons, B. M. and Tuck, R. C., “A version of the Aitken accelerator cited on p. 486
for computer iteration,” International Journal for Numerical Methods
in Engineering, Vol. 1, No. 3, 1969, pp. 275–277.
doi: 10.1002/nme.1620010306
207 Kenway, G. K. W., Kennedy, G. J., and Martins, J. R. R. A., “Scalable cited on pp. 486, 500
parallel approach for high-fidelity steady-state aeroelastic analysis
and derivative computations,” AIAA Journal, Vol. 52, No. 5, May
2014, pp. 935–951.
doi: 10.2514/1.J052255
208 Chauhan, S. S., Hwang, J. T., and Martins, J. R. R. A., “An automated cited on p. 486
selection algorithm for nonlinear solvers in MDO,” Structural and
Multidisciplinary Optimization, Vol. 58, No. 2, June 2018, pp. 349–377.
doi: 10.1007/s00158-018-2004-5
209 Kenway, G. K. W. and Martins, J. R. R. A., “Multipoint high-fidelity cited on p. 500
aerostructural optimization of a transport aircraft configuration,”
Journal of Aircraft, Vol. 51, No. 1, January 2014, pp. 144–160.
doi: 10.2514/1.C032150
210 Hwang, J. T., Lee, D. Y., Cutler, J. W., and Martins, J. R. R. A., cited on p. 508
“Large-scale multidisciplinary optimization of a small satellite’s
design and operation,” Journal of Spacecraft and Rockets, Vol. 51, No.
5, September 2014, pp. 1648–1663.
doi: 10.2514/1.A32751
211 Biegler, L. T., Ghattas, O., Heinkenschloss, M., and Bloemen Waan- cited on p. 513
ders, B. van, Eds., Large-Scale PDE-Constrained Optimization. Berlin:
Springer, 2003.
Bibliography 606

212 Braun, R. D. and Kroo, I. M., “Development and application of cited on pp. 517, 518
the collaborative optimization architecture in a multidisciplinary
design environment,” Multidisciplinary Design Optimization: State of
the Art, Alexandrov, N. and Hussaini, M. Y., Eds. Philadelphia, PA:
SIAM, 1997, pp. 98–116.
doi: 10.5555/888020
213 Kim, H. M., Rideout, D. G., Papalambros, P. Y., and Stein, J. L., cited on p. 520
“Analytical target cascading in automotive vehicle design,” Journal
of Mechanical Design, Vol. 125, No. 3, September 2003, pp. 481–490.
doi: 10.1115/1.1586308
214 Tosserams, S., Etman, L. F. P., Papalambros, P. Y., and Rooda, cited on p. 520
J. E., “An augmented Lagrangian relaxation for analytical target
cascading using the alternating direction method of multipliers,”
Structural and Multidisciplinary Optimization, Vol. 31, No. 3, March
2006, pp. 176–189.
doi: 10.1007/s00158-005-0579-0
215 Talgorn, B. and Kokkolaras, M., “Compact implementation of non- cited on p. 520
hierarchical analytical target cascading for coordinating distributed
multidisciplinary design optimization problems,” Structural and
Multidisciplinary Optimization, Vol. 56, No. 6, 2017, pp. 1597–1602
doi: 10.1007/s00158-017-1726-0
216 Sobieszczanski–Sobieski, J., Altus, T. D., Phillips, M., and Sandusky, cited on p. 523
R., “Bilevel integrated system synthesis for concurrent and dis-
tributed processing,” AIAA Journal, Vol. 41, No. 10, 2003, pp. 1996–
2003.
doi: 10.2514/2.1889
217 Tedford, N. P. and Martins, J. R. R. A., “Benchmarking multidiscipli- cited on p. 529
nary design optimization algorithms,” Optimization and Engineering,
Vol. 11, No. 1, February 2010, pp. 159–183.
doi: 10.1007/s11081-009-9082-6
218 Golovidov, O., Kodiyalam, S., Marineau, P., Wang, L., and Rohl, cited on p. 530
P., “Flexible implementation of approximation concepts in an
MDO framework,” Proceedings of the 7th AIAA/USAF/NASA/ISSMO
Symposium on Multidisciplinary Analysis and Optimization. American
Institute of Aeronautics and Astronautics, 1998.
doi: 10.2514/6.1998-4959
219 Balabanov, V., Charpentier, C., Ghosh, D. K., Quinn, G., Vander- cited on p. 530
plaats, G., and Venter, G., “Visualdoc: A software system for general
purpose integration and design optimization,” Proceedings of the 9th
AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimiza-
Bibliography 607

tion. American Institute of Aeronautics and Astronautics, 2002.


doi: 10.2514/6.2002-5513
220 Trefethen, L. N. and Bau III, D., Numerical Linear Algebra. Philadel- cited on pp. 554, 564
phia, PA: SIAM, 1997.
isbn: 0898713617
221 Saad, Y. and Schultz, M. H., “GMRES: A generalized minimal cited on p. 565
residual algorithm for solving nonsymmetric linear systems,” SIAM
Journal on Scientific and Statistical Computing, Vol. 7, No. 3, 1986,
pp. 856–869.
doi: 10.1137/0907058
222 Broyden, C. G., “A class of methods for solving nonlinear simul- cited on p. 566
taneous equations,” Mathematics of Computation, Vol. 19, No. 92,
October 1965, pp. 577–593.
doi: 10.1090/S0025-5718-1965-0198670-6
223 Rosenbrock, H. H., “An automatic method for finding the greatest cited on p. 573
or least value of a function,” The Computer Journal, Vol. 3, No. 3,
January 1960, pp. 175–184, issn: 0010-4620.
doi: 10.1093/comjnl/3.3.175
224 Barnes, G. K., “A comparative study of nonlinear optimization cited on p. 580
codes,” Master’s thesis, University of Texas at Austin, 1967.
225 Venkayya, V., “Design of optimum structures,” Computers & Struc- cited on p. 582
tures, Vol. 1, No. 1–2, August 1971, pp. 265–309, issn: 0045-7949.
doi: 10.1016/0045-7949(71)90013-7
Index

absolute value function airfoil optimization, 441


complex-step method, 235 Aitken acceleration, 486, 487,
smoothing, 144 494
accuracy, 48 algorithmic differentiation (AD),
activation functions, 407, 408 41, 224, 236, 494
rectified linear unit (ReLU), adjoint variables, 242
408 checkpointing, 246
sigmoid, 408 computational cost, 245
active constraints, 164, 189 computational graph, 241
active-set method, 189 connection to complex-step
acyclic graph, 482 method, 248
adjoint method, 39, 40, 254 coupled systems, 498
AD partial derivatives, 259 directional derivative, 241
constraint aggregation, 210, forward mode, 237, 238, 498
259 forward vs. reverse, 245
coupled, 500 matrix operations, 250
equations, 255 operator overloading, 246,
structural problem, 258 247, 249
variables, 242, 255 partial derivatives, 259
vector, 255 reverse mode, 43, 237, 242,
verification, 261 410, 498
aerodynamic shape optimiza- scaling, 245
tion, 20, 40, 281, 441 seed, 242, 245
aerostructural shortcuts, 250
analysis, 473 software, 249
model, 477 source code transformation,
affine function, 422, 433 246, 249
aggregation functions taping, 248
𝑝-norm, 212 verification, 261
induced exponential, 212 analysis, 3, 6, 69
induced power, 213 analytic
Kreisselmeier–Steinhauser function, 231
(KS), 211 methods, see implicit ana-
aircraft fuel tank problem, 217 lytic methods

608
Index 609

analytical target cascading (ATC), gradient-based algorithms,


520 132, 134
anchor points, 358 gradient-free algorithms, 282
approximate Hessian, 123, 567 MDO architectures, 529
approximate minimum degree stochastic algorithms, 284
(AMD) ordering, 482 BFGS method, 38, 125, 128, 545,
Armijo condition, see sufficient 567, 569, 571
decrease condition damped, 199
artificial intelligence (AI), 39, 43 derivation, 125
artificial minimum, 280 Hessian reset, 128
asymmetric subspace optimiza- limited memory, see L-BFGS
tion (ASO), 525 method
asymptotic error constant, 62 SQP, 198
augmented Lagrangian method, update, 127
175, 180, 197, 313 bilevel integrated system syn-
automatic differentiation, see al- thesis (BLISS), 523
gorithmic differentia- binary
tion (AD) decoding, 306
encoding, 306
back substitution, 244, 556 representation, 306
backpropagation, 43, see also al- variables, 325, 329
gorithmic differentia- binding-direction method, 190
tion (AD) biological reproduction, 305
backtracking, 97, 98, 184 bisection, 107, 109
backward difference, 227 black-box model, 18, 226, 231,
banana function, see Rosenbrock 279, 480, 497, 515
function derivatives, 224
Barnes problem, 444, 580 solver, 503
barrier methods, see interior penalty blocking constraint, 193
methods Boltzmann distribution, 345
basis functions, 381, 382, 396, bound constraints, 7, 153, 154,
459 317
Gaussian, 397 artificial, 155
radial, see radial basis func- brachistochrone problem, 35, 151,
tion (RBF) 577
bean function, 97, 112, 118, 121, bracketing, 102
129, 132, 289, 312, 317, branch-and-bound method, 283,
574 328, 331
Bellman integer variables, 334
equation, 38, 340 relaxation, 329
principle of optimality, 340 breadth-first tree search, 331
benchmarking, 318 Broyden’s method, 68, 492, 566
Index 610

bugs, 57 step size, 233


testing, 236
calculus of variations, 34–36 trigonometric functions, 236
callback functions, 15 component, 268, 471, 474, 476,
Cauchy–Riemann equations, 232 483, 484
ceiling, 318 explicit, 476, 499, 505
central difference, 227, 229 group, 479, 483
central limit theorem, 387 implicit, 476, 499
chain rule, 236, 237, 536 multiple, 224
forward, 238 composite function, 536
multivariable, 538 computational cost, 49, 62, 371
reverse, 242 AD, 236, 246
characteristic equation, 549 adjoint method, 256
checkerboard pattern, 318 analysis, 59
checkpointing, 246 budget, 92, 313
Cholesky factorization, 60, 556 complex step, 231, 232
chromosome, 305, 309 derivatives, 222
encoding, 306 direct method, 255
classification direct vs. adjoint, 256
convex problems, 423 finite difference, 227
gradient-free algorithms, 282 forward AD, 239
MDO architectures, 530 linear solvers, 61
optimization algorithms, 21 optimization, 22, 47, 281
optimization problems, 17 reverse AD, 243
problem, 430 solvers, 12, 60, 252
stationary points, 90 computational differentiation,
Clenshaw–Curtis quadrature, 454 see algorithmic differ-
collaborative optimization (CO), entiation (AD)
517 computational fluid dynamics
collocation points, 463 (CFD), 40, 281, 441, 527
column space, 542 computational graph, 241, 245
combinatorial optimization, 36, computer code, see source code
37, see also discrete op- conceptual design, 3
timization concurrent subspace optimiza-
complementary slackness con- tion (CSSO), 528
dition, 167, 189 condition number, 555, 565
complex-step method, 42, 224, cone programming, 423
231, 232, 498 confidence interval, 402
absolute value function, 235 conjugacy, 115, 116, 560
accuracy, 233 conjugate gradient method, 114,
connection to AD, 248 115, 560
implementation, 234 Fletcher–Reeves formula, 116
Index 611

linear, 114, 560, 563 residuals, 65


nonlinear, 116, 117 tolerance, 95, 137, 224
Polak–Ribière formula, 117 convex
reset, 116, 117 function, 20, 422
consistency constraints, 511, 517, hull, 300, 301
521 optimization, 20, 41, 421
constrained optimization, 36, 152 problem, 386
graphical solution, 153 convexity, 20, 27
problem statement, 153 coordinate
constraint qualification, 168, 519 descent, 472
constraints, 12, 312 search, 39, 114
active, 13, 164, 189 search algorithm, 290
aggregation, 203, 210, 259 correlation, 398, 553
blocking, 193 matrix, 404
bound, 7, 153 coupled
consistency, 511, 517, 521 adjoint, 502
equality, 12, 153 Broyden method, 495
equality versus inequality, derivatives, 505
153, 193 model, 474
functions, 12 Newton’s method, 495
handling, 152 solver, 496
inactive, 13, 164, 189 system, 480, 484
inequality, 12, 153 coupling variables, 474, 478, 480,
infeasible, 13 482, 483, 485, 488, 502,
Jacobian, 187 511
reformulation, 196 covariance, 399, 446, 447, 553
scaling, 180 matrix, 447
working set, 189 cross validation, 392, 394, 409
continuity, 18 𝑘-fold, 394
continuous parameterization, 328 leave-one-out, 395
contour simple, 394
perpendicular, 79 crossover, 305, 309, 311
tangent, 156 linear, 311
control law, 427 point, 309
convergence, 313 single-point, 309
criterion, 236 crowded tournament selection,
failure, 47, 95, 137, 273 366
order of, 63, 65 crowding distance, 364
plot, 65 cubature, 454
quadratic, 120 cubic interpolation, 108, 144
rate, 62, see rate of conver- cuboid, 364
gence, 285
Index 612

cumulative distribution function directional, 81, 97, 228, 241


(CDF), 376, 438, 551 ease of implementation, 274
curse of dimensionality, 371, 373, eigenvalues, 251
455 eigenvectors, 251
curvature, 22, 83, 110, 111, 118 explicit, 265
approximation, 122 first-order, 222
condition, 124, 568 implicit, 265
directional, 84, 123 implicit analytic methods,
maximum, 85 251
principal directions, 85, 114 matrix, 547
curve fit, 370 matrix operations, 250, 251
Cuthill–McKee ordering, 482 methods, 224, 274
CVX, 430 mixed partial, 83
partial, 78, 238, 253, 260,
damped BFGS update, 199 266, 500, 537
data physical interpretation, 80
dependencies, 482 post-optimality, 526, 528
fitting, 382 propagation, 237, 238, 241,
model, 381 242, 252
transfer, 341, 482, 483 relative, 80
debugging, 58 scalability, 274
decaying sinusoid, 402 scaling, 273
decision variables, see design second-order, 83
variables, 325 singular value decomposi-
decomposition, see factorization tion, 251
deep neural networks, 43, 406 sparse, 261, 263
dense matrix, 554 total, 238, 253, 537
dependence verification, 241, 273
implicit, 252 weighted function, 245
dependency structure matrix, descent direction, 96, 164
see design structure ma- design
trix (DSM) constraints, 12
depth-first tree search, 330 cycle, 3
derivative-free optimization (DFO), optimal vs. conventional,
24, 283 4, 5
derivatives, 222 optimization, 3
accuracy, 137, 274 phases, 2
backward propagation, 244 process, 2
black box, 224 sensitivities, see derivatives
computational cost, 274 space visualization, 10, 12,
coupled, 41, 497, 499, 505 14
definition, 227 specifications, 3
Index 613

design structure matrix (DSM), directed graph, 482


481, 500 weighted, 336
design variables, 6, 252, 475 directional
binary, 325, 329 curvature, 84, 123
bounds, 153, 154, 317 derivative, 81, 97, 228, 241
continuous, 7, 17 disciplinary subproblem, 515,
converting integer to binary, 517
329 discipline, 471, 476
discrete, 8, 17, 27, 281, 310, disciplined convex optimization,
325 422, 428
integer, 310, 325, 334 software, 430
mixed, 17 discontinuity, 7, 19, 230, 280,
parameterization, 9, 328 318, 319
scalability, 22, 281 smoothing, 144
scaling, 112, 136 discrete optimization, 26, 36,
shared, 472 325
units, 112 dynamic programming, 337
detailed design, 3 dynamic rounding, 327
determinant, 546 genetic algorithm, 349
deterministic function, 20 greedy algorithms, 335
DFP method, 38, 125, 568, 571 rounding, 327
update, 568 simulated annealing, 345
diagonal matrix, 545 discrete variables, 27, 281, 310,
Dido’s problem, 33 325
differential, 173, 253, 265, 538 avoiding, 326
differentiation, see also deriva- discretization, 47, 49, 62
tives error, 48, 56
algorithmic, see algorithmic methods, 51
differentiation (AD) divergence, 63
chain rule, see chain rule diversity, 43, 308, 310
numerical, 226 dominance, 197, 355
symbolic, see symbolic dif- depth, 363
ferentiation dominated point, 356
DIRECT algorithm, 42, 283, 296 dot product, 540
𝑛-dimensional, 302, 303 2-norm, 544
one-dimensional, 299 test, 261
direct linear solver, 555 double-precision
direct method, 254 floating-point format, 53
coupled, 499 number, 54
structural problem, 258 dual number, 247
verification, 261 dynamic
direct quadrature, 445, 449 polling, 292
Index 614

programming, 26, 38, 337, exit conditions, 92


342 expected
rounding, 327 improvement, 413
system, 427 value, 410, 414, 438, 439
expected value, see mean
efficient global optimization (EGO), experimental data, 371
284, 413, 414 experiments, 5
eigenvalues, 85, 89, 251, 547, 548 explicit
eigenvectors, 85, 114, 251, 548, component, 476, 499, 505
561 equation, 265
elitism, 314, 366 function, 49, 478
engineering design, 2, 45, 471, model, 50
472 exploitation, 317, 412
enhanced collaborative optimiza- exploration, 284, 317, 413
tion (ECO), 528 exponential
equality constraints, 12 convergence, 454
equality of mixed partials, 83 distribution, 552
error, 57, 224 function, 211
absolute, 53 expression swell, 226
constant, see asymptotic er- exterior penalty, 175
ror constant extrapolation, 311
discretization, 48, 56
iterative solver tolerance, 56 factorization
modeling, 47 Cholesky, 556
numerical, 47, 48, 52 LU, 555
programming, 57 Farkas’ lemma, 36, 165
propagation, 54 Faure sequence, 381
relative, 53 feasibility tolerance, 200, 207
roundoff, 48, 53, 56, 57, 556 feasible
truncation, 56, 227 descent direction, 165
Euclidean direction, 158
norm, 544 region, 12
space, 540 space, 184
Euler–Lagrange equation, 36 feedback, 487
evolution, 305 Fibonacci sequence, 338
evolutionary algorithms, 39, 42, file input and output, 224, 475,
284, 361 483
GA, 304 filter methods, 197, 313
PSO, 314 finite-difference derivatives, 224,
exact penalty decomposition (EPD), 226, 252, 273, 279, 498,
528 508
exhaustive search, 296, 326, 327 accuracy, 229
Index 615

backward difference, 227 design (SAND)


central difference, 227, 229 function
coupled, 498 blending, 144
forward difference, 227, 229 constraint, 12
higher-order, 228 explicit, 478
implementation, 230 implicit, 479
optimal step size, 229 of interest, 223, 252
step, 227 objective, 9
step-size dilemma, 228, 229 smoothness, 280
step-size study, 229 functional form, 480, 483, 501
finite-difference discretization, Jacobian, 503
51
finite-element Gauss–Hermite quadrature, 452–
discretization, 49, 51, 582 454, 465
structural model, 527 Gauss–Konrod quadrature, 454
finite-precision arithmetic, 53, Gauss–Legendre quadrature, 452
55, 91, 228, 307 Gauss–Newton algorithm, 389
finite-volume discretization, 51 Gauss–Seidel method
first-order linear, 558
derivatives, 222 nonlinear block, 485, 487,
perturbation methods, 445, 494, 495, 507
446 Gaussian
fitness, 284, 307, 308 basis, 397
fixed-point iteration, 61, 225, 483, distribution, see normal prob-
530, 557 ability distribution
Fletcher–Reeves formula, 116 elimination, 555
floating-point format, 53 kernel, 398
food shopping problem, 424 multivariate distribution, 399
forward difference, 227, 229 process, see kriging, 414
forward propagation, 445 quadrature, 450
direct quadrature, 449 gene, 305
first-order perturbation, 446 generalization error, 394
Monte Carlo, 456 generalized minimum residual
polynomial chaos, 459 (GMRES) method, 564
forward substitution, 241, 482 generalized pattern search (GPS),
four fundamental subspaces, 542 283, 290, 293, 294
Frobenius norm, 545 genetic algorithm (GA), 39, 304,
full factorial sampling, 372 306, 361, 373
full-space hierarchical binary-encoded, 305, 306,
Newton’s method, 489 349
full-space optimization, see si- constraints, 312
multaneous analysis and crossover, 309
Index 616

discrete optimization, 349 weighted directed, 336


multiobjective, 361, 365 graph form programming, 423
mutation, 309 graphical solution, 14
real-encoded, 305, 310 greedy algorithms, 335
selection, 307 grocery store shopping, 337
geometric programming (GP),
41, 423, 432 H-section beam problem, 216
software, 435 Hadamard product, 166
Gibbs distribution, see Boltzmann half-space, 156, 158, 164
distribution intersection, 164
global Halton sequence, 379
optimization, 42 scrambled, 380
optimum, 19, 20, 24, 145, Hammersley sequence, 380, 405
285, 297, 320, 421 Hamming cliff, 310
search, 23, 145, 280, 282, Hartmann function, 574
284, 320 Hermite polynomials, 453, 464
globalization strategy, 68, 94, Hessian, 84, 109, 114, 120, 143
488 approximation, 109, 122, 123,
governing equations, 49, 60, 69, 125, 567
251, 253, 475 directional curvature, 84
GPkit, 435 eigenvalues, 85
gradient, 77, 78, 120 eigenvectors, 85, 114
normalization, 110 Gauss–Newton algorithm,
scaling, 136 390
gradient-based algorithms, 22, initial approximation, 123
28, 77, 373, 410 interpretation, 84
comparison, 132, 134 inverse approximation, 126,
constrained, 152 568, 569
efficiency, 222 inverse update, 568
unconstrained, 77, 95 Lagrangian, 160, 161, 190
gradient-descent method, see steepest- positive-definite, 120, 124
descent method positive-semidefinite, 88
gradient-enhanced kriging, 402 symmetry, 84
predictor, 402 update, 123, 567
gradient-free algorithms, 22, 28, vector product, 84
279 heuristics, 24, 113, 285
graph, 482 hierarchical solvers, 41, 495, 499
acyclic, 482 hierarchy, 477, 479, 497
coloring, 261, 263, 264, 500, higher-order moments, 447
504 histogram, 374, 449
cyclic, 482 history of optimization, 33
directed, 482 hit-and-run algorithms, 284
Index 617

human expertise, 4 inactive constraints, 164, 189


hybrid adjoint, 259, 260 indefinite matrix, 547
hypercube, 302 individual discipline feasible (IDF),
hyperplane, 156 511
intersection, 158 induced functions, 212
tangent, 156, 158, 164 inequality
hyperrectangle, 302 constraints, 12, 153
potentially optimal, 302 quadratic penalty, 179
trisection, 302 inertia, 314
hypersurface, 11, 253 inexact penalty decomposition
(IPD), 528
identity matrix, 205, 545 infeasibility, 197
scaled, 131 infeasible
ill-conditioning, 55, 349, 555, directions, 164
583 region, 13
aggregation function, 211 infill, 372, 412
collaborative optimization initial design, 3, 8, 77
(CO), 519 inner product, 451, 540
interpolation, 109 weighted, 452
least squares, 384 input and output conversion,
line search, 105 477
Newton’s method, 68 inputs, 6, 224, 371, 445, 478, 479
penalty function, 175, 181, integer
186 overflow, 53
imaginary step, 231 programming, see discrete
implicit optimization
component, 476, 499 variables, 325, 334
dependence, 252 integer variables, 310
equation, 49, 265, 539 interior penalty methods, 183
filtering, 283, 320 interior-point methods, 39, 41,
function, 49, 70, 252, 479 152, 186, 203
model, 50 line search, 205
implicit analytic methods, 224, with quasi-Newton approx-
251 imation, 207
adjoint, 210, 254, 259 interpolation, 107, 381
coupled, 499 cubic, 108, 144
direct, 254 ill-conditioning, 109
direct vs. adjoint, 256 non-smooth, 144
forward mode, 254 quadratic, 107, 108
reverse mode, 255 intuition, 3, 10, 12, 29
structural problem, 258 invasive weed optimization, 284
verification, 261 inverse
Index 618

barrier, 183 Kreisselmeier–Steinhauser (KS)


cumulative distribution, 376 function, 211
inversion sampling, 376 kriging, 37, 42, 284, 397, 414
investment portfolio selection, gradient-enhanced, 402
341 kernel, 398, 402
isosurface, 11, 79, 81 ordinary, 398
tangent, 156 predictor, 400
iterations regression-based, 406
major, 94 Krylov subspace methods, 61,
minor, 94 68, 495, 557, 564
iterative kurtosis, 447, 551
linear solvers, 555
solvers, 56, 61, 62, 225, 236 L-BFGS method, 130, 131
Lagrange multipliers, 36, 181,
Jacobi method 187, 197, 524
linear, 558 adjoint interpretation, 256
nonlinear block, 483–485, equality constraints, 159
495 inequality constraints, 166
Jacobian, 68, 119, 154, 223, 227, interior-point method, 206
231, 243, 245, 253, 491, meaning of, 172
566 Lagrangian
compressed, 263 function, 160, 186, 204
constraints, 187 mechanics, 36
coupled, 500 Latin hypercube sampling (LHS),
diagonal, 262 145, 374, 376, 377, 457
inverse, 567 law of large numbers, 456
nullspace, 164 law of reflection, 34
size, 154, 256 law of refraction, 34
sparse, 260–263, 500, 504 leading principal
square, 260 minor, 546
structure, 500, 503 submatrix, 546
transpose, 500 learning rate, 411
Jones function, 145, 295, 304, least squares, 36, 383
318, 574 constrained, 426
discontinuous, 318 linear, 383
nonlinear, 388, 389
Kepler’s equation, 35, 75, 225 regularized, 385
kernel, 398 left nullspace, 543
KKT conditions, 37, 167, 186, legacy codes, 225
280, 519, 569 Legendre polynomial, 451
knapsack problem, 337, 341 Levenberg–Marquardt algorithm,
dynamic programming, 344 389
tabulation, 343
Index 619

likelihood function, 387 continuity, 297


concentrated, 400 local
line search, 38, 92, 95, 114 constraints, 516
algorithm, 95 design variables, 516
backtracking, 97, 98 optimum, 19, 285
bracketing, 102 search, 23, 77, 282
comparison with trust re- log likelihood function, 387
gion, 94, 143 logarithmic
exact, 96, 101, 115, 116 barrier, 184
ill-conditioning, 105 scale, 65
interior-point methods, 205 logical operators, 235
interpolation, 107 lognormal distribution, 552
Newton’s method, 121 loops, 137, 225, 237
overview, 94 unrolling, 237
pinpointing, 102, 104 low-discrepancy sequence, 377,
plot, 106, 121, 135 457
quasi-Newton method, 123 lower
SQP, 187, 196 convex hull, 300, 301
step length, 98 triangular matrix, 241, 272,
sufficient decrease, 97 482, 555
unit step, 135 LU factorization, 60, 493, 555
linear
conjugate gradient, 563 machine learning, 2, 5, 43
convergence, 63 deep neural networks, 406
direct solvers, 555 hidden layers, 406
function, 422 input layer, 406
independence constraint qual- maximum likelihood, 386
ification, see constraint minibatch, 410
qualification neural networks, 406
iterative solvers, 555 output layer, 406
least squares, 422 support vector machine, 431
mixed-integer programming, machine precision, 54, 310
328, 329 machine zero, see machine pre-
programming (LP), 19, 37, cision
329, 423 major iterations, 94
regression, 382 manifold, 253
solvers, 554 manufacturing, 444
system, 554 Markov
linear-quadratic regulator (LQR), chain, 337, 342
427 variable-order, 337
Lipschitz process, 337
constant, 297 mating pool, 307
Index 620

matrix MAUD, see modular analysis


bandwidth, 482 and unified derivatives
block diagonal, 494 (MAUD)
column space, 542 maximization as minimization,
condition number, 555, 565 10
dense, 554 maximum
derivatives, 547 curvature, 85
determinant, 546 likelihood, 386, 399
diagonal, 545 log likelihood, 399
factorization, 61, 555 point, 87, 92
Hadamard product, 166 step, 98
identity, 205, 545 MDO architectures, 40, 471, 529
ill-conditioned, 555 ASO, 525
indefinite, 547 ATC, 520
inverse, 61, 546, 554 BLISS, 523
inverse product, 250 BLISS-2000, 528
Jacobian, 223 CO, 517
leading principal minor, 546 CSSO, 528
lower triangular, 241, 272, distributed, 515, 529
482, 555 ECO, 528
multiplication, 539 EPD, 528
negative-definite, 547 IDF, 511
norm, 545 IPD, 528
nullspace, 155, 159, 543 MAUD, 529
positive-definite, 89, 546 MDF, 507, 527
positive-semidefinite, 88, 547 MDOIS, 528
rank, 155, 542 monolithic, 506, 529
reordering, 482, 487 QSD, 528
row space, 542 SAND, 513, 515
scaled identity, 131 MDO frameworks, 530
size notation, 539 MDO of independent subspaces
sparse, 482, 554 (MDOIS), 528
splitting, 557 mean value, 549, 550
stiffness, 583 memoization, 338
symmetric, 546 merit function, 197
symmetric positive-definite, mesh refinement, 56
556, 560 mesh-adaptive direct search (MADS),
transpose, 481, 545 283
upper triangular, 245, 273, metamodel, see surrogate mod-
555 els
vector product, 541 method of lines, 52
well-conditioned, 555 minibatch, 410
Index 621

minimum, 92 multiobjective optimization, 10,


global vs. local, 19 18, 28, 197, 281, 353,
strong, 89 440
weak, 19 NBI method, 358
minor iterations, 94 epsilon constraint method,
mixed-integer programming, 325 358
linear, 329 weighted-sum method, 356
model, 6, 471, 483 evolutionary algorithms, 361
data-driven, 381 GA, 361
explicit, 50 objectives versus constraints,
implicit, 50 154
inputs and outputs, 479 problem statement, 355
multidisciplinary, 475 multiphysics, 471
optimization considerations, multiple local minima, see mul-
69 timodality
physics-based, 381 multipoint optimization, 441
statistical, 397, 398 multistart, 145, 280, 319, 372,
modeling error, 47 373
modular analysis and unified multivariate Gaussian distribu-
derivatives (MAUD), 494– tion, 399
497, 499, 504, 508, 529 mutation, 305, 309
modularity, 475, 477
monolithic solver, 495 𝑁 2 matrix, see design structure
monomial, 432 matrix (DSM)
Monte Carlo simulation, 373, natural selection, 305, 308
445, 449, 456, 470 negative-definite matrix, 547
multidisciplinary neighboring design, 346
model, 475 Nelder–Mead algorithm, 39, 283,
multidisciplinary analysis (MDA), 285, 288, 312, 349
482, 484, 498, 507, 530 convergence, 287
multidisciplinary design feasi- operations, 286
ble (MDF), 507, 509, simplex, 285, 286
527 neural networks, 43, 406
multidisciplinary design opti- deep, see deep neural net-
mization (MDO), 2, 28, works
40, 471 depth, 406
multidisciplinary model, 39 feedforward, 406
multifidelity models, 371 node, 406
multilevel coordinate search (MCS), recurrent, 406
283 weights, 409
multimodality, 19, 20, 23, 77, Newton’s method, 23, 35, 138,
137, 145, 280, 348 186, 389, 554
Index 622

computational cost, 135 1-norm, 175, 544


convergence rate, 67 2-norm, 139, 540, 544
coupled, 488 Frobenius, 545
full-space hierarchical, 489 matrix, 545
globalization, 68, 488 visualization, 544
ill-conditioning, 68 weighted, 544
issues, 120 weighted Frobenius, 569
linear system, 68, 119 NP-complete, see polynomial-
minimization, 118 time complete
monolithic, 488 NSGA-II, 306, 362
preconditioning, 68 nullspace, 155, 159, 543
reduced-space hierarchical, Jacobian, 164
490, 492 left, 543
root finding, 61 numerical
scale invariance, 120, 144 conditioning, see ill-conditioning
solver, 65 errors, 47, 48, 52, 475
step, 68, 119 integration, see quadrature
Newton–Cotes formulas, 450 models, see model
Newton–Krylov method, 68 noise, 28, 47, 57, 92, 137,
Niederreiter sequence, 381 224, 230, 280, 319, 385
noisy optimization, 45
data, 385 stability, 55
function, 24, 92, 230, 421
model, 28, 371 objective function, 9, 77
NOMAD, 296 multiple, 281, see multiob-
nondominated jective optimization
point, 356 scaling, 112, 136
set algorithm, 362 selecting, 9, 11
sorting, 363 separable, 355
nonlinear units, 112
block methods, 484 offspring, 305
least squares, 388 one-shot optimization, 71
simplex algorithm, see Nelder– OpenAeroStruct, 494, 504
Mead algorithm OpenMDAO, 494, 497, 504, 508
nonsingular matrix, 554 operations research, 2, 19, 41, 45
normal probability distribution, operator overloading, 236, 246,
312, 375, 387, 397, 447, 247
470, 551 opportunistic polling, 291
uncorrelated, 465 optimal control, 2, 5, 26, 40, 41,
norms, 543 427
∞-norm, 91, 139, 544 optimal-stopping problem, 34
𝑝-norm, 212, 544 optimality, 4, 285
Index 623

criteria, 22, 24 parallel computation, 307, 484,


dynamic programming, 340 485, 491, 494, 495, 518
Farkas’ lemma, 165 parameterization, 9, 328
first-order equality constrained, parents, 305
160 Pareto
first-order inequality con- anchor points, 358
strained, 165 front, 356, 366, 440
KKT conditions, 167 optimal, 356
second-order constrained, optimality, 355
161, 168 set, 356
tolerance, 200, 207 utopia point, 359
unconstrained, 88, 89 partial
optimization derivatives, 78, 238, 253,
algorithm selection, 26 260, 266, 500, 537
difficulties, 135, 137, 273 differential equation (PDE),
problem classification, 17 51, 62
problem formulation, 5, 6, pivoting, 556
17 particle swarm optimization (PSO),
problem reformulation, 5 42, 314, 316, 373
problem statement, 14 convergence, 317
software, 15, 41, 92, 199 initial population, 316
under uncertainty (OUU), particle position update, 315
28, 438 partitioning, 475, 476
optimum, see minimum pattern-search algorithms, see
order of convergence, 63, 65 also generalized pattern
ordinary search (GPS)
differential equation (ODE), PDE-constrained optimization,
51 513
kriging, 398 penalty function, 175, 197
orthogonal, 451 ATC, 520
columns, 263 exterior, 175
polynomials, 451, 460 interior, 183
search directions, 111 methods, 38, 152, 174, 289,
vectors, 542 312, 320
outer product, 540 parameter, 175, 176, 184,
self, 126, 566, 570 197
outputs, 6, 18, 224, 371, 478 quadratic, 175, 178, 521
overdetermined system, 384, 385 relaxation, 521
overfitting, 392, 393 percent-point function, 376
overflow, 54 physics-based model, 37, 381
integer, 53 pinpointing, 102
plane, 156
Index 624

Polak–Ribière formula, 117 probability density function (PDF),


polar cone, 165 376, 399, 438, 550, 551
politics, 473 probability distribution, 438, 439
polling, 291 exponential, 552
polyhedral cone, 157, 164 Gaussian, see normal prob-
polynomial chaos, 445, 459 ability distribution
intrusive, 467 lognormal, 552
nonintrusive, 467 uniform, 311, 374, 552
software, 463 Weibull, 552
polynomial-time complete, 326 programming, 58
polynomials, 396 bugs, 57
Hermite, 453, 464 errors, 57
Legendre, 451 language, 15, 42, 44, 48, 53,
orthogonal, 451, 460 235, 236, 249, 385, 483,
quadratic, 396 498
population, 304, 305, 308 modular, 58
initial, 305, 307 profiling, 59
portfolio optimization, 38 testing practices, 59
positive-definite matrix, 89, 125, propagated error, 54
546 pruning, 329
positive-semidefinite matrix, 88, pseudo-load, 259
547
positive spanning QR factorization, 384
directions, 290 quadratic
set, 290 approximation, 118, 122, 123
post-optimality convergence, 63, 120
derivatives, 526, 528 form, 541, 547
sensitivity, 5, 174 function, 114, 138
studies, 5 interpolation, 107, 108
posynomial, 432 penalty, 175, 178, 179
potentially optimal rectangles, programming (QP), 187, 383,
302 423, 425
precision, 20, 47, 48, 53, 54, 56, quadratically constrained quadratic
60, 153, 224, 236, 251, programming (QCQP),
475, 483, 497 139, 427, 428
preconditioner, 489, 564 quadrature, 449, 454
preconditioning, 68, 564 Clenshaw–Curtis, 454
principal curvature directions, direct, 449
85, 114 Gauss–Hermite, 452–454, 465
principle of least time, 34 Gauss–Konrod, 454
principle of minimum energy, 1 Gauss–Legendre, 452
sparse grid, 455
Index 625

quantile function, 376 plinary design feasible


quasi-Monte Carlo method, 457 (MDF)
quasi-Newton methods, 38, 120, regression, 5, 381
122, 123, 567 linear, 382, 385
BFGS, 38, 125, 567, 569, 571 nonlinear, 388
Broyden, 566 testing, 59
condition, 124 regular point, 160, 168
curvature condition, 124 regularization, 391, 397, 431
DFP, 38, 125, 568, 571 regularized least squares, 385
Hessian reset, 128 relative
L-BFGS, 130, 131 derivatives, 80
SR1, 128, 570, 571 error, 53
unification, 571 step size, 230
quasi-random sequences, 378 relaxation, 329, 485
quasi-separable decomposition factor, 486, 559
(QSD), 528 reliability metric, 441
reliable design, 438, 444
radial basis function (RBF), 284, reordering, 482, 487
396 residual form, 49, 480
radical inverse function, 378 residuals, 49, 50, 120, 224, 252,
random 265, 475–477, 555
sampling, 292, 307, 310, 374, derivatives, 222
456 norm, 65
variable, 397, 399, 439, 549 response surface model, see sur-
rank, 155, 385, 542 rogate models
rate of convergence, 62 restricted-step methods, see trust
linear, 63 region
plot, 64 reverse chain rule, 242
quadratic, 63 reverse Cuthill–McKee (RCM)
superlinear, 64 ordering, 482
real-encoded GA reward, 440
crossover, 311 risk, 440
initial population, 310 robust design, 438, 439
mutation, 311 Rosenbrock function, 573
selection, 310 𝑛-dimensional, 281, 573
rectified linear unit (ReLU), 408 two-dimensional, 134, 136,
recursion, 337 143, 392, 573
recursive solver, 495 roulette wheel selection, 308
reduced-space rounding, 327
Newton’s method, 495 roundoff error, 48, 53, 56, 57,
optimization, see multidisci- 229, 556
row space, 542
Index 626

saddle point, 91, 92, 574 Hammersley, 380, 405


safety factor, 509 low-discrepancy, 377
sampling, 371 Niederreiter, 381
full factorial, 372 scrambled Halton, 380
inversion, 376 Sobol, 381
plan, 372 van der Corput, 378
random, 292, 307, 310, 374, sequential optimization, 5, 472,
456 510, 528, 529
scalar product, 540 sequential quadratic program-
scale invariance, 120, 144 ming (SQP), 38, 41, 152,
scaled identity matrix, 131 186
scaling, 68, 92 equality constrained, 186
constraints, 180 active set, 189
design variables, 112, 136 inequality constrained, 189
gradient, 136 line search, 187, 196
logarithm, 113 meaning, 188
objective function, 112, 136 quasi-Newton, 200
trust-region method, 144 system, 187
search direction, 109 shape optimization, 35, 40
conjugate, 115 shared design variables, 472, 516
method comparison, 132, Sherman–Morrison–Woodbury
134 formula, 126, 567, 569,
normalization, 117 571
steepest descent, 109 shipping, 341
zigzagging, 111 Shubert’s algorithm, 297
secant side constraints, see bound con-
equation, 124, 569, 570 straints
inverse equation, 568 sigmoid function, 144, 408
method, 66, 122, 566 signomial programming, 435
second-order cone programming simplex, 285
(SOCP), 41, 423, 427 simplex algorithm, 37
seed, see algorithmic differenti- simulated annealing, 42, 345,
ation (AD) 347
self influence, 316 simulation, 3, 69
semidefinite programming (SDP), simultaneous analysis and de-
41, 423 sign (SAND), 71, 72,
sensitivities, see derivatives 513, 515
separable objectives, 355 singular matrix, 554
sequence skewness, 447, 551
Faure, 381 slack variables, 166, 204
Fibonacci, 338 slanted quadratic function, 573
Halton, 379 slope, 78, 82, 97
Index 627

smooth functions, 152 stability, 55


smoothing discontinuities, 144 standard
smoothness, 18, 24 deviation, see variance
Sobol sequence, 381 error, 402
social influence, 315, 316 standard deviation, 312
software, 40, 41 state variables, 49, 69, 224, 252,
AD, 249 474, 483, 488
engineering, 44 stationary point, 90, 92, 93, 166
geometric programming, 435 statistical model, 397, 398
MDO frameworks, 530 steepest-descent method, 36, 109
optimization, 15, 199 step length, 98, 135
stochastic gradient descent, step-size dilemma, 228, 229
411 stiffness matrix, 49, 252, 258,
surrogate modeling, 397 477, 583
solvers, 49 stochastic
hierarchical, 41, 495, 497 algorithms, 25, 284
iterative, 224, 225, 236 collocation, 462
linear, 60, 61, 554 function, 20
monolithic, 495 gradient descent, 43, 411
Newton, 61, 65 strong Wolfe conditions, 101,
nonlinear, 61 105, 123
overview, 60 structural
recursive, 495 design problem, 70, 72, 252,
source code, 224, 231, 235, 248 258, 582
transformation, 246, 247, 249 model, 48, 252, 478
span, 155, 542 optimization, 40, 70, 444
sparse structurally orthogonal columns,
Jacobian, 260–263 262
linear systems, 61, 245, 554 subspace, 156, 542
matrix, 482 subsystem, 495
spectral subtractive cancellation, 55, 229
expansions, see polynomial successive over-relaxation (SOR),
chaos 61, 559
projection, 462 sufficient curvature condition,
splines, 9 100
splitting matrix, 558 sufficient decrease condition, 97,
spring system problem, 132, 142, 99
209, 579 sum of squared errors, 409
SQP, see sequential quadratic superlinear convergence, 64
programming (SQP) supervised learning, 431
SR1 method, 571 support vector machine, 431
update, 570
Index 628

surrogate modeling toolbox (SMT), Taylor series expansion, 534


397 ten-bar truss problem, 220, 582
surrogate models, 25, 28, 37, three-bar truss problem, 218
283, 292, 370, 372, 528 time
interpolatory, 381 dependence, 26
kriging, 37, 397 horizon, 339
linear regression, 382 integration, 62
polynomial, 382 topology optimization, 40
regression, 381 total
surrogate-assisted optimization, derivatives, 238, 253, 499,
370 500, 537
surrogate-based optimization (SBO), differential, see also differ-
28, 37, 42, 370, 412, 528 ential
swarm, 314 potential energy, 132, 579
symbolic differentiation, 80, 225, tournament selection, 308, 313
237, 238, 246 multiobjective, 365
toolbox, 225 trade-offs
symmetric rank 1 (SR1) method, cost vs. performance, 10
128, 570 direct vs. adjoint, 256
symmetry of second derivatives, forward vs. reverse mode
83 AD, 245
system-level multidisciplinary, 473
optimization, 528 performance vs. robust-
representation, 480, 501 ness, 441
solver, 483, 497 risk vs. reward, 353, 440
subproblem, 515 weight vs. drag, 354
training data, 371, 381, 393, 409
tabulation, 339, 340, 343 trajectory optimization, 1, 26,
tangent 39, 372
hyperplane, 156, 158, 164 transportation problem, 36
Jacobian, 223, 239 traveling salesperson problem,
taping, 248 37, 326, 347
target variables, 511 tree
Taylor series breadth-first search, 331
approximation, 86, 87, 94, data structure, 495, 496
118, 446 depth-first search, 330
complex step, 231 pruning, 329
constraint, 158 trisection, 302
finite differences, 226 truncation error, 56, 227, 228,
multivariable, 535 233
Newton’s method, 67 trust region, 94, 138, 524
single variable, 534
Index 629

comparison with line search, random, 397


94, 143 state, 252, 474, 483, 488
methods, 121, 138, 141–143, target, 511
207, 283 variance, 397, 438, 440, 448, 549,
overview, 94 550
type casting, 475 vector, 540
operations, 540
uncertainty, 5, 20, 355, 387, 402, space, 542
438, 439, 444, 472 verification, 48, 241, 261, 273
quantification, 440, 441, 445 visualization, 10
unconstrained optimization, 77
underdetermined system, 463 warm start, 151, 557
underfitting, 394 weak minimum, 19
underflow, 54, 232, 233 Weibull distribution, 470, 552
unified derivatives equation (UDE), weighted
265, 494, 503, 539 directed graph, 336
AD, 271 Frobenius norm, 569
adjoint method, 269 function, 245
derivation, 265 inner product, 452
direct method, 269 norm, 544
forward, 267 sum, 356
reverse, 267, 505 wind
uniform distribution, 311, 374, farm problem, 27, 40, 328,
552 356, 441, 442, 469
unimodal function, 19, 280 rose, 442
unimodality, 20, 77 wine barrel problem, 34, 149
unit testing, 59 wing design problem, 8, 40, 80,
units, 7 473, 493, 504, 509, 513,
unsupervised learning, 431 515, 519, 527, 575
upper triangular matrix, 245, Wolfe conditions, see strong Wolfe
273, 555 conditions
utopia point, 359 Woodbury matrix identity, see
Sherman–Morrison–Woodbury
validation, 5, 48 formula
van der Corput sequence, 378 working set, 189
variable-order Markov chain, 337
variables XDSM diagram, 471, 482
bounds, 524 data dependency lines, 482
coupling, 474, 478, 480, 482, iterator, 484
483, 485, 488, 502, 511 process lines, 484
design, 6, 252, 475
input, 371, 445, 478, 479 zero-one variables, see binary
output, 371, 478 variables
Index 630

zigzagging, 111
Based on course-tested material, this rigorous yet accessible graduate textbook covers both fun-
damental and advanced optimization theory and algorithms. It covers a wide range of numerical
methods and topics, including both gradient-based and gradient-free algorithms, multidisciplinary
design optimization, and uncertainty, with instruction on how to determine which algorithm should
be used for a given application. It also provides an overview of models and how to prepare them
for use with numerical optimization, including derivative computation. Over 400 high-quality visu-
alizations and numerous examples facilitate understanding of the theory, and practical tips address
common issues encountered in practical engineering design optimization and how to address them.
Numerous end-of-chapter homework problems, progressing in difficulty, help put knowledge into
practice. Accompanied online by a solutions manual for instructors and source code for problems,
this is ideal for a one- or two-semester graduate course on optimization in aerospace, civil, mechan-
ical, electrical, and chemical engineering departments.

You might also like